Description
Given a comma-separated record where
  • each field may be enclosed by a pair of double quotes,
  • fields are separated by commas, and
  • any character escaped by a backslash is treated literally.
In the following, we want to extract the 3rd field.
Raw Input Desired Output
0,1\,2,"3,4,5\",6,7",8,9
aa,"bb,cc\",dd",ee\,ff,"gg",hh,ii,jj
AA,BB,CC,DD,EE
"3,4,5\",6,7"
ee\,ff
CC
Script and Comments
Script1
[ 1] s/^((([^",\]|\\.)*|"([^"\]|\\.)*"),){2}(([^",\]|\\.)*|"([^"\]|\\.)*")(,.*|$)/\5/
Comments
  1. The `-r' option of GNU sed must be used to make sed interpret REs as EREs.
  2. \\ is used to express a literal backslash in REs.
  3. ([^",\]|\\.)*|"([^"\]|\\.)*" matches a field where
    • ([^",\]|\\.)* matches a non-enclosed field while
    • "([^"\]|\\.)*" matches a enclosed one.
  4. The DFA (Deterministic Finite Automata) shown in the following diagram is a machine able to recognize a field:
    • Each circle in the diagram stands for a state.
    • The circle attached with an arrow labeled `start' is the initial state.
    • Reading a character the same as the one in the label of an arc will switch the machine from one state to another.
    • The machine can be switched from a state to another without consuming any character if the arc is not labeled.
    • Reaching a concentric circle means that a field has been recognized.
  5. To extract the 3rd field,
    • first we use ^((([^",\]|\\.)*|"([^"\]|\\.)*"),){2}(([^",\]|\\.)*|"([^"\]|\\.)*")(,.*|$) to match the entire record where the RE can be divided into 3 parts:
      regular expression what is matches
      ^((([^",\]|\\.)*|"([^"\]|\\.)*"),){2}the first two fields
      ([^",\]|\\.)*|"([^"\]|\\.)*" the third field
      (,.*|$) the rest of the record
    • then replace the entire record with the 3rd field whose contents are captured by the fifth pair of parentheses in the RE.
  6. Lines with less than 3 fields will be printed without any modification.