Manuel Collado notes that in addition to commas, a CSV field can also
contain double quotes, that have to be escaped by doubling them. The previously
described regexps fail to accept quoted fields with both commas and
double quotes inside. He suggests that the simplest FPAT
expression that
recognizes this kind of fields is /([^,]*)|("([^"]|"")+")/
. He
provides the following input data to test these variants:
p,"q,r",s p,"q""r",s p,"q,""r",s p,"",s p,,s
And here is his test program:
BEGIN { fp[0] = "([^,]+)|(\"[^\"]+\")" fp[1] = "([^,]*)|(\"[^\"]+\")" fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")" FPAT = fp[fpat+0] }
{ print "<" $0 ">" printf("NF = %s ", NF) for (i = 1; i <= NF; i++) { printf("<%s>", $i) } print "" }
When run on the third variant, it produces:
$ gawk -v fpat=2 -f test-csv.awk sample.csv -| <p,"q,r",s> -| NF = 3 <p><"q,r"><s> -| <p,"q""r",s> -| NF = 3 <p><"q""r"><s> -| <p,"q,""r",s> -| NF = 3 <p><"q,""r"><s> -| <p,"",s> -| NF = 3 <p><""><s> -| <p,,s> -| NF = 3 <p><><s>
In general, using FPAT
to do your own CSV parsing is like having
a bed with a blanket that’s not quite big enough. There’s always a corner
that isn’t covered. We recommend, instead, that you use Manuel Collado’s
CSVMODE
library for gawk
.