(See update 2.)
In a recent blog post, I said that parsing simple CSV files could be CPU bound. By parsing, I mean reading the data on disk and copying it into an array. I also strip the field values of spurious white space.
You can find my C++ code on my server.
A reader criticized my implementation as follows:
- I use the C++ getline function to read the lines. The reader commented that “getline does one heap allocation and copy for every line.” I doubt that getline generates heap allocation each time it is called: I reuse the same string object for every call.
- For each field value, I did two heap allocations and two copies. I now reuse the same string objects for fields, thus limiting the number of heap allocations.
- The reader commented that I should use a custom allocator to avoid heap allocations. Currently, if the CSV file has x fields, I use x+1 string objects (a tiny number) and small constant number of heap allocations.
Despite these changes, I still get that parsing CSV files is strongly CPU bound:
$ ./parsecsv ./netflix.csv
without parsing: 26.55
with parsing: 95.99
However, doing away with the heap allocations at every line did reduce the parsing running time by a factor of two. It is not difficult to believe I could close the gap. But I still see no evidence that parsing CSV files is strongly I/O bound as some of my readers have stated. Consider that in real applications, I would need to convert field values to dates or to numerical values. I might also need to filter values, or support fancier CSV formats.
Note: Yet again, I do not claim that my code is nearly optimal. My exact claim is that reading CSV files may be I/O bound using reasonable code. I find this very surprising.