Parsing CSV files is CPU bound: a C++ test case (Update 2)

I am continuing my fun saga to determine whether parsing CSV files is CPU bound or I/O bound. Recall that I posted some C++ code and reported that it took 96 seconds of process time to parse a given 2GB CSV file and just 27 seconds to read the lines without parsing. Preston L. Bannister correctly pointed out that using the clock() function is wrong. So I updated my code using his ZTimer class instead. The new numbers are 103 seconds for the full parsing and 57 seconds to just parse the lines.

Some anonymous reader claimed that my code was still grossly inefficient. I do not like arguing without evidence.

Ah! But Unix utilities can also parse CSV files. They are usually efficient. Let us use the cut command:

$ time cut -f 1,2,3,4 -d , ./netflix.csv > /dev/null
real 1m59.596s
user 1m53.163s
sys 0m3.775s

So, 120 seconds?

What about sorting the CSV file? Of course, it is a lot more expensive: 504 seconds.

$ time sort -t, ./netflix.csv > /dev/null
real 8m23.985s
user 2m28.855s
sys 1m1.467s

Finally, for a basis of comparison, let us just dump the file to /dev/null:

$ time cat ./netflix.csv > /dev/null
real 0m29.337s
user 0m0.029s
sys 0m2.541s

The final story:

parsing method time elapsed
cat Unix command 29 s
Daniel’s line parser 57 s
Daniel’s CSV parser 103 s
cut Unix command 120 s
sort Unix command 504 s

Analysis: My C++ code is not grossly inefficient. If the I/O cost of reading the file is about 30 seconds, parsing it takes about 100 seconds. My preliminary conclusion is that parsing CSV files is more CPU than I/O bound.

7 thoughts on “Parsing CSV files is CPU bound: a C++ test case (Update 2)”

  1. There are a couple of things that might speed things up. One is to avoid conventional I/O and mmap the file. Then you need a small DFA to parse the CSV format in memory.

    Eventually you should get an inner loop that looks something like:

    while( state = dfa[state].edges[*p++] )

    (State 0 would be the end/error state.)

    There are a few tricks to doing this, like padding the end of the mmap with sentinel characters to drive the DFA into the end state and end the loop. Or you can add a counter to bounds-check; the cpu may be able to ILP the extra instructions.

    I once did something similar, with a trie as the DFA, and it was quite fast. There are two main factors, IMHO, which are important: low instruction count, and low branch count. If your inner loop is constantly checking if it’s at the end of the input buffer, and checking other loop counters or termination conditions, the CPU’s branch prediction will degrade severely.

    You can also get into SSE instructions, and there are a few things with cache management that would probably be relevant.

  2. @KWillets
    Er, did you benchmark using memory-mapped files? Last time I did, the results I got (on both Linux and Windows) was SLOWER than simple sequential file access.

    This makes sense.

    Memory-mapped file access is optimized for random access. Sequential file access is optimized for sequential access. Operating systems do sneaky things under the covers to optimize sequential I/O (a VERY common case).

    (Not the first time I’ve run across this myth! Clearly not enough folks write benchmarks and collect measurements.)

  3. @Bannister

    Thank you. Maybe I will test out memory-mapped file later. For fun.

    I think that more open discussions about these issues is important.

    By “open” I mean “with open code”.

Leave a Reply

Your email address will not be published. Required fields are marked *