I am continuing my fun saga to determine whether parsing CSV files is CPU bound or I/O bound. Recall that I posted some C++ code and reported that it took 96 seconds of process time to parse a given 2GB CSV file and just 27 seconds to read the lines without parsing. Preston L. Bannister correctly pointed out that using the clock() function is wrong. So I updated my code using his ZTimer class instead. The new numbers are 103 seconds for the full parsing and 57 seconds to just parse the lines.
Some anonymous reader claimed that my code was still grossly inefficient. I do not like arguing without evidence.
Ah! But Unix utilities can also parse CSV files. They are usually efficient. Let us use the cut command:
$ time cut -f 1,2,3,4 -d , ./netflix.csv > /dev/null
So, 120 seconds?
What about sorting the CSV file? Of course, it is a lot more expensive: 504 seconds.
$ time sort -t, ./netflix.csv > /dev/null
Finally, for a basis of comparison, let us just dump the file to /dev/null:
$ time cat ./netflix.csv > /dev/null
The final story:
|parsing method||time elapsed|
|cat Unix command||29 s|
|Daniel’s line parser||57 s|
|Daniel’s CSV parser||103 s|
|cut Unix command||120 s|
|sort Unix command||504 s|
Analysis: My C++ code is not grossly inefficient. If the I/O cost of reading the file is about 30 seconds, parsing it takes about 100 seconds. My preliminary conclusion is that parsing CSV files is more CPU than I/O bound.