Parsing CSV files is CPU bound: a C++ test case

(These results were updated.)

In Parsing text files is CPU bound, I claimed that I had a C++ test case proving that parsing CSV files could be CPU bound. By CPU bound, I mean that the overhead of taking each line, finding out where the commas are, and storing the copies of the fields into an array, dominates the running time.

How do I test this theory? I read the file twice. Once, I just read each line and report the time elapsed. Then, I read each line and process them and report the time elapsed. If the two times are similar, the process is I/O bound, if the second time is much larger, the process is CPU bound.

I get this result on a 2 GB file (numbers updated on Dec. 19, 2008):


$ ./parsecsv ./netflix.csv
without parsing: 26.55
with parsing: 95.99

Hence, parsing dominates the running time. At least in this case. At least with my C++ code.

Before you start arguing with me, please go download my reproducible test case. All you need is the GNU GCC compiler. I tested out two machines, with two different versions of GCC.

Note: I do not claim that this is professional benchmarking.

Reference: This quest started out from a post by Matt Casters where he reported that you could parse a CSV file faster using two CPU cores instead of just one.

3 thoughts on “Parsing CSV files is CPU bound: a C++ test case”

  1. You are most certainly right if you think that memory allocation is guilty. However, I specifically defined “parsing” as “copying the fields into new arrays”.

    There is no question that if I just read the bytes and do nothing with them, it is not going to end up being CPU bound, but that is not very realistic of a real application, is it? Copying the fields and storing them into some array appears to me to me a basic operation.

  2. 1. tokenize does heap allocation and copy *per value* via token string.

    Heap allocation is avoided in latest version.

    2. vector of strings also do an *extra* heap allocation and copy to copy construct *each element*.

    Fixed this in latest version.

    3. Shorter and complete I/O bound (for > then physical memory files) code can be written, using a custom allocator (~10 lines reusable code) that does only amortized 1 heap allocation and memcpy *per file*. Remember that youโ€™re using C++ not Java ๐Ÿ™‚ The solution is left as an exercise for the readers ๐Ÿ™‚

    I wrote in my blog post:

    ยซI do not claim that writing software where CSV parsing is strong I/O is not possible, or even easy.ยป

  3. Major performance problems of the code:

    0. getline does one heap allocation and copy for every line.

    1. tokenize does heap allocation and copy *per value* via token string.

    2. vector of strings also do an *extra* heap allocation and copy to copy construct *each element*.

    3. Shorter and complete I/O bound (for > then physical memory files) code can be written, using a custom allocator (~10 lines reusable code) that does only amortized 1 heap allocation and memcpy *per file*. Remember that you’re using C++ not Java ๐Ÿ™‚ The solution is left as an exercise for the readers ๐Ÿ™‚

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax