Parsing CSV files is CPU bound: a C++ test case (Update 1)

(See update 2.)

In a recent blog post, I said that parsing simple CSV files could be CPU bound. By parsing, I mean reading the data on disk and copying it into an array. I also strip the field values of spurious white space.

You can find my C++ code on my server.

A reader criticized my implementation as follows:

  • I use the C++ getline function to read the lines. The reader commented that “getline does one heap allocation and copy for every line.” I doubt that getline generates heap allocation each time it is called: I reuse the same string object for every call.
  • For each field value, I did two heap allocations and two copies. I now reuse the same string objects for fields, thus limiting the number of heap allocations.
  • The reader commented that I should use a custom allocator to avoid heap allocations. Currently, if the CSV file has x fields, I use x+1 string objects (a tiny number) and small constant number of heap allocations.

Despite these changes, I still get that parsing CSV files is strongly CPU bound:


$ ./parsecsv ./netflix.csv
without parsing: 26.55
with parsing: 95.99

However, doing away with the heap allocations at every line did reduce the parsing running time by a factor of two. It is not difficult to believe I could close the gap. But I still see no evidence that parsing CSV files is strongly I/O bound as some of my readers have stated. Consider that in real applications, I would need to convert field values to dates or to numerical values. I might also need to filter values, or support fancier CSV formats.

My experiments are motivated by a post by Matt Casters. Some said that Java was guilty. I use C++ and I get a similar result. So far at least. Can you tell me where I went wrong?

Note: Yet again, I do not claim that my code is nearly optimal. My exact claim is that reading CSV files may be I/O bound using reasonable code. I find this very surprising.

Published by

Daniel Lemire

A computer science professor at the Université du Québec (TELUQ).

6 thoughts on “Parsing CSV files is CPU bound: a C++ test case (Update 1)”

  1. You’re still doing two heap allocations (one for string object, one for actually char buffer)

    No. I do not. Here is the code:


    c.resize(fieldlength);
    c.assign(str,lastPos, fieldlength);

    Also vector resizing is slow (O(n)).

    No, absolutely not. Vector resize is amortized O(1).

    Let me throw around a few numbers: lzop compression is 150MB/s

    I think not. More like 40 MB/s.

    $ time lzop -p ./netflix.csv

    real 0m50.057s
    user 0m22.245s
    sys 0m4.338s

    Try harder.

    Care to provide actual code? Or provide a patch for my code?

  2. But “despite these changes” is only one minor change to reduce one string alloc/copy per field. You’re still doing two heap allocations (one for string object, one for actually char buffer) and one copy *per field*, which is a far cry from the amortized 1 alloc + copy *per file* goal. Every heap allocation involves a mutex lock/unlock, which is expensive even without contention. Also vector resizing is slow (O(n) and need to reconstruct all string elements. So depending on number of strings in the vector you doing 2 heap allocations and a copy for every element multiple times during a run.

    2GB file in 26.55s is 75MB/s, which is about right for top speed of sequential read on a single modern hard-drive not doing anything else (otherwise it’ll slow down dramatically.) 2GB in 96s, which is about 21MB/s, is very slow for parsing. Let me throw around a few numbers: lzop compression is 150MB/s, decompression can exceed 500MB on a 2.6Ghz C2D processor (see benchmarks on http://www.quicklz.com/).

    I’ve written an indexer (including html parsing, word-breaking and constructing small in memory inverted index (the in memory indices indeed needs to small enough to fit in to the L2 cache otherwise speed drops down dramatically) that has a throughput near 100MB/s.

    Try harder.

  3. try to compress something completely fit into memory multiple times and make sure it doesn’t write to disk (pipe to /dev/null)

    I gained two seconds:


    $ time lzop -c ./netflix.csv > /dev/null

    real 0m47.757s
    user 0m22.230s
    sys 0m2.577s

    vector resize is definitely not amortized O(1) more like O(log n) (because every time it resizes, if the storage is not enough, the storage is doubled) (What I meant by O(n) is that when the storage is not enough the resize operation is O(n): you need to allocate new storage and copy construct all the elements on new storage and delete the old storage. You have to do that O(log n) times.)

    What is large is the number of rows. Not the field size. But let us say that n is the size of a field. Ok, but n is no larger than 16. So, ok, log n = 4. Four (4) is a fixed number. I have millions of rows. If you amortize 4 over millions of row it no longer matters.

    Change your vector to deque should make your program a lot faster even w/o custom allocator, which is required for the next speed up.

    I do not see why this would be true. The vector contains 4 strings. It is hard to beat a simple vector when you are storing only 4 strings.

    Make sure you compile your code with -O3 if you’re using g++

    -O3 makes things slower.

  4. What’s the size of the physical RAM on your box? your lzop number is certainly bound by your I/O speed. Try to compress something completely fit into memory multiple times and make sure it doesn’t write to disk (pipe to /dev/null)

    vector resize is definitely not amortized O(1) more like O(log n) (because every time it resizes, if the storage is not enough, the storage is doubled) (What I meant by O(n) is that when the storage is not enough the resize operation is O(n): you need to allocate new storage and copy construct all the elements on new storage and delete the old storage. You have to do that O(log n) times.)

    Change your vector to deque should make your program a lot faster even w/o custom allocator, which is required for the next speed up.

    Make sure you compile your code with -O3 if you’re using g++

  5. Sorry, I misread your code. I thought you put all the fields in the vector instead of just doing a simple split 🙂 Your code should be a lot faster than 20MB/s on modern hardware.

    Can you post your platform specs? uname -a, gcc -v? CPU?

    Your numbers just don’t jive with my experience. Your own code runs faster with -O3 on a 2 year old Macbook Pro:

    $ ./po2 largeuni.csv
    without parsing: 0.107499
    with parsing: 0.256668

    $ ./po3 largeuni.csv
    without parsing: 0.103201
    with parsing: 0.245105

    If you want to get to the bottom of it can generate a small set of CSV data (say 10MB, enough for testing parsing speed) that everyone can verify, I’ll take stab on the code.

  6. Can you post your platform specs? uname -a, gcc -v? CPU?

    I use a standard Mac Pro desktop. The specs are described in this recent paper:

    * Owen Kaser, Daniel Lemire, Kamel Aouiche, Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes, DOLAP 2008, 2008.
    http://arxiv.org/abs/0808.2083

    If you want to get to the bottom of it can generate a small set of CSV data (say 10MB, enough for testing parsing speed) that everyone can verify, I’ll take stab on the code.

    Just generate any large CSV file. If you use something “typical” (small fields, about 4 of them, and lots of rows), it should not matter which file you use exactly.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax