Parsing text files is CPU bound

Computer Science researchers often stress the importance of compression to get better performance. I believe this is a good illustration of an academic bias. Indeed, file size is easy to measure. It is oblivious to Computer and CPU architectures. We even have a beautiful theory that tells you how far from the optimal compression ratio you stand. And better compression is important: YouTube would not be possible without aggressive video compression.

However, the following theorem is false:

Theorem 1. The time required to parse a file is proportional to its size.

Matt Casters showed that using an open source data warehousing tool, parsing simple CSV files is CPU bound. That is, he gets (slightly) better parsing speed when using two CPU cores to parse the file than a single one. On a strongly I/O bound process, using two threads to read a file would make things worse because it introduces  disk random access.

I decided to verify this claim. I have an optimized CSV file parser written in C++. It may not be as fast as it can possibly be, but the 100 lines of C++ are reasonable. It is single-threaded. I launched the script on a large CSV file and, sure enough, the command “top -o cpu” reported the process as using 100% of the CPU (on a 2 dual-core systems). So, yes, parsing CSV files is CPU bound!

This has serious implications. For example, comparable tests will reveal that XML parsing of large files is also CPU bound. Hence, it is a bit strange to see the binary XML people stress that they can compress XML by an average of 10 times, but report little regarding CPU usage.

Note: You mileage may vary. I do not claim that file parsing is always CPU bound. Also, compression is an important technique to accelerate databases.

Update. Here are the file characteristics: number of columns = 4, size in GB = 2.6.

9 thoughts on “Parsing text files is CPU bound”

  1. @KWillets

    Compression isn’t just for I/O bandwidth.

    I agree.

    Getting optimally-sized operands is also a benefit, although I haven’t seen it researched much.

    Compression can be used to diminish the number of CPU cycles used. Definitively.

    If you change the question from “what is the minimum number of bits needed to store this data” to “what is the minimum number of bits needed to construct the output”, eg a join or intersection, etc., it becomes more interesting.

    The minimum size of the output is used a lower bound on the complexity of the problem in algorithmic design. So, it can be used to show that you have an optimal algorithm.

  2. @Bannister

    How big was your input file?

    Several GiBs. I work with large files.

    You should be running any benchmark more than once to get consistent times. The input file must be bigger than memory so that it is not cached in memory by the operating system.

    This is a “top -o cpu” command on another process while the program is running.

    My observations are not meant to be scientific, but I can tell you that optimizing my CSV parsing code helped quite a bit speed it up. Had the process been “very I/O bound”, optimizing the code would not have mattered. My code is somewhere on google code (it is open source).

    Unless you have an insanely fast disk subsystem (not something you find in usual desktops), reading anything as simple as CSV files should be very I/O-bound, and not CPU-bound.

    It should. Shouldn’t it?

    I made a blog post out of it because I find all of this puzzling. I’d be glad to see someone do a follow-up analysis. Maybe someone could prove me wrong? I’d like that.

  3. Compression isn’t just for I/O bandwidth. Getting optimally-sized operands is also a benefit, although I haven’t seen it researched much.

    If you change the question from “what is the minimum number of bits needed to store this data” to “what is the minimum number of bits needed to construct the output”, eg a join or intersection, etc., it becomes more interesting.

  4. How big was your input file?

    You should be running any benchmark more than once to get consistent times. The input file must be bigger than memory so that it is not cached in memory by the operating system.

    Unless you have an insanely fast disk subsystem (not something you find in usual desktops), reading anything as simple as CSV files should be very I/O-bound, and not CPU-bound.

  5. Actually, the binary XML people stress both compactness and processing efficiency. Some of the binary XML use cases are CPU bound because they have fast networks that are not congested. Others are I/O bound because they have slow, wireless, or congested networks.

    The EXI evaluation document you referenced is an early draft that does not yet include processing efficiency measurements. The next draft will include these measurements. In the mean time, see http://www.w3.org/TR/exi-measurements/ for a complete collection of binary XML compactness and processing efficiency measurements, including some taken over high-speed and wireless networks.

    Also, see http://www.agiledelta.com/efx_perffeatures.html for compactness and processing speed measurements from a commercial EXI implementation.

    All the best,

    John

  6. Unless you have an insanely fast disk subsystem (not something you find in usual desktops), reading anything as simple as CSV files should be very I/O-bound, and not CPU-bound.

    Please refrain from using words such as “insanely” 🙂 Let’s put a number on it shall we. Suppose we would have one of these new SSD drives doing 200MB/s.
    On a 2GHz CPU you would have 10 cycles to process a single byte of data. That is not very much considering the fact that with those 10 cycles you need to handle String encoding (UTF-8, etc, etc), Integer, Number, Date and Boolean encoding.

    Looking at these numbers you see WHY reading CSV files is nearly always CPU-bound for high end storage sub-systems.

  7. Matt, you will find I am very intentional in my use of language. 🙂

    What is the definition of sanity? Sanity is defined relative to what most people do most of the time. (Which might seem a bit odd – but that is another discussion.)

    When coming up with a solution, you have to make “sane” assumptions about the design space. You cannot assume unlikely resources (at least not usually). You have to assume what most of target hardware will have, most of the time.

    If we are interested in the performance of parsing large files, that would be because we *have* large files. Most likely we have a series of very large files, generated as a part of some repeating process. (Otherwise the problem becomes uninteresting.)

    Combine the above two considerations, with the cost of SSD storage, and you have an awkward fit. Also, after poking around a bit on the web, it looks as though the *sustained* read rates for SSD is currently around 40-100MB/s. (We are not interested in burst rates.)

    Yes, there have always been high-end (and expensive!) storage systems that deliver much higher than average performance. If you are designing a solution for a general population of machines, then it would be insane to assume performance present in only a very small minority.

    So the use of “insanely fast” can carry exactly the right freight when talking about a design. 🙂

  8. If I had to wager a guess, I’d say that all the memory management for storing the results are to blame – malloc()/free() are notoriously slow, so calling them the hundreds of millions of times a multi-gigabyte file requires is very likely the problem. Try using a static structure for output, or else just discard the output entirely (still computing it of course), and see what happens.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax