How fast is getline in C++?

A standard way to read a text file in C++ is to call the getline function. To iterate over all lines in file and sum up their length, you might do as follows:

while(getline(is, line)) {
    x += line.size();

How fast is this?

On a Skylake processor with a recent GNU GCC compiler without any disk access, it runs at about 2 GB/s. That’s slower than the maximal throughput of a good flash drive. These results suggest that reading a text file in C++ could be CPU bound in the sense that buying an even faster disk would not speed up your single-threaded throughput.

nanoseconds per byte 0.46 ns
speed 2.0 GB/s

My code is available.

If you write code that processes the strings generated by the getline function calls, in the worst case, the total time will be the sum of the time required by the getline function plus the time required by your code. In other words, you are unlikely the achieve speeds near 2 GB/s.

In comparison, a software library like simdjson can parse and validate JSON inputs, doing everything from Unicode validation to number parsing, at a speed of over 2 GB/s.

I have not tried to do so, but you can locate lines and iterate over them at much greater speeds than 2 GB/s.

Daniel Lemire, "How fast is getline in C++?," in Daniel Lemire's blog, June 18, 2019.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

28 thoughts on “How fast is getline in C++?”

  1. I wonder how much of a speed-up it is to memory-map a file(mmap) and implementing getline as a re-entrant function with just pointer arithmetic(string_view?).

        1. The benchmark is entirely in userspace: it is not reading from a file at all, but building a long stream in memory and passing that inside a std::istringstream to getline which works on any stream, not just files.

          1. I.e., Daniel’s point is “even cutting files entirely out of the equation, getline is still limited to 2 GB/s”.

          2. But then the whole stringstream mechanic will take a hold on the process. Iostreams use virtual calls all over the place and that is know to break the CPU pipeline.

  2. I think a better example would help drive the point home, as if you only care about the size of the file fseek+ftell or fstat should be O(1) instead of O(n).

    (Unless you really need the size of the file excluding newlines, but that seems to have somewhat limited utility)

  3. Is getline implemented in libstdc++? If so, one would think that its performance would definitely depend on what C++ library implementation you’re using.

    There could also be overheads when dealing with streams and strings, but I’m not knowledgeable enough with C++ to really know.

    1. getline is a C function, so it’s not implemented in libstdc++, but rather libc or whatever MSVCRT thing is used on Windows.

      1. Nevermind, I was mixing it up with the identically named getline(3) function in POSIX. Daniel is talking about std::getline in C++ and yes it would be implemented in libstc++.

    2. C++ streams are locale-aware, and have plenty of virtual methods (so that it’s possible to have streams that get their input from files, and others that get their input from strings, etc.), and C++ strings allocate memory when needed.

      That is, there are plenty of opportunities for overhead in std::getline. Whenever I write C++, I honestly ask “do I need this input to be fast or convenient?” anytime I read from a file.

  4. It varies by OS, but there are some hard limits depending on how the “IO” is done (in quotes, because it might not involve any IO in the usual case the pages are cached).

    If you use mmap, you need to fault-in every 4k page you read. Linux has “fault around” which means several pages are mapped in for every fault but you still end up limited to about 10 GB/s.

    The situation is worse for read(2) which needs to make a system call every time it needs to fill a buffer. No doubt getline() is ultimately using read(2) or something equivalent under the covers. Every system call costs ~1000 cycles or more after Spectre and Meltdown, so you need to use large buffers enough to amortize the cost – but large buffers only work to an extent: too large and you end up missing in cache. Finally, the kernel has to copy from the page cache to your process. I measured limits around 6 GB/s although I haven’t tried since Spectre & Meltdown.

    OTOH if you can get your file mapped into a 2 MIB page, the speed limit is 100 GB/s or more for mmap(). I’m not sure if that’s supported by any file system other than tmpfs yet.

    All of those limits are significantly higher than 2 GB/s, so I guess this benchmark is mostly CPU-bound inside the getline call.

  5. It’s worth pointing out (again) that this does not test IO at all. Getline is just memchr + memcpy with buffer overflow checking – the benchmark spends more time in memchr than in getline.

    If I do a loop with memchr to find the end of a line followed by memcpy into a fixed buffer, it is 70% faster. But with no checks whatsoever, it’s not safe or robust…

    However if I increase the average string size by 10x, the C++ version becomes more than 4 times as fast and the bare-bones version is 3 times as fast. The C++ overhead is now less than 25%.

    So clearly the C++ overhead is not what is killing us here, it’s the cost of doing many small unpredictable memchr/memcpy. Placing the line ends exactly 80 characters apart doubles performance of the memchr/memcpy.

  6. From my experience, there are two items that can speed up getline: faster memchr. This is my simple avx2 implementation for memchr function which has been used in fast-wc. Manage all memory related operations in the userspace if possible. Below are simple benchmark results for a very large log file in the mapped memory disk to demonstrate that both wc ad fast-wc can process lines with the speed of 4GB/s.

    processor : 87
    vendor_id : GenuineIntel
    cpu family : 6
    model : 79
    model name : Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
    stepping : 1
    microcode : 0xb00002e
    cpu MHz : 2200.081
    cache size : 56320 KB

    Linux> time wc /log/workqueue-execution-2019-06-17_00000
    ^C15.32user 0.22system 0:15.56elapsed 99%CPU (0avgtext+0avgdata 672maxresident)k
    0inputs+0outputs (0major+197minor)pagefaults 0swaps
    Linux> time wc -l /log/workqueue-execution-2019-06-17_00000
    9333291 /log/workqueue-execution-2019-06-17_00000
    1.48user 4.21system 0:05.70elapsed 99%CPU (0avgtext+0avgdata 644maxresident)k
    0inputs+0outputs (0major+191minor)pagefaults 0swaps

    Linux> ls -l /log/workqueue-execution-2019-06-17_00000
    -rw-r--r-- 1 hdang users 21650199150 Jun 19 15:08 /log/workqueue-execution-2019-06-17_00000
    Linux> time fast-wc /log/workqueue-execution-2019-06-17_00000
    Number of lines: 9333291
    Max line length: 978370
    Min line length: 132
    File size: 21650199150
    1.06user 3.82system 0:04.89elapsed 99%CPU (0avgtext+0avgdata 684maxresident)k
    0inputs+0outputs (0major+188minor)pagefaults 0swaps

    1. I’m not sure what this is supposed to do, however this code can be sped up a lot. For example with unrolling you only really need to check the buffer end once every N characters. Similarly you only need one compare to check all the special cases, eg. if ((ch ^ 2) <= 0x20) will catch all special cases. And when handling the special cases you’d use logical operations and conditional moves to avoid branch mispredictions. An advanced version would use SIMD instructions to do all this on N characters at a time.

  7. Using setvbuf(file, 0, _IOFBF, 102400); can do wonders to improve reads on a C FILE stream. Last time I measured (~20 years ago?), the knee of the curve was under 100K.

    I simply have found no reason to use the C++ streams, so I cannot comment on any equivalent.

    For bare-metal performance, I do read() into a very large buffer, and parse out the lines, as in:

    Recently used a very similar approach to parse raw radar data and dump in text form. Must be very fast to be useful to the customer, as the data can easily be tens of gigabytes per run. Tool is bottlenecked on storage performance, on the target system (eight spinning disks in RAID 0). Not code I can share, in this case.

  8. You may like to call std::ios_base::sync_with_stdio(false) before reading, it often speeds things up.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.