How expensive is it to parse numbers from a string in C++?

In C++, there are multiple ways to convert strings into numbers. E.g., to convert the string “1.0e2” into the number 100.

In software, we frequently have to parse numbers from strings. Numbers are typically represented in computers as 32-bit or 64-bit words whereas strings are variable-length sequences of bytes. We need to go from one representation to the other.

For example, given the 3-character string “123”, we want to recover the integer value 123, which is typically stored as a byte value 0x7b followed by some zero bytes.

Though parsing integers is not trivial, parsing floating-point numbers is trickier: e.g., the strings “1e10”, “10e9”, “100e+8” and “10000000000” all map to the same value. Typically, we store floating-point values as either 32-bit or 64-bit values using the IEEE 754 standard. Parsing may fail in various fun ways even if we set aside garbage inputs that have no reasonable number value (like “4a.14x.10.14”). For example, the string “1e500” cannot be represented, typically, in software (except as the infinity). The string “18446744073709551616” cannot be represented as either a 64-bit floating-point value or a 64-bit integer. In some instances, we need to “round” the value: the string “1.0000000000000005” cannot be represented exactly as a 64-bit floating-point value, so we need to round it down to “1.0000000000000004” while we might round “1.0000000000000006” up to 1.0000000000000007.

How expensive is it to parse numbers? To test it out, I wrote a long string containing space-separated numbers. In one case, I generated random 64-bit unsigned integers, and another I generated random 64-bit normal floating-point numbers. Then I benchmark standard C++ parsing code that simply sums up the numbers:

std::stringstream in(mystring);
while(in >> x) {
   sum += x;
}
return sum;

I use long strings (100,000 numbers), and the GNU GCC 8.3 compiler on an Intel Skylake processor. I find that parsing numbers is relatively slow:

integers 360 cycles/integer 18 cycles/byte 200 MB/s
floats 1600 cycles/integer 66 cycles/byte 50 MB/s

Most disks and most networks can do much better than 50 MB/s. And good disks and good networks can beat 200 MB/s by an order of magnitude.

My source code is available.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

8 thoughts on “How expensive is it to parse numbers from a string in C++?”

  1. Can you also compute cycles/char?

    As you can see, ints take about 3x as many cycles, but the throughput is 20x worse, likely because they also use many more chars.

  2. This talk of the CppCon 2019 mentions C++17 std::from_chars being much faster:

    https://youtu.be/4P_kbF0EbZM

    Indeed, cppreference has this sentence: „This is intended to allow the fastest possible implementation that is useful in common high-throughput contexts such as text-based interchange (JSON or XML).“

    Maybe you‘d be interested in benchmarking this against your current implementation?

  3. I wonder if there would be a measurable benefit from simply breaking the dependency chain on sum variable, that is using multiple variables instead of one. After all, results with integers would be the same.

    There are certainly faster methods to parse integers than one that takes 18 cycles per byte, at least if you can vectorise! Parsing series of integers with SIMD (Wojciech Muła) finds out that 3-6 cycles per byte is entirely plausible to reach on smaller (shorter) integers.

  4. Look at Boost Spirit. Spirit X3 has essentially optimal (scalar) parsing of numeric strings at -O2 and will actually check for things like over/under flow and narrowing.

    Facebook’s Andrei Alexandrescu also gave a talk on speeding this up further (it’s probably implemented in Folly) by looking at the CPU pipeline and breaking dependencies.

  5. Additionally, it’s worth mentioning that there’s always going to be a divide between functions that respect the users locale , and parse strings like “3,786” and ones that don’t. Iostreams very much do handle this, which makes them inappropriate in general for parsing file formats where a grammar is known.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.