In C++, there are multiple ways to convert strings into numbers. E.g., to convert the string “1.0e2” into the number 100.
In software, we frequently have to parse numbers from strings. Numbers are typically represented in computers as 32-bit or 64-bit words whereas strings are variable-length sequences of bytes. We need to go from one representation to the other.
For example, given the 3-character string “123”, we want to recover the integer value 123, which is typically stored as a byte value 0x7b followed by some zero bytes.
Though parsing integers is not trivial, parsing floating-point numbers is trickier: e.g., the strings “1e10”, “10e9”, “100e+8” and “10000000000” all map to the same value. Typically, we store floating-point values as either 32-bit or 64-bit values using the IEEE 754 standard. Parsing may fail in various fun ways even if we set aside garbage inputs that have no reasonable number value (like “4a.14x.10.14”). For example, the string “1e500” cannot be represented, typically, in software (except as the infinity). The string “18446744073709551616” cannot be represented as either a 64-bit floating-point value or a 64-bit integer. In some instances, we need to “round” the value: the string “1.0000000000000005” cannot be represented exactly as a 64-bit floating-point value, so we need to round it down to “1.0000000000000004” while we might round “1.0000000000000006” up to 1.0000000000000007.
How expensive is it to parse numbers? To test it out, I wrote a long string containing space-separated numbers. In one case, I generated random 64-bit unsigned integers, and another I generated random 64-bit normal floating-point numbers. Then I benchmark standard C++ parsing code that simply sums up the numbers:
std::stringstream in(mystring); while(in >> x) { sum += x; } return sum;
I use long strings (100,000 numbers), and the GNU GCC 8.3 compiler on an Intel Skylake processor. I find that parsing numbers is relatively slow:
integers | 360 cycles/integer | 18 cycles/byte | 200 MB/s |
floats | 1600 cycles/integer | 66 cycles/byte | 50 MB/s |
Most disks and most networks can do much better than 50 MB/s. And good disks and good networks can beat 200 MB/s by an order of magnitude.
Can you also compute cycles/char?
As you can see, ints take about 3x as many cycles, but the throughput is 20x worse, likely because they also use many more chars.
Yes, this can be a major bottleneck.
It’s not for every application (since it doesn’t guarantee the 16th decimal place is preserved), but I’ve found ScanadvDouble() in https://github.com/chrchang/plink-ng/blob/master/2.0/plink2_string.cc to be very useful.
This talk of the CppCon 2019 mentions C++17 std::from_chars being much faster:
https://youtu.be/4P_kbF0EbZM
Indeed, cppreference has this sentence: „This is intended to allow the fastest possible implementation that is useful in common high-throughput contexts such as text-based interchange (JSON or XML).“
Maybe you‘d be interested in benchmarking this against your current implementation?
You may like watching Floating-Point charconv: Making Your Code 10x Faster With C++17’s Final Boss
I wonder if there would be a measurable benefit from simply breaking the dependency chain on sum variable, that is using multiple variables instead of one. After all, results with integers would be the same.
There are certainly faster methods to parse integers than one that takes 18 cycles per byte, at least if you can vectorise! Parsing series of integers with SIMD (Wojciech Muła) finds out that 3-6 cycles per byte is entirely plausible to reach on smaller (shorter) integers.
But how about the standard library’s
stoi
atoi
thingies?I use them in my
Value::deserialize()
method.Look at Boost Spirit. Spirit X3 has essentially optimal (scalar) parsing of numeric strings at -O2 and will actually check for things like over/under flow and narrowing.
Facebook’s Andrei Alexandrescu also gave a talk on speeding this up further (it’s probably implemented in Folly) by looking at the CPU pipeline and breaking dependencies.
Additionally, it’s worth mentioning that there’s always going to be a divide between functions that respect the users locale , and parse strings like “3,786” and ones that don’t. Iostreams very much do handle this, which makes them inappropriate in general for parsing file formats where a grammar is known.