In our work parsing JSON documents as quickly as possible, we found that one of the most challenging problem is to parse numbers. That is, you want to take the string “1.3553e142” and convert it quickly to a double-precision floating-point number. You can use the strtod function from the standard C/C++ library, but it is quite slow. People who write fast parsers tend to roll their own number parsers (e.g., RapidJSON, sajson), and so we did. However, we sacrifice some standard compliance. You see, the floating-point standard that we all rely on (IEEE 754) has some hard-to-implement features like “round to even”. Sacrificing such fine points means that you can be off by one bit when decoding a string. As such, this never matters: double-precision numbers have more accuracy than any engineering project will ever need and a difference on the last bit is irrelevant. Nevertheless, it is mildly annoying.
A better alternative in C++ might be from_chars. Unfortunately, many standard libraries have not yet caught up the standard and they fail to support from_chars properly. One can get around this problem by using the excellent abseil library. It tends to be much faster than venerable strtod function.
Unfortunately, for our use cases, even abseil’s from_chars is much too slow. It can be two or three times slower than our fast-but-imperfect number parser.
I was going to leave it be. Yet Michael Eisel kept insisting that it should be possible to both follow the standard and achieve great speed. Michael gave me an outline. I was unconvinced. And then he gave me a code sample: it changed my mind. The full idea requires a whole blog post to explain, but the gist of it is that we can attempt to compute the answer, optimistically using a fast algorithm, and fall back on something else (like the standard library) as needed. It turns out that for the kind of numbers we find in JSON documents, we can parse 99% of them using a simple approach. All we have to do is correctly detect the error cases and bail out.
Your results will vary, but the next table gives the speed numbers from my home iMac (2017). The source code is available along with everything necessary to test it out (linux and macOS).
parser | MB/s |
---|---|
fast_double_parser (new) | 660 MB/s |
abseil, from_chars | 330 MB/s |
double_conversion | 250 MB/s |
strtod | 70 MB/s |
On Windows it’s highly optimized (by Stephan T. Lavavej himself). Possibly you could add, which std-libs are lagging, or bad.
I’d be more interested in knowing which one support it. The only one I have seen mentioned is Visual Studio. This will no doubt improve in the future, thankfully.
Currently, this would not produce portable code since most other standard libraries I have tried do not support it. One portable approach is to rely on abseil.
Implicitly, you’ve added it now to the comments.
With some fiddling (if very important and worth the trouble, and some digging in the relevant docs) one could create an object file with clang-cl and link that in on linux or with MinGW. The thing has more or less a c-api anyway.
Seems easier to use abseil for the time being, no?
Could you please also check against the ‘new’ c++ format? They state that their implementation is also very fast.
As far as I can tell, the “new” way to parse floats in C++ is to use “from_chars” and I address this both in my benchmarks and my post. If you are thinking about something else, would you kindly elaborate?
Aha, ok. I didn’t know that it was based on the from_chars routine (I’m still on C++17). Seeing as this is also quite fast that sounds very good.
Is there any chance the std::format can use your algorithm, or does that have to go through the committee?
Standard libraries could certainly adopt the approach we have designed.
I’ve been doing some tests and I don’t think this is as fast as the algorithm in RapidJSON: https://github.com/Tencent/rapidjson/blob/7e68aa0a21b7800ec98133cb106e49bd6536e25c/include/rapidjson/internal/strtod.h#L131
Am I correct in understanding that the goal of your number parser is to produce identical results to the C++ standard implementation, and this is the source of the performance difference from RapidJSON?
Thanks
You are correct.
RapidJSON has at least two fast-parsing mode. The fast mode, which I think is what you refer to, is indeed quite fast, but it can be off by one ULP, so it is not standard compliant. Boost Spirit similarly offers fast parsing, but it is not again not standard compliant.
Our very own simdjson has also a fast number parsing mode…
David Gay’s dtoa.c had updated (2016) with 96-bits big float, which speed up his earlier version by quite a bit, an order of magnitude or faster in some cases.
see CHANGES dated 20160429,
http://www.netlib.org/fp/
Thank you.
Note for people reading this: dtoa goes in the opposite direction.
The file name is confusing; dtoa.c contains both dtoa() and strtod(). I suppose Albert meant the latter.
@Marcin Yes and I include benchmark against the updated code at https://github.com/lemire/simple_fastfloat_benchmark
I do not think that the string parsing has been made orders of magnitude faster. At least, my tests do not reveal much difference.