Parsing floats in C++: benchmarking strtod vs. from_chars

Programmers often need to convert a string into a floating-point numbers. For example, you might get the string “3.1416” and you would like to get the resulting value as floating-point type.

In C/C++, the standard way is the strtod function:

char * string = "3.1416";
char * string_end = string;
double x = strtod(string, &string_end);
if(string_end == string) { 
  //you have an error!
}

Unfortunately, the strtod function is locale-sensitive like many of these string processing functions. It means that it may behaved differently depending on the system where you run the code. However, all runtime libraries appear to have locale-specific functions like strtod_l or _strtod_l (Windows). You can use these locale-specific functions to specify the default locale and thus get functions that behave the same on all systems.

One nice feature of the strtod function is that it gives you back a pointer at the end of the parsed number. In this manner, you can efficiently parse a sequence of comma-separated numbers, for example.

You can also use C++ streams but they are typically not geared toward performance.

In C++17, we got a nicer option: the from_chars function. It works similarly:

std::string st = "3.1416";
double x; 
auto [p, ec] = std::from_chars(st.data(), st.data() + st.size(), x);
if (p == st.data()) {
      //you have an errors!
}

The great thing about from_chars is that it is, by definition, locale-independent. Furthermore, it is standardized, so it should be available everywhere. Unfortunately, to my knowledge, the only C++ compiler to come with a fully implemented from_chars function is Visual Studio 2019. And you need to have a recent release.

Indeed, though C++17 has been out for some time, and many compilers claim to fully support it, the runtime libraries need to catch up! You can get independent from_chars implementations, of course, but it would be nice if runtime librairies supported C++17 out of the box.

Still, we should celebrate that Microsoft is doing the right thing and eagerly supporting the latest standards.

I heard good things about the Visual Studio from_chars. And among other things, that it is very fast.

I wanted to know much faster it was. So I wrote a benchmark! I generate a large number of random values in the [0,1] interval and I record how long it takes to parse them back.

stdtod (C) 120 MB/s
from_chars (C++17) 140 MB/s

You do get a nice 20% performance boost, at no cost whatsoever. Brilliant.

Further reading: Comparing strtod with from_chars (GCC 12)

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

27 thoughts on “Parsing floats in C++: benchmarking strtod vs. from_chars”

  1. 20% doesn’t seem like a lot considering how the marketing copy for the new C++17 number conversion routines claimed that “[if] your serialization code is bottlenecked by floating-point printing, this will accelerate your code by roughly 3x to 30x (yes, times, not percent).” https://isocpp.org/blog/2020/08/cppcon-2019-floating-point-charconv-making-your-code-10x-faster-with-cpp17s

    Also not obvious at first blush is why the standard (at least as quoted by cppreference) doesn’t guarantee interoperability when both sides use IEEE754 floating point: “The guarantee that std::from_chars can recover every floating-point value formatted by to_chars exactly is only provided if both functions are from the same implementation.” Why did they leave this wiggle room (vs stating something about a requirement about the std::numeric_limits<> of the types on the two platforms)? How could Linux and Windows, or Windows-on-x86 and Windows-on-x86_64 FAIL to round trip yet each individually conform to the standard? https://en.cppreference.com/w/cpp/utility/to_chars

    1. Also not obvious at first blush is why the standard (at least as
      quoted by cppreference) doesn’t guarantee interoperability when both
      sides use IEEE754 floating point

      I take it that it is a pessimistic view.

    2. Printing ≠ parsing!

      Printing became an order of magnitude faster because itʼs using brand new algorithms to print floating point numbers.

      Parsing “only” omits locale handling, so thereʼs a limit to how much faster it could get.

  2. 20% performance boost? To be correct this is a little above 14%, not 20%. And yes it is a meaningful increase, but let be correct.

      1. I firstly just calculated the decrease from the upper value to the lower one, not the increase from the lower to the upper.

    1. Take into account that both values (1.40 and 1.20) were rounded to two significant digits, and add the fact that the measure has some incertitude to begin with (~3%)… So I need to round the ratio to one significant digit. Hence 20% is fair.

      140 / 120 – 1 = 0.17

  3. It would be interesting to see the speed vs. the classic good open source strtod from “dtoa.c” ; historically the MSVC CRT strtod has been about 10X slower than “dtoa.c” so I’d be more pleased if they fixed their 10X slowdown. (the MUSL CRT strtod was 100X slower)

      1. I don’t undestand where I could have meant “atof” ?

        dtoa.c (.c indicates a source code file) is the classic gold standard for a good strtod that is both accurate and fast. It can be updated with modern instructions (64 bit registers and clz) to be even faster.

        http://www.ampl.com/netlib/fp/dtoa.c

        Rough order of magnitude, 140 MB/s looks about 10X too slow for strtod.

        On most doubles, strtod only has to do *= 10, += digit, so it should be only a few cycles per digit, not hundreds of cycles per digit. Doubles that require more work than that are very rare.

        dtoa.c averages around 2 cycles per byte so these times look very slow indeed.

        1. I added the netlib strtod that Charles linked and ran it unmodified (except for changing the name), and it provides a modest speedup:

          # parsing random integers in the range [0,1)
          volume = 2.09808 MB
          strtod : 125.83 MB/s (+/- 1.4 %)
          netlib : 174.48 MB/s (+/- 4.1 %)

          This works out to about 15 cycles per byte on my machine, considerably more than the 2 Charles it suggests it should achieve.

          Looking at the main loop it seems like it is not going to get to 2 cycles (there is at least one or two mispredicts every time a 0 appears), but 15 does seem like too much. The compiler does a poor job in various places too.

          1. Hi everyone,

            The “random” part of the benchmark is why this is so far off from Charles’s 10x guess. (Travis: I assume you didn’t mean “random integers in [0, 1)” since there aren’t a lot of those to go around.)

            The standard fast path (which is in David Gay’s code) for correctly rounded decimal to double conversion is when you have 15 significant decimal digits (or 16 significant digits <= 2^53). Then you have an exactly represented integer (from the parsed decimal digits) divided by an exactly represented power of ten. That is a single IEEE-754 operation and hence correctly rounded. (You can also support ‘e’ on the fast path with a branch to either divide or multiply by a power of ten as long as the net power of ten is exactly representable.)

            If you’re parsing hand-written double literals in a programming language parser or calculator program then you’re almost never going to need more than 15 digits and hence the fast path will be hit almost all the time. On the other hand, there aren’t really enough hand-written double literals in a typical source file for the performance to be a bottleneck.

            Where performance really matters is when you’re trying to round-trip large arrays of doubles, and they will often have noise in the less significant digits (from accumulated round-off unrelated to any serialization issues) which can push you off the fast path most of the time.

            A relevant factor for hitting the fast path is whether the doubles you’re reading were printed with the minimum number of required digits. In Daniel’s data generator in the benchmark, his code appears to be the printing the minimum (at most 17 digits for doubles). But if we assume a uniform distribution on [0, 1) and a fast path that only works with 15 digits, I think that means only 1% of cases will hit the fast path.

            1. From what I can tell most of the time is spent simply parsing the characters into an integer, not any slow path associated with more than 15 digits.

              That is, the loop that does:

              result *= 10 + *str - '0'

              It does a bit more than that because it handles zeros specially (I suppose to avoid overflow in the case of many trailing zeros).

              This parsing would need to happen regardless of 15 or 17 digits, so I’m not seeing the slow path unless it’s related to the trailing zero handling I mentioned.

              I did make some tweaks to the netlib algo and got a good speedup to around 600 MB/s, but still pretty far away from Charles’ estimate.

  4. Not that I think that it will increase the performance significantly, but the handling of ‘string_end’ is not correct. It doesn’t have to be initialized and the check for nullptr is not necessary. When strtod() succeeds, string_end will point to the terminating zero of string. So the error check should read: if ( *string_end != ‘\0 ) { failed(); }

    1. Why is the benchmark including the cost of the static c_locale?

      I am not sure what you mean by cost. Both strtod_l and strtod depend on a locale. In this case, I specify the default locale.

      As explained in the post strtod is locale-sensitive and we want a local insensitive version for comparison purposes.

      Why not time based on a batch rather than on individual parses?

      Because I am interested in throughput, as indicated by the fact that I express my results in MB/s. I am not interested in latency in this case.

      Does the benchmark actually test and or include all possible ways a float can be represented?

      It does not. Have you checked the code?

      The exact performance number will depend on how the values were serialized with the number of digits being the most important factor. However, if you want full roundtrip, you need 17 digits in general for this problem. I have fixed the number of digits to a flat 17 throughout.

      In practice, if you are parsing a lot of floats, chances are good that they are following the same format.

      Another flawed benchmark.

      Can you point at a better from_chars to strtod benchmark?

      You offer a link to your own library, but it does not seem to include such a benchmark?

  5. I tested from_chars in gcc 11 snapshot (g++ (GCC) 11.0.0 20200913 (experimental)) and current it is actually considerably slower than strtod.

    Of course, it might speed up before release.

    # parsing random integers in the range [0,1)
    volume = 2.09808 MB
    strtod : 126.03 MB/s (+/- 2.5 %)
    from_chars : 82.09 MB/s (+/- 2.5 %)

    It is not a given that from_chars will be faster than strtod either: the former has a “round trip” requirement that the latter doesn’t.

  6. You may want to have a look at https://github.com/yb303/swar.
    It’s only a header really.
    I wrote it a few years back when I was bored between jobs and now cleaned up and uploaded.
    It provides a few variants of atoi / itoa, without using SSE.
    SEE may make it a bit faster for strings longer than 8 (ints greater than 99999999)

Leave a Reply to Daniel Lemire Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.