Programmers often write out numbers as strings (e.g., 3.1416) and they want to read back the numbers from the string. If you read and write JSON or CSV files, you do this work all of the time.
Previously, we showed that we could parse floating-point numbers at a gigabyte per second or better in C++ and in Rust, several times faster than the conventional approach. In Go 1.16, our approach improved parsing performance by up to a factor of two.
Not everyone programs in C++, Rust or Go. So what about porting the approach to C#? csFastFloat is the result!
For testing, we rely on two standard datasets, canada and mesh. The mesh dataset is made of “easy cases” whereas the canada dataset is more difficult. We use .NET 5 and an AMD Rome processor for testing.
parser | canada | mesh |
---|---|---|
Double.Parse (standard) | 3 million floats/s | 11 million floats/s |
csFastFloat (new) | 20 million floats/s | 35 million floats/s |
Importantly, the new approach should give the same exact results. That is, we are accurate.
Can this help in the real world? I believe that the most popular CSV (comma-separate-values) parsing library in C# is probably CSVHelper. We patched CSVHelper so that it would use csFastFloat instead of the standard library. Out of a set of five float-intensive benchmarks, we found gains ranging from 2x to 8%. Your mileage will vary depending on your data and your application, but you should see some benefits.
Why would you see only an 8% gain some of the time? Because, in that particular case, only about 15% of the total running time has to do with number parsing. The more you optimize the parsing in general, the more benefit you should get out of fast float parsing.
The package is available on nuget.
Credit: The primary author is Carl Verret. We would like to thank Egor Bogatov from Microsoft who helped us improve the speed, changing only a few lines of code, by making use of his deep knowledge of C#.
Hey Daniel, as always, really nice work!
Since C# is OpenSource, and your approach accurate, have you considered making a PR into C# runtime so everyone can benefit from that?
https://github.com/dotnet/runtime/blob/308ae6ad833089199b8afbf30a7b402f35190fc8/src/libraries/System.Private.CoreLib/src/System/Double.cs#L284
An issue has been opened. Meanwhile, we are working hard to make the library as usable as possible.
Hi Daniel, a couple of questions. I was just about to ask what data format you were using for some of the integer libraries when I realized that a lot of these are parsing text files.
So when you say “parsing” floats or integers, should I understand that this means parsing a text representation of these values? Is that implied in the term “parse”, such that we wouldn’t say we were parsing if the data was binary?
And then with these floats, I noticed the data files have a lot of content before the floats themselves. In many of the files, there are lots of leading zeros before each number (more than eight). What are those about? And then some files have a bunch of hex before each number, like this file:
https://github.com/CarlVerret/csFastFloat/blob/master/TestcsFastFloat/data_files/tencent-rapidjson.txt
What is that hex data? Are those supposed to be floats also? It seems like the floats come at the end of each row, after a lot of hex. An easier example is this one:
58A8 43150000 4062A00000000000 149
The float is 149, and the two longer strings in the middle are different hex representations of 149 as a float. But I don’t know what 58A8 is. Is csFastFloat doing anything with those hex strings? Which representation is actually parsing?
We parse strings representing numbers in decimal form.
These files you are looking at are test files for internal use, and not part of the library. We use them for testing.