JSON is arguably the standard data interchange format on the Internet. It is text-based and needs to be “parsed”: the input string needs to be transformed into a conveniently data structure. In some instances, when the data volume is large, ingesting JSON is a performance bottleneck.
Last week, we discreetly made available a new library to parse JSON called simdjson. The objective of the library is to parse large volumes of JSON data as quickly as possible, possibly reaching speeds in the gigabytes per second, while providing full validation of the input. We parse everything and still try to go fast.
As far as we can tell, it might be the first JSON parser able to process data at a rate of gigabytes per second.
The library quickly become popular, and for a few days it was the second most popular repository on GitHub. As I am writing these lines, more than a week later, the library is still the second “hottest” C++ library on GitHub, ahead of famous machine-learning libraries like tensorflow and opencv. I joked that we have beaten, for a time, deep learning: tensorflow is a massively popular deep-learning library.
As is sure to happen when a piece of software becomes a bit popular, a lot of interesting questions are raised. Do the results hold to scrutiny?
One point that we have anticipated in our paper and in the documentation of the software is that parsing small JSON inputs is outside of our scope. There is a qualitative difference between parsing millions of tiny documents and a few large ones. We are only preoccupied with sizeable JSON documents (e.g., 50kB and more).
With this scope in mind, how are our designs and code holding up?
There is already a C# port by Egor Bogatov, a Microsoft engineer. He finds that in several instances, his port is several times faster than the alternatives. I should stress that his code is less than a week old.
Where next? I do not know. We have many exciting ideas. Porting this design to ARM processors is among them.