JSON is arguably the standard data interchange format on the Internet. It is text-based and needs to be “parsed”: the input string needs to be transformed into a conveniently data structure. In some instances, when the data volume is large, ingesting JSON is a performance bottleneck.
Last week, we discreetly made available a new library to parse JSON called simdjson. The objective of the library is to parse large volumes of JSON data as quickly as possible, possibly reaching speeds in the gigabytes per second, while providing full validation of the input. We parse everything and still try to go fast.
As far as we can tell, it might be the first JSON parser able to process data at a rate of gigabytes per second.
The library quickly become popular, and for a few days it was the second most popular repository on GitHub. As I am writing these lines, more than a week later, the library is still the second “hottest” C++ library on GitHub, ahead of famous machine-learning libraries like tensorflow and opencv. I joked that we have beaten, for a time, deep learning: tensorflow is a massively popular deep-learning library.
We have paper full of benchmarks and my co-author (Geoff) has a great blog post presenting our work.
As is sure to happen when a piece of software becomes a bit popular, a lot of interesting questions are raised. Do the results hold to scrutiny?
One point that we have anticipated in our paper and in the documentation of the software is that parsing small JSON inputs is outside of our scope. There is a qualitative difference between parsing millions of tiny documents and a few large ones. We are only preoccupied with sizeable JSON documents (e.g., 50kB and more).
With this scope in mind, how are our designs and code holding up?
There is already a C# port by Egor Bogatov, a Microsoft engineer. He finds that in several instances, his port is several times faster than the alternatives. I should stress that his code is less than a week old.
The C++ library has been available for less than a week, so it is still early days. Nevertheless, clever programmers have published prototypical bindings in Python and Javascript (node). In both instances, we are able to beat the standard parsers in some reasonable tasks. A significant fraction of the running is spent converting raw results from C++ into either Python or JavaScript objects. But even with this obstacle, the Python wrapper can be several times faster than the standard Python JSON parser.
It is always going to require significant engineering to get great performance when you need to interface tightly with a high-level language like Python or JavaScript… Â When you really need the performance, the trick is to push the computation down within the guts of the C/C++ code.
Where next? I do not know. We have many exciting ideas. Porting this design to ARM processors is among them.
Some of the python guys are already exposing interesting possibilities of ‘partial’ or lazy materialization of the JSON into the language. What they are doing at this stage is just suppressing the sometimes expensive construction of things in the language (we are still parsing everything under the hood). But it might later be a nice hook to explore lazy materialization in more detail.
Small files are still a work in progress. I don’t want to break performance properties there, but I’m a bit skeptical that a workload that’s over in a microsecond or so is even properly measured by our infrastructure. That’s a good one for a HELP WANTED tag.
along with the libraries in various languages, one popular json tool is jq (https://github.com/stedolan/jq).
jq is part of many data processing pipelines, and a boost to jq’s processing speeds via simdjson would be a great thing to see.
I agree, this seems worthwhile. Anyone wants to look at it?
I wonder how it would perform if the 256 bit instructions were emulated by four (or more) non-simd 64 bit instructions.
While not optimal, performance may be good compared to bytewise implementions. It still processes more bytes in parallel and has low data dependencies.
I agree that it is an interesting question.
In the paper, we provide a “model” (using linear regregression) and from this model it should be possible to reason about such things.
Note that the width is only part of the story. Some SIMD instructions are just very slow to emulate with scalar instructions because there is no scalar instruction equivalent. Consider for example PSHUFB, which effectively acts as 32 parallel lookups in a 16-entry table, and no doubt features heavily in simdjson (I could be wrong, but let’s be realistic: we’re talking about Geoff “middle name PSHUFB” Langdale and Daniel “always looking for a new place to apply PSHUFB” Lemire here).
Using 64-bit instructions, assuming a 64-bit PSHUFB existed, you can only process (look up) 8 bytes at a time, which is the 4x reduction due to width, but also now your lookup table is only 8 entries, so if you actually needed 16-entry tables, you may have to do two lookups and a bunch of merging or something like that. So this is a case where a width reduction has a quadratic rather than linear effect (it’s 4 * 2 not 4 * 4 because of the in-lane behavior of PSHUFB in AVX2 but that’s not fundamental).
Even that scenario is only a fantasy though! There is of course no 64-bit scalar PSHUFB instruction: those kind of shuffle instructions generally only exist in the SIMD extensions in the first place. So how are you going to emulate PSHUFB with basic scalar operations you’d find in any ISA. It’s going to be really slow (unless there is some trick I’m missing).
This is why SIMD algorithms often look very different to their scalar counterparts: it’s not just width that’s the difference – but rather whole operations are not available on both sides of the fence.
This difference flows both ways too! There are scalar techniques that don’t scale to SIMD, such as many things involving loads (e.g., larger lookup tables) or loops with divergent control flow per iteration. There are also scalar operations not even available in SIMD on x86 like the carry-less multiply thing, 64×64->128 multiplication, pdep and pext and various other bit-bashing ops (although future AVX-512 extensions should add some of these).
What’s the most efficient general purpose PSHUFB emulation with scalar instructions is actually an interesting question by itself…
The baseline could be something like writing out the input bytes to memory, and then iterating over each 4-bit control in the shuffle mask using it to look up the corresponding byte in the output and storing it, (storing either byte-by-byte or in 64-bit chunks with some shift + OR to build the elements to store).
So for 32-byte PSHUFB you are looking at 32 loads plus a bunch of other stuff, probably around 100 instructions at a minimum for this approach!
It would be amusing to try to emulate PSHUFB with a shift and a conditional move. I think the load solution would be faster, though. You could merge 2 4-bit values together and look up a 256-way table, so only 16 loads. Also you might need to emulate the high-bit behavior.
This is, however, silly. To substitute the use of PSHUFB in simdjson one would just look up 8-bit values in an 8-bit table with scalar code. This would be way faster than faking up SWAR replacements for SIMD.
The SWAR approach to searching for particular bytes is reasonable – some silliness with carry, add and multiply and off you go. This was still worth doing in Hyperscan. There might be some arguably clever SWAR’isms for some of the other bits and pieces in simdjson, but I can’t imagine why you would want to do this.
All that being said a scalar stage 1 would almost certainly perform best if built from scratch. I make a similar argument with ARM NEON.
Agreed, although it still supports my original point: which is that it can be misleading to look at say 256-bit wide SIMD and then suppose a 4x drop in efficiency when moving to 64-bit scalar instructions, because some SIMD patterns suffer a much larger drop: in the case of single-byte lookups as you suggest, it’s roughly a 32x gap.
So yeah, you’d not want to emulate PSHUFB directly in a scalar version of simdjson, but it did make me curious what the scalar emulation would be which is interesting in various scenarios like source-compatible emulation of SIMD intrinsics on platforms that don’t support them, like this.
I bought an cheap laptop to use while traveling, which comes with a low performance AMD A6-9225 CPU, and thought I’d give simdjson a shot, in part because I was curious about SSE performance.
simdjson : 4.027 cycles per input byte (best) 4.156 cycles per input byte (avg) 0.644 GB/s (error margin: 0.020 GB/s)
rapid : 9.478 cycles per input byte (best) 9.828 cycles per input byte (avg) 0.274 GB/s (error margin: 0.010 GB/s)
sasjon : 9.025 cycles per input byte (best) 11.105 cycles per input byte (avg) 0.288 GB/s (error margin: 0.054 GB/s)
simdjson (just parse) : 4.006 cycles per input byte (best) 4.103 cycles per input byte (avg) 0.648 GB/s (error margin: 0.015 GB/s)
rapid (just parse) : 8.303 cycles per input byte (best) 10.494 cycles per input byte (avg) 0.312 GB/s (error margin: 0.065 GB/s)
sasjon (just parse) : 7.871 cycles per input byte (best) 10.036 cycles per input byte (avg) 0.330 GB/s (error margin: 0.071 GB/s)
simdjson (just dom) : 0.620 cycles per input byte (best) 0.726 cycles per input byte (avg) 4.175 GB/s (error margin: 0.609 GB/s)
rapid (just dom) : 1.138 cycles per input byte (best) 1.510 cycles per input byte (avg) 2.278 GB/s (error margin: 0.562 GB/s)
sasjon (just dom) : 0.610 cycles per input byte (best) 0.809 cycles per input byte (avg) 4.249 GB/s (error margin: 1.044 GB/s)
Not bad at all, even on this slow chip it’s pushing 0.6GB/s, and seeing more than a 2x performance increase over the next fastest.
Thanks for trying it out.