JSON parsing: simdjson vs. JSON for Modern C++

JSON is the ubiquitous data format on the Internet. There is a lot of JSON that needs to be parsed and validated.

As we just released the latest version of our fast JSON parser (simdjson), a reader of this blog asked how it compared in speed against one of the nicest C++ JSON libraries: JSON for Modern C++ (nlohmann/json).

We have not reported on benchmarks against “JSON for Modern C++” because we knew that it was not designed for raw speed. It is designed for ease-of-use: it makes the life of the programmer as easy as possible. In contrast, simdjson optimizes for speed, even when it requires a bit more work from the programmer. Nevertheless, it is still interesting to compare speed, to know what your trade-off is.

Let us run some benchmarks… I use a Skylake processor with GNU GCC 8.3.

file simdjson JSON for Modern C++
apache_builds 2.3 GB/s 0.098 GB/s
github_events 2.5 GB/s 0.093 GB/s
gsoc-2018 3.3 GB/s 0.091 GB/s
twitter 2.2 GB/s 0.094 GB/s

Thus, roughly speaking, simdjson can parse, validate a JSON document and create a DOM tree, at a speed of about 2 GB/s. The easier-to-use “JSON for Modern C++” has a speed of about 0.1 GB/s, so about 20x slower. As a reference, we can easily read files from disk or the network at speeds higher than 2 GB/s.

Link: simdjson.

A new release of simdjson: runtime dispatching, 64-bit ARM support and more

JSON is a ubiquitous data exchange format. It is found everywhere on the Internet. To consume JSON, software uses tools called JSON parsers.

Earlier this year, we released the first version of our JSON parsing library, simdjson. It is arguably the fastest standard-compliant parser in the world. It provides full validation. That is, while we try to go as fast as possible, we do not compromise on input validation, and we still process many files at gigabytes per second.

You may wonder why you would care about going fast? But consider that good disks and network chips can provide gigabytes per second of bandwidth. If you can’t process simple files like JSON at comparable speeds, all the bandwidth in the world won’t help you.

So how do we do it? Our trick is to leverage the advanced instructions found on commodity processors such as SIMD instructions. We are not the first ones to go this route, but I think we are the first ones to go so far.

We have published a new  new major release (0.2.0). It brings about many improvements:

  • This first version only ran on recent Intel and AMD processors. We were able to add support for 64-bit ARM processors found on mobile devices. You can run our library on an iPhone. We found that our great speed carries over: we are sometimes close to 2 GB/s.
  • We added support for older PC processors. In fact, we did better: we also introduced runtime dispatching on x64 processors. It means that the software will smartly recognize the feature of your processor and adapt, running the best code paths. We support processors as far back as the Intel Westmere or the AMD Jaguar (PlayStation 4).
  • We now support JSON Pointer queries.

Many people contributed to this new release. It is a true community effort.

This is a lot more coming in the future. Among other things, we expect to go even faster.

Link: GitHub.

Science and Technology links (July 27th 2019)

    1. There are thick ice deposits on the Moon. Water in space is important as it can be used to create fuel and to sustain life.
    2. Most animals do not suffer heart attacks and strokes, the way human beings do. It appears that the difference may be due to a specific genetic mutation. Heart attacks are the leading cause of death in the West. In turn, insulin resistance (diabetes) is the most important single cause of coronary artery disease.
    3. Restricting the time period during which you eat (to say between the morning up til the early afternoon), reduce your appetite and may help you lose weight.

How fast can a BufferedReader read lines in Java?

In an earlier post, I asked how fast the getline function in C++ could run through the lines in a text file. The answer was about 2 GB/s, certainly over 1 GB/s. That is slower than some of the best disk drives and network connections. If you take into account that software rarely only need to “just” access the lines, it is easy to build a system where text-file processing is processor-bound, as opposed to disk or network bound.

What about Java? In Java, the standard way to access lines in a text file is to use a BufferedReader. To avoid system calls, I create a large string containing many lines of text, and then I call a very simple processing function that merely records the length of the strings…

StringReader fr = new StringReader(data);
BufferedReader bf = new BufferedReader(fr);
bf.lines().forEach(s -> parseLine(s));

// elsewhere:
public void parseLine(String s) {
  volume += s.length();
}

The result is that Java is at least two times slower than C++, on the same system, for this benchmark:

BufferedReader.lines 0.5 GB/s

This is not the best that Java can do: Java can ingest data much faster. However, my results suggest that on modern systems, Java file parsing might be frequently processor-bound, as opposed to system bound. That is, you can buy much better disks and network cards, and your system won’t go any faster. Unless, of course, you have really good Java engineers.

Many firms probably just throw more hardware at the problem.

My source code is available.

Update: An earlier version of the code had a small typo that would create just one giant line. This turns out not to impact the results too much. Some people asked for more technical details. I ran the benchmark on a Skylake processor using GNU GCC 8.1 as a C++ compiler and Java 12, all under Linux. Results will vary depending on your exact configuration.

Programming competition with $1000 in prizes: make my code readable!

Colm MacCárthaigh is organizing a programming competition with three 3 prizes: $500, $300, $200. The objective? Produce the most readable, easy to follow, and well tested implementation of the nearly divisionless random integer algorithm. The scientific reference is Fast Random Integer Generation in an Interval (ACM Transactions on Modeling and Computer Simulation, 2019). I have a few blog posts on the topic such as Nearly Divisionless Random Integer Generation On Various Systems.

This algorithm has been added to the Swift standard library by Pavol Vaskovic. It has also been added to the Go standard library by Josh Snyder. And it is part of the Python library Numpy thanks to Bernardt Duvenhage and others.

I have a C++ repository on GitHub with relevant experiments as well as a Python extension.

Colm wants the implementation to be licensed under the Apache Software License 2.0. It could be in any programming language you like. The deadline is September 1st 2019. You can find Colm on Twitter.

Arbitrary byte-to-byte maps using ARM NEON?

Modern processors have fast instructions that can operate on wide registers (e.g., 128-bit). ARM processors, the kind of processors found in your phone, have such instructions called “NEON”. Sebastian Pop pointed me to some of his work doing fast string transformations using NEON instructions. Sebastian has done some great work to accelerate the PHP interpreter on ARM processors. One of his recent optimization is a way to transform the case of strings quickly.

It suggested the following problem to me. Suppose that you have a stream of bytes. You want to transform byte values in an arbitrary manner. Maybe you want to map the byte value 1 to the byte value 12, the byte value 2 to the byte value 53… and so forth.

Here is how you might implement such a function in plain C:

 for(size_t i = 0; i < volume; i++) {
    values[i] = map[values[i]];
 }

For each byte, you need two loads (to get to map[values[i]]) and one store, assuming that the compiler does not do any magic.

To implement such a function on block of 16 bytes with NEON, we use the vqtbl4q_u8 function which is essentially a way to do 16 independent look-up in a 64-byte table. It uses the least significant 5 bits as a look-up index. If any of the other bits are non-zero, it outputs zero. Because there are 256 different values, we need four distinct calls to the vqtbl4q_u8 function. One of them will give non-zero results for byte values in [0,64), another one for bytes values in [64,128), another one for byte values in [128,192), and a final one for byte values in [192,256). We select the right values with a bitwise XOR (and the veorq_u8 function). Finally, we just need to apply bitwise ORs to glue the results back together (via the vorrq_u8 function).

uint8x16_t simd_transform16(uint8x16x4_t * table, uint8x16_t input) {
  uint8x16_t  t1 = vqtbl4q_u8(table[0],  input);
  uint8x16_t  t2 = vqtbl4q_u8(table[1],  
       veorq_u8(input, vdupq_n_u8(0x40)));
  uint8x16_t  t3 = vqtbl4q_u8(table[2],  
       veorq_u8(input, vdupq_n_u8(0x80)));
  uint8x16_t  t4 = vqtbl4q_u8(table[3],  
       veorq_u8(input, vdupq_n_u8(0xc0)));
  return vorrq_u8(vorrq_u8(t1,t2), vorrq_u8(t3,t4));
}

In terms of loads and stores, assuming that you enough registers, you only have one load and one store per block of 16 bytes.

A more practical scenario might be to assume that all my byte values fit in [0,128), as is the case with a stream of ASCII characters…

uint8x16_t simd_transform16_ascii(uint8x16x4_t * table, 
               uint8x16_t input) {
  uint8x16_t  t1 = vqtbl4q_u8(table[0],  input);
  uint8x16_t  t2 = vqtbl4q_u8(table[1],  
     veorq_u8(input, vdupq_n_u8(0x40)));
  return vorrq_u8(t1,t2);
}

To test it out, I wrote a benchmark which I ran on a Cortex A72 processor. My source code is available. I get a sizeable speed bump when I use NEON with an ASCII input, but the general NEON scenario is slower than a plain C version.

plain C 1.15 ns/byte
neon 1.35 ns/byte
neon (ascii) 0.71 ns/byte

What about Intel and AMD processors? Most of them do not have 64-byte lookup tables. They are limited to 16-byte tables. We need to wait for AVX-512 instructions for wider vectorized lookup tables. Unfortunately, AVX-512 is only available on some Intel processors and it is unclear when it will appear on AMD processors.

Science and Technology links (July 20th 2019)

  1. Researchers solve the Rubik’s cube puzzle using machine learning (deep learning).
  2. There has been a rise in the popularity of “deep learning” following some major breakthroughs in tasks like image recognition. Yet, at least as far as recommender systems are concerned, there are reasons to be skeptical of the good results being reported:

    In this work, we report the results of a systematic analysis of algorithmic proposals for top-n recommendation tasks. Specifically, we considered 18 algorithms that were presented at top-level research conferences in the last years. Only 7 of them could be reproduced with reasonable effort. For these methods, it however turned out that 6 of them can often be outperformed with comparably simple heuristic methods, e.g., based on nearest-neighbor or graph-based techniques. The remaining one clearly outperformed the baselines but did not consistently outperform a well-tuned non-neural linear ranking method. Overall, our work sheds light on a number of potential problems in today’s machine learning scholarship and calls for improved scientific practices in this area.

  3. Your blood contains about four grams of glucose/sugar (it is a tiny amount).
  4. Our brain oscillates at a rate of about 40 Hz (40 times per second). Some researchers may have found the cells responsible for coordinating these waves.
  5. Are we suffering from record-setting heat waves? A recent American government report concludes that we are not:

    (…) the warmest daily temperature of the year increased in some parts of the West over the past century, but there were decreases in almost all locations east of the Rocky Mountains. In fact, all eastern regions experienced a net decrease (…), most notably the Midwest (about 2.2°F) and the Southeast (roughly 1.5°F). (…) As with warm daily temperatures, heat wave magnitude reached a maximum in the 1930s.

    The same report observes that cold extremes have become less common, however.

Science and Technology links (July 13th 2019)

  1. Drinking juice increases your risk of having cancer.
  2. College completion rates are increasing and this may have to do with lower standards.
  3. Australia is going to try a flu vaccine that was partially designed with help from an artificial intelligence called Sam. Sam acquires knowledge and has new ideas, according to its creators. I think that this is misleading wording. What is real, however, is that we routinely underutilize software: many designs that are made entirely by human beings could be greatly improved via software. We are nowhere close to the end of the information technology revolution.
  4. Videoconferencing is now common. We know that it does not work as well as it should in part because we keep looking at the screen, not at the camera, and this breaks eye contact… an important behavior for human beings. Apple is deploying a fix for this problem in its own videoconferencing software by slightly altering the image to make it appear as if you were looking straight at the camera. It should be available publicly in a few months. Hopefully, other videoconferencing software will follow the lead.
  5. Omega 3 supplements lower risk of heart attacks. Calcium with vitamin D increases the risk of strokes. Multivitamins and antioxidants have no significant effect on mortality.
  6. People who live in cities are not less civil than people who live in the country. I think that if you live in a city, you have more social interactions and are therefore more likely to have rude encounters per time period. However, people in a big city are just as nice as people in a village, on average.
  7. Women with children who drink alcohol and use illegal substances are more likely to be a relationship.
  8. Researchers who study obesity frequently (50% of the time) confuse correlation or association with causality.
  9. Though the United States is often thought to be a a leader in research, in a comparison between the leading economies (G20), the researchers found that the output and impact of American researchers are in decline and that the output per researcher has fallen below the G20 average.
  10. Most new drugs entering the market are useless. I think that this should be expected: progress is hard and most of our attempts fail, even if we can only recognize the failures much later. In my mind, we should not worry that too many useless drugs are marketed, we should fear that companies shy away from risky new ventures. Don’t blame the innovators when they take risks and fail, encourage them to fail more often, to take bigger risks.
  11. Though our brains make new brain cells all the time, this effect (neurogenesis) can be slowed or prevented by our own immune system.

Parsing JSON quickly on tiny chips (ARM Cortex-A72 edition)

I own an inexpensive card-size ROCKPro64 computer ($60). It has a ARM Cortex-A72 processors, the same processors you find in the recently released Raspberry Pi 4. These are processors similar to those you find in your smartphone, although they are far less powerful than the best Qualcomm and Apple have to offer.

A common task in data science is to parse JSON documents. We wrote a fast JSON parser in C++ called simdjson; to go fast, it uses fancy (‘SIMD‘) instructions. I wondered how well it would fare on the ARM Cortex-A72 processor.

I am comparing against to top-of-the-line fast parsers (RapidJSON and sajson). I cannot stress enough how good these parsers are: out of dozens of JSON parsers, they clearly stand out. I think it is fair to say that RapidJSON is the ‘standard’ parser in C++. Of course, our parser (simdjson) is good too.

Here are the speeds compared on some of our standard documents, I use GNU GCC 8.3.

file simdjson RapidJSON sajson*
twitter 0.39 GB/s 0.14 GB/s 0.27 GB/s
update-center 0.35 GB/s 0.14 GB/s 0.27 GB/s
github_events 0.39 GB/s 0.20 GB/s 0.29 GB/s
gsoc-2018 0.52 GB/s 0.20 GB/s 0.36GB/s

So it is interesting to note that code optimized with fancy ‘SIMD‘ instructions runs fast even on small ARM processors.

Yet these speeds are far below what is possible on better processors. Our simdjson parser on an Intel processor can easily achieve speeds well beyond 2 GB/s. There are at least three reasons for the vast difference:

  1. Intel processors have ‘faster’ instructions (AVX2) that can process 256-bit registers. There is no equivalent on ARM. In our case, this means that the Intel processor only need about 75% as many instructions as the ARM processors.
  2. Intel processors run at a higher clock. It is easy to find an Intel processor that reach 3.5 GHz or more, maybe much more. My ROCKPro64 runs at 1.8 GHz, so roughly half the speed.
  3. Intel processors easily retire 3 instructions per cycle on a task like JSON parsing whereas the ARM Cortex-A72 is capable of about half of that.

The last two points alone explain about a 4x difference in favour of Intel processors.

To reproduce the results, using Docker, you can do…

git clone https://github.com/lemire/simdjson.git
cd simdjson
docker build -t simdjson . 
docker run --privileged -t simdjson

Credit: Geoff Langdale with contributions from Yibo Cai and Io Daza Dillon.

 

*- sajson does not do complete UTF-8 validation unlike the other two parsers.

 

Parsing JSON using SIMD instructions on the Apple A12 processor

Most modern processors have “SIMD instructions“. These instructions operate over wide registers, doing many operations at once. For example, you can easy subtract 16 values from 16 other values in a single SIMD instruction. It is a form of parallelism that can drastically improve the performance of some applications like machine learning, image processing or numerical simulations.

A common data format on the Internet is JSON. It is a ubiquitous format for exchanging data between computers. If you are using some data-intensive web or mobile application written in this decade, it almost surely relies on JSON, at least in part.

Parsing JSON can become a bottleneck. The simdjson C++ library applies SIMD instruction to the problem of parsing JSON data. It works pretty well on modern Intel processors, the kind you have in your laptop if you bought it in the last couple of years. These processors have wide (256-bit registers) and a corresponding set of instructions (AVX2) that is powerful.

It is not the end of the track: the upcoming generation of Intel processors support AVX-512 with its 512-bit registers. But such fancy processors are still uncommon, even though they will surely be everywhere in a few short years.

But what about processors that do not have such fancy SIMD instructions? The processor in your iPhone (an Apple A12) is an ARM processor and it “merely” has 128-bit registers, so half the width of current Intel processors and a quarter of the width of upcoming Intel processors.

It would not be fair to compare an ARM processor with its 128-bit registers to an Intel processor with 256-bit register… but we can even the odds somewhat by disabling the AVX2 instructions on the Intel processor, and forcing to rely only on smaller 128-bit registers (and the old SSE instructions sets).

Another source of concern is that mobile processors run at a lower clock frequency. You cannot easily compensate for differences in clock frequencies.

So let us run some code and look at a table of results! I make available the source code necessary to build an ios app to test the JSON parsing speed. If you follow my instructions, you should be able to reproduce my results. To run simdjson on an Intel processor, you can use the tools provided with the simdjson library. I rely on GNU GCC 8.3.

file AVX (Intel Skylake 3.7 GHz) SSE (Intel Skylake 3.7 GHz) ARM (Apple A12 2.5 GHz)
gsoc-2018 3.2 GB/s 1.9 GB/s 1.7 GB/s
twitter 2.2 GB/s 1.4 GB/s 1.3 GB/s
github_events 2.4 GB/s 1.6 GB/s 1.2 GB/s
update-center 1.9 GB/s 1.3 GB/s 1.1 GB/s

So we find that the Apple A12 processor is somewhere between an Intel Skylake processor with AVX disabled and a full Intel Skylake processor, if you account for the fact that it runs at a lower frequency.

Having wider registers and more powerful instructions is an asset: no matter how you look at the numbers, AVX instructions are more powerful than ARM SIMD instructions. Once Intel makes AVX-512 widely available, it may leave many ARM processors in the dust as far as highly optimized codebases are concerned. In theory, ARM processors are supposed to get more modern SIMD instructions (e.g., via the Scalable Vector Extension), but we are not getting our hands on them yet. It would be interesting to know whether Qualcomm and Apple have plans to step up their game.

Note: There are many other differences between the two tests (Intel vs. Apple). Among them is the compiler. The SSE and NEON implementations have not been optimized. For example, I merely ran the code on the Apple A12 without even trying to make it run faster. We just checked that the implementations are correct and pass all our tests.

Credit: The simdjson port to 128-bit registers for Intel processors (SSE4.2) was mostly done by Sunny Gleason, the ARM port was mostly done by Geoff Langdale with code from Yibo Cai. Io Daza Dillon completed the ports, did the integration work and testing.