We released simdjson 0.3: the fastest JSON parser in the world is even better!

Last year (2019), we released the simjson library. It is a C++ library available under a liberal license (Apache) that can parse JSON documents very fast. How fast? We reach and exceed 3 gigabytes per second in many instances. It can also parse millions of small JSON documents per second.

The new version is much faster. How much faster? Last year, we could parse a file like simdjson at a speed of 2.0 GB/s, and then we reached 2.2 GB/s. We are now reaching 2.5 GB/s. Why go so fast? In comparison, a fast disk can reach  5 GB/s and the best network adapters are even faster.

The following plot presents the 2020 simdjson library (version 0.3) compared with the fastest standard compliant C++ JSON parsers (RapidJSON and sajson). It ran on a single Intel Skylake core, and the code was compiled with the GNU GCC 9 compiler. All tests are reproducible using Docker containers.

In this plot, RapidJSON and simjson have exact number parsing, while RapidJSON (fast float) and sajson use approximate number parsing. Furthermore, sajson has only partial unicode validation whereas other parsers offer exact encoding (UTF8) validation.

If we only improved the performance, it would already be amazing. But our new release pack a whole lot of improvements:

  1. Multi-Document Parsing: Read a bundle of JSON documents (ndjson) 2-4x faster than doing it individually.
  2. Simplified API: The API has been completely revamped for ease of use, including a new JSON navigation API and fluent support for error code and exception styles of error handling with a single API. In the past, using simdjson was a bit of a chore, the new approach is definitively modern, see for yourself:
    auto cars_json = R"( [
      { "make": "Toyota", "model": "Camry",  "year": 2018, 
           "tire_pressure": [ 40.1, 39.9 ] },
      { "make": "Kia",    "model": "Soul",   "year": 2012, 
           "tire_pressure": [ 30.1, 31.0 ] },
      { "make": "Toyota", "model": "Tercel", "year": 1999, 
           "tire_pressure": [ 29.8, 30.0 ] }
    ] )"_padded;
    dom::parser parser;
    dom::array cars = parser.parse(cars_json).get<dom::array>();
    
    // Iterating through an array of objects
    for (dom::object car : cars) {
      // Accessing a field by name
      cout << "Make/Model: " << car["make"] 
               << "/" << car["model"] << endl;
    
      // Casting a JSON element to an integer
      uint64_t year = car["year"];
      cout << "- This car is " << 2020 - year 
               << "years old." << endl;
    
      // Iterating through an array of floats
      double total_tire_pressure = 0;
      for (double tire_pressure : car["tire_pressure"]) {
        total_tire_pressure += tire_pressure;
      }
      cout << "- Average tire pressure: " 
          << (total_tire_pressure / 2) << endl;
    
      // Writing out all the information about the car
      for (auto [key, value] : car) {
        cout << "- " << key << ": " << value << endl;
      }
    }
    

  3. Exact Float Parsing: simdjson parses floats flawlessly at high speed.
  4. Fallback implementation: simdjson now has a non-SIMD fallback implementation, and can run even on very old 64-bit machines. This means that you no longer need to check whether the system supports simdjson.
  5. Automatic allocation: as part of API simplification, the parser no longer has to be preallocated: it will adjust automatically when it encounters larger files.
  6. Runtime selection API: We have exposed simdjson’s runtime CPU detection and implementation selection as an API, so you can tell what implementation we detected and test with other implementations.
  7. Error handling your way: Whether you use exceptions or check error codes, simdjson lets you handle errors in your style. APIs that can fail return simdjson_result, letting you check the error code before using the result. But if you are more comfortable with exceptions, skip the error code
    and cast straight to the value you need, and exceptions will be thrown automatically if an error happens. Use the same API either way!
  8. Error chaining: We also worked to keep non-exception error-handling short and sweet. Instead of having to check the error code after every single operation, now you can chain JSON navigation calls like looking up an object field or array element, or casting to a string, so that you only have to check the error code once at the very end.
  9. We now have a dedicated web site (https://simdjson.org) in addition to the GitHub site (https://github.com/simdjson/simdjson).

Credit: many people contributed to simdjson, but John Keiser played a substantial role worthy of mention.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

14 thoughts on “We released simdjson 0.3: the fastest JSON parser in the world is even better!”

  1. Congratulations is de rigueur here.

    I guess the next challenge will be an API to speed up YAML parsing. YAML files are an important part of deploying PyTorch on most platforms, it will be worth seeing if you can easily adapt this library to this type of parsing.

    1. There has been some XML parsers, at least at the prototype level, that have used SIMD instructions deliberately. However, I do not think that there has ever been something like simdjson for XML.

  2. Awesome work! When I looked at integrating the previous version into a higher level parser, it didn’t look like it handled streaming. Was that impression correct and if so, has that changed? Thanks!

    1. We do handle long inputs containing multiple JSON documents (e.g., line separated). We even have a nifty API for it (see “parse_many”).

      If you mean streaming as in “reading from a C++ istream”, then, no, we do not support this and won’t. It is too slow. We are faster than getline applied to an in-memory istream.

      1. Don’t care about C++ istream. In order to stream, the parser must be able to deal with partial/incomplete inputs and with resuming such an incomplete parse.

        Feeding it is then Somebody Else’s Problem.

        One of the things I learned early on in building high-perf components is that having the component itself be fast is (at most) half the bette. The crucial bit is that it must be possible, preferably easy/straightforward, to use it in such a way the the whole ensemble is fast.

        A lot of the “fast” XML parsers tended to fall flat in that regard.

        1. The way simdjson is currently designed is that it won’t let you access a document (at all) unless it has been fully validated. The rationale behind this is that many people do not want to start ingesting documents that are incorrect. And, of course, you can only know that a document is valid if you have seen all of it.

          For line-separated JSON documents, it is not an issue because you get to see the whole JSON document before returning it to the user, it is just that you have a long stream of them.

          We plan to offer more options in future releases.

  3. Minor inconsistency in the example: the average tire pressure of the cars will not be what one would expect (only half of it)!

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax