How fast can you parse JSON?

JSON has become the de facto standard exchange format on the web today. A JSON document is quite simple and is akin to a simplified form of JavaScript:

{
     "Image": {
        "Width":  800,
        "Height": 600,
        "Animated" : false,
        "IDs": [116, 943, 234, 38793]
      }
}

These documents need to be generated and parsed on a large scale. Thankfully, we have many fast libraries to parse and manipulate JSON documents.

In a recent paper by Microsoft (Mison: A Fast JSON Parser for Data Analytics), the researchers report parsing JSON document at 0.1 or 0.2 GB/s with common libraries such as RapidJSON. It is hard to tell the exact number as you need to read a tiny plot, but I have the right ballpark. They use a 3.5 GHz processor, so that’s 8 to 16 cycles per input byte of data.

Does it make sense?

I don’t have much experience processing lots of JSON, but I can use a library. RapidJSON is handy enough. If you have a JSON document in a memory buffer, all you need are a few lines:

rapidjson::Document d;
if(! d.ParseInsitu(buffer).HasParseError()) {
 // you are done parsing
}

This “ParseInsitu” approach modifies the input buffer (for faster handling of the strings), but is fastest. If you have a buffer that you do not want to modify, you can call “Parse” instead.

To run an example, I am parsing one sizeable “twitter.json” test document. I am using a Linux server with a Skylake processor. I parse the document 10 times and check that the minimum and the average timings are close.

ParseInsitu Parse
4.7 cycles / byte 7.1 cycles / byte

This is the time needed to parse the whole document into a model. You can get even better performance if you use the streaming API that RapidJSON provides.

Though I admit that my numbers are preliminary and partial, they suggest to me that Microsoft researchers might not have given RapidJSON all its chances, since their numbers are closer to the “Parse” function which is slower. It is possible that they do not consider it acceptable that the input buffer is modified but I cannot find any documentation to this effect, nor any related rationale. Given that they did not provide their code, it is hard to tell what they did exactly with RapidJSON.

The Microsoft researchers report results roughly 10x better than RapidJSON, equivalent to a fraction of a cycle per input byte. The caveat is that they only selectively parse the document, extracting only subcomponents of the document. As far as I can tell, their software is not freely available.

How would they fare against an optimized application of the RapidJSON library? I am not sure. At a glance, it does not seem implausible that they might have underestimated the speed of RapidJSON by a factor of two.

In their paper, the Java-based JSON libraries (GSON and Jackson) are fast, often faster than RapidJSON even if RapidJSON is written in C++. Is that fair? I am not, in principle, surprised that Java can be faster than C++. And I am not very familiar with RapidJSON… but it looks like performance-oriented C++. C++ is not always faster than Java but in the hands of the right people, I expect it to do well.

So I went looking for a credible performance benchmark that includes both C++ and Java JSON libraries and found nothing. Google is failing me.

In any case, to answer my own question, it seems that parsing JSON should take about 8 cycles per input byte on a recent Intel processor. Maybe less if you are clever. So you should expect to spend 2 or 3 seconds parsing one gigabyte of JSON data.

I make my code available.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

11 thoughts on “How fast can you parse JSON?”

  1. Another question (open to debate): should the cost of parsing include validation? Is it reasonable to quietly return “reasonable’ results of a query on something that isn’t valid JSON?

    This is a query that affects Mison more (as far as I can tell). Mison’s ability to skip fields might allow it to pass over JSON syntax errors without noticing (they didn’t publish code, so I can tell for sure).

  2. How fast? It depends…

    If you consider Jackson from the JVM world then, please, try jsoniter-scala:

    https://github.com/plokhotnyuk/jsoniter-scala

    It parses from input streams or byte arrays immediately to Scala data structures without any intermediate representation like strings, hash maps, etc.

    So jsoniter-scala is much safer and efficient than any other JSON-parser for Scala.

    It has methods for scanning through multi-Gb value streams or JSON arrays and parse values without need to hold them all in memory:

    https://github.com/plokhotnyuk/jsoniter-scala/blob/master/core/src/main/scala/com/github/plokhotnyuk/jsoniter_scala/core/package.scala#L79

    Also, it has outstanding features like fast skipping of not needed fields (key/value pairs) or crazily fast parsing and serialization of java.time.* classes.

    Just see benchmark results below. BTW, they include results for parsing and serialization of messages from the Twitter API:

    http://jmh.morethan.io/?source=https://plokhotnyuk.github.io/jsoniter-scala/jdk8.json

    All results for JDK 8/10 and GraalVM CE/EE on the one page:

    https://plokhotnyuk.github.io/jsoniter-scala/

    WARNING: Results of GraalVM CE/EE are only for a rough evaluation of possible potential of this new tech.
    Final results can be changed significantly after JMH tool and GraalVM developers will provide mutual compatibility.

    In most cases jsoniter-scala works on par with best binary serializers for Java and Scala:

    https://github.com/dkomanov/scala-serialization/pull/8

    1. For some limited kind of work like parsing with projection or parsing of arrays of UUIDs jsoniter-scala can archive 2 bytes per cycle (or ~2Gb per second on contemporary desktops) that is quite competitive with the state of art filter/parsers like Mison or Sparser.

      Results and code of projection benchmark:

      https://github.com/guillaumebort/mison/pull/1

      Results (need to scroll down to ArrayOfUUIDsBenchmark section) and code of benchmark for parsing of UUID arrays:

      http://jmh.morethan.io/?source=https://plokhotnyuk.github.io/jsoniter-scala/oraclejdk11.json

      https://github.com/plokhotnyuk/jsoniter-scala/blob/master/jsoniter-scala-benchmark/src/main/scala/com/github/plokhotnyuk/jsoniter_scala/macros/ArrayOfUUIDsBenchmark.scala

  3. Using a loop with the same input may give an unfair advantage to Java as the JIT may kick in and optimize the code for the input. The effect could exist in C++ at a smaller scale with the branch prediction getting better. As an example of such flawed benchmark: https://github.com/nodejs/node/pull/1457#issuecomment-94188258.

    I didn’t read the Microsoft paper, so this is pure speculation on my side.

  4. It does make a lot of sense!

    Does it look like they claim 16 cycles per byte and you think it is 4? You may be right. Benchmarking is hard. But take a look at the abstract – 10x improvement for “analytical”, “FSM” (and probably some SIMD)?

    Analytical? Dom style parser will always loose reading 3 out of 20 (per record). I would expect streaming parser to do 2x+ better. No allocations, copying or parsing digits into numbers nobody will use.

    How are streaming parsers implemented? Which one is not a loop reading character by character (byte by byte)? Can any compiler vectorize that? (and remove branching?)

    Modern regexp libraries (like hyperscan) show great results with vectorized FSM. I believe such techniques would speed up analytical query over json (scanning for 50 bytes out of 1000 bytes record).

    Does it matter? Json is de facto standard format for data exchange between companies these days. Many open datasets and just about every startup use json extensively. So yeah, if someone (Microsoft?) spend the time doing it, please open source it! I would greatly appreciate having it in spark!

    BTW: No, you will not get 10x in practice. Nobody does json, everybody does json.gz. But gzip is very slow. Thanks to Amdahl’s law you will get only 2x improvement with vectorized FSM json parser. Gunzip will take 90% cpu then.

  5. Great pointer: it’s fast and header-only. Thanks a lot!
    Regarding Java vs C++. They are pretty close. In the land of the search engines, we see more carefully implemented Java engines beat C++ implementations. We all know Lucene is very fast and hard to beat. There is another example: Galago is a Java re implementation of Indri. So, it uses a similar query evaluation paradigm, but it is nevertheless about 2x faster.

    1. Yes. I am not assuming that Java is slower than C++ in actual systems, but RapidJSON is C++ written with performance in mind.

      When parsing bytes, there are tricks that are easy in C++ but hard in Java.

  6. Surely being stringly makes JSON inefficient from a processor standpoint anyway?

    Also are those 8 cycles / byte a simplification because of amortised costs? (unless it’s unique to Skylake lines).

    1. Surely being stringly makes JSON inefficient from a processor standpoint anyway?

      Not necessarily.

      Also are those 8 cycles / byte a simplification because of amortised costs? (unless it’s unique to Skylake lines).

      It is specific to the machine I tested on.

  7. For some limited kind of work like parsing with projection or parsing of arrays of UUIDs jsoniter-scala can archive 2 bytes per cycle (or ~2Gb per second on contemporary desktops) that is quite competitive with the state of art filter/parsers like Mison or Sparser.

    Results and code of projection benchmark:

    https://github.com/guillaumebort/mison/pull/1

    Results (need to scroll down to ArrayOfUUIDsBenchmark section) and code of benchmark for parsing of UUID arrays:

    http://jmh.morethan.io/?source=https://plokhotnyuk.github.io/jsoniter-scala/oraclejdk11.json

    https://github.com/plokhotnyuk/jsoniter-scala/blob/master/jsoniter-scala-benchmark/src/main/scala/com/github/plokhotnyuk/jsoniter_scala/macros/ArrayOfUUIDsBenchmark.scala

Leave a Reply to stegua Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.