Most modern processors have “SIMD instructions“. These instructions operate over wide registers, doing many operations at once. For example, you can easy subtract 16 values from 16 other values in a single SIMD instruction. It is a form of parallelism that can drastically improve the performance of some applications like machine learning, image processing or numerical simulations.
A common data format on the Internet is JSON. It is a ubiquitous format for exchanging data between computers. If you are using some data-intensive web or mobile application written in this decade, it almost surely relies on JSON, at least in part.
Parsing JSON can become a bottleneck. The simdjson C++ library applies SIMD instruction to the problem of parsing JSON data. It works pretty well on modern Intel processors, the kind you have in your laptop if you bought it in the last couple of years. These processors have wide (256-bit registers) and a corresponding set of instructions (AVX2) that is powerful.
It is not the end of the track: the upcoming generation of Intel processors support AVX-512 with its 512-bit registers. But such fancy processors are still uncommon, even though they will surely be everywhere in a few short years.
But what about processors that do not have such fancy SIMD instructions? The processor in your iPhone (an Apple A12) is an ARM processor and it “merely” has 128-bit registers, so half the width of current Intel processors and a quarter of the width of upcoming Intel processors.
It would not be fair to compare an ARM processor with its 128-bit registers to an Intel processor with 256-bit register… but we can even the odds somewhat by disabling the AVX2 instructions on the Intel processor, and forcing to rely only on smaller 128-bit registers (and the old SSE instructions sets).
Another source of concern is that mobile processors run at a lower clock frequency. You cannot easily compensate for differences in clock frequencies.
So let us run some code and look at a table of results! I make available the source code necessary to build an ios app to test the JSON parsing speed. If you follow my instructions, you should be able to reproduce my results. To run simdjson on an Intel processor, you can use the tools provided with the simdjson library. I rely on GNU GCC 8.3.
file | AVX (Intel Skylake 3.7 GHz) | SSE (Intel Skylake 3.7 GHz) | ARM (Apple A12 2.5 GHz) |
---|---|---|---|
gsoc-2018 | 3.2 GB/s | 1.9 GB/s | 1.7 GB/s |
2.2 GB/s | 1.4 GB/s | 1.3 GB/s | |
github_events | 2.4 GB/s | 1.6 GB/s | 1.2 GB/s |
update-center | 1.9 GB/s | 1.3 GB/s | 1.1 GB/s |
So we find that the Apple A12 processor is somewhere between an Intel Skylake processor with AVX disabled and a full Intel Skylake processor, if you account for the fact that it runs at a lower frequency.
Having wider registers and more powerful instructions is an asset: no matter how you look at the numbers, AVX instructions are more powerful than ARM SIMD instructions. Once Intel makes AVX-512 widely available, it may leave many ARM processors in the dust as far as highly optimized codebases are concerned. In theory, ARM processors are supposed to get more modern SIMD instructions (e.g., via the Scalable Vector Extension), but we are not getting our hands on them yet. It would be interesting to know whether Qualcomm and Apple have plans to step up their game.
Note: There are many other differences between the two tests (Intel vs. Apple). Among them is the compiler. The SSE and NEON implementations have not been optimized. For example, I merely ran the code on the Apple A12 without even trying to make it run faster. We just checked that the implementations are correct and pass all our tests.
Credit: The simdjson port to 128-bit registers for Intel processors (SSE4.2) was mostly done by Sunny Gleason, the ARM port was mostly done by Geoff Langdale with code from Yibo Cai. Io Daza Dillon completed the ports, did the integration work and testing.
Why not run the tests with the Skylake clocked at 2.5GHz instead? Test will still have caveats, but at least the numbers would be “real”.
You could rescale everything from 3.7 GHz to 2.5 GHz if you’d like. Given that we are essentially CPU bound, numbers scale with frequency linearly (to a good degree).
Actually the result as is is rather nice.
Apple has 3 SIMD units in the more recent chips, Intel has 2 SSE or AVX units. So you would expect, naively (and assuming perfect independence of all the important instructions) Apple at 2.5GHz to more or less match Skylake at 1.5*2.5GHz…
The next step would be Apple with SVE, but SVE (like first round AVX) is primarily FP, you’d really want SVE2.
My guess! is this year we get SVE (as two units, 256 wide) with SVE2 in two years. But what do I know?
You could also try using Apple’s JSON libraries. One would hope those are optimized (including for SIMD) though who knows? And they may be optimized for error-handling or a particular programming model rather than absolute performance?
Recent Intel has three SIMD units, and AMD Zen (128-bit units) and Zen2 (256-bit units) have 4.
However, the Intel units are not symmetric: not all operations can occur on all units, although some can such as logical operations and some integer math. So depending on the mix of operations, an Intel chip might behave like it has 1, 2 or 3 SIMD units.
I don’t think all of simdjson is vectorized, so the vector related scaling only affects a portion of the algorithm: the scaling of the other parts will depend on scalar performance.
Thanks for the clarification.
Given your statements, I’m then really surprised at the gap. Of course Apple is wider, but this doesn’t seem like code for which that would help THAT much.
Is this a case where the NEON intrinsics are just a nicer fit to the problem? Or where certain high latency ops (at least lookups and permutes, for example) run in fewer cycles on Apple?
What gap exactly? You mean the part where the Intel SSE throughput doesn’t exceed the A12 performance by 3.7/2.5?
The implementation has a scalar part and an SIMD part, so the problem doesn’t scale linearly with SIMD width (note also the AVX performance not being double the SSE performance on the same chip). So you can’t apply your SIMD width calculation to the overall performance. We already know A12 usually does more scalar work per cycle, so this can explain it.
Also, you can’t just count the number of SIMD EUs, because they are highly asymmetric on Intel and perhaps on Apple chips. If doesn’t matter that you have three EUs if you are primary bound by say shuffles which only one EU supports.
“If doesn’t matter that you have three EUs if you are primary bound by say shuffles which only one EU supports.”
OK, that’s the sort of thing I was after.
As far as I can tell the Apple cores are extremely symmetric except for the usual weirdness around integer division and multiplication. I’ve never seen anything to suggest asymmetric NEON support.
“We already know A12 usually does more scalar work per cycle, so this can explain it.”
This I’m less convinced by, in that I find it hard to believe either core is hitting even an IPC of 4. I’d expect that, even in carefully tweaked hand assembly, this is a hard problem to scale to high IPC.
Maybe I’m wrong! That’s just a gut feeling…
What does IPC > 4 have to do with anything?
A12 gets higher IPC (and higher “work per cycle” which is what we are really talking about) in general, but running at an IPC > 4 is not in any way a prerequisite for that.
In general A12 does better than pure frequency scaling would suggest: both because the A12 is more braniac (does more work per cycle), and because scaling distorts things like misses to L3 or DRAM which are at least partly measured not in cycles but in real time (or DDR cycles or whatever, that doesn’t scale with CPU frequency).
So if you are expecting an Intel chip at 3.7 GHz and an A12 chip at 2.5 GHz to perform in a ratio of 3.7/2.5 you’ll be disappointed most of the time and I don’t see any reason for this code to be different.
Nice work. Can you clarify your note about not optimizing for the A12? First, what do you mean by “ARM vs. Apple”? Weren’t they the same thing in this case? And what sort of optimizations did you not do for the A12 code? You used SIMD so I’m not sure what else what was on the table.
First, what do you mean by “ARM vs. Apple”?
It was a typo. It is Intel vs. Apple.
And what sort of optimizations did you not do for the A12 code? You used SIMD so I’m not sure what else what was on the table.
There are many design choices, there are often 10 different ways to implement a given function.
The fact that we use SIMD instructions for part of the code is no guarantee that we are making full use of the processor. It is very likely that someone who knows ARM well could beat our implementation… by an untold margin.
The AVX implementation received more tuning so it is less likely that you could beat it by much.
For example, our UTF-8 validation on ARM is likely slower than it should be and we even have better code samples (it is an issue in the project), we just did not have time to assess it.
Great article! Maybe a small typo in first section:
Should not it be:
Thanks.
In addition to Qualcomm/Apple you may also want to try the new ARM eMag core running at 3.3 GHz (32 cores).
Packet has a c2 type available with this CPU.
I actually own an Ampere box! (And I have covered it a few times on this blog.)
It does have lots of cores… but it is not really competitive in terms of single-threaded performance especially when NEON is involved.
(I am still a fan of the company and will find a way to buy their second generation systems.)
I did an experiment to reduce the amount of unnecessary work in stage 1. Rather than flatten the bitmap in flatten_bits, we can just write the whole bitmap as is. Stage 2 then decodes it in UPDATE_CHAR one bit at a time. A simplistic implementation shows the following speedups on an AArch64 server for the 4 json files: 0.8%, -2.4%, 5.1%, 7.1%. Branch mispredictions are higher of course, but it’s still faster overall.
While stage 1 achieves a great IPC of ~3 with very few branch mispredictions, the work it performs doesn’t seem to be worthwhile enough to really help stage 2. Like I mentioned before, adding code to skip spaces in the parser should simplify stage 1 considerably and give larger speedups.
Thanks. I will investigate this possible optimization.
Apple’s processor is what makes it unique & popular. It’s optimized so well in both Iphone & Mac.