ARM MacBook vs Intel MacBook: a SIMD benchmark

In my previous blog post, I compared the performance of my new ARM-based MacBook Pro with my 2017 Intel-based MacBook Pro. I used a number parsing benchmark. In some cases, the ARM-based MacBook Pro was nearly twice as fast as the older Intel-based MacBook Pro.

I think that the Apple M1 processor is a breakthrough in the laptop industry. It has allowed Apple to sell the first ARM-based laptop that is really good. It is not just the chip, of course. It is everything around it. For example, I fully expect that most people who buy these new ARM-based laptops to never realize that they are not Intel-based. The transition is that smooth.

I am excited because I think it will drive other laptop to rethink their designs. You can buy a thin laptop from Apple with a 20-hour battery life and the ability to do intensive computations like a much larger and heavier laptop would.

(This blog post has been updated after a corrected a methodological mistake. I was running the Apple M1 processor under x64 emulation.)

Yet I did not think that the new Apple processor is better than Intel processors in all things. One obvious caveat is that I am comparing the Apple M1 (a 2020 processor) with an older Intel processor (released in 2017). But I thought that even the older Intel processors can have an edge over the Apple M1 in some tasks and I wanted to make this clear. I did not think it was controversial. Yet I was criticized for making the following remark:

In some respect, the Apple M1 chip is far inferior to my older Intel processor. The Intel processor has nifty 256-bit SIMD instructions. The Apple chip has nothing of the sort as part of its main CPU. So I could easily come up with examples that make the M1 look bad.

This rubbed many readers the wrong way. They pointed out that ARM processors do have 128-bit SIMD instructions called NEON. They do. In some ways, the NEON instruction set is nicer than the x64 SSE/AVX one. Recent Apple ARM processors have four execution units capable of SIMD processing while Intel processors only have three. Furthermore, the Intel execution units have more restrictions. Thus 64-bit ARM NEON routines will outperform comparable SSE2 (128-bit SIMD) Intel routines despite the fact that they both work over 128-bit registers. In fact, I have a blog post making this point by using the iPhone’s processor.

But it does not follow that the 128-bit ARM NEON instructions are generally a match for the 256-bit SIMD instructions Intel and AMD offer.

Let us test out the issue. The simdjson library offers SIMD-heavy functions to minify JSON and validate UTF-8 inputs. I wrote a benchmark program that loads a file in memory and then repeatedly calls the minify and validate function, looking for the best possible speed. Anyone with a MacBook and Xcode should be able to reproduce my results.

The vectorized UTF-8 validation algorithm is described in Validating UTF-8 In Less Than One Instruction Per Byte (published in Software: Practice and Experience).

The simdjson library relies on an abstraction layer so that functions are implemented using higher-level C++ which gets translated into efficient SIMD intrinsic functions specific to the targeted system. That is, we are not comparing different hand-tuned assembly functions. You can check out the UTF-8 validation code for yourself online.

Let us look at the results:

minify UTF-8 validate
Apple M1 (2020 MacBook Pro) 6.6 GB/s 33 GB/s
Intel Kaby Lake (2017 MacBook Pro) 7.7 GB/s 29 GB/s
Intel/M1 ratio 1.2 0.9

As you can see, the older Intel processor is slightly superior to the Apple M1 in the minify test.

Of course, it is only one set of benchmarks. There are many confounding factors. Did the algorithmic choices favour the AVX2 ISA? It is possible. Thankfully all of the source code is available so any such bias can be assessed.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

43 thoughts on “ARM MacBook vs Intel MacBook: a SIMD benchmark”

    1. This is true, whether you are an x86 loyalist or indifferent, the old assumptions are all being turned on their heads. I think we will see even more progress from AMD and Intel now that Apple is here to shake up the rankings.

      1. AFAIK, SVE is currently available on the Fugaku supercomputer. However, you can’t exactly get one at NewEgg.

        According to the roadmap published here, it appears the Neoverse-V1 and Neoverse-N2 will be the first two designs from ARM itself to sport SVE.

        This article from AnandTech corroborates what I just said

        SVE2 doesn’t explicitly show up on any of those public roadmap slides, so it’s probably a couple years out—at least in cores designed by ARM. Although, as AnandTech points out, “SVE” in the slide may actually refer to SVE2 in some cases.

        ARM first disclosed SVE several years ago, but is only just now starting to make SVE-capable cores. I wouldn’t be surprised if we had to wait another few years to buy an end product that offers SVE2.

        Even though the Neoverse-V1 is “available now,” that doesn’t mean I can go buy a machine sporting one. It means silicon vendors can license and start building chips around it. It’ll be some time before you see volume product.

        Why such slow adoption? Wide SIMD in the CPU just wasn’t that important to cell phones. It’s too power hungry, and it was hard to keep the ARM CPUs fed. Dedicated accelerators were a better fit in that product space, particularly from an energy efficiency standpoint.

        In a workstation or server, you have different set of constraints. And, now we have some decent interconnects.

        Challenges remain: it’s one thing to plop down the functional units for these wide vectors. Managing power—both peak and transient—is another kettle of fish.

        1. Neoverse-V1 is ARMv8.4-A + 2x 256-bit SVE. (and was finished this year)
          Neoverse-N2 is ARMv8.5-A + 2x 128-bit SVE2. (and will be in finished form next year)

          Of course, that means finished on Arm’s side, that means that we should expect Neoverse-V1 designs in 2021 and Neoverse-N2 designs in 2022.

      2. It looks like there is compiler and emulator support for SVE/SVE2 but the only available silicon is the Fujitsu A64FX (pdf) with SVE.

        You have identified an area that Apple/Amazon Arm64 silicon is playing catchup to x64 on both desktop and server: vectorized SIMD algorithms.

        1. Calling this catchup is misleading. SVE/2 is not just wider NEON, it is a rethinking of how to design a vector ISA a for much better compiler auto-vectorization (A very rough figure of merit is 128-bit wide SVE would run a “broad suite” of autovectorized code about 1.3x faster than NEON).
          If we want to use these sorts of terms, leapfrogging would be more appropriate.

          1. I think you misunderstand RAD’s comment.

            My feeling is that he was basing his statement on my (erroneous) earlier results.

            I think that there is wide agreement that SVE is exciting new tech.

          2. SVE2 looks great but we are not going to see it in mainstream silicon until the next generation of Apple and Amazon chips at best. In every other area, the Apple M1 and Amazon Graviton 2 seem to offer the best bang-for-the-buck over x64. Until Neoverse V1/N2 silicon is available, I don’t think we will see a business case for a scale-up in-memory column store like SAP HANA moving away from Intel.

            Benchmarks using Daniel’s EWAH and/or Roaring Bitmap projects should be able to approximate when Arm ports make sense. We need more real-world SIMD-centric benchmarks; maybe Lucene/ElasticSearch, Apache Arrow, DuckDB, ClickHouse?

  1. Given the fact the NVIDIA is buying ARM there is no negligible chance
    that they change licensing policies…
    However may the idea of successful ARM laptops will push somebody to try the same stint with MIPS.
    This could be an extremely interesting development.

  2. I know you don’t usually read it, and I don’t know why they didn’t leave a comment here, but there are a few comments on HN that suggest you might have significant bug in your benchmark: https://news.ycombinator.com/item?id=25408853.

    The summary would seem to be that ARM64 isn’t being properly detected by the macros in the simdjson code, resulting in the executable using the “generic fallback implementation”. The simple fix is to add an explicit “-DSIMDJSON_IMPLEMENTATION_ARM64=1” to the compilation. With this, one of the commenters got “minify” at 6.64796 GB/s, and “validate” at 16.4721 GB/s, concluding “That puts Intel at 1.17x and 1.15x for this specific test”.

    1. I believe that the big issue noted in the HN thread is that the Arm benchmarks appear to be using x86 code running under Rosetta. With real ARM64 code and more optimisation this gets the benchmarks to minify : 6.73381 GB/s and validate: 17.8548 GB/s so 1.16x and 1.06x.

      1. Hi Daniel,

        I see now that you got lots of notification besides me. Sorry for adding to the pile. To partially make up for it, I ran your benchmark on a MacBook Air with Ice Lake for a more direct comparison:

        % sysctl -n machdep.cpu.brand_string
        Intel(R) Core(TM) i5-1030NG7 CPU @ 1.10GHz
        % ./benchmark
        simdjson v0.7.0
        Detected the best implementation for your machine: haswell (Intel/AMD AVX2)
        loading twitter.json
        minify : 7.47081 GB/s
        validate: 34.4244 GB/s

        I was hoping that we might be able to see the effect of AVX512, but I see now that the simdjson code doesn’t yet support it. If you happen to an unreleased version that has it, I’d be happy to test and report.

  3. Comment from HN you might be interested in:

    This article has a mistake. I actually ran the benchmark, and it doesn’t return a valid result on arm64 at all. The posted numbers match mine if I run it under Rosetta. Perhaps the author has been running their entire terminal in Rosetta and forgot.
    As I write this comment, the article’s numbers are: (minify: 4.5 GB/s, validate: 5.4 GB/s). These almost exactly match my numbers under Rosetta (M1 Air, no system load):
    % rm -f benchmark && make && file benchmark && ./benchmark
    c++ -O3 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
    benchmark: Mach-O 64-bit executable arm64
    minify : 1.02483 GB/s
    validate: inf GB/s

    % rm -f benchmark && arch -x86_64 make && file benchmark && ./benchmark
    c++ -O3 -o benchmark benchmark.cpp simdjson.cpp -std=c++11
    benchmark: Mach-O 64-bit executable x86_64
    minify : 4.44489 GB/s
    validate: 5.3981 GB/s

    Maybe this article is a testament to Rosetta instead, which is churning out numbers reasonable enough you don’t suspect it’s running under an emulator.

    Update, I re-ran with messe’s fix (from downthread):

    % rm -f benchmark && make && file benchmark && ./benchmark
    c++ -O3 -o benchmark benchmark.cpp simdjson.cpp -std=c++11 -DSIMDJSON_IMPLEMENTATION_ARM64=1
    benchmark: Mach-O 64-bit executable arm64
    minify : 6.64796 GB/s
    validate: 16.4721 GB/s

    That puts Intel at 1.17x and 1.15x for this specific test, not the 1.8x and 3.5x claimed in the article.

    Also I looked at the generated NEON for validateUtf8 and it doesn’t look very well interleaved for four execution units at a glance. I bet there’s still M1 perf on the table here.

    https://news.ycombinator.com/user?id=bacon_blood

    1. Thanks for the quick update on a Sunday afternoon!

      I’m looking forward to seeing how you can best make use of the new hardware.

  4. You’ve known for over an hour that your benchmark was grossly flawed, and that your results are farcical.

    This is embarrassing. If you had any credibility at all you’d at least put a mea culpa at the type, but if you’re cowardly just deleted this fullstop.

    The “critics”, it turns out, were absolutely right. You wrote some lazy nonsense, and when called on it, made even worse lazy nonsense. Ouch.

    1. You’ve known for over an hour that your benchmark was grossly flawed, and that your results are farcical.

      1. I edited my code inside Visual Studio code. I opened a terminal within Visual Studio code and compiled there, not realizing that Visual Studio code itself was running under Rosetta 2. Whether it is “a gross” mistake is up to debate. I think it was an easy mistake to make…

      2. It is Sunday here and I was with my family. I saw on Twitter that there was a mistake, and so I replied to the person that raised the issue that I would revisit the numbers. I did, a few hours later. Again: it is Sunday and I was with my family. The post was literally fixed the same day.

      Yes. I made a mistake. I admit. I also corrected it as quickly as possible.

      This is embarrassing. If you had any credibility at all you’d at least put a mea culpa at the type, but if you’re cowardly just deleted this fullstop.

      I added a paragraph in this blog post that says: “(This blog post has been updated after a corrected a methodological mistake. I was running the Apple M1 processor under x64 emulation.)”

      The older blog post contains a note that describes how I was in error.

      How am I being cowardly?

      At no point did I try to hide that I made a mistake. In fact, I state it openly.

      The “critics”, it turns out, were absolutely right. You wrote some lazy nonsense, and when called on it, made even worse lazy nonsense. Ouch.

      I was wrong about SIMD performance on the Apple M1.

      I get stuff wrong sometimes, especially when I write a quick blog post on a Sunday morning… But even when I am super careful, I sometimes make mistakes. That’s why I always encourage people to challenge me, to revisit my numbers and so forth.

    2. Daniel, thank you for making simdjson available to the world. I think others would share my opinion that while rude, aggressive, and accusatory posts are unfortunately to be expected on the internet, no response is required. I hope this won’t discourage you from posting in the future. Don’t let the trolls get you down!

    3. Come on, dude that’s not necessary. There are few enough academics investigating ALL aspects of performance across a range of real life hardware.

      Let he who is without sin cast the first stone; mote and beams; those remain wise words.

    4. “As for literary criticism in general: I have long felt that any reviewer who expresses rage and loathing for a novel or a play or a poem is preposterous. He or she is like a person who has put on full armor and attacked a hot fudge sundae or a banana split.”

      ― kurt Vonnegut

  5. Daniel, your work and research has an amazing impact on the engineering community.

    You do not have to answer angry trolls that do not know you but gladly use any opportunity to humiliate and laugh at someone.

  6. Because the M1 (and I’m assuming its future iterations) is a SoC; Data Processing that needs SIMD (matrices, vectors, etc.) is delegated to other specialised units such as the on-die GPU and Neural Processor. IIRC, the M1 features new SIMD instructions that complement both the GPU and Neural Unit, as to what degree they are purposed for depends on how Apple uses them for their Metal 2 API. This type of distributed processing is the new modern and I believe it’s the way to go.

  7. To be completely fair, Kaby lake processors we’re released in August of 2016, a roughly 4-year-old processor compared to the just released M1. Ice lake processors are also a year old and as the test done by Nathan Kurz above shows, the ice lake processor, does a much better job. It exceeds validate results of the latest M1 results at roughly 34.4 GB/s.

    I wonder how much the Intel performance is impacted due to Meltdown and Spectre patches. Ice lake solved some but not all of those issues.

    1. That’s why I stress the dates of the MacBook. It is not “fair”.

      But it is more complicated even than that because the M1 uses less power than comparable Intel processors. So you’d want to account for energy use as well… something I do not do.

  8. Since the GPU and CPU both share the same memory. What is the latency impact of dispatching SIMD heavy work to the GPU on the M1.

  9. The M1 performed much better than I expected in SIMD benchmarks, and the difference between 128 and 256-bit vector widths was the reason I was initially skeptical about Apple’s performance claims. But looking at benchmarks makes me certain that Apple’s Macbooks are headed in the right direction.

    I’m excited for the next iteration of Apple Silicon.

  10. A late comment but did you use the Accelerate Framework? Apparently this taps additional SIMD units not available directly and can have a significant performance impact.

Leave a Reply to Martin Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.