Benchmarking ARM processors: Graviton 4, Graviton 3 and Apple M2

The world of commodity processor is roughly divided in two: x64 chips for servers and PCs, and ARM processors for mobile devices. However, ARM chips increasingly common on servers and laptop. My own favorite laptop is an Apple macBook with an M2 chip. Amazon has been producing its own ARM processors (Graviton) and it recently made available the latest of such chips, the Graviton 4. It is reportedly based on a Neoverse V2 design from ARM while its previous design was a Neoverse V1 (Graviton 3). Qualcomm has also released high performance ARM chips for Windows laptops, the Snapdragon Elite X.

I decided to quickly test Amazon’s new Graviton 4 processor. I am deliberately not comparing against x64 processors. It is far easier to compare ARM against ARM given that you can run exactly the same instructions.

In a previous benchmark, I found that, at least for the type of work that I do, the Graviton 3 processor ran slower than an Apple M2, even after correcting for its lower frequency. The Graviton 3 runs at up to 2.6 GHz, while Apple M2 can run at up to 3.5 GHz.  The Graviton 4 runs at 2.8 GHz. Going from the clock speed alone, you would not expect much of a performance gain going from the Graviton 3 to the Graviton 4 (at most 10%)

Let us run benchmarks. Under AWS, I am going to use Ubuntu 24 with GCC 13. Under macOS, I am using the latest Apple LLVM (15).

URL parsing (C++)

Let us start with a URL parsing benchmark. We use the Ada URL parser to parse thousands of URLs, as fast as possible. To reproduce, do the following:

git clone https://github.com/ada-url/ada.git
cd ada
cmake -B build -DADA_BENCHMARKS=ON
sudo ./build/benchmarks/benchdata

We focus on the BasicBench_AdaURL_aggregator_href results.

I am getting that the AWS Graviton 4, though it runs at a lower frequency, can match the Apple M2 performance.

system ns/url GHz instructions/cycle
AWS Graviton 3 260 2.6 3.4
AWS Graviton 4 168 2.8 4.7
Apple M2 160 3.4 4.4

Unicode Validation (C#)

We recently published a fast Unicode validation library for .NET 8/C#. It contains many benchmarks, but let me consider the validation of a JSON file.

To reproduce:

sudo apt-get install -y dotnet-sdk-8.0
git clone https://github.com/simdutf/SimdUnicode.git
cd SimdUnicode/
cd benchmark/
dotnet run --configuration Release --filter "*Twitter*"

This time I am getting that the AWS Graviton 4 is significantly slower than the Apple M2, though it is much faster than the Graviton 3. Even adjusting for CPU frequency, the AWS Graviton 4 is slower: the frequency is 20% lower, but the speed is 30% lower.

system GB/s (SimdUnicode) GB/s (.NET Runtime Library)
AWS Graviton 3 14 9
AWS Graviton 4 19 11
Apple M2 25 14

JSON parsing (C++)

Let us try the simdjson library. To reproduce, do the following:

git clone https://github.com/simdjson/simdjson.git
cd simdjson
cmake --build build -j
./build/benchmark/bench_ondemand --benchmark_filter="find_tweet"

On this benchmark, we find agan that the AWS Graviton 4 is significantly better than the AWS Graviton 3, but somewhat behind the Apple M2, even after adjusting for its lower frequency.

system GB/s instructions/cycle
AWS Graviton 3 3.6 4.4
AWS Graviton 4 4.6 5.1
Apple M2 6.4 5.7

Base64 encoding and decoding (C++)

This time we are going to use the simdutf library and its fast base64 encoding and decoding functions. To reproduce:

git clone https://github.com/simdutf/simdutf.git
cd simdutf
cmake -B build -D SIMDUTF_BENCHMARKS=ON
cmake --build build
./build/benchmarks/base64/benchmark_base64 -r README.md

We focus on the simdutf::arm64 results. Again, the AWS Graviton 4 is faster than the AWS Graviton 3, but slower than the Apple M2, even after adjusting for CPU frequency.

system GB/s instructions/cycle
AWS Graviton 3 2.8 2.2
AWS Graviton 4 3.5 2.6
Apple M2 6.7 3.7

Number Parsing (C++)

We can use a number-parsing benchmark used to assess the fast_float library. To reproduce:

git clone https://github.com/lemire/simple_fastfloat_benchmark.git
cd simple_fastfloat_benchmark
cmake -B build
cmake --build build
./build/benchmarks/benchmark

We care about the fast_float results. We get similar results, again.

system GB/s
AWS Graviton 3 1.0
AWS Graviton 4 1.3
Apple M2 1.7

Bandwidth

I ran the memory-level paralellism benchmark. I find that the Graviton 4 is slightly better than the Graviton 3. However, the difference is small and you might not notice it in practice. This is a point-chasing benchmark and you do several at once. As you get to over 10 ‘lanes’ it becomes sensitive to noise. The Graviton 3 ‘noise’ visible in the graph is likely measurement error.

Conclusion

These few tests suggest that the Graviton 4 processor is not quite a match for Apple Silicon on a per-core basis. However, it is significant step up from the Graviton 3. Even though both Gravitons have nearly the same clock speed, the Graviton 4 is much faster (e.g., by 30%). The Graviton 4 can retire many more instructions per cycle than the Graviton 3.

graviton 3 ▏ 2.6 GHz ███████████████████████▏
graviton 4 ▏ 2.8 GHz █████████████████████████

URL parsing
graviton 3 ▏ 3.8 Murl/s ████████████████
graviton 4 ▏ 5.9 Murl/s █████████████████████████

Unicode Validation
graviton 3 ▏ 14 GB/s ██████████████████▍
graviton 4 ▏ 19 GB/s █████████████████████████

simdjson
graviton 3 ▏ 3.6 GB/s ███████████████████▌
graviton 4 ▏ 4.6 GB/s █████████████████████████

base64
graviton 3 ▏ 2.8 GB/s ███████████████████▉
graviton 4 ▏ 3.5 GB/s ████████████████████████▉

number parsing
graviton 3 ▏ 1.0 GB/s ███████████████████▏
graviton 4 ▏ 1.3 GB/s █████████████████████████

Daniel Lemire, "Benchmarking ARM processors: Graviton 4, Graviton 3 and Apple M2," in Daniel Lemire's blog, July 10, 2024.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.