Twitter user opdroid1234 remarked that they are getting more performance out of the ARM nodes than out of the Intel nodes on Amazon’s cloud (AWS).
I found previously that the Graviton 3 processors had less bandwidth than comparable Intel systems. However, I have not done much in terms of raw compute power.
The Intel processors have the crazily good AVX-512 instructions: ARM processors have nothing close except for dedicated accelerators. But what about more boring computing?
We wrote a fast URL parser in C++. It does not do anything beyond portable C++: no assembly language, no explicit SIMD instructions, etc.
Can the ARM processors parse URLs faster?
I am going to compare the following node types:
- c6i.large: Intel Ice Lake (0.085 US$/hour)
- c7g.large: Amazon Graviton 3 (0.0725 US$/hour)
I am using Ubuntu 22.04 on both nodes. I make sure that cmake, ICU and GNU G++ are installed.
I run the following routine:
- git clone https://github.com/ada-url/ada
- cd ada
- cmake -B build -D ADA_BENCHMARKS=ON
- cmake --build build
- ./build/benchmarks/bench --benchmark_filter=Ada
The results are that the ARM processor is indeed slightly faster:
Intel Ice Lake | 364 ns/url |
Graviton 3 | 320 ns/url |
The Graviton 3 processor is about 15% faster. It is not the 20% to 30% that opdroid1234 reports, but the Graviton 3 nodes are also slightly cheaper.
Please note that (1) I am presenting just one data point, I encourage you to run your own benchmarks (2) I am sure that opdroid1234 is being entirely truthful (3) I love all processors (Intel, ARM) equally (4) I am not claiming that ARM is better than Intel or AMD.
Note: I do not own stock in ARM, Intel or Amazon. I do not work for any of these companies.
Further reading: Optimized PyTorch 2.0 inference with AWS Graviton processors
Graviton3 has Arm SVE, which includes predication. (in a 2x 256b setup, so significantly less throughput though)
AVX-512 is far superior to SVE in terms of how powerful the instructions are. And though the Graviton 3 has 256-bit registers, it looks like most ARM designs are going back to 128-bit registers which will leave x64 processors with significantly more powerful SIMD instructions than ARM processors.
Hello,
a point worth considering is that hyperthreading likely is enabled on the x86 instances which will have a negative impact even on single-threaded workloads if the machine is fully used.
For a fully loaded machine, 2 vCPU x86 are likely worse than 2 CPU Arm (until memory bandwidth hits the Arm machine with all its CPUs competing to access it :-).
I think this might explain why you don’t see the same speedup as opdroid1234 who seems to be running heavily threaded tasks (compilation).
One can always object that the comparison is biased in favour of Intel but in this instance, the x64 node is more expensive than the ARM node.
Granted, it is possible that Amazon is subsidizing its ARM hardware.
Here is a output from Oracle cloud Neoverse-N1 Arm64 “Shared CPU”
[root@instance-20220729-0825 ada]# ./build/benchmarks/bench –benchmark_filter=Ada
2023-03-02T14:44:17+00:00
Running ./build/benchmarks/bench
Run on (4 X 50 MHz CPU s)
Load Average: 0.61, 0.27, 0.10
ada spec: Ada follows whatwg/url
bytes/URL: 73.454545
curl : OMITTED
input bytes: 808
number of URLs: 11
performance counters: Enabled
Benchmark Time CPU Iterations UserCounters…
BasicBench_AdaURL 4152 ns 4145 ns 167020 GHz=3.06742 cycle/byte=25.0557 cycles/url=1.84045k instructions/byte=41.4072 instructions/cycle=1.65261 instructions/ns=5.06924 instructions/url=3.04155k ns/url=600 speed=194.92M/s time/byte=5.1303ns time/url=376.844ns url/s=2.65362M/s
Sorry about unreadable copy/paste
performance counters: Enabled
Benchmark Time CPU Iterations UserCounters…
BasicBench_AdaURL 4152 ns 4145 ns 167020
GHz=3.06742 cycle/byte=25.0557 cycles/url=1.84045k instructions/byte=41.4072 instructions/cycle=1.65261 instructions/ns=5.06924 instructions/url=3.04155k ns/url=600 speed=194.92M/s time/byte=5.1303ns time/url=376.844ns url/s=2.65362M/s
That’s pretty useful, demonstrates well that Graviton 2 is nothing but a vanilla implementation of Neoverse-N1, and it can be replicated by any licensee.
In this test, the Oracle system does a bit worse (376.844ns/url vs 320 ns/url) but that’s a small difference. Furthermore, I used the Graviton 3, not 2.
No matter: your point is correct, I think, others can no doubt compete against the Amazon’s Graviton processors. It just happens that it is easy for me to have access to AWS, so that’s what I use.
This makes Intel’s stance even more perilous, it seems to me.
Historical tails of x86, complex instruction set CISC, heavy logics, hard prediction, modes switches, workarounds for old bugs, extra logics/extra silicone bigger size/latency – making x86 less efficient.
Check this article from Erik.
https://erik-engheim.medium.com/arm-vs-risc-v-vector-extensions-992f201f402f
You know what metric would be super-interesting?
watt hours / URL or maybe joules / URL.
I understand it’s impossible to estimate VM power consumption. Nevertheless, energy cost is important.
I agree.