ARM vs Intel on Amazon’s cloud: A URL Parsing Benchmark

Twitter user opdroid1234 remarked that they are getting more performance out of the ARM nodes than out of the Intel nodes on Amazon’s cloud (AWS).

I found previously that the Graviton 3 processors had less bandwidth than comparable Intel systems. However, I have not done much in terms of raw compute power.

The Intel processors have the crazily good AVX-512 instructions: ARM processors have nothing close except for dedicated accelerators. But what about more boring computing?

We wrote a fast URL parser in C++. It does not do anything beyond portable C++: no assembly language, no explicit SIMD instructions, etc.

Can the ARM processors parse URLs faster?

I am going to compare the following node types:

  • c6i.large: Intel Ice Lake (0.085 US$/hour)
  • c7g.large: Amazon Graviton 3 (0.0725 US$/hour)

I am using Ubuntu 22.04 on both nodes. I make sure that cmake, ICU and GNU G++ are installed.

I run the following routine:

  • git clone https://github.com/ada-url/ada
  • cd ada
  • cmake -B build -D ADA_BENCHMARKS=ON
  • cmake --build build
  • ./build/benchmarks/bench --benchmark_filter=Ada

The results are that the ARM processor is indeed slightly faster:

Intel Ice Lake 364 ns/url
Graviton 3 320 ns/url

The Graviton 3 processor is about 15% faster. It is not the 20% to 30% that opdroid1234 reports, but the Graviton 3 nodes are also slightly cheaper.

Please note that (1) I am presenting just one data point, I encourage you to run your own benchmarks (2) I am sure that opdroid1234 is being entirely truthful (3) I love all processors (Intel, ARM) equally (4) I am not claiming that ARM is better than Intel or AMD.

Note: I do not own stock in ARM, Intel or Amazon. I do not work for any of these companies.

Further reading: Optimized PyTorch 2.0 inference with AWS Graviton processors

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

11 thoughts on “ARM vs Intel on Amazon’s cloud: A URL Parsing Benchmark”

  1. The Intel processors have the crazily good AVX-512 instructions: ARM processors have nothing close

    Graviton3 has Arm SVE, which includes predication. (in a 2x 256b setup, so significantly less throughput though)

    1. AVX-512 is far superior to SVE in terms of how powerful the instructions are. And though the Graviton 3 has 256-bit registers, it looks like most ARM designs are going back to 128-bit registers which will leave x64 processors with significantly more powerful SIMD instructions than ARM processors.

  2. Hello,

    a point worth considering is that hyperthreading likely is enabled on the x86 instances which will have a negative impact even on single-threaded workloads if the machine is fully used.

    For a fully loaded machine, 2 vCPU x86 are likely worse than 2 CPU Arm (until memory bandwidth hits the Arm machine with all its CPUs competing to access it :-).

    I think this might explain why you don’t see the same speedup as opdroid1234 who seems to be running heavily threaded tasks (compilation).

    1. One can always object that the comparison is biased in favour of Intel but in this instance, the x64 node is more expensive than the ARM node.

      Granted, it is possible that Amazon is subsidizing its ARM hardware.

  3. Here is a output from Oracle cloud Neoverse-N1 Arm64 “Shared CPU”

    [root@instance-20220729-0825 ada]# ./build/benchmarks/bench –benchmark_filter=Ada
    2023-03-02T14:44:17+00:00
    Running ./build/benchmarks/bench
    Run on (4 X 50 MHz CPU s)
    Load Average: 0.61, 0.27, 0.10
    ada spec: Ada follows whatwg/url
    bytes/URL: 73.454545
    curl : OMITTED
    input bytes: 808
    number of URLs: 11

    performance counters: Enabled

    Benchmark Time CPU Iterations UserCounters…

    BasicBench_AdaURL 4152 ns 4145 ns 167020 GHz=3.06742 cycle/byte=25.0557 cycles/url=1.84045k instructions/byte=41.4072 instructions/cycle=1.65261 instructions/ns=5.06924 instructions/url=3.04155k ns/url=600 speed=194.92M/s time/byte=5.1303ns time/url=376.844ns url/s=2.65362M/s

    1. Sorry about unreadable copy/paste

      performance counters: Enabled

      Benchmark Time CPU Iterations UserCounters…

      BasicBench_AdaURL 4152 ns 4145 ns 167020

      GHz=3.06742 cycle/byte=25.0557 cycles/url=1.84045k instructions/byte=41.4072 instructions/cycle=1.65261 instructions/ns=5.06924 instructions/url=3.04155k ns/url=600 speed=194.92M/s time/byte=5.1303ns time/url=376.844ns url/s=2.65362M/s

      1. That’s pretty useful, demonstrates well that Graviton 2 is nothing but a vanilla implementation of Neoverse-N1, and it can be replicated by any licensee.

        1. In this test, the Oracle system does a bit worse (376.844ns/url vs 320 ns/url) but that’s a small difference. Furthermore, I used the Graviton 3, not 2.

          No matter: your point is correct, I think, others can no doubt compete against the Amazon’s Graviton processors. It just happens that it is easy for me to have access to AWS, so that’s what I use.

          This makes Intel’s stance even more perilous, it seems to me.

  4. You know what metric would be super-interesting?

    watt hours / URL or maybe joules / URL.

    I understand it’s impossible to estimate VM power consumption. Nevertheless, energy cost is important.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.