Hasty comparison: Skylark (ARM) versus Skylake (Intel)

In a previous post, I ran a benchmark on an ARM server and again on an Intel-based server.  My purpose was to indicate that if one function is faster, even much faster, on one processor, you are not allowed to assume that it will also be faster on a vastly different processor. It wasn’t meant to be a deep statement, but even simple facts need illustration. Nevertheless, it was interpreted as an ARM versus Intel comparison.

In the initial numbers that I offered, the ARM Skylark processor that I am using did very poorly compared to the Intel Skylake processor. Eric Wallace explained away the result:  The default compiler on my Linux CentOS machine appears to be unaware of my processor architecture (ARM Aarch64) and, incredibly enough, compiles the code down to 32-bit ARM instructions.

So let us get serious and use a recent compiler (GNU GCC 8) from now on.

And while we are at it, let us do a Skylark versus Skylake, ARM versus Intel, benchmark. I am going to pick three existing C programs from the Computer Language Benchmark Games:

  1. Binarytree is a memory access benchmark. The code constructs binary trees and must traverse them.
  2. Mandelbrot is a number crunching benchmark.
  3. Fasta is a randomized string generation benchmark.

The Skylark processor is from a 32-core box, the reported maximum frequency is 3.3GHz. The Skylake processor is from a 4-core box with a maximal frequency of 4GHz. Here are the numbers I get.

Skylark (ARM) Skylake (Intel)
Binarytree 80 s 16 s
Mandelbrot 15 s 24 s
Fasta 2.0 s 0.8 s

My benchmark is available.

What can we conclude from these numbers? Nothing except maybe that the Skylark box struggles with Binarytree. That benchmark is dominated by the cost of memory allocation/deallocation.

Let me try another benchmark, this time from the cbitset library:

Skylark (ARM) Skylake (Intel)
create 23 ms 4.0 ms
bitset_count 3.2 ms 4.4 ms
iterate 5.0 ms 4.0 ms

The “create” benchmark is basically a memory-intensive test, whereas the two other tests are computational. Again, it seems that ARM server struggles with memory allocations.

Is that something that has to do with the processor or the memory subsystem? Or is it a matter of compiler and standard libraries?

Update: Though the ARM processor has a relatively CentOS distribution, it comes with an older C library. Early testing seems to suggest that this software difference accounts for a sizeable fraction (though not all) of the performance gap between Skylake and Skylark.

Update 2: Using ‘jemalloc’, the ‘Binarytree’ goes from 80 s to 44 s while the ‘create’ timing goes from 23 ms to 13 ms. This gives me confidence that some of the performance gap reported about between Skylake and Skylark is due to software differences.

Daniel Lemire, "Hasty comparison: Skylark (ARM) versus Skylake (Intel)," in Daniel Lemire's blog, March 26, 2019.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

18 thoughts on “Hasty comparison: Skylark (ARM) versus Skylake (Intel)”

  1. On my Haswell Macbook the results are closer to your results form Skylark.

    create(): 16.143000 ms
    bitset_count(b1): 1.414000 ms
    iterate(b1): 5.797000 ms
    iterate2(b1): 1.704000 ms
    iterate3(b1): 3.577000 ms
    iterateb(b1): 4.935000 ms
    iterate2b(b1): 1.632000 ms
    iterate3b(b1): 4.668000 ms

    And profiler shows, that most of the time is spent in bzero (which is part of realloc I suppose)

    564.40 ms 100.0% 0 s lemirebenchmark (8498)
    559.40 ms 99.1% 0 s start
    559.20 ms 99.0% 276.80 ms main
    207.50 ms 36.7% 60.50 ms create
    138.90 ms 24.6% 138.90 ms _platform_bzero$VARIANT$Haswell

    1. Yes… memory allocations are slow and expensive under macOS compared to Linux. That’s a software issue (evidently).

      That’s why I am not convinced that the relative weakness of the Skylark processor that I find is related to the processor. It might how memory allocations are implemented under Linux for ARM.

      1. Yes, this looks like speed difference is in kernel mode page faults handling. Linux test on Ivy Bridge shows performance similar to Skylake.

          1. My test platform on CortexA53 have glibc 2.28 and Linux alarm 5.0.4-1-ARCH. Results:

            create(): 45.415000 ms
            bitset_count(b1): 8.408000 ms
            iterate(b1): 25.324000 ms
            iterate2(b1): 11.455000 ms
            iterate3(b1): 30.555000 ms
            iterateb(b1): 25.781000 ms
            iterate2b(b1): 21.812000 ms
            iterate3b(b1): 32.944000 ms

              1. People writing their own malloc love to compare against glibc malloc, because it is such an easy target to beat.

                You can try LD_PRELOAD with jemalloc or tcmalloc.

  2. As I noted on Twitter, I think one way of getting more reproducible results is to use a container setup to make your dependencies more specific.

  3. Seems like you may have removed #pragma omp parallel for from the mandelbrot program?

    Some people are cross-checking your code against the benchmarks game website and becoming a little confused by that difference, so performance it may help to say in the blog post whatever you did or did not change.

  4. Yes, it’s unfortunate that distros don’t install an up to date GCC/GLIBC. Worse, both have many useless security features enabled which can severely impact performance. However it’s relatively easy to build your own GCC and GLIBC, so that’s what I strongly recommend for benchmarking. Use the newly built GCC for building GLIBC. You can statically link any application with GLIBC – this works without needing schroot/docker and avoids dynamic linking overheads.

    GLIBC malloc has been improved significantly in the last few years: a fast-path was added for small block handling, and single-threaded paths avoid all atomic operations. I’ve seen the latter speed up malloc intensive code like Binarytree by 3-5 times on some systems. Note GLIBC has a low level hack for x86 which literally jumps over the lock prefix byte of atomic instructions. So the gain is less on x86, but it avoids nasty predecode conflicts which appear expensive.

    1. Btw Just to add, binarytree shows sub 20 second results on an Arm server with a recent GLIBC. GLIBC is faster than Jemalloc on this benchmark.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.