In a previous post, I ran a benchmark on an ARM server and again on an Intel-based server. My purpose was to indicate that if one function is faster, even much faster, on one processor, you are not allowed to assume that it will also be faster on a vastly different processor. It wasn’t meant to be a deep statement, but even simple facts need illustration. Nevertheless, it was interpreted as an ARM versus Intel comparison.
In the initial numbers that I offered, the ARM Skylark processor that I am using did very poorly compared to the Intel Skylake processor. Eric Wallace explained away the result: The default compiler on my Linux CentOS machine appears to be unaware of my processor architecture (ARM Aarch64) and, incredibly enough, compiles the code down to 32-bit ARM instructions.
So let us get serious and use a recent compiler (GNU GCC 8) from now on.
And while we are at it, let us do a Skylark versus Skylake, ARM versus Intel, benchmark. I am going to pick three existing C programs from the Computer Language Benchmark Games:
- Binarytree is a memory access benchmark. The code constructs binary trees and must traverse them.
- Mandelbrot is a number crunching benchmark.
- Fasta is a randomized string generation benchmark.
The Skylark processor is from a 32-core box, the reported maximum frequency is 3.3GHz. The Skylake processor is from a 4-core box with a maximal frequency of 4GHz. Here are the numbers I get.
Skylark (ARM) | Skylake (Intel) | |
Binarytree | 80 s | 16 s |
Mandelbrot | 15 s | 24 s |
Fasta | 2.0 s | 0.8 s |
What can we conclude from these numbers? Nothing except maybe that the Skylark box struggles with Binarytree. That benchmark is dominated by the cost of memory allocation/deallocation.
Let me try another benchmark, this time from the cbitset library:
Skylark (ARM) | Skylake (Intel) | |
create | 23 ms | 4.0 ms |
bitset_count | 3.2 ms | 4.4 ms |
iterate | 5.0 ms | 4.0 ms |
The “create” benchmark is basically a memory-intensive test, whereas the two other tests are computational. Again, it seems that ARM server struggles with memory allocations.
Is that something that has to do with the processor or the memory subsystem? Or is it a matter of compiler and standard libraries?
Update: Though the ARM processor has a relatively CentOS distribution, it comes with an older C library. Early testing seems to suggest that this software difference accounts for a sizeable fraction (though not all) of the performance gap between Skylake and Skylark.
Update 2: Using ‘jemalloc’, the ‘Binarytree’ goes from 80 s to 44 s while the ‘create’ timing goes from 23 ms to 13 ms. This gives me confidence that some of the performance gap reported about between Skylake and Skylark is due to software differences.
On my Haswell Macbook the results are closer to your results form Skylark.
create(): 16.143000 ms
bitset_count(b1): 1.414000 ms
iterate(b1): 5.797000 ms
iterate2(b1): 1.704000 ms
iterate3(b1): 3.577000 ms
iterateb(b1): 4.935000 ms
iterate2b(b1): 1.632000 ms
iterate3b(b1): 4.668000 ms
And profiler shows, that most of the time is spent in bzero (which is part of realloc I suppose)
564.40 ms 100.0% 0 s lemirebenchmark (8498)
559.40 ms 99.1% 0 s start
559.20 ms 99.0% 276.80 ms main
207.50 ms 36.7% 60.50 ms create
138.90 ms 24.6% 138.90 ms _platform_bzero$VARIANT$Haswell
Yes… memory allocations are slow and expensive under macOS compared to Linux. That’s a software issue (evidently).
That’s why I am not convinced that the relative weakness of the Skylark processor that I find is related to the processor. It might how memory allocations are implemented under Linux for ARM.
Yes, this looks like speed difference is in kernel mode page faults handling. Linux test on Ivy Bridge shows performance similar to Skylake.
Further testing suggests that upgrading glibc might improve performance drastically.
My test platform on CortexA53 have glibc 2.28 and Linux alarm 5.0.4-1-ARCH. Results:
create(): 45.415000 ms
bitset_count(b1): 8.408000 ms
iterate(b1): 25.324000 ms
iterate2(b1): 11.455000 ms
iterate3(b1): 30.555000 ms
iterateb(b1): 25.781000 ms
iterate2b(b1): 21.812000 ms
iterate3b(b1): 32.944000 ms
I need to find a way to upgrade my glibc somehow to run my own tests.
People writing their own malloc love to compare against glibc malloc, because it is such an easy target to beat.
You can try LD_PRELOAD with jemalloc or tcmalloc.
Probably kernel version is more important here, memset spends most time in kernel page fault function.
See my “update 2”. I was able to drastically improve speed by switching to a new memory allocation library.
Interesting. I tried to build with jmalloc on my CortexA53 and create test is slower, than with glibc:
45 ms for glibc vs 66 for jmalloc
Here is straces for both cases, syscalls used for memory allocations are different: https://gist.github.com/notorca/b8ab4ef1ef7780db8fa911b83aedac6f
My own results are similar in the sense that jemalloc seems to issue many more system calls (which I find surprising):
https://gist.github.com/lemire/7ca46ac9a28acce3f2654b9ce7a2350e
As I noted on Twitter, I think one way of getting more reproducible results is to use a container setup to make your dependencies more specific.
Yes, you are probably right.
Seems like you may have removed
#pragma omp parallel for
from the mandelbrot program?Some people are cross-checking your code against the benchmarks game website and becoming a little confused by that difference, so performance it may help to say in the blog post whatever you did or did not change.
Isaac: my code is available. You are correct that I did not go into the details, but I encourage you to review my code. It is implicit that the benchmarks are single-threaded. We are interested in the performance of each core, not of the whole system.
Yes, it’s unfortunate that distros don’t install an up to date GCC/GLIBC. Worse, both have many useless security features enabled which can severely impact performance. However it’s relatively easy to build your own GCC and GLIBC, so that’s what I strongly recommend for benchmarking. Use the newly built GCC for building GLIBC. You can statically link any application with GLIBC – this works without needing schroot/docker and avoids dynamic linking overheads.
GLIBC malloc has been improved significantly in the last few years: a fast-path was added for small block handling, and single-threaded paths avoid all atomic operations. I’ve seen the latter speed up malloc intensive code like Binarytree by 3-5 times on some systems. Note GLIBC has a low level hack for x86 which literally jumps over the lock prefix byte of atomic instructions. So the gain is less on x86, but it avoids nasty predecode conflicts which appear expensive.
Btw Just to add, binarytree shows sub 20 second results on an Arm server with a recent GLIBC. GLIBC is faster than Jemalloc on this benchmark.
GLIBC is faster than Jemalloc on this benchmark.
That’s good to know. My intuition is that, at least on Linux, the GCC stack has great memory allocation.