ARM and Intel have different performance characteristics: a case study in random number generation

In my previous post, I reviewed a new fast random number generator called wyhash. I commented that I expected it to do well on x64 processors (Intel and AMD), but not so well on ARM processors.

Let us review again wyhash:

uint64_t wyhash64_x; 


uint64_t wyhash64() {
  wyhash64_x += 0x60bee2bee120fc15;
  __uint128_t tmp;
  tmp = (__uint128_t) wyhash64_x * 0xa3b195354a39b70d;
  uint64_t m1 = (tmp >> 64) ^ tmp;
  tmp = (__uint128_t)m1 * 0x1b03738712fad5c9;
  uint64_t m2 = (tmp >> 64) ^ tmp;
  return m2;
}

(Source code)

It is only two multiplications (plus a few cheap operations like add and XOR), but these are full multiplications producing a 128-bit output.

Let us compared with a similar but conventional generator (splitmix) developed by Steele et al. and part of the Java library:

 uint64_t splitmix64(void) {
  splitmix64_x += 0x9E3779B97F4A7C15;
  uint64_t z = splitmix64_x;
  z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9;
  z = (z ^ (z >> 27)) * 0x94D049BB133111EB;
  return z ^ (z >> 31);
}

We still have two multiplications, but many more operation. So you would expect splitmix to be slower. And it is, on my typical x64 processor.

Let me reuse my benchmark where I simply sum up 524288 random integers are record how long it takes…

Skylake x64 Skylark ARM
wyhash 0.5 ms 1.4 ms
splitmix 0.6 ms 0.9 ms

According to my tests, on the x64 processor, wyhash is faster than splitmix. When I switch to my ARM server, wyhash becomes slower.

The difference is that the computation of the most significant bits of a 64-bit product on an ARM processor requires a separate and potentially expensive instruction.

Of course, your results will vary depending on your exact processor and exact compiler.

Note: I have about half a million integers, so if you double my numbers, you will get a rough estimate of the number of nanoseconds per 64-bit integer generated.

Update 1: W. Dijkstra correctly pointed out that wyhash could not, possibly, be several times faster than splitmix in a fair competition. I initially reported bad results with splitmix, but after disabling autovectorization (-fno-tree-vectorize), the results are closer. He also points out that results are vastly different on other ARM processors like Falkor and ThunderX2.

Update 2: One reading of this blog post is that I am pretending to compare Intel vs. ARM and to qualify one as being better than the other one. That was never my intention. My main message is that the underlying hardware matters a great deal when trying to determine which code is fastest.

Update 3. My initial results made the ARM processor look bad. Switching to a more recent compiler (GNU GCC 8.3) resolved the issue.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

11 thoughts on “ARM and Intel have different performance characteristics: a case study in random number generation”

  1. Computing the high bits of 64×64 is how expensive on this ARM server? I mean there’s a 20x relative difference in performance…

    What type of ARM? “Skylarke ARM” doesn’t turn up many hits – mostly stuff about a nice farm that does weddings.

  2. Results form Pine64 on CortexA53:

    wyrng 0.013576 s
    bogus:14643616649108139168
    splitmix64 0.010964 s
    bogus:18305447471597396837

    1. And here is the numbers form my laptop, Intel i5-4250U.
      wyrng 0.000929 s
      bogus:15649925860098344998
      splitmix64 0.000842 s
      bogus:15901732380406292985

      1. I don’t benchmark on laptops, but here is what I get on my haswell server (i7-4770):

        $ g++ --version
        g++ (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010
        $ g++  -std=c++11 -O2 -fno-tree-vectorize -o fastestrng fastestrng.cpp && ./fastestrng
        wyrng       0.000431 s
        bogus:14643616649108139168
        splitmix64      0.000587 s
        bogus:18305447471597396837
        lehmer64    0.000569 s
        bogus:16285628012437095220
        lehmer64 (3)    0.000392 s
        bogus:15342908890590157271
        lehmer64 (3)    0.000379 s
        bogus:18372309517275774290
        
        Next we do random number computations only, doing no work.
        wyrng       0.000442 s
        bogus:15649925860098344998
        splitmix64      0.000567 s
        bogus:15901732380406292985
        lehmer64    0.000568 s
        bogus:6253507633689833227
        lehmer64 (2)    0.000459 s
        bogus:17457190375316347997
        lehmer64 (3)    0.000361 s
        bogus:4305661330232405915
        

        Email me if you want access to it.

        1. I have enough hardware to test, next I want to try on 64bit Atom. But my point here, performance of such things does not really depends on the instructions set (ARMv8 vs amd64). It depends on internal CPU architecture. Cortex A53 and Apple A11 are both armv8 cpus, but on A11 wyrng is faster and on A53 splitmix64 is faster.

            1. Another important point in such comparisons is compiler. On my CortexA53 lehmer64 (2) is fastest with gcc, and lehmer64 (3) is fastest with clang. Looks like gcc generates full 128×128 bit multiplication, while clang generates 128×64.

    2. And on iPhone X with AppleA11 wyrng is faster:
      wyrng 0.000563 s
      bogus:12179112671541558566
      splitmix64 0.000728 s
      bogus:808196752756138662

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax