Amazon’s new ARM servers: Graviton 2

Most servers on the Internet run on x64 processors, mostly made by Intel. Meanwhile, most smartphones run ARM processors.

From a business perspective, these are different technologies. The x64 processors are mostly designed by only two companies (Intel and AMD), with one very large dominant player (Intel). In contrast, ARM processors come in many different flavours. Apple, Qualcomm, Samsung and others design their own ARM processors. This diversity can be either a blessing or a curse. The blessing is that you get a lot of direct competition and many custom solutions. The curse is that it is harder to support ARM systems because there are so many variations.

Amazon is the largest public cloud providers and they are large enough to design their own servers and even their own silicon. Some time ago, they launched a service (Graviton) based on their own ARM-based servers. I tested them out, but the performance just was not there. Amazon just announced a second iteration of these servers called Graviton 2 and they claim a 7-fold performance increase over their previous ARM servers. They are based on processor designs made for servers called Neoverse. I do not yet have access to these servers, but Matthew Wilson from Amazon was kind enough to run my standard JSON parsing benchmark (using simdjson and the twitter.json data file).

Compiling the results in a table suggests that these new Amazon servers have processors that are nearly as good as those in a flagship smartphone.

iPhone XR A12 2.5 GHz 1.3 GB/s
Graviton 2 Neoverse N1 2.5 GHz 1.1 GB/s
Ampere (first generation) Skylark 3.2 GHz 0.37 GB/s
Rockpro64 Cortex-A72 1.8 GHz 0.32 GB/s

My fast JSON parsing benchmark is just something I happen to care about. It is probably not representative of whatever you care about. In particular, it is CPU intensive whereas servers have many other bottlenecks.

Nevertheless, I find these results quite encouraging. If I normalize the speed by the frequency, I get that the new Neoverse N1 processor is 2.5 times faster than the Cortex-A72. When they come out, they may be the fastest publicly accessible ARM servers.

Amazon is claiming that these Graviton 2 servers offer much better performance than Intel-based servers (EC2 M5) and that they will be slightly cheaper. My expectation is that the better performance will be due in large part of a higher number of cores.

Update: Matthew Wilson reports 1.3 GB/s following some optimization work.

Science and Technology links (December 7th 2019)

  1. Incredibly, there is a new simpler way to solve the quadratic formula. I used to rely on the completion of the square, but this is better! There is a video report on the finding.
  2. Surgeons are putting patients in suspended animation (biostasis) during surgery, replacing their blood with a cold saline solution, for time-sensitive interventions.
  3. Statins are a widely prescribed class of medications. They may adversely affect cognition.

AMD Zen 2 and branch mispredictions

Intel makes some of the very best processors many can buy. For a long time, its main rival (AMD) failed to compete. However, its latest generation of processors (Zen 2) appear to roughly match Intel, at a lower price point.

In several benchmarks that I care about, my good old Intel Skylake (2015) processor beats my brand-new Zen 2 processor.

To try to narrow it down, I create a fun benchmark. I run the follow algorithm where I generate random integers quickly, and then check the two least significant bits. By design, no matter how good the processor is, there should be one mispredicted branch per loop iteration.

while (true) {
   r = random_integer()
   if (( r AND 1) == 1) {
     write r to output
   }
   if (( r AND 2) == 2) {
     write r to output
   }
}

I record the number of CPU cycles per loop iteration. This number is largely independent from processor frequency, memory access and so forth. The main bottleneck in this case is branch misprediction.

Intel Skylake 29.7 cycles
AMD Zen 2 31.7 cycles

Thus it appears that in this instance the AMD Zen 2 has two extra cycles of penalty per mispredicted branch. If you run the same benchmark without the branching, the difference in execution time is about 0.5 cycles in favour of the Skylake processor. This suggests that AMD Zen 2 might waste between one to two extra cycles per mispredicted branch.

My code is available. I define a docker container so my results are easy to reproduce.

Instructions per cycle: AMD Zen 2 versus Intel

The performance of a processor is determined by several factors. For example, processors with a higher frequency tend to do more work per unit of time. Physics makes it difficult to produce processors that have higher frequency.

Modern processors can execute many instructions per cycle. Thus a 3.4GHz processor has 3.4 billion cycles per second, but it might easily execute 7 billion instructions per second on a single core.

Up until recently, Intel produced the best commodity processors: its processors had the highest frequencies, the most instructions per cycle, the most powerful instructions and so forth. However, Intel is increasingly being challenged. One smaller company that is making Intel nervous is AMD.

It has been reported that the most recent AMD processors surpass Intel in terms of instructions per cycle. However, it is not clear whether these reports are genuinely based on measures of instruction per cycle. Rather it appears that they are measures of the amount of work done per unit of time normalized by processor frequency.

In theory, a processor limited to one instruction per cycle could beat a modern Intel processor on many tasks if they had powerful instructions and faster data access. Thus “work per unit of time normalized per CPU frequency” and “instructions per cycle” are distinct notions.

I have only had access to a recent AMD processors (Zen 2) for a short time, but in this short time, I have routinely found that it has a lower number of instructions per cycle than even older Intel processors.

Let us consider a piece of software that has a high number of instructions per cycle, the fast JSON parser simdjson. I use GNU GCC 8 under Linux, I process a test file called twitter.json using the benchmark command line parse. I record the number of instructions per cycle, as measured by CPU counters, in the two stages of processing. The two stages together effectively parse a JSON document. This is an instruction-heavy benchmark: the numbers of mispredicted branches and cache misses are small. The Skylake processor has the highest frequency. I use an AMD Rome (server) processor.

I find that AMD is about 10% to 15% behind Intel.

processor IPC (stage 1) IPC (stage 2)
Intel Skylake (2015) 3.5 3.0
Intel Cannon Lake (2018) 3.6 3.1
Zen 2 (2019) 3.2 2.8

Another problem that I like is bitset decoding. That is given an array of bits (0s and 1s), I want to find the location of the ones. See my blog post Really fast bitset decoding for “average” densities. I benchmark just the “basic” decoder.

void basic_decoder(uint32_t *base_ptr, uint32_t &base, 
  uint32_t idx, uint64_t bits) {
  while (bits != 0) {
    base_ptr[base] = idx + _tzcnt_u64(bits);
    bits = bits & (bits - 1);
    base++;
  }
}

My source code is available.

processor IPC
Intel Skylake (2015) 2.1
Zen 2 (2019) 1.4

So AMD runs at 2/3 the IPC of an old Intel processor. That is quite poor!

Of course, your results will vary. And I am quite willing to believe that in many, maybe even most, real-world cases, AMD Zen 2 can do more work per unit of work than the best Intel processors. However I feel that we should qualify these claims. I do not think it is entirely reasonable for AMD customers to expect better numbers of instructions per cycle on the tasks that they care about, and they may even find lower numbers. AMD Zen 2 does not dominate Intel Skylake, it is more reasonable to expect rough parity.

Further reading: AMD Zen 2 and branch mispredictions

Better computational complexity does not imply better speed

A recent magazine article presents a theoretical result: Harvey and van der Hoeven have shown that you can multiply two n-bit integers using O(n log n) complexity. Roughly speaking, this means that as n grows large, there is a mathematical model of how long the resulting algorithm might run that grows like n log n in the worst case. That is, it does not get much worse as n gets larger, mathematically speaking. It is likely that this result will go into the textbook as an important contribution to theoretical computer science.

So far so good. But then the magazine articles then state that the authors have presented a faster way to compute multiplications. That is incorrect. At this point in time, they have not make this demonstration or even this claim, to my knowledge. Possibly they have an implementation somewhere and might be able to answer my questions, but I don’t expect so.

So what is the problem with claiming that they have a faster and more efficient way to do multiplications?

  • Computational complexities like O(n log n) are not a speed or even a time in the real world.
  • They are not even a model of the running time or speed. They are a mathematical asymptotic model for n large. How large must n be for a favorable model to theoretically beat a less favorable model mathematically? Possibly as large as the number of atoms in the universe. Or maybe even larger than that. I am not joking: ‪as far as I can tell, the Harvey and Van Der Hoeven would require more digits per integer than there are atoms in the universe to have a chance at being practical. Please pause to consider: if you are a scientist, you need to make predictions that apply to our universe otherwise you are doing mathematics. Mathematics is a fine activity, but it is not to be confused with science or engineering. I can hear people clamouring that, in the end, big n always win out. Let me quote a famous theoretical computer scientist on this topic:

    Asymptotics will eventually win out, as long as everything else stays fixed. But that’s the precise problem. Everything else doesn’t stay fixed.

  • But it gets worse: these are not scientific models. A scientific model would predict the running time of the algorithm given some implementation, within some error margin. However, these models do nothing of the sort. They are purely mathematical. They are not falsifiable. As long as they are mathematically correct, then they are always true. To be fair, some researchers like Knuth came up with models that closely mimic reasonable computers, but that’s not what people pursuing computational complexity bounds do, typically.
  • Importantly, this means that these asymptotic models make no prediction about the running time of actual code. If I ask Harvey and van der Hoeven how long (in seconds) their algorithm take to compute the multiplication between 1024-byte integers on my macbook, they cannot tell me for their paper alone. Thus they don’t know that it is actually faster than the big-integer library I am using (gmplib).
  • Maybe you would argue that, eventually, their algorithm would beat whatever gmplib has implemented, given large enough integers. But that’s not a scientific (i.e., falsifiable) statement. I could implement their algorithm and apply it to 100000-byte integers and get that their approach is no faster. You could then argue that I need to go even larger. If I do and fail again, you might argue that I need to go even larger. That is not how science or engineering should work. If you have a scientific model, it should make predictions that can be correct or false about the real world. And maybe you end up telling me that if I use 21024 bits per integer, then surely the implementation of the new algorithm is faster: but it is again an unscientific statement because there is no such thing as 21024-bit integers and there will never be in this universe.
  • Possibly Harvey and van der Hoeven have more precise bounds than just a big-O result. But even if they make falsifiable predictions about actual use cases, it does not follow that they are correct. They are working from a mathematical model. It is an idealized version of a computer. Without checking, you cannot tell whether your model is correct. You have to test it out. And, importantly, if you make flawed predictions, then your model is incorrect from a scientific point of view. And what are you modelling exactly? If you are assuming that a computer that manages petabytes of memory works the same as a computer that has gigabytes of memory, something is fishy.

I am not just nitpicking. This is of critical importance. Medical researchers cure Alzheimer’s or cancer in marvelous ways almost weekly using “animal models” or in vitro (in the laboratory). These breakthroughs almost never translate in the real world. It is ridiculously hard to go from models to applications.

If you do theoretical work and arrive at a model that suggests that you have a better, faster algorithm, then you are not nearly done. The map is not the territory. If you are good, then your model should match closely reality. But, no matter how smart you are, and no matter how clever your mathematical work is, you can only claim to be solving a problem faster after you have built it into a computer and recorded the elapsed time.

I am not diminishing Harvey and van der Hoeven accomplishments. If my understanding is correct, their names will be in textbooks for a very, very long time. It is well deserved based on the mathematical construction.

But we are not able to multiply integers faster thanks to their paper. This paper of theirs is not an engineering contribution. To be fair, it may lead to one such contribution. Working with models is often a useful way to do better engineering. However, you have to keep in mind that the model is not reality and that reality the only thing that matters at the end. If you lose track of this important fact, then you are falling back into prescientific thinking.

What if you disagree with my post? What if you think that Harvey and van der Hoeven’s contribution is indeed a step forward engineering-wise? Then you have a clear path forward: provide an implementation, benchmark it against well established software (e.g., gmplib) and show the benefits. No amount of white board arguments can make this requirement go away: show me the code.

Memory parallelism: AMD Rome versus Intel

When thinking about “parallelism”, most programmers think about having multiple processors. However, even a single core in a modern processor has plenty of parallelism. It can execute many instructions per cycle and, importantly, it can issue multiple memory requests concurrently.

Our processors are becoming “more parallel” over time, as is evident by the recent increases in the number of cores. AMD sells 64-core processors. But each individual core is also becoming “more parallel”.

To demonstrate, let me use a memory access test.

The starting point is a shuffled array. You access one entry in the array, read the result, and it points to the next location in the array. You get a random walk through an array. The test terminates when you have visited every location in the array.

Then you can “parallelize” this problem. Divide the random walk into two equal-size paths. Or divide it into 10 equal-size paths.

If you can issue only one memory request at any one time, parallelizing the problem won’t help. However, processors can issue more than one memory request and so as you parallelize the problem, your running times get smaller and your effective bandwidth higher.

How has this evolved over time? The very latest Intel processors (e.g., Cannon Lake), can sustain more than 20 memory requests at any one time. It is about twice what the prior generation (Skylake) could do. How do the latest AMD processor fare? About the same. They can sustain nearly 20 concurrent memory requests at any one time. AMD does not quite scale as well as Intel, but it is close. In these tests, I am hitting RAM: the array is larger than the CPU cache. I am also using huge pages.

The important lesson is that if you are thinking about your computer as a sequential machine, you can be dramatically wrong, even if you are just using one core.

And there are direct consequences. It appears that many technical interviews for engineering positions have to do with linked lists. In a linked list, you access the element of a list one by one, as the location of the next entry is always coded in the current entry and nowhere else. Obviously, it is a potential problem performance-wise because it makes it hard to exploit memory-level parallelism. And the larger your list, the more recent your processor, the worse it gets.

I make the raw results available. I use the testingmlp software package.

Cloud computing: a story of incentives

Many businesses today run “in the cloud”. What this often means is that they have abstracted out the hardware entirely. Large corporations like Amazon, Google, Microsoft or IBM operate the servers. The business only needs to access the software, remotely.

In theory, this means that you can adjust your capacity for just your needs. If you need only twelve servers most of the year, then you pay for only twelve servers. And on the specific days when you need 100 servers, you pay for the 100 servers on these days only. You may even use “serverless” computing and pay just for what you use, saving even more money.

Is this the whole story?

I am not so sure.

A tremendous benefit of cloud computing for the developers and operation people is that it cuts through the red tape. If you are using the cloud, then a developer can get ten more servers at the click of a button. I have met credible people from well-known businesses who told me that their developers have practically an unlimited ability to allocate new cloud resources.

If we make it easy for developers to quickly use lots of computing resources, these developers might effectively think of computing and storage as infinite. It also wipes away all incentives to produce efficient systems.

You may end up warming up the planet. It is not a joke: training a single machine-learning model can have over four times the footprint of a car over its entire lifetime.

Your ten servers end up needing a much more complicated architecture than your single laptop. But that is not obviously a negative for some developers who get to try out fancier-than-needed software, always useful on a resume.

Developers are not alone. When I take pictures these days, I never delete them. They get uploaded to the cloud and I forget about them. When I managed the scarce storage on my PC where digital pictures could be stored, two decades ago, I would spend a lot of time deleting bad pictures. I no longer care.

Further reading: The computing power needed to train AI is now rising seven times faster than ever before and Facebook AI research’s latest breakthrough in natural language understanding is running up against the limits of existing computing power. (credit: Peter Turney)

Science and Technology links (November 16th 2019)

    1. We have new technology to do genetic engineering on human beings (CRISPR). In a small clinical trial, the researchers tested it on live human subjects and found it to be safe.
    2. Girls and boys have a similar intelligence according to psychometric tests, but girls do much better in school, and this is not due to their better verbal ability.
    3. Being illiterate is associated with dementia and cognitive decline in old age.
    4. Dumber people tend to smoke more cannabis, but cannabis itself does not appear to make people dumber.
    5. We transplanted pig skin onto human beings.
    6. A large fraction of Ph.D. students seek psychological help for anxiety and depression. The students often blame the Ph.D. supervisors.
    7. Researchers have apparently been able to reverse aging in the eye of old mice. The mice eyes can then recover from a crushed nerve and from glaucoma.

Unrolling your loops can improve branch prediction

Modern processors predict branches (e.g., if-then clauses), often many cycles a ahead of time. When predictions are incorrect, the processor has to start again, an expensive process. Adding a hard-to-predict branch can multiply your running time.

Does it mean that you should only care about hard-to-predict branches? Unfortunately no.

In a prior blog post, I showed how adding a predictable branch can degrade the branch prediction of other branches.

Conversely, removing a predictable branch might improve branch predictions. Wilco Dijkstra suggested a way to test it out. A common form of predictable branch is a loop. Though a loop does not look like an if-then clause, it still compiles to a branch. A loop is predictable if it iterates long enough. You can “unroll” loops, that is, reduce the number of iterations by doing more work with each iterations. Unrolling does not just remove a predictable branch, but that it does that among other things.

Let us consider an example. I generate random numbers. I check whether they are odd, and when they are, I write the result to an array. It is a silly example meant to illustrate the effect of branch prediction. Because my random numbers can be generated by a pseudo-random number generator, I can run many trials with the same random number. The more often I repeat the loop, the better the branch prediction gets.

  while (howmany != 0) {
    uint64_t randomval = rng(howmany + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    howmany--;
  }

We can unroll this loop: instead of generating one random number per loop iteration, I generate two. You can also generate four or eight, and so forth. The unrolled loop has fewer branches because each iteration through the loop is an implicit branch.

  uint64_t randomval;
  while (howmany >= 2) {
    randomval = rng(howmany + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    randomval = rng(howmany - 1 + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    howmany-=2;
  }  
  while (howmany != 0) {
    uint64_t randomval = rng(howmany + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    howmany--;
  }

I implemented this benchmark using 1000 random numbers per trial. I record the number of mispredicted branch per random number generated on different processors.

ARM 64-bit (Skylark):

trial basic loop unrolled twice unrolled four times
1 58% 57% 56%
2 48% 33% 26%
3 48% 28% 21%
15 45% 18% 8%

Intel x64 (Cannon Lake):

trial basic loop unrolled twice unrolled four times
1 53% 51% 49%
2 50% 35% 30%
3 46% 20% 17%
15 40% 0.5% 0.4%

 

Adding a (predictable) branch to existing code can increase branch mispredictions

Software is full of “branches”. They often take the form of if-then clauses in code. Modern processors try to predict the result of branches often long before evaluating them. Hard-to-predict branches are a challenge performance-wise because when a processor fails to predict correctly a branch, it does useless work that must be thrown away.

A convenient illustration is an algorithm that generates a random number and then only appends it to a list if the random number is odd*. When the numbers are genuinely random, half of the branches will be mispredicted. However, if we generate the same 2000 numbers using a pseudo-random number generator, the processor might learn to predict more accurately which number is odd.

while (howmany != 0) {
    randomval = random();
    if (randomval is odd)
      append randomval to array
    howmany--;
 }

What if we add a predictable branch? Let us say that we check whether the random 64-bit value is some arbitrary number. This new branch will be easily predicted as false.

while (howmany != 0) {
    randomval = random();
    if (randomval is 12313132)
       generate error
    if (randomval is odd)
      append randomval to array
    howmany--;
 }

Since the new branch is predictable, maybe it comes nearly for free?

Let us run 10 trials of the first algorithm, then 10 trials of the second, and so forth repeatedly, until the branch predictor is practically stable.

Let us count the number of mispredicted branches per loop iteration. We added an easy-to-predict branch, so it should not contribute directly to the number of mispredicted branches. I get the following numbers…

processor one hard branch one hard, one easy branch
Intel Skylake processor 4% to 9% 30% to 40%
ARM A72 24% to 26% 49% to 51%

So at least in this particular test, the mere addition of an easy-to-predict branch increased substantially the number of mispredicted branches.

My source code is available.

Note: The loop itself is an easily-predicted branch since the processor must determine whether it continues for another iteration or not at the end of each iteration.

*- It is a not a practical algorithm, it only serves to illustrate my point.