AMD Zen 2 and branch mispredictions

Intel makes some of the very best processors many can buy. For a long time, its main rival (AMD) failed to compete. However, its latest generation of processors (Zen 2) appear to roughly match Intel, at a lower price point.

In several benchmarks that I care about, my good old Intel Skylake (2015) processor beats my brand-new Zen 2 processor.

To try to narrow it down, I create a fun benchmark. I run the follow algorithm where I generate random integers quickly, and then check the two least significant bits. By design, no matter how good the processor is, there should be one mispredicted branch per loop iteration.

while (true) {
   r = random_integer()
   if (( r AND 1) == 1) {
     write r to output
   if (( r AND 2) == 2) {
     write r to output

I record the number of CPU cycles per loop iteration. This number is largely independent from processor frequency, memory access and so forth. The main bottleneck in this case is branch misprediction.

Intel Skylake 29.7 cycles
AMD Zen 2 31.7 cycles

Thus it appears that in this instance the AMD Zen 2 has two extra cycles of penalty per mispredicted branch. If you run the same benchmark without the branching, the difference in execution time is about 0.5 cycles in favour of the Skylake processor. This suggests that AMD Zen 2 might waste between one to two extra cycles per mispredicted branch.

My code is available. I define a docker container so my results are easy to reproduce.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

11 thoughts on “AMD Zen 2 and branch mispredictions”

  1. Well the rng is in the timed loop as well, so I am not sure you can be certain the difference is solely in the branch prediction part.

    The timing looking more or less in line with what I’d expect: each branch can’t resolve until the rng value is calculated, and there are 3 shifts, 3 xors, and 2 muls in the rng, for a total of 12 cycles of latency, plus the & 1 and & 2 ops for 2 more cycles, so 14 cycles. The “standard” BP latency for Intel is usually quoted as 16 cycles, so 14 + 16 = 30, almost exactly in line with your results.

    A more precise test would move the rng outside of the timed loop, although the most obvious way to do that (read from an array) means you now how to be careful of caching effects. Another way would be to run the loop w/o mispredicts to see if the (latency limited) baseline is the same.

      1. Indeed, my comment cuts both ways.

        Another way to check, would be to vary the mispredict rate from 0% to 100% (100% being what you have now) – if AMD and Intel times overlap at 0% and then diverge by 2 cycles at 100% you have strong proof.

  2. According to Skylake mispredict penalty is 16.5 cycles and Zen1 is 19 cycles. Zen2 results aren’t there, but it may not have changed from Zen1.

    Zen does seem to have a curiously long pipeline – about as long as the Bulldozer family. I’ve heard speculation that the intent was to allow the chip to clock fairly high (but the process didn’t allow it).
    Interestingly to note that Icelake increases the mispredict penalty by 1 cycle, so there also may be a trend towards longer pipelines.

    1. The penalty (apparently) also varies depending on whether the code after the mispredict hits in the uop cache or not. Intuitively this makes sense: if you need to decode the new target, you add all the decode stages to the penalty. I recall, however, that Agner said he couldn’t measure a difference.

      Those numbers seem fairly consistent with the gap Daniel found.

  3. While correct your post is misleading…
    Real performance depends on branch prediction cost and the count of missed branches.

    In your example they are the same for Intel and AMD(since code is totally unpredictable), but in more realistic scenario it is possible that AMD has better branch prediction.

    You could benchmark that but it is very tricky to say what is “realistic branchy” code.

  4. Well, some similar analysis has been done on the branch misprediction penalty of zen uarch with bdw and skl.

    “Zen’s branch prediction penalty is around 17 to 21 cycles, Kaby Lake is 16 to 20 cycles and Broadwell is 15 to 21 cycles. In general, Broadwell’s branch penalty is generally lower, about 15 cycles, KabyLake is slightly higher, about 17 cycles, and Zen’s predicted penalty value is generally about 19 cycles.
    Through testing, we infer that Zen’s pipeline number of bits should be around 19, but due to factors such as µOp-Cache (micro-operation cache), it can be as low as about 17 cycles.”

    This is not a new thing sence zen(zen2) has got a deeper pipeline and this is NOT necessarily relative to IPC (bdw’s penalty is lower than skylake and if you are using a 8086 processor the penalty is only 4 cycles).

    If you want to take branch misprediction penalty’s affect into consideration it will be better to combine it with branch misprediction rate using some real-world workload instead of a random integer where you will get 50% branch misprediction rate.

    Also, by saying “my good old Intel Skylake (2015) processor”, I hope you didn’t forget that Intel failed to ship a new architecture for server and desktop platform(where high perf is really in need) in 4 years and the NEW architecture sunny cove(cannonlake is dead, intel even wants everyone to forget about it together with first gen 10nm process) is limited to ultrabook. That’s why ZEN2 is on the same stage competing with intel’s SKL uarch. And for 2020 intel will ship cometlake(skylake refresh refresh refresh refresh) to survive the year.

  5. I am having trouble to understand that when I use specific cpu/pid pinning on linux-perf-events.h I get one cycle less on Zen+,

    #include <sys/types.h>
    pid_t pid2 = getpid();
    const int cpu = 1; // 0 indexed, second cpu
    fd = syscall(__NR_perf_event_open, &attribs, pid2, cpu, group, flags);

    then running via (taskset on debian):

    make clean && make && taskset 0x2 ./condrng

    gives me:

    cond 4.56 cycles/value 15.00 instructions/value branch misses/value 0.00
    cond 32.59 cycles/value 19.00 instructions/value branch misses/value 1.00

    But without this pinning I get:

    cond 4.51 cycles/value 15.00 instructions/value branch misses/value 0.00
    cond 33.90 cycles/value 19.00 instructions/value branch misses/value 1.00

    Man page ( states that current version in repo seems valid, all I can suspect is that cpu lookup is adding overhead

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see