Software is full of “branches”. They often take the form of if-then clauses in code. Modern processors try to predict the result of branches often long before evaluating them. Hard-to-predict branches are a challenge performance-wise because when a processor fails to predict correctly a branch, it does useless work that must be thrown away.

A convenient illustration is an algorithm that generates a random number and then only appends it to a list if the random number is odd^{*}. When the numbers are genuinely random, half of the branches will be mispredicted. However, if we generate the same 2000 numbers using a pseudo-random number generator, the processor might learn to predict more accurately which number is odd.

while (howmany != 0) { randomval = random(); if (randomval is odd) append randomval to array howmany--; }

What if we add a predictable branch? Let us say that we check whether the random 64-bit value is some arbitrary number. This new branch will be easily predicted as false.

while (howmany != 0) { randomval = random(); if (randomval is 12313132) generate error if (randomval is odd) append randomval to array howmany--; }

Since the new branch is predictable, maybe it comes nearly for free?

Let us run 10 trials of the first algorithm, then 10 trials of the second, and so forth repeatedly, until the branch predictor is practically stable.

Let us count the number of mispredicted branches per loop iteration. We added an easy-to-predict branch, so it should not contribute directly to the number of mispredicted branches. I get the following numbers…

processor | one hard branch | one hard, one easy branch |
---|---|---|

Intel Skylake processor | 4% to 9% | 30% to 40% |

ARM A72 | 24% to 26% | 49% to 51% |

So at least in this particular test, the mere addition of an easy-to-predict branch increased substantially the number of mispredicted branches.

Note: The loop itself is an easily-predicted branch since the processor must determine whether it continues for another iteration or not at the end of each iteration.

*- It is a not a practical algorithm, it only serves to illustrate my point.

How did you measure the predicts/mispredicts?

Using the processor’s counters.

Wow that’s a low baseline mispredict rate on Intel.

Does that rng function result in an easily predictable even/odd pattern?

What matters is that these are always the same pseudo-random integers from run to run.

Yes this is kind of obvious since adding the 3rd branch adds useless entries in the branch history, reducing its effective length and thus the number of unique patterns it can remember.

Unrolling it 4x or 8x will have the opposite effect since it removes the always taken loop branch which has the same effect.