Benchmarking is hard: processors learn to predict branches

A lot of software is an intricate of branches (ifthen clauses). For performance reasons, modern processors predict the results of these branches.

In my previous post, I showed how the bulk of your running time could be due to mispredicted branches. My benchmark consisted in writing 64 million random integers to an array. When I tried to only write odd random integers, the performance became much worse, due to the mispredictions.

while (howmany != 0) {
    val = random();
    if( val is an odd integer ) {
      out[index] =  val;
      index += 1;
    }
    howmany--;
}

Why 64 million integers and not, say, 2000? If you run just one test, then it should not matter. However, what if you are running multiple trials? Quickly, the number of mispredicted branches falls to zero. The numbers on an Intel Skylake processor speaks for themselves:

trial mispredicted branches (Intel Skylake)
1 48%
2 38%
3 28%
4 22%
5 14%

The “learning” keeps on going as the following plots shows… Eventually, the fraction of mispredicted branches falls to about 2%.

That is, as you keep measuring how long the same task takes, it gets faster and faster because the processor learns to better predicts the outcome. The quality of the “learning” depends on your exact processor, but I would expect that better, newer processors should learn better. Importantly, from run to run, we generate the same random integers.

The latest server processors from AMD learn to predict the branch perfectly (within 0.1%) in under 10 trials.

trial mispredicted branches (AMD Rome)
1 52%
2 18%
3 6%
4 2%
5 1%
6 0.3%
7 0.15%
8 0.15%
9 0.1%

This perfect prediction on the AMD Rome falls apart if you grow the problem from 2000 to 10,000 values: the best prediction goes from a 0.1% error rate to a 33% error rate.

You should probably avoid benchmarking branchy code over small problems.

My code is available.

Credit: The AMD Rome numbers were provided by Velu Erwan.

Further reading: A case for (partially) TAgged GEometric history length branch prediction (Seznec et al.)

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

11 thoughts on “Benchmarking is hard: processors learn to predict branches”

  1. So, the branch predictor incrementally learns a 2000-iteration branch sequence to a pretty high level of accuracy? I must say I’m curious about the internal state and evolution of this machinery. I suppose Intel doesn’t really provide much public details on it, could it be a Hopfield network or something similar, maybe mixing in some stochastic evolution?

      1. I suppose AMD has advertised their branch predictor using “neural network” in its operation – which on this kind of silicon would probably be more of a perceptron type than those tensor engines that hyped up NNs are nowadays.

        1. Oddly, they are using that ever before the current ML hype. On their recent chips they are also using TAGE, but still have the perception too – it isn’t clear to me what is responsible for what and if they are arranged in a hierarchy or what.

    1. Look up TAGE.

      It does well on this kind of test, because it uses a limited history of say 20-30 branches as a lookup into a history table. Here, each 20-30 branch sequence, with very high probability identifies the exact position within the random sequence. I.e., the “tag” has a high amount of information.

      On the other hand, TAGE does poorly for apparently much simpler cases. Consider a nested loop with a fixed number of iterations. Here, Intel chips fail to predict the inner loop exit with as few as 30ish iterations (the transition is sharp, from 100% to 0% basically when you cross the threshold).

      That’s because the tag is now very sparse in entropy/information. Assuming a taken branch is 1, the history looks like 1111111011111110…. for 7 iteration loop. I.e., mostly ones. Even though a 26-bit tag (say) could distinguish almost 70 million patterns (and in Daniel’s random test you could get close if you didn’t hit some other resource limit), in this case it’s basically all 1s. The only think you can key on is that occasional 0, and as soon as you get to 27 interactions, the 0 won’t appear in a 26-bit key, and all the predictor sees is a long string of 1s and it always fails.

      This is why you sometimes see recommendations to keep nested loops to less than 16.

      Intel chips used to have a specific loop predictor which could remember iteration counts for loops like this (indeed, a much easier problem in principle than remembering a series of 2,000 random branches) but it was removed sometime around the start of the Core series, IIRC.

      1. I wonder how these graphs would look like if one could completely flush (or reset to a known state, say, “all branches not taken”) the predictor state before the run. Maybe what I thought was not so much incremental “learning” as it might have been incremental “unlearning” of branch histories which were essentially random in the context of this benchmark.

        Which makes me actually wonder how much one might be able to fish information from the branch predictor in the regard of its past by careful timing or performance counter measurements.

        1. Yeah, the same things that make TAGE good at predicting long series of correlated branch directions makes it hard to flush: it’s not just a matter of executing a large number of branches at different addresses, since those will never access the longer history tables (or maybe will only access the longer tables).

          You basically need a TAGE-aware flusher that uses mutilple sequences each which targets a specific TAGE length (and you might need mutilple depending on if there are independent TAGE states eg hashed by address). Furthermore there are probably, fast-but dumb non-TAGE predictors to catch easy cases that you’ll have to flush and defeat as well.

          I think you can saw that most of the behavior is incremental learning, not unlearning. At least the long tail where the predictions become very good should be mostly “learning” since unlearning really just means you can go from 0% to 50% (totally wrong, to random guess i.e., right half the time). To get above 50 and towards 99 must be learning.

          That said, there is probably some unlearning right at the start: you’ll get a lot of false predictions (still 50% tho) for short TAGE lengths since you’ll be matching against earlier hits that don’t have any relationship. I don’t think that’s a major effect.

          Interesting challenge: design a series of branches which gets a 100% misprediction rate. You could do it if you knew the internal behavior of the predictor, but even if not you could maybe do it incrementally.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax