Unrolling your loops can improve branch prediction

Modern processors predict branches (e.g., if-then clauses), often many cycles a ahead of time. When predictions are incorrect, the processor has to start again, an expensive process. Adding a hard-to-predict branch can multiply your running time.

Does it mean that you should only care about hard-to-predict branches? Unfortunately no.

In a prior blog post, I showed how adding a predictable branch can degrade the branch prediction of other branches.

Conversely, removing a predictable branch might improve branch predictions. Wilco Dijkstra suggested a way to test it out. A common form of predictable branch is a loop. Though a loop does not look like an if-then clause, it still compiles to a branch. A loop is predictable if it iterates long enough. You can “unroll” loops, that is, reduce the number of iterations by doing more work with each iterations. Unrolling does not just remove a predictable branch, but that it does that among other things.

Let us consider an example. I generate random numbers. I check whether they are odd, and when they are, I write the result to an array. It is a silly example meant to illustrate the effect of branch prediction. Because my random numbers can be generated by a pseudo-random number generator, I can run many trials with the same random number. The more often I repeat the loop, the better the branch prediction gets.

  while (howmany != 0) {
    uint64_t randomval = rng(howmany + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    howmany--;
  }

We can unroll this loop: instead of generating one random number per loop iteration, I generate two. You can also generate four or eight, and so forth. The unrolled loop has fewer branches because each iteration through the loop is an implicit branch.

  uint64_t randomval;
  while (howmany >= 2) {
    randomval = rng(howmany + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    randomval = rng(howmany - 1 + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    howmany-=2;
  }  
  while (howmany != 0) {
    uint64_t randomval = rng(howmany + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    howmany--;
  }

I implemented this benchmark using 1000 random numbers per trial. I record the number of mispredicted branch per random number generated on different processors.

ARM 64-bit (Skylark):

trial basic loop unrolled twice unrolled four times
1 58% 57% 56%
2 48% 33% 26%
3 48% 28% 21%
15 45% 18% 8%

Intel x64 (Cannon Lake):

trial basic loop unrolled twice unrolled four times
1 53% 51% 49%
2 50% 35% 30%
3 46% 20% 17%
15 40% 0.5% 0.4%

 

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

10 thoughts on “Unrolling your loops can improve branch prediction”

  1. Interestingly, with simpler stuff, loop unrolling stopped be helpful quite a while ago. Perhaps, a compiler has been doing a better job compared to manual unrolling. What difference does it make here? The fact that there’s a conditional?

    1. Loop unrolling significantly improves performance indeed. LLVM unrolls loops even with -O2 nowadays. Several targets are enabling it for -O2 in GCC10, motivated by higher SPEC scores.

      Generally unrolling reduces instruction counts and branches, improves load/store addressing (eg. merging 2 loads or stores into a load-pair or store-pair instruction) and enables more instruction level parallellism.

      The gain in this case is that removing the loop branch almost doubles the number of bits of the the branch history. This is used like a hash into the branch predictor tables, so more useful bits means it can not only remember more branches but also predict them better – much better in this case!

        1. Well GCC does not unroll at all at any optimization level if you don’t add -funroll-loops. However this will change with GCC10 for targets like AArch64 and Power which will unroll small loops with -O2 and higher.

            1. Basically yes. GCC does not unroll most loops even at -03. It will unroll if you ask it to with -funroll-loops, or if you use profile guided optimization.

              There are some exceptions:

              Small compile-time-known tripcount loops may be fully unrolled (no loop remains).
              When loops are vectorized, they are implicitly unrolled since you need 4 or 8 or whatever iterations of the scalar loop to make a single vector iteration (however, unlike clang, the vector body isn’t further unrolled).

              1. Travis, great observation. Ok, now I see why some loops are unrolled (I was recently interested primarily in those that CAN be vectorized). Clearly, it’s nearly impossible to vectorize if the loop has conditionals. But, if you unroll these, you get better branch prediction (great to know).

            2. Indeed, GCC’s x86 developers seem opposed to the idea of unrolling eventhough there is clear evidence it helps modern cores…

              Note eventhough LLVM unrolls loops by default, it’s not able to unroll this particular loop, not even with -funroll-all-loops, while GCC does it with -funroll-loops.

              1. There is clear evidence that unrolling some loops helps on modern codes.

                The evidence is less clear that aggressively unrolling all loops helps.

                Almost every loop is helped, in isolation, by unrolling. Many loops hurt, when in a real application, by unrolling.

                I don’t think the compiler vendors can solve that compromise: there is no absolutely correct answer. clang errs one way, gcc another. There is no happy medium because the right answer for a single function can be different depending on the application it is ultimately compiled into.

                Only full LTO combined with full and accurate PGO can solve this, so we might as well give on up that dream for now. We have to rely on developers to annotate key loops.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax