Mispredicted branches can multiply your running times

Modern processors are superscalar, meaning that they can execute many instructions at once. For example, some processors can retire four or six instructions per cycle. Furthermore, many of these processors can initiate instructions out-of-order: they can start working on instructions that appear much later in the code.

Meanwhile most code contains branches (ifthen clauses). These branches are often implemented as “jumps” where the processor either goes running instruction further away, or continues on its current path.

It is hard to reconcile branches with out-of-order superscalar execution. To do so, processors have sophisticated branch predictors. That is, the processor tries to predict the future. When it sees branch, and thus a jump, it tries to guess which way it will go.

This often works quite well. For example, most loops are actually implemented as branches. At the end of each iteration in the loop, the processor must predict whether there will be a next iteration. It is often safe for the processor to predict that the loop will continue (forever). The processor will only mispredict one branch per loop in this manner.

There are other common examples. If you are accessing the content of an array, many languages will add “bound checking”: before accessing the array value, there will be a hidden check to see whether the index is valid. If the index is not valid, then an error is generated, otherwise the code proceeds normally. Bound checks are predictable since all accesses should (normally) be valid. Consequently, most processors should be able to predict the outcome nearly perfectly.

What happens if the branch is hard to predict?

What happens inside the processor is that all of the instruction that were executed but that follow the mispredicted branch must be cancelled and the computation must start anew. You should expect a penalty of more than 10 cycles for each mispredicted branch. It can multiply the running times.

Let us look at some simple code where we write out random integers to an output array:

while (howmany != 0) {
    out[index] =  random();
    index += 1;
    howmany--;
}

We can generate a decent random number in about 3 cycles on average. That is, the total latency of the random number generator might be 10 cycles. But our processor is superscalar, so we can do several random-number computations at once. Thus we may generate a new random number every 3 cycles or so.

Let us modify slightly this function so that we only write out the odd integers:

while (howmany != 0) {
    val = random();
    if( val is an odd integer ) {
      out[index] =  val;
      index += 1;
    }
    howmany--;
}

Naively we might think that this new function could be faster. Indeed, we might only write out one out of two integers. We have a branch, but checking whether an integer is odd requires checking a single bit.

I benchmarked these two functions in C++ using a skylake processor:

write all random numbers 3.3 cycles per integer
write only odd random numbers 15 cycles per integer

The second function takes about five times longer!

Is there something you can do? Yes. You can just remove the branch. You can characterize an odd integer by the fact that its bitwise logical AND with the value 1 is one. The trick is to increment the array index by one only when the random value is odd.

while (howmany != 0) {
    val = random();
    out[index] = val;
    index += (val bitand 1);
    howmany--;
}

In this new version, we always write the random value to the output array, even when it is not needed. At a glance, it looks wasteful. However, it does away with the mispredicted branches. In practice the performance is nearly as good as the original code, and much better than the version with branches:

write all random numbers 3.3 cycles per integer
write only odd random numbers 15 cycles per integer
branch removed 3.8 cycles per integer

Could the compiler have solved this problem on its own? In general, the answer is negative. Sometimes compilers have some options to avoid branches entirely even if there is an if-then clause in the original code. For example, branches can sometimes be replaced by “conditional moves” or other arithmetic tricks. However, there are tricks that compilers cannot use safely.

An important take-away is that mispredicted branches are not a minor concern, they can have a large effect.

My source code is available.

Further reading: Processors learn to predict branches (follow-up blog post).

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

28 thoughts on “Mispredicted branches can multiply your running times”

  1. Removing branches can be pretty tricky. I’d be interested to hear your thoughts on __builtin_expect / __builtin_expect_with_probability / __builtin_unpredictable if you have any.

    I know MSVC has historically refused to implement anything similar, but with [[likely]] and [[unlikely]] coming in C++2a (which I’m guessing MSVC will support in VS 2021) it should be possible to have something portable to most modern compilers.

    1. If the branch is predictable and you want to tell so to the compiler, this can help because the compiler can reorder the instructions so that the likely path is close to the stream of instructions. I have never seen huge gains from this approach, but I have measured some worthwhile further gains… after optimizing almost everything else.

      However, if the branch is unpredictable, and there is no way to remove the branching, then what can the compiler do? I do not know.

      1. That pretty much matches up with my experience. IIRC I’ve only seen around ~1-2% improvements on x86/x86_64 in real code.

        It’s just so easy, especially compared to figuring out how to make something branchless (if possible). It’s also pretty handy for error checking macros since the improvement tends to get replicated all over the place for free.

        1. The likely/unlikely hints haven’t in principle or practice helped much with branch prediction problems like the one described here, because they indicate branches that are likely to go a specific way, hence “predictable”. At best, they will organize code so the “fall through” path is the likely taken one, as Daniel mentions, which helps for a variety of reasons. This often helps a bit – but you could certainly see some small loops that get re-arranged for a 100% benefit (e.g., doubled in speed), but due to front-end effects, not branch prediction per-se.

          __builtin_unpredictable on the other hand could have a big effect here, since in practice it says “try harder to make this branch free”. Compilers can make things branch free much more than they do – they are reluctant to do this because most branches are predictable, and a branchfree transformation often slows things down for predictable branches. So they are conservative and only choose branchfree for the simplest cases or where some (IMO often dubious) heuristic indicates it is worthwhile (e.g., PGO although that’s not really in the dubious category). Empirically, I have not seen gcc insert more than 2 cmov instructions in order to remove one branch.

          Hopefully builtin_unpredictable can give a huge hint to make something branch free, even if it costs more than that.

  2. Yes, compilers can’t do such tricks, because those 2 versions have different observable behavior. The version where stores to memory are guarded under if ‘statement’ can’t touch the memory of even elements. I tend to think this also prevents vectorization of the loop. Daniel, you probably checked that but still, is there unexpected difference in generated assembly?

  3. For example, branches can sometimes be replaced by “conditional moves” or other arithmetic tricks. However, there are tricks that compilers cannot use safely.

    Actually many compilers have been doing exactly this routinely and automatically for many decades! GCC generates a branchless sequence for the example on Arm by default: https://www.godbolt.org/z/ZemW4j

    It can give major speedups in common cases, especially on simpler, in-order cores which don’t have advanced (read: complex and huge) branch predictors. On advanced CPUs the gain is more modest, a few percent on average.

    This particular example uses a conditional store which is non-trivial if you don’t have full conditional execution like Arm. Compilers can safely emit such a store using a conditional move:

    *(cond ? p : &dummy_var) = x;

    Loads can be done in the same way – so it’s always possible to remove branches from code if there is at least one conditional instruction (it’s not always faster though – that’s the hard part!).

    As for giving branch and inlining hints to compilers, only worry about that for highly performance critical functions where it will help. It should be based on execution statistics of real code, not a “guess” as this is often incorrect. I’ve used these hints in GLIBC in the fast paths in malloc/free, highly optimized string and math functions with significant gains. However it can be counterproductive if you don’t know what you’re doing or forget to disassemble your code to verify the compiler optimized it the way you expected.

    1. The method of storing to a dummy memory location is a new one to me! I think it would translate into 1-3 extra decoded instructions (executed uops) per store on Intel, which would be trivially beneficial on large amount of unpredictable code, unless it would cause register spilling or the memory location would act as a dependency on a tight loop.

      Sadly it sounds like you can’t really make the compiler do it. I mean, you can hardcode this functionality on your code, but not make the compiler choose it or a branchy variant depending on which is deemed more profitable on basis of static analysis / profiling / hints.

    2. Interestingly I think the compiler could perform the dummy pointer optimization without violating any rule in the C standard, or such. Just allocate the dummy location by the runtime (either from the stack – in which case it could be an stack location already easily available on a register, or from a dedicated thread-local memory mapping), and choose either the real or dummy store address for the store. I don’t think this would violate any rules any more than something like register spilling or ABI-specific register handling does.

      Of course another question that arises from all this is that why x86 doesn’t have a (non-vectored) conditional store instruction for this purpose…

      1. The compiler could definitely do it, but as you’ve mentioned I’ve never seen it. Usually the compiler prefers branches. It knows how to remove some types of branches for “mathematical” expressions using conditional moves, but I’ve never seen it use dummy stores (or dummy loads, a similar technique) – although there’s nothing preventing it in the standard. Using a stack location would be fine.

        You can use the idiom by hand though (although maybe the optimizer will see though it and revert it).

        1. I think one reason why such an optimization is missing is that “unpredictable” hints are a rather recent thing and I haven’t really heard of branch prediction profiling driven optimizations. Also, this solution doesn’t arise spontaneously as a combination of other optimizations.

          I wonder how hard it would be to implement this on a modern compiler. They emit conditional stores on architectures that support them, so it shouldn’t be awfully hard to implement the variant with fixed dummy destination, somewhat harder to reuse pointers to valid but unused locations in existing registers.

          1. One of the main optimizations that PFO performs is branch related, i.e., calculating the taken % of branches which affect code generation. I believe it can, for example, result in a cmov where no combination of hints would have without PGO.

            Agree with everything else.

  4. There’s multiple bugs in the faster version. Firstly, it doesn’t output howmany elements, it can output way fewer or even none. Secondly, the last element it outputs can be even as it never gets overridden.

    Finally, in this toy problem the real answer is simply to set out[index] = random() | 1; to get uniform odd numbers.

      1. I… eh.. what? That is simply not true, if we cross-reference your code to the numbers you report. You report “cycles per integer”, and you compute this (for all algorithms) as total_cycles / howmany, regardless of how many integers the algorithm actually output.

        Furthermore, we are looking at performance optimizations. If your optimized implementation isn’t functionally the same as the non-optimized one, then what is the point of it all?

        Worse, the bug in your optimized implementation causes it to simply do less work. Of course it is faster, it just decides to stop halfway through! About half of the random values will be odd. So at the end you can expect that your optimized buggy implementation has output about howmany/2 elements. And then it stops, having output about half as much as requested.

        This also makes your measurements bogus, the 3.8 cycles per integer for the optimized implementation you report is wrong. It is at least twice that, perhaps more.

          1. Alright, final attempt to communicate the issue by making the example more extreme. Let’s modify your code to only generate random integers from [0, 5) instead of odd numbers.

            while (howmany != 0) {
            val = random();
            out[index] = val;
            index += val < 5; // MODIFIED
            howmany--;
            }

            Will you still report “3.8 cycles per integer” for this random number generation code, even though it only successfully outputs (assuming val is uint64_t like in your actual code) an integer 5/2^64 of the time? You’ll have to run this code for hundreds of hours with an incredibly large howmany before it’ll even output a single number.

            1. Yes. It is the number of iterations, that is the number of branches, that matter for this experiment. In fact, you could modify this problem to compute the last odd integer generated or something else. Actually I pretty much assume that writing things out is free. If you read my blog post, at no point do I talk in terms of the size of the output.

              The point of this blog post is in the title “Mispredicted branches can multiply your running times”. It is not about generating random numbers quickly. That’s also an interesting problem but, as you point out, generating odd integers is certainly not interesting.

              If you look through the content of my blog post, you will find work on random number generation, and performance. This particular blog post is not about that.

  5. This will probably sound sacrilegious, but isn’t the real lesson here that branch prediction is a poor method of optimization? After all, branch prediction gave us the Spectre vulnerability. Wouldn’t it be better to either:

    Explore both branches at the same time using 2 isolated cores/pipelines/compute units, and then clearing the one that is invalid, or
    Avoid the problem entirely by using a faster (photonic?) memory bus.

    I realize that may sound a bit naive, but surely it’s better than doing things like adding neural networks to CPUs for branch prediction…

    1. Exploring multiple paths at once can quickly become prohibitive… the number of code paths grows exponentially… so it is not feasible over a long window.

      I don’t think that the problem is memory speed per se. The problem can occur even if you are just in L1 cache.

      1. Thanks for the reply. I’m a software engineer, so my knowledge of the subject is limited. I just read this article, which seems to be a nice introduction to branch prediction:

        https://danluu.com/branch-prediction/

        Maybe the next step is to get rid of pipelines all together? What if we had pools of pipeline stages and dynamically chained them together during thread execution? I’m imagining something analogous to how microservices are used to build distributed applications. (Of course, ideas are easy. Implementation is the hard part.)

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax