Why are unrolled loops faster?

A common optimization in software is to “unroll loops”. It is best explained with an example. Suppose that you want to compute the scalar product between two arrays:

  sum = 0;
  for (i = 0; i < length; i++)
    sum += x[i] * y[i];

An unrolled loop might look as follows:

  sum = 0;
  i = 0;
  if (length > 3)
    for (; i < length - 3; i += 4)
      sum += x[i] * y[i] + x[i + 1] * y[i + 1] +
             x[i + 2] * y[i + 2] + x[i + 3] * y[i + 3];
  for (; i < length; i++)
    sum += x[i] * y[i];

Mathematically, both pieces of code are equivalent. However, the unrolled version is often faster. In fact, many compilers will happily (and silently) unroll loops for you (though not always).

Unrolled loops are not always faster. They generate larger binaries. They require more instruction decoding. They use more memory and instruction cache. Many processors have optimizations specific to small tight loops: manual loop unrolling generating dozens of instructions within the loop tend to defeat these optimizations.

But why would unrolled loops be faster in the first place? One reason for their increased performance is that they lead to fewer instructions being executed.

Let us estimate the number of instructions that we need to be executed with each iteration of the simple (rolled) loop. We need to load two values into registers. We need to execute a multiplication. And then we need to add the product to the sum. That is a total of four instructions. Unless you are cheating (e.g., by using SIMD instructions), you cannot do better than four instructions.

How many instruction do we measure per iteration of the loop? Using a state-of-the-art compiler (GNU GCC 8), I get 7 instructions. Where do these 3 extra instructions come from? We have a loop counter which needs to be incremented. Then this loop counter must be compared with the end-of-loop condition, and finally there is a branch instruction. These three instructions are “inexpensive”. There is probably some instruction fusion happening and other clever optimizations. Nevertheless, these instructions are not free.

Let us grab the numbers on an Intel (Skylake) processor:

amount of unrolling instructions per pair cycles per pair
1 7 1.6
2 5.5 1.6
4 5 1.3
8 4.5 1.4
16 4.25 1.6

My source code is available.

The number of instructions executed diminishes progressively (going toward 4) as the overhead of the loop becomes smaller and smaller due to unrolling. However, the speed, as measured in number of cycles, does not keep on decreasing: the sweet spot is about 4 or 8 unrolling. In this instance, unrolling is mostly beneficial because of the reduced instruction overhead of the loop… but too much unrolling will eventually harm the processing.

There are other potential benefits of loop unrolling in more complicated instances. For example, some loaded values can be carried between loop iterations, thus saving load instructions. If there are branches within the loop, it may help or harm branch prediction to unroll. However, I find that a reduced number of instructions is often in the cards.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

7 thoughts on “Why are unrolled loops faster?”

    1. Though SIMD instructions and the corresponding vectorization of the code are a form of “loop unrolling”, I implicitly ignored them for the purpose of this blog post (if you check my code, you’ll notice that I deliberately disabled vectorization).

      This being said, it would not change the story much: the reason vectorization speed things up is because it drastically reduces the number of instructions.

  1. Why doesn’t the processor have for loops built in? Set the amount of instructions per loop and how many iterations and then start the loop. This gets you the best of both worlds. Less overhead from running compare and jump instructions and less overhead from the total size of the instructions.

  2. Well, I think the processor probably has a micro architecture component called Loop Stream Detector(LSD) to help detect the loop and cache all the instruction decoding results for next iteration. Therefore, the performance improvement of unrolled loop does not come from fewer instructions to be executed and on the contrary, it is the more instructions needed to be executed in one iteration enables the possibilities for CPU back-end to execute in a more parallel manner.

Leave a Reply to Pan Zhang Cancel reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax