# Why are unrolled loops faster?

A common optimization in software is to “unroll loops”. It is best explained with an example. Suppose that you want to compute the scalar product between two arrays:

```  sum = 0;
for (i = 0; i < length; i++)
sum += x[i] * y[i];
```

An unrolled loop might look as follows:

```  sum = 0;
i = 0;
if (length > 3)
for (; i < length - 3; i += 4)
sum += x[i] * y[i] + x[i + 1] * y[i + 1] +
x[i + 2] * y[i + 2] + x[i + 3] * y[i + 3];
for (; i < length; i++)
sum += x[i] * y[i];
```

Mathematically, both pieces of code are equivalent. However, the unrolled version is often faster. In fact, many compilers will happily (and silently) unroll loops for you (though not always).

Unrolled loops are not always faster. They generate larger binaries. They require more instruction decoding. They use more memory and instruction cache. Many processors have optimizations specific to small tight loops: manual loop unrolling generating dozens of instructions within the loop tend to defeat these optimizations.

But why would unrolled loops be faster in the first place? One reason for their increased performance is that they lead to fewer instructions being executed.

Let us estimate the number of instructions that we need to be executed with each iteration of the simple (rolled) loop. We need to load two values into registers. We need to execute a multiplication. And then we need to add the product to the sum. That is a total of four instructions. Unless you are cheating (e.g., by using SIMD instructions), you cannot do better than four instructions.

How many instruction do we measure per iteration of the loop? Using a state-of-the-art compiler (GNU GCC 8), I get 7 instructions. Where do these 3 extra instructions come from? We have a loop counter which needs to be incremented. Then this loop counter must be compared with the end-of-loop condition, and finally there is a branch instruction. These three instructions are “inexpensive”. There is probably some instruction fusion happening and other clever optimizations. Nevertheless, these instructions are not free.

Let us grab the numbers on an Intel (Skylake) processor:

 amount of unrolling instructions per pair cycles per pair 1 7 1.6 2 5.5 1.6 4 5 1.3 8 4.5 1.4 16 4.25 1.6

The number of instructions executed diminishes progressively (going toward 4) as the overhead of the loop becomes smaller and smaller due to unrolling. However, the speed, as measured in number of cycles, does not keep on decreasing: the sweet spot is about 4 or 8 unrolling. In this instance, unrolling is mostly beneficial because of the reduced instruction overhead of the loop… but too much unrolling will eventually harm the processing.

There are other potential benefits of loop unrolling in more complicated instances. For example, some loaded values can be carried between loop iterations, thus saving load instructions. If there are branches within the loop, it may help or harm branch prediction to unroll. However, I find that a reduced number of instructions is often in the cards.

Daniel Lemire, "Why are unrolled loops faster?," in Daniel Lemire's blog, April 12, 2019.

### Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

## 9 thoughts on “Why are unrolled loops faster?”

1. IraqiGeek says:

How does this compare with unrolling using fmadd in SSE/AVX?

1. Though SIMD instructions and the corresponding vectorization of the code are a form of “loop unrolling”, I implicitly ignored them for the purpose of this blog post (if you check my code, you’ll notice that I deliberately disabled vectorization).

This being said, it would not change the story much: the reason vectorization speed things up is because it drastically reduces the number of instructions.

2. Why doesn’t the processor have for loops built in? Set the amount of instructions per loop and how many iterations and then start the loop. This gets you the best of both worlds. Less overhead from running compare and jump instructions and less overhead from the total size of the instructions.

1. Processors do have optimizations specific to loops, but there is always some overhead. No free lunch.

3. Robert says:

Also, branch predictions …

4. Pan Zhang says:

Well, I think the processor probably has a micro architecture component called Loop Stream Detector(LSD) to help detect the loop and cache all the instruction decoding results for next iteration. Therefore, the performance improvement of unrolled loop does not come from fewer instructions to be executed and on the contrary, it is the more instructions needed to be executed in one iteration enables the possibilities for CPU back-end to execute in a more parallel manner.

1. My expectation goes into the other direction. Techniques like LSD are meant to accelerate tight loops. If you unroll too much, you lose these processor optimizations.

5. There are multiple moving parts here:-
– Manual unrolling the loop may reduce overhead penalties (compare trip count + jump).
– It may result in increased cache misses due to bloated loop bodies.
– While the method containing unrolled loops may execute faster, code bloating may diminish its chances of getting inlined into its calling context, since compile heuristics generally take into consideration compiled method size / IR counts.
– Certain micro-architectural features like Loop Stream detector have hard constraints on the number of micro-ops in the loop body for the entire loop to be streamed from LDS cache thus saving front-end power, in fact, Intel Optimization manual [1] also advice on splitting the loops body into multiple loops to trigger LSD, but it may have counter-intuitive effects due to increased combined loops overheads of multiple loops.
– Unrolling a vector kernel is generally useful.

[1] 3.4.2.4 Optimizing the Loop Stream Detector (LSD) https://www.intel.com/content/www/us/en/content-details/821612/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html

6. What about the case of loop unrolling to hide latency?

For example, we may have an FMA instruction with a latency of 4, but a reciprocal throughput of 0.5. I assume we’d want to use something akin to compiler explorer to at least make sure asm is issuing 8 FMA instructions per iteration? From what I gather, the latency/rtp ≈ optimal # of accumulators. [0] It also seems to be a strategy to “hide” latency [1]. I assume hiding latency and optimizing accumulator usage are closely related.

Regardless, this is a wonderful explanation! Thank you!

You may subscribe to this blog by email.