# Software performance is… counterintuitive

There is a long tradition among computer scientists of counting the number of operations as a way to estimate the running cost of an algorithm. Many computer scientists and naive programmers believe that adding extra instructions to a piece of code necessarily slows this down.

This is just wrong. If the initial piece of code was dirt cheap, adding more instructions may come for free.

Let me give you an example. I work a lot with bitmaps. A bitmap, in my context, is just an array of bits (bitset) that we pack into machine-size words (e.g., 64-bit integers). We can use them to represent sets of integers: the integer i is in the set if the ith bit is set to true. The union between two such sets can be computed with a bitwise OR:

```for(int k = 0; k < length; ++k) {
output[k] = input1[k] | input2[k];
}
```

Another common operation on bitmaps is the computation of their cardinality: the number of bits set to true or 1. The compiler I am using (GCC) provides a useful function to compute the number of 1s in a 64-bit word (__builtin_popcountl). So we can compute both the union of the two bitmaps and the cardinality of the result in one pass:

```int card = 0;
for(int k = 0; k < length; ++k) {
output[k] = input1[k] | input2[k];
card += __builtin_popcountl(output[k]);
}
return card;
```

This new loop looks a lot more expensive than the previous one. It does a lot more work!

I implemented both in C and benchmarked them on a recent PC (haswell, gcc 4.8). Here are the number of million of iterations per second of the two loops:

array size (one of 3 arrays)  bitwise-or   bitwise-or and cardinality
8192 kB 550 550
4096 kB 655 655
1024 kB 1,400 1,310
8 kB 1,900 1,500

You read this right: the two loops have identical cost for moderately big arrays!

What happens is that the __builtin_popcountl function is compiled down to a single instruction: popcnt, available on PC processors produced after 2008. This instruction is very efficient: it has a throughput of one instruction per CPU clock cycle. Moreover, recall that modern processors are superscalar: they can execute more than one operation at a time. To make things more complicated, the processor is also often starved for data as it needs to load data from RAM. In this case, the processor is able to compute of the cardinality for free on megabytes of data. But even when all the data mostly fits in cache, the penalty for doing extra work can be small. The computation of the bitwise OR is so cheap that it leaves much of the silicon on your processor free to do other work at the same time if needed.

Note that even for short arrays, we probably exaggerate the benefit of the short loop: in these particular tests, we loop many times over the same memory sections so that it remains in CPU cache whereas in an actual application we would suffer cache misses.

To recap, if you have to compute the bitwise OR between two arrays, you can do extra work (such as computing the number of 1s in the result) for free!

My observation will not surprise an expert programmer… However, many others are unaware that their mental model of computation (essentially a simple sequential Turing machine) is wrong. It is of little consequence… until they try to make sense of performance results… Sadly, the naive model of software performance is widespread in academia…

Update: To make things more complicated, Nathan Kurz showed that we can rewrite the short loop using AVX instructions, thus producing much faster code on short arrays that remain in cache.

1. You are absolutely right. You can even say it on a more abstract level: The performance of a piece of code has not to be measured based on the FLOPS but on the amount of work it actually does.

Putting it this way, performance is not reduced to the somewhat blind view on the instruction level (blind as in ‘It doesn’t matter what the code does as long as it’s doing it fast’). Instead the algorithm itself is taken into consideration.

Comment by Peter — 6/6/2014 @ 9:58

2. Nice explanation!
Such effects are especially visible when we look also on memory latency. Sometimes ALU instructions can ‘hide’ because we are waiting for the memory.

AVX256 is for instance not twice as fast as SIMD128.

Comment by fenbf — 6/6/2014 @ 13:33

3. Your explanation actually understates the counterintuitive behaviour, since not only are you sneaking a ‘popcount’ per iteration, you are also getting an extra ‘add’ for free with no additional wall time.

A more charitable view of the academic approach would be that as the superscalar prowess of modern processors increases, the tradition of ignoring constants and assuming one O(n) algorithm is as good as a another is finally becoming true.

I tried your code on Haswell i7-4770 with three different compilers, all using “-march=native -O3″. The different levels of performance for such a simple loop are fairly large:

icc-14.0.3: 873 873
gcc-4.8.1: 582 582
clang-3.4: 599 499

The numbers on Sandy Bridge E5-1620 are quite different:

icc-14.0.1: 748 748
gcc-4.8.0: 700 676
clang-3.2: 700 676

That is to say, icc shows a speedup from Sandy Bridge to Haswell, but gcc and clang slow down. I think this is because for everything except the icc/Haswell test, the memory access speed is the limiting factor. This SB test system has quad-channel RAM vs the dual-channel HSW. Still, it’s probably worth noting that on Haswell, icc somehow manages to be 50% faster on such a simple loop.

My personal conclusion from this isn’t that one should therefor switch to icc, but that even when doing something simple in C, one should realize that the compiler has tremendous influence on the results you see. If you really care about performance on a particular processor family, you should probably lock your loop down in assembly rather trusting to the winds of fate.

Comment by Nathan Kurz — 6/6/2014 @ 14:08

4. Wondering whether the speed of reading from memory was a limiting factor, I just tried modifying the program to show what happens for a variety of buffer sizes. It presents a very different picture, with the simpler loop running much faster until the buffers get too large to fit in the different levels of cache (L1 = 32KB, L3 = ~10MB).

I used the same binary on both Haswell and Sandy Bridge, compiled with ‘icc -Wall -std=c99 -xAVX -O3 -o card-icc card.c’. There are probably some rounding issues involved, but I think the numbers are realistic. The amount of cache/memory used should be 3 * 8 * N (two input buffers and one output * 8 bytes per uint64_t * buffer size).

Haswell:
N=2048 repeat=200000 2925.714286 1241.212121 0.424242 156502015344
N=4096 repeat=100000 2925.714286 1241.212121 0.424242 91025568816
N=8192 repeat=50000 2925.714286 1204.705882 0.411765 77936987128
N=16384 repeat=25000 1861.818182 1412.413793 0.758621 76607756564
N=32768 repeat=12500 1780.869565 1321.290323 0.741935 77661022020
N=65536 repeat=6250 1780.869565 1365.333333 0.766667 88572242330
N=131072 repeat=3125 1706.666667 1365.333333 0.800000 92097612397
N=262144 repeat=1562 1574.880492 1204.320376 0.764706 94307193854
N=524288 repeat=781 787.440246 758.275793 0.962963 95454643365
N=1048576 repeat=390 567.978667 552.627892 0.972973 95913647420
N=2097152 repeat=195 567.978667 552.627892 0.972973 96253977139
N=4194304 repeat=97 573.024631 557.325326 0.972603 95938535739
N=8388608 repeat=48 575.218834 551.579704 0.958904 95029871800
N=16777216 repeat=24 575.218834 551.579704 0.958904 95073245884
N=33554432 repeat=12 575.218834 551.579704 0.958904 95092915964

Sandy Bridge:
N=2048 repeat=200000 2560.000000 952.558140 0.372093 156502015344
N=4096 repeat=100000 2409.411765 975.238095 0.404762 91025568816
N=8192 repeat=50000 2409.411765 975.238095 0.404762 77936987128
N=16384 repeat=25000 1517.037037 1050.256410 0.692308 76607756564
N=32768 repeat=12500 1365.333333 1024.000000 0.750000 77661022020
N=65536 repeat=6250 1412.413793 1024.000000 0.725000 88572242330
N=131072 repeat=3125 1365.333333 975.238095 0.714286 92097612397
N=262144 repeat=1562 1106.672778 890.149843 0.804348 94307193854
N=524288 repeat=781 772.582883 744.488960 0.963636 95454643365
N=1048576 repeat=390 717.446737 705.076966 0.982759 95913647420
N=2097152 repeat=195 693.126508 670.401049 0.967213 96253977139
N=4194304 repeat=97 666.963095 678.079147 1.016667 95938535739
N=8388608 repeat=48 671.088640 671.088640 1.000000 95029871800
N=16777216 repeat=24 682.463024 671.088640 0.983333 95073245884
N=33554432 repeat=12 671.088640 682.463024 1.016949 95092915964

This shows the Haswell system running slightly faster out of L1 and L3, and the Sandy Bridge running faster once we are reading from RAM. But the simple loop remains faster than the complex one until we start reading out of RAM.

Comment by Nathan Kurz — 6/6/2014 @ 15:30

5. I would love to see more test results, but so far AVX was as fast as SSE. Perhaps, it is faster when you have many fewerstore/load operations.

Comment by Leonid Boytsov — 6/6/2014 @ 17:07

6. @Nathan

I have updated my blog post and revised my code. I agree that the shorter loop becomes faster on pure cache processing.

Comment by Daniel Lemire — 6/6/2014 @ 17:23

7. @Nathan

BTW I am suspicious of your results with the Intel compiler. In my experience, it does a good job at defeating microbenchmarks. I’d like to see the code it generates and see how it differs from gcc. We may find very similar compiled code for the loops, but differences elsewhere.

Update: Turns out that the Intel compiler is not cheating.

Comment by Daniel Lemire — 6/6/2014 @ 17:43

8. It’s also worth keeping in mind that longer code may actually have a shorter code path (eg, because it performs tests and optimizations).

Basically, no amount of straight-line code (ie, outside of a loop or recursion) is likely to add significantly to run time.

Comment by Rich Morin — 7/6/2014 @ 6:06

Sorry, the comment form is closed at this time.

« Blog's main page