On feeding your CPU with data

Can you guess the speed difference between these two lines of code? The first line of code does N additions:

for (int i=0; i<N;i++) sum+=arr[i];

The second line of code does N/16 additions:

for (int i=0; i<N;i+=16) sum+=arr[i];

A naive programmer might expect the second option to be 16 times faster. The actual answer is much more complicated and worth further study. I have implemented a simple test.

I used the GNU GCC 4.5 compiler with the -O2 optimization flag. We can compute the complete sum faster using different compiler flags (-O3 -funroll-loops) and I present these results in a separate column (sum-unroll). In this last version, the compiler makes aggressive use of SSE instructions to vectorize the problem.

  N  sumsum-unroll1/16
20K    1100    6700    20,000
400K    1000    3700    5100
500M    2100    3900    4200

The speeds are expressed in millions of integers per second.

On tiny arrays, most of the data resides close to the CPU. Hence, computations are essentially CPU bound: doing fewer computations means more speed.

The story with large arrays is different. There, skipping almost all of the data (15 integers out of 16) only buys you twice the speed! Moreover, once we take into account the vectorized version that the compiler produced (sum-unroll), the difference becomes almost insignificant!

My source code is available as usual.

Conclusion: We are in the big data era. Maybe ironically, I sometimes get the impression that our processors are drinking data out of a straw. Whereas a speed of 20,000 million integers per second is possible when the data is cached, I barely surpass 4000 million integers per second when reading from RAM.

Source: I stole the problem from Elazar Leibovich who posted it privately on Google+.

8 thoughts on “On feeding your CPU with data”

  1. I’m not sure if it is true or not, but mainframes are often attributed with having much bigger pipes (to move data around).

    Way back, I was working in C and a friend was in APL on a huge main-frame, we’d write the same cpu intensive algorithms, then compare performance. I never kept track of the numbers, but it was clear that a time-slice off of my friend’s machine was nearly equivalent to the speed of my workstation (which was state-of-the-art then) for smaller jobs, but for big bulk jobs his hardware was often stunningly faster.

    Somewhere in the beginning of the OO age, everything became optimized for one-offs, rather than for bulk processing. Usually when I’m optimizing code, the first thing I try is to deal with the data in bulk (followed by memoization) …

    Paul.

  2. RAM is another form of secondary storage, like disk used to be. Cache is now what RAM was conceived to be: a flat memory space with constant access time.

  3. How long do they run, on which OS, and at what priority? If the test process can be preempted by the OS, which is more likely to happen on longer runs (as with the large data arrays) then you might be measuring the switch contexts of processes without meaning to do so, and would probably want to eliminate that…

  4. And things are even more complicated if you’re programming a GPU, which has an much more complicated memory architecture. Or, potentially, if you’re doing multithreaded coding on a multicore machine.

    Time to relearn computational complexity.

Leave a Reply

Your email address will not be published. Required fields are marked *