Is software prefetching (__builtin_prefetch) useful for performance?

Many software performance problems have to do with data access. You could have the most powerful processor in the world, if the data is not available at the right time, the computation will be delayed.

It is intuitive. We used to locate libraries close to universities. In fact, universities were built around libraries. We also aggregate smart and knowledgeable people together.

Sadly, our computers are organized with processors on one side and memory on the other. To make matters worse, memory chips are simply unable to keep up with the processing power of processors. Thus we introduce caches. These are fast memory stores that are stuck to the processor. We have a whole hierarchy of CPU caches.

Happily, processor vendors like Intel have invested a lot of effort in making sure that the right data is made available at the right time. Thus if you access data according to a predictable pattern the processor will do a good job. Also, the processor can reorder its instructions and execute a load instruction earlier if it is useful. Simply put, a lot of engineering went to make sure that data-access problems are minimized. You might think that, as a programmer, you can do better… but truth is, most times you couldn’t. At least, most of us, most of the time, couldn’t.

You can still “blind” your processor. Some expensive instructions like the division will apparently prevent processors from fetching the data in time. At the very least, they can prevent some instruction reordering. Intuitively, the processor is so busy processing a monstrous instruction that it has no time to see what is coming next.

You can also thoroughly confuse the processor by doing many different things in an interleaved manner.

In a comment to one of my earlier blog post, someone suggested that I could improve the performance of my code by using software prefetches. That is, processor vendors like Intel provide us with instructions that can serve as hints to the processor… effectively telling the processor ahead of time that some data will be needed. Using the GCC and Clang compiler, you can invoke the __builtin_prefetch intrinsic function to get this effect. I do not know if other compilers, like that of Visual Studio, have this function.

The instructions generated by this function have a cost of their own, so they can definitively make your software slower. Still, when inserted in just the right manner in the right code, they can improve the performance.

Should you be using __builtin_prefetch? My experience says that you probably shouldn’t. I don’t like hard rules, but I have not yet seen a scenario where they are obviously a good idea. That is, I claim that if software prefetching is useful, that’s probably because your code could use some work.

I asked the commenter to provide an example where software prefetching was a good idea. He offered a scenario where we sum the values of bytes within an array…

for (int j = 0; j < ...; j++) {
    counter += bytes[p];
    p = (p + 513) & (NUMBYTES - 1);
}

This code, by itself, will not benefit from software prefetching. However, the commenter interleaved the sums…

for (int j = 0; j < ...; j++) {
  for(int i = 0; i < batchsize; i++) {
    counter[i] += bytes[p[i]];
    p[i] = (p[i] + 513) & (NUMBYTES - 1);
  }
}

If you pick batchsize large enough, you can confuse the processor and get some benefits from using the __builtin_prefetch function. I suppose that it is because the access pattern looks too unpredictable to the processor.

Let us look at the numbers I get…

sequential sums3 s
interleaved sums4.4 s
interleaved sums with __builtin_prefetch4.0 s

The prefetching improves the performance of the interleaved sums by 10%, but you can get much better performance simply by doing the sums one by one. The resulting code will be simpler, easier to debug, more modular and faster.

To provide evidence that __builtin_prefetch is useful, you need to take code, optimize it as much as you can… and then, only then, show that adding __builtin_prefetch improves the performance.

My source code is available.

Credit: I’d like to thank Simon Hardy-Francis for his code contribution.

21 thoughts on “Is software prefetching (__builtin_prefetch) useful for performance?”

    1. Stockfish is a good example. It even uses some extra code to pre-generate the hashkey of the next move (before it is made) to calculate the prefetch-address. After the prefetch, it does some more work on the current node, then makes the next move on it’s internal board (updating/calculating the hashkey again) and only then accesses the hashtable with a low hash-miss-rate.
      The prefetch can increase the overall speed of a chessprogram by 2-5% (and maybe only 1-2% without pre-calculated hashkey).

  1. Interesting topic. prefetch is also available on MSVC as _mm_pause.
    Anecdotally, I am often disappointed by the results, and it’s worrisome that the required lookahead distance is so system-specific (perhaps can be auto-tuned).

    That said, we’ve seen a few speedups, e.g. 1.1x in a memory-bound checksum. It may also help to use regular loads (ensuring compiler doesn’t optimize them out) instead of prefetches.

    Generally agree with your recommendation to just optimize for the hardware prefetcher.

  2. This is a topic that I would like to better understand, as this can influence how we try to optimise cache usage, as cache appears to be important for effective use of vector instructions.
    Does L1 cache store data or memory pages (ie block of memory, say 64kb) ? A similar question for L2 and L3 cache. If cache stores active memory pages, then I think the discussion needs to address this, as a “fetch” would be for a memory page, while your loop example considers variables.
    The other unknown is for multi-thread coding, how is data shared between processor caches, especially if this data is changing.
    If cache stores “blocks” of memory, I think this should be more significant for pre-fetch than a counter loop.
    I would very much appreciate if you can explain this relationship.

  3. My apologies, as my previous post should have been clearer when referring to “processor”, “core” and “thread” and how L1, L2 & L3 cache relate to these … A supplementary question !

    1. On current Intel processors, I think L3 is always shared between the cores. L1 and L2 are on a per core basis. This will definitively be different in other types of processors.

      A core can run 1 or more threads.

      If you have many cores doing prefetches, they’ll compete for cache space and for memory requests.

  4. I find this a very interesting topic.
    My apologies if I am redirecting this post idea, but I don’t think a prefetch instruction sufficiently explains the performance and especially, how to influence improvement.
    Processor data storage is described as a hierarchy of data storage types:
    • Memory
    • L3 cache
    • L2 cache
    • L1 cache
    • 32-bit registers
    • 64-bit registers
    • Vector registers, which can be 128-bit, 256-bit or 512-bit
    All these have different group sizes.
    Where does a prefetch instruction fit into this mix?
    When optimising data availability to vector registers, I expect this must depend on how data is moved through memory and the 3 levels of cache, in both directions.
    As I said initially, I don’t understand how this works, but a better understanding might help when devising coding strategies to achieve improvement. I have achieved huge variations in performance of vector instructions, often without a clear understanding as to why.
    I do think it is fortunate that we can’t interfere directly in how this is done, as I’m sure we would do a worse job.

  5. Hello, Daniel. I disagree with the points in this article so I made an example with regular loop that becomes faster with a prefetch. The answer is long so I made a post in my blog.

    1. Quoting from your article…

      The performance improvement is 1.36%. We can say this is not a very good result and we will be right. The main problem is that prefetch itself takes time for execution.

      I am not sure we disagree in the end?

      1. but you can get much better performance simply by doing the sums one by one. The resulting code will be simpler, easier to debug, more modular and faster.

        I made an example where we can speedup summing one by one with prefetch. My first point is that prefetch should be a compilers’s optimization, not by hand. And the other point that regular memory access is not good a example for prefetch. Later I will write about cases where it is really good and can not be improved by code refactoring.

        1. prefetch should be a compilers’s optimization

          But how would the compiler know (a) that the access pattern causes the auto prefech to be less effective, and (b) that the memory accessed is large enough to warrant the use of the prefech instruction? Not to mention (c) that batch access with prefetch is more effective… would you imagine the compiler reorganizing non-batched code into batched code?

  6. As the person who came up with this simple example — and donated part of his Saturday morning — at the request of Daniel Lemire… I’m surprised I didn’t even receive a ‘thanks’.
    Instead the ‘algorithm’ is criticized “prefetching improves the performance of the interleaved sums by 10%, but you can get much better performance simply by doing the sums one by one.” But I don’t think it’s possible to get a good example with good algorithm in only a few dozen lines of code. What was requested: “a short C program (say 30 lines of code) where __builtin_prefetch() helps”. So this criticism seems somewhat unfair.
    Also, in my tests [1] I noticed a performance gain of up to 40% with a batch size of 64… which is way bigger than the 10% improvement reported in the blog post.
    The test code is designed to show the effects of randomly accessing non-cached RAM using different batch sizes. It maybe that accessing RAM with a smaller batch size is generally faster — as Daniel Lemire points out — however, IMO that’s missing the point. The point is that although the test code shows a possible speed improvement between 10% and 40%, using prefetch allows the developer to execute other code while the prefetches are happening. And this can be A LOT of other code if it works on RAM which is already cached. Consider the fact that the same test code works nearly 14 times faster if operating on already cached RAM. That’s a huge amount of instructions that could be operating while waiting for the prefetch instructions which operate asynchronously to other instructions. So the pattern is (a) prefetch in a batch, (b) do some task on purely cached RAM, and (c) finally use the prefetched RAM.
    Here’s the problem: You obviously will never run into an ‘abc’ situation by optimizing as Daniel Lemire suggests: “optimize it as much as you can… and then, only then, show that adding __builtin_prefetch improves the performance”. Instead you need to refactor the algorithm around prefetching to truly take advantage 🙂

    [1] https://gist.github.com/simonhf/caaa33ccb87c0bf0775a863c0d6843c2

    1. As the person who came up with this simple example — and donated part of his Saturday morning — at the request of Daniel Lemire… I’m surprised I didn’t even receive a ‘thanks’.

      I sent you an email privately just now. I should have emailed you privately to discuss my upcoming blog post, I did not. For that, I apologize. The source code I link to refers to your original gist, but my code is substantially different: I kept the main idea, but I simplified the code for the purpose of producing a simple example. I appreciate your work. I kept your name out of the blog post because I did not want to personify the argument and because I am doing a different benchmark. I should have asked you whether you wanted me to use your name.

      I have appended a credit section to my blog post where I thank you for the code contribution.

      I’m sorry if I offended you. That was not my intention.

      But I don’t think it’s possible to get a good example with good algorithm in only a few dozen lines of code.

      It could be that software prefetching becomes very useful in complex code, even if it is not useful in simpler code.

      Also, in my tests I noticed a performance gain of up to 40% with a batch size of 64… which is way bigger than the 10% improvement reported in the blog post.

      My numbers are not very sensitive to the batch size beyond a certain point… 32, 64… it is all the same. I tried the pick the parameters so that the benefits of the prefetching are best. If you find better parameters, I’ll be glad to update my blog post.

      My code is different from your code, which is partly why I did not name you in the post. I am using a simplified version of your code.

      Consider the fact that the same test code works nearly 14 times faster if operating on already cached RAM. That’s a huge amount of instructions that could be operating while waiting for the prefetch instructions which operate asynchronously to other instructions.

      It shows that we can be “memory bound”. However, my argument is not whether we should prefetch the data or not… you should obviously prefetch it… my argument has to do with whether you need software prefetching. I think you do not. My claim is that you can almost always rewrite your code without software prefetches in such a way that it will be at least as fast.

      I’m willing to admit that there might be cases where software prefetching is useful, but I think that they are uncommon.

      So the pattern is (a) prefetch in a batch, (b) do some task on purely cached RAM, and (c) finally use the prefetched RAM. Here’s the problem: You obviously will never run into an ‘abc’ situation by optimizing as Daniel Lemire suggests: “optimize it as much as you can… and then, only then, show that adding __builtin_prefetch improves the performance”. Instead you need to refactor the algorithm around prefetching to truly take advantage 🙂

      Your approach might well be practically sound, but I submit to you that you take it as an axiom that you need the software prefetching. This can work well when programming… but I am trying to determine whether the software prefetching is needed at all.

  7. Hello, Daniel.
    Is AVX512PF useful? It provides prefetching for random access in gather(). I wonder the improvement of AVX512PF, but I don’t have a platform to test it.

      1. details from https://en.wikipedia.org/wiki/AVX-512

        Xeon Phi x200 (Knights Landing):[1][9] AVX-512 F, CD, ER, PF
        Xeon Phi Knights Mill:[7] AVX-512 F, CD, ER, PF, 4FMAPS, 4VNNIW, VPOPCNTDQ
        Skylake-SP, Skylake-X:[10][11][12] AVX-512 F, CD, VL, DQ, BW
        Cannonlake:[7] AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI
        Ice Lake:[7] AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VPOPCNTDQ, BITALG, VNNI, VPCLMULQDQ, GFNI, VAES

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax