Getting 4 bytes or a full cache line: same speed or not?

Many software operations are believed to be “memory bound”, meaning that the processor spins empty while waiting for data to come from memory. To compensate, our processors rely on fast cache memory. If your problem fits in cache, it will typically run much faster than if the processor constantly needs to query the memory subsystem.

If you need to retrieve a block of data, the processor does not retrieve just the necessary bytes. It retrieves data in units of a “cache line” which is typically 64 bytes on Intel processors. So the first 64 bytes of memory addresses are on the first cache line, the next 64 bytes are on the second cache line, and so forth.

To be clear, it means that you should not expect it to be faster to retrieve only 8 bytes of data rather than 64 bytes of data, assuming that all the data in question lies in a single cache line. That is what I expect most programmers to tell you.

Let me state this as a “law”:

Given a large out-of-cache memory block, the time (latency) required to access 4, 8, 16 or 64 bytes of data from a cache line is the same.

How well does this mental model works in the real world?

Suppose that you build a large array of 32-bit integers. There are sixteen 32-bit integers per cache line. The array is large enough to exceed your cache by a wide margin. Then you repeatedly select a cache line at random and sum up a few 32-bit integers in the cache. Maybe you just pick one integer per cache line, maybe you pick four, maybe eight, maybe sixteen. Your code might look like the follow, where percache is the number of integer you access per cache line:

uint32_t counter = 0;
for(size_t i = 0; i < howmany; i++) {
    size_t idx = pick_address_of_random_cache_line();
    // there are 16 uint32_t per 64-byte cache line... 
    for(size_t j = 0; j < percache; j++) {
       counter += array[16 * idx + j];
    }
}

So, how sensitive is the running to the parameter percache? A simple theory is that if the array is large enough, it does not matter whether you access one, two, four or sixteen values per cache line.

Let us run an experiment to see. I generate an 8 GB array, it far exceeds my CPU cache. Then I vary the number of integers accessed per cache line, and I sum it all up.

Integer per cache lineCycles per cache line
138
460
870
16110

So even though I am randomly accessing cache lines, out of cache, whether I access one, eight, or sixteen 32-bit values per cache line matters a great deal.

To make things interesting, I can try to accelerate these operations by using SIMD instructions. SIMD instructions can do many additions at once. Let us see…

Integer per cache lineCycles per cache line
835
1641

Wow! So hand-tuned vectorized helped this problem a lot, even though I am randomly accessing cache lines from a large array.

Let me twist the problem around. After accessing each cache line, and updating my counter by summing up one, two, four, eight or sixteen value, I am going to change the state of my random number generator in a way that depends on the newly computed counter. This makes the problem harder for my computer because, before it can fetch the next cache line, it absolutely must wait for the computation of the counter to be finished.

Integer per cache lineCycles per cache line
1300
4310
8320
16330

As we can see, the problem is about an order of magnitude harder. However, even then, whether I access one or sixteen values per cache line matters: the difference is about 10%. It is not negligible.

Thus it might be a lot harder than you expect to be “memory bound”. Counting the number of cache lines accessed is not a sane way to measure the efficiency of an out-of-cache algorithm.

My code is available.

22 thoughts on “Getting 4 bytes or a full cache line: same speed or not?”

  1. Really interesting, do you have an explanation for why there’s so much sensitivity to the “percache” tunable? I’d honestly expect not that much difference in working with 1, 2, or 16 integers in the same cacheline once the cache miss penalty has been paid…

  2. So your “law” is true. Even before your change to make each loop iteration dependant on the previous iteration the latency for each cache line was around 300 cycles. But processors can fetch multiple cache lines in parallel and so an OOO processor is very good at hiding this latency. Basically the same as your conclusion.

  3. It is an understandable architectural optimization that cache line fills can provide data “early” instead of first reading the whole cache line (although I wonder how this goes with ECC memory). After all, SDRAM bandwidth, measured in whole cache lines, is sufficiently low to make this kind of a latency optimization worthwhile.

    Even more interesting (or maybe just more esoteric) experiment would be to look at the effect of read order of the filled cache line. SDRAM burst ordering can bring a specific “critical SDRAM word” on the front of the read, but may shuffle the rest of the burst in a somewhat unintuitive manner. On top of all this, modern processors may reorder reads, which may improve performance in this scenario, but make it less obvious what is going on.

    The clearest scenario is the following: reading bytes sequentially from beginning with a large alignment (such as a cache line boundary) onwards should result a sequential burst ordering, providing data on address-sequential manner. Starting the operation from some other alignment is likely not to do so, and a data-dependent address chaining might expose this fact.

    This information may be valuable to an extent when performing multiple, possibly interdepent accesses to a data structure likely not to reside in the cache hierarchy. At least I can fathom some microbenchmarks to expose such implementation details.

    1. One such a microbenchmark would be to construct all circular linked lists of eight pointers, eight bytes each (both the native x64 pointer size as well as the most likely SDRAM word size), each list aligned on a cache line. Then simply traverse off-cache linked lists each by eight steps; assembly code couldn’t be simpler. I would expect that there are eight time-optimal traversal orders among those 43020 possible (taking the starting offset into account).

      1. One more addition: reading just two pointers on such lists would be likely expose most of the oddity involved. Say, lists which have their first read entry on the beginning of the cache line and the second entry on the end of it. It is important to keep these reads dependent; otherwise the DRAM controller might (I don’t really know if they do it in reality) be able to find a burst ordering which brings latency of reading all required words down.

    2. Interesting I hadn’t thought of critical word first stuff being a contributor here – does modern hardware still even use it? I had heard that they don’t: the high bandwidth and relatively long latnecy (and the wide busses which means a lot of data arrives at the same time anyways) would make the relative benefit quite small.

      It would be an interesting test though. I’ll do it soon in uarch-bench.

      FWIW my impression of the primary effect here, in the first set of “throughput” tests, is that the extra work done by the versions with more than one read ends up reducing the memory level parallelism by clogging both the ROB and load buffers with instructions and loads so that after the first miss the CPU can’t get far enough ahead to max out MLP.

      GCC decided not to inline the test functions, but if you had the inline keyword it does, and the results are a lot closer, especially for 4 and 8, because the loop overhead disappears, cutting down instructions clogging the ROB. The 16 case probably maxes out the load buffers and is still fairly slow.

      1. Interesting I hadn’t thought of critical word first stuff being a contributor here – does modern hardware still even use it? I had heard that they don’t

        Frankly, I’m not certain if they do. It should be checked out by benchmarking, but if any of the effects seen in the tests in this blog post actually result from the latency caused by simple bandwidth bottleneck on the memory bus (there definitely are other plausible explanations on modern hardware), the effect could be sufficient to provide a low-hanging fruit for optimization on purely random dependent reads, such as traversing large graphs or executing DFAs. Unless, of course, critical word ordering has an overhead which makes it impractical to be a benefit these days.

        I quickly estimated that performance effect of this could be up to 10-20% (< 8 clock cycles), assuming sequential burst reads and critical word first reads are equal on all other aspects.

        My knowledge on the subject is at least a decade old. Carefully constructed benchmarks could smash my assumptions…

        1. I started writing a test along the lines you suggested, but before I even got to the critical word (from RAM) part, I found another weired effect: when multiple loads hit the same L2 line (after missing in L1) only the first loads gets the “expected” L2 latency of 12 cycles. The rest of the loads get a significantly longer latency of 19 cycles.

          This almost sounds like “critical word first” at the L2 to L1 level, but I don’t think it is: 7 extra cycles makes no sense for a 64-byte bus that should transfer the whole cache line in one shot and that has 64-byte bandwidth, and more importantly the effect occurs even if the other reads are for the exact same location.

          More likely what happens is something like an optimized path for waking up the load that triggered the L1 miss and providing it the value directly, while the other loads just get woken up without receiving a value and have to read from L1 (although the difference of 7 cycles is even a bit longer than the usual L1 latency).

          I put more details at RWT.

          This effect would also explain some of part of Daniel’s results: e.g., part of the final latency test: there would be at least an additional 7 cycles added on (which BTW exactly explains 310 cycles vs 300 for 4 vs 1 adds: 7 cycles extra + 3 cycles of adds).

          1. Uh-oh, that’s very curious! I think I have heard of such an effect before, but I can’t really get a grasp where or when that would have been.

          2. I think the Intel glitch Travis found on RWT is worth noting in this thread too. Essentially, a trivial linked list sum generated by all mainstream compilers seems to generate code which runs unexpectedly 19 cycles per iteration, when a similar, but more convoluted piece of code does the same at 12 cycles per iteration when data is in L2:

            https://www.realworldtech.com/forum/?threadid=178902&curpostid=178969

            Modern microarchitectures are occasionally strange indeed!

            1. Well, this is was partially a restatement of what Travis described above, but with a concrete, real-world example that probably would surprise many, more than a synthetic and partially a nonsensical benchmark written in assembly.

              Also, this must have been the case of “an effect” I had run into earlier on. Many others must have also run into it…

  4. With MSVC compiler, there’s very small difference when testing with 8GB vectors, only 8% between 1 and 16.

    I think the problem is GCC optimizer doesn’t like that code for some reason.

    Here’s my version that should be portable across compilers, scalar-only, C++, without assembly, and slightly more optimized: https://github.com/Const-me/MemoryAccessCosts

    The results are in result-gcc.txt and result-msvc.txt in that repo. They’re from the same PC, I’m using gcc 5.4.0 running in WSL in Windows 10.

    If you want to build my code, run cmake . then make then ./memory-access-demo.

    1. Thank you. It appears that MSVC is better at vectorizing this code than GCC. According to your assembly, MSVC relies on SIMD instructions. Note that my post does report what happens if you vectorize: a much smaller difference. The smaller difference is apparently related to the fact that we use far fewer instructions.

      1. I dunno what’s going on.

        I’d like to point out that even ignoring the caches, RAM delivers 128 bits / transfer on my PC. I have dual-channel DDR3 RAM, each channel delivers 64 bits / transfer. 128 bits is exactly an SSE register.

        One possible explanation is Intel optimizes performance of their chips for their own compiler, AFAIK ICC is even better at vectorizing than MSVC. Unfortunately I don’t own an ICC to test that. So the performance of scalar RAM loads just wasn’t a priority for Intel (just like performance of e.g. x87 floating point math, modern compilers tend to use SSE & SSE2 instead of x87).

        Another possible explanation is, it’s an unfortunate accident. CPU decodes memory load instruction, issues a load request, decodes another one, discovers it loads from the same line, repeats a few more times, and then hits a hardware limit of in-flight instructions, or in-flight RAM requests, and just stops and waits. Meanwhile with SSE there’re fewer instructions and fewer RAM loads, maybe the CPU does deeper pipelining.

        Another possible explanation is the prefetcher silicon screwing things up. CPU tries to detect consecutive RAM reads and optimize them by prefetching next memory addresses. Maybe when CPU decoded 16 scalar load instructions reading consecutive addresses the prefetcher kicked in and started prefetching unneeded data, while 4 vector load instructions isn’t enough for that.

        Optimization is hard. I’m not in academia, I’m a software developer often working on performance-sensitive code. I’ve experienced many weird things, e.g. couple months ago I’ve ported my relatively large C++ project from VS2015 to VS2017, and for some functions I saw up to 2x performance degradation, just because the compiler produced slightly different code from the same C++ source. I had to refactor a couple of things to maintain the performance (for that kind of work, a good profiler helps a lot).

        1. It’s not the prefetcher: the results are more or less the same with prefetching turned off.

          IMO the parallel results (the first set Daniel presents, with large relative difference) are explained simply by the higher “percache” runs having less MLP. You can measure this with perf and see the difference: MLP drops as you move to accessing more elements per cache. For example, I get MLP factors of 5.4, 4.0, 2.9, 2.1, for percache of 1, 4, 8, 16, respectively. At least on my machine this largely explains the average performance difference of ~35, ~51, ~69, ~94 cycles per operation (i.e,. the MLP is very nearly proportional to performance).

          You can measure this yourself with perf or ocperf with the events l1d_pend_miss_pending and l1d_pend_miss_pending_cycles: dividing the former by the latter gives the observed MLP.

          The MLP is lowered with higher percache values because of the extra work done. E.g., percache 16 executes 9 million instructions for my test versus 3 million for percache 1. It works out to about 90 instructions per cache line. Since the ROB (reorder buffer) is only ~200ish instructions, it means you can only fit at most 224 / 90 = 2.5 cache-line misses in the instruction window, compared to 3x that number for the percache 1 case. Note that 2.5 is approximately the observed MLP: in fact, the ROB-based max MLP approximately bounds by above all the observed MLP numbers.

          This also explains why vectorization is helpful: not because it is inherently “faster” (the time to execute the actual sums is fairly small compared to the observed per-line runtime, with or without vectorization) – but because it uses fewer instructions so the instruction window is much larger. For example, sumrandom in Daniel’s code, which is vectorized, only executes ~21 instructions per cache line despite summing the entire line, so the instruction window can fit roughly 10 or 11 lines and the observed MLP is 7.1: much higher than even percache 1.

  5. Something related and very interesting is the knock-on effect of kicking out data out of the cache. If your data is sparse every time you fetch a cache line and you want significantly less data (e.g. 4 bytes) you are kicking out of cache a previous cache line (e.g. 64 bytes).

    For data structures benefiting from caching it’s imperative to have the minimum memory access units/nodes be as compact as possible within a cache line. If this is not possible and the effects of caching are negligible, consider using non-temporal memory access so you keep the existing cached data for some other parts of the program.

    1. Definitely. This is also one source of the sometimes big gap between microbenchmark and real world results.

      Lets say you are tuning some lookup structure using a “realistic” microbenchmark (i.e., it reproduces the access pattern in your application, perhaps even using a trace of real application accesses). You’ll end up with something that balances cache use optimally only with the respect to the microbenchmark: i.e., kicking a line out only penalizes later iterations of the microbennchmark (if at all).

      Drop that think into the real application and you’re now kicking out lines used by the application code that will run after your lookup, which might have a very different cost than it did in your benchmark. As always, microbenchmarks are useful but have to be used very carefully, especially when comparing implementations with different memory impacts.

  6. Something related and very interesting is the knock-on effect of kicking out data out of the cache.

    When I was building benchmarks like this you would see this especially when testing array sizes that fit in the L2 cache. If you build a chain of pointers and then start walking the list immediately then all the data in the L2 is modified and then it gets pulled to the L1 cache it stays in the modified state. So then every read would need to do a dirty writeback and evict data from the L1 and put it back in the L2. This is much slower.

    So I moved to writing a large junk array of data on the side to flush the cache before starting my measurements. Works much better if my goal is simple cache miss latencies.

    before: https://www.dropbox.com/s/raqv6eubcti0ir3/mem_lat1.jpg?dl=0
    after: https://www.dropbox.com/s/t7ebunoxpzzyc18/mem_lat3.jpg?dl=0

    1. That’s interesting: I wouldn’t necessary expect that. It would seem that L1 could have its own dirty flag, separate from the L2 MESI state, so it would only be written back if actually changed. It seems you are saying that the line in L1 will inherit the L2’s modified state, permanently (until the line is flushed out of L2). Hmm…

      I will check out your code.

  7. Seems to me this means memory locality matters a lot! Ignoring the SIMD version, my estimate of the actual “work” performed is:

    Integers/line | Cycles/integer | Original cycles/line

    1 | 38 | 38
    4 | 15 | 60
    8 | 8.75 | 70
    16 | 6.875 | 110

    That is a 5.5X speed up, just by putting data closer together! I spent more time than I would like during my PhD tweaking an in-memory B-tree, which has similar performance characteristics.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax