Avoiding cache line overlap by replacing one 256-bit store with two 128-bit stores

Memory is organized in cache lines, frequently blocks of 64 bytes. On Intel and AMD processors, you can store and load memory in blocks of various sizes, such as 64 bits, 128 bits or 256 bits.

In the old days, and on some limited devices today, reading and storing to memory required you to respect alignment. You could not simply write any block memory anywhere. Today you mostly can write wherever you like. There is a small penalty for misalignment but the penalty is typically under 10% as I argued in my 2012 post Data alignment for speed: myth or reality?

Yet writing or reading from two cache lines (what Intel calls a cache line split) instead of a single one is likely to be more expensive at least some of the time. Let us explore an interesting scenario. It is sometimes possible to avoid crossing a cache line boundary by doing two memory accesses instead of a single large one. Is it worth it?

Cache lines in memory are aligned on addresses that are divisible by 64 bytes. Suppose that you would want to store 256 bits of data every 64 bytes, at just the right offset so that the 256 bits overlap two cache lines. You hit last 16 bytes of one cache line and the first 16 bytes of the second one. You can achieve the desired results by starting with an offset of 48 bytes. That is, you find find a memory address that is divisible by 64 bytes, and then you add 48 bytes.

In code, using Intel intrinsics, it looks as follow:

char * p = ...
for (size_t i = 0; i < ... ; i++) {
  _mm256_storeu_si256(p + i * 64, vec);

You can avoid entirely crossing the cache line bounding by first storing 128-bit of data at the 48-byte offset, and then storing another 128-bit of data. The first store is at the end of the first cache line and the second store is at the beginning of the second one.

char * p = ...
for (size_t i = 0; i < ... ; i++) {
      _mm_storeu_si128(p + i * 64, vec);
      _mm_storeu_si128(p + i * 64 + 16, vec);

How do these two approaches fare? I wrote a simple benchmark that stores many blocks of 256-bit at a 48-byte offset. It either stores it in one 256-bit step or in two 128-bit steps. I record the number of cycles per iteration on an AMD Rome processor. I rely on the the pre-installed RedHat LLVM compiler (clang version 3.4.2).

A single 256-bit write 2.33 cycles
Two 128-bit writes 2.08 cycles

It is a gain of slightly over 10% for the two 128-bit writes. What if you remove the 48-byte offset (or set it to zero)? Then both benchmark clock at 2.08 cycles per iteration. I expect that the 48-byte offset is a best-case scenario for the two 128-bit writes: if you change it then both approaches have the same cache-line overlap problems. So this 10% gain requires you to choose the alignment carefully.

My source code is available. Of course, your results will depend on the processor and to some extend on the compiler. You should be able to run my benchmark program on your own Linux x64 system.

Be mindful that if you are getting worse results on a per cycle basis on comparable hardware, you might be limited by your compiler. An analysis of the assembly might be required.

Further reading: Travis Downs has an interesting complementary analysis. He finds that unaligned stores crossing a 32-byte boundary can be tremendously expensive (i.e., 5 cycles) on the type of processor I am using for these tests (Zen 2). The 32-byte boundary exists irrespective of cache lines. Meanwhile, he finds that stores aligned exactly on a 32-byte boundary are much cheaper (1 cycle).

Daniel Lemire, "Avoiding cache line overlap by replacing one 256-bit store with two 128-bit stores," in Daniel Lemire's blog, March 18, 2020.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

6 thoughts on “Avoiding cache line overlap by replacing one 256-bit store with two 128-bit stores”

  1. Multi-threading will influence that though, unless I’m misunderstanding things. Changes in a cache-line on one core are not immediately visible in that cache-line viewed from another core [and not necessarily in that order] and needs to be ‘synchronized’ [the tech-term is surely something else, but], which takes time/cycles.

    In C++, the alignment of the object will be at least it’s size, aligning on 48 bits is UB [casting a void pointer returned from malloc to a type, does not create (an) object(s) of that type and even for ‘objects’ of type int, this is technically UB, one needs to go through placement new, which imposes the alignment].

    Having said that, current compilers don’t seem to have a problem with any of the above.

    1. In this instance, I am relying on Intel intrinsics which have “unaligned” as part of their specification (look for the small “u” in the name). So my code is not relying on undefined behaviour.

  2. Thank you for sharing this is very interesting read. Yet still I have to split __m256 is it correct? Right now I was struggling with similar problem and got exited when I read this post but I think still not the answer to my problem, that I had 12% cache misses because I had tiled/vectorized a 3 dim large nested array. So each iteration is jumping way forward, without tiling got 1% but additional second on process time

    1. I don’t understand why you would get more cache misses… It should not matter how you read the data as far as cache misses go. You could read the data byte-by-byte… and still get the same number of cache misses.

      1. Sorry dont have exact answer and for cryptic code, still studying/working on it,now code looks like this:

        ..another loop...
        int siz = n1 - (n1 & 7);
        int mi = dMatrixInfo[i][1];
        for (int j = 0; j < siz; j = j + 8) {
        float *d2 = &(d[mi * m++]);
        float *d22 = &(d[mi * m++]);
        float *d23 = &(d[mi * m++]);
        float *d24 = &(d[mi * m++]);
        float *d25 = &(d[mi * m++]);
        float *d26 = &(d[mi * m++]);
        float *d27 = &(d[mi * m++]);
        float *d28 = &(d[mi * m]);
        int size = n2 - (n2 & 7);
        for (int k = 0; k < size; k = k + 8) {
        _mulAddBroadcast(&d2[k], &eVal, &n[k]);
        _mulAddBroadcast(&d22[k], &eVal2, &n[k]);
        _mulAddBroadcast(&d23[k], &eVal3, &n[k]);
        _mulAddBroadcast(&d24[k], &eVal4, &n[k]);
        _mulAddBroadcast(&d25[k], &eVal5, &n[k]);
        _mulAddBroadcast(&d26[k], &eVal6, &n[k]);
        _mulAddBroadcast(&d27[k], &eVal7, &n[k]);
        _mulAddBroadcast(&d28[k], &eVal8, &n[k]);

        which I have tiled from

        int size = n2 - (n2 & 7);
        for (int d = 0; d < size; d += 8) {
        _mulAddBroadcast(&d2[d], &eVal, &n[d]);
        for (int d = size; d < n2; d++) {
        d2[d] = fma(eVal, n[d], d2[d]);

  3. I’ve been using headers on various data types, so a string might be stored as a pointer to the character data, with some extra information stored immediately before the first character in memory. If I access the character data, then step backwards to get an item from the header, am I likely to get a cache miss / stall that I might not get if I stored a pointer to the start of the header and accessed the character data as an offset from that?

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.