Our computers do not read or write memory in units of bits or even bytes. Rather memory is accessed in small blocks of memory called “cache lines”. For a given system, the cache line size is usually fixed and small (e.g., 16 to 256 bytes). All Intel/AMD x64 systems I have used relied on a 64-byte cache line. My current Apple laptop with its ARM-based M2 processor relies on a 128-byte cache line.
How can you measure the size of the cache line? I asked on Twitter/X and a few people (e.g., Peter Cawley, Will Bickford) pointed to a phenomenon called false sharing. I will come back to false sharing in a future blog post. I do not want to discuss or cover false sharing because it depends on parallelism. Furthermore, it may be trickier than it seems. I want a simpler solution.
Many people (Robert Clausecker, Sam Westrick, Tomasz Kowalczewski, Vinoth Deivasigamani, Sergey Slotin and many others) proposed using ‘strided access’ benchmark.
I finally decided to test a strided copy: from a large array, I copy every N bytes to another large array. It is a problem that should be largely “memory access bound” as long as N is not too small. I start N at 16. Importantly, I never read my own writes, so I avoid concerns with 4K aliasing on Intel processors.
If N is larger than twice the cache line, then I can effectively skip one cache line out of two. If N is smaller than the cache line, then every cache line must be accessed. Having a stride value just above the cache line should be not sufficiently to see large gains: but you expect the speed to almost double once you reach twice the size of the cache line if the only thing that matters are cache lines.
Sadly, several other factors come into play on a modern system with such a benchmark. There is more than just the cache-line size as a variable! So we need to verify the model experimentally.
I wrote the benchmark in C, but the actual C compiler is unlikely to be relevant. The original code was in Go and I got the same results, but I switched to C to make sure I avoided Go-specific issues. Interestingly, ChatGPT converted the Go code to C code automatically for me, with just a couple of small mistakes. Mind you, it is deliberately simple code.
I run each experiment, for each stride size, 10 times and I record the maximum, the minimum and the average. I use an Apple M2 processor running on my laptop and an Intel-based server. I do not require a particular memory alignment when allocating memory. I do not make any other attempt to control the results.
The numbers on the server are quite nice, with hardly any difference between the average, the maximum and the minimum. If your stride is 129, you are 66% faster than when your stride is 64. This suggests that the cache-line size is 64 bytes. The gain is not 2x as I would have expected but the processing might be loading cache lines speculatively. Observe how a stride that is a multiple of 64 (e.g., 128 or 256) is slightly bad: we see the performance dip visibly. It might be due to a caching issue e.g., only half the cache-line addresses are used which makes it more difficult for the processor to make full use of its cache (due to address aliasing).
The result on my laptop are much less clean even though I kept the laptop unused during the benchmarking. In this instance, if your stride is 257, you are more than 2 times faster than when your stride is 128. It suggests that the cache-line size is 128 bytes. Just like the Intel system, a stride of 128 is unfortunate: there is a visible performance dip.
Note that we do not actually copy data at 300 GB/s, that would be impossible on its face on my hardware: but we can copy an array that fast if we just copy one byte per block of 512 bytes.
Thus you can empirically measure the size of the cache line with a strided copy of a large array… As soon you use a stride that is twice the cache line, you should be more than 50% faster.