# Measuring the size of the cache line empirically

Our computers do not read or write memory in units of bits or even bytes. Rather memory is accessed in small blocks of memory called “cache lines”. For a given system, the cache line size is usually fixed and small (e.g.,  16 to 256 bytes). All Intel/AMD x64 systems I have used relied on a 64-byte cache line. My current Apple laptop with its ARM-based M2 processor relies on a 128-byte cache line.

How can you measure the size of the cache line? I asked on Twitter/X and a few people (e.g., Peter Cawley, Will Bickford) pointed to a phenomenon called false sharing. I will come back to false sharing in a future blog post. I do not want to discuss or cover false sharing because it depends on parallelism. Furthermore, it may be trickier than it seems. I want a simpler solution.

Many people (Robert Clausecker, Sam Westrick, Tomasz Kowalczewski, Vinoth Deivasigamani, Sergey Slotin and many others) proposed using ‘strided access’ benchmark.

I finally decided to test a strided copy: from a large array, I copy every N bytes to another large array. It is a problem that should be largely “memory access bound” as long as N is not too small. I start N at 16. Importantly, I never read my own writes, so I avoid concerns with 4K aliasing on Intel processors.

If N is larger than twice the cache line, then I can effectively skip one cache line out of two. If N is smaller than the cache line, then every cache line must be accessed. Having a stride value just above the cache line should not be sufficiently to see large gains: but you expect the speed to almost double once you reach twice the size of the cache line if the only thing that matters are cache lines.

Sadly, several other factors come into play on a modern system with such a benchmark. There is more than just the cache-line size as a variable! So we need to verify the model experimentally.

I wrote the benchmark in C, but the actual C compiler is unlikely to be relevant. The original code was in Go and I got the same results, but I switched to C to make sure I avoided Go-specific issues. Interestingly, ChatGPT converted the Go code to C code automatically for  me, with just a couple of small mistakes. Mind you, it is deliberately simple code.

We are measuring the time it takes to copy one byte out of every N bytes, using a total buffer size of 32 MB. We report the numbers in GB/s, effectively dividing the total time by the buffer size. I stress that it is a strided copy, not an actually copy, so as stride (N) grows large, the speed should increase. However, as long as the stride (N) is lower than the cache line, we expect a flat speed.

I run each experiment, for each stride size, 10 times and I record the maximum, the minimum and the average. I use an Apple M2 processor running on my laptop and an Intel-based server. I do not require a particular memory alignment when allocating memory. I do not make any other attempt to control the results.

The numbers on the server are quite nice, with hardly any difference between the average, the maximum and the minimum. If your stride is 129, you are 66% faster than when your stride is 64. This suggests that the cache-line size is 64 bytes. The gain is not 2x as I would have expected but the processing might be loading cache lines speculatively. Observe how a stride that is a multiple of 64 (e.g., 128 or 256) is slightly bad: we see the performance dip visibly. It might be due to a caching issue e.g., only half the cache-line addresses are used which makes it more difficult for the processor to make full use of its cache (due to address aliasing).

The result on my laptop are much less clean even though I kept the laptop unused during the benchmarking. In this instance, if your stride is 257, you are more than 2 times faster than when your stride is 128. It suggests that the cache-line size is 128 bytes. Just like the Intel system, a stride of 128 is unfortunate: there is a visible performance dip.

Note that we do not actually copy data at 300 GB/s, that would be impossible on its face on my hardware: but we can copy an array that fast if we just copy one byte per block of 512 bytes.

Thus you can empirically measure the size of the cache line with a strided copy of a large array… As soon you use a stride that is twice the cache line, you should be more than 50% faster.

Daniel Lemire, "Measuring the size of the cache line empirically," in Daniel Lemire's blog, December 12, 2023.

### Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

## 5 thoughts on “Measuring the size of the cache line empirically”

1. Marcin Pohl says:

What are the cache prefetcher settings set to?

2. Fernando Antunez says:

There is a step by step reasoning process to measure the size of the cache line in the amazing book “Understanding Software Dynamics” by Richard Sites

3. Matteo Mei says:

I don’t understand you C code. For every possible stride you do 10 runs to compute min max and average times. Each run is a repetition of 5 * slice invocations to the stride_copy function and yields the average time of an invocation. So the plot is misleading because the copy function is doing less and less operations in cache as stride increases. To see an effect we would have to divide the time of a run with the number of accesses of that run (size / stride)

1. We are measuring the speed of a strided copy, as explained in the blog post. If the stride is smaller than a cache line, then the speed of strided copy is akin to that of a standard copy, and relatively insensitive to the stride. As soon as the stride exceeds the cache line size, we should start to have faster and faster strided copy.

So the plot is misleading because the copy function is doing less and less operations in cache as stride increases.

It is not misleading, it is done by design.

To see an effect we would have to divide the time of a run with the number of accesses of that run (size / stride)

It is not clear to me what you mean, but if you write up your approach and post a link, readers will be able to compare you technique to detect the cache line with the one presented in this blog post. Please ensure that you provide your source code and your experimental results.

You may subscribe to this blog by email.