Memory access on the Apple M1 processor

When a program is mostly just accessing memory randomly, a standard cost model is to count the number of distinct random accesses. The general idea is that memory access is much slower than most other computational tasks.

Furthermore, the cost model can be extended to count “nearby” memory accesses as free. That is, if I read a byte at memory address x and then I read a byte at memory address x+1, I can assume that the second byte comes “for free”.

This naive memory-access model is often sensible. However, you should always keep in mind that it is merely a model. A model can fail to predict real performance.

How might it fail? A CPU core can issue multiple memory requests at once. So if I need to access 7 memory locations at once, I can issue 7 memory requests and wait for them. It it is likely that waiting for 7 memory requests is slower than waiting for a single memory request, but is it likely to be 7 times slower?

The latest Apple laptop processor, the M1, has apparently a lot of memory-level parallelism. It looks like a single core has about 28 levels of memory parallelism, and possibly more.

Such a high degree of memory-level parallelism makes it less likely that our naive random-memory model applies.

To test it out, I designed the following benchmark where I compare three functions. The first one just grabs pairs of randomly selected bytes and it computes a bitwise XOR between them before adding them to a counter:

  for(size_t i = 0; i < 2*M; i+= 2) {
    answer += array[random[i]] ^ array[random[i + 1]];
  }

We compare against a 3-wise version of this function:

  for(size_t i = 0; i < 3*M; i+= 3) {
    answer += array[random[i]] ^ array[random[i + 1]] 
              ^ array[random[i + 2]];
  }

Our naive memory-access cost model predicts that the second function should be 50% more expensive. However many other models (such as a simple instruction count) would also predict a 50% overhead.

To give our naive memory-access model a run for its money, let us throw in a 2-wise version that also accesses nearby values (with one-byte offset):

  for(size_t i = 0; i < 2*M; i+= 2) {
    int idx1 = random[i];
    int idx2 = random[i + 1];
    
    answer += array[idx1] ^ array[idx1 + 1] 
           ^ array[idx2]  ^ array[idx2 + 1];
  }

Our naive memory-access cost model would predict that first and last function should have about the same running time while the second function should be 50% more expensive.

Let us measure it out. I use a 1GB array and I report the average time spent in nanosecond on each iteration.

2-wise 8.9 ns
3-wise 13.0 ns
2-wise + 12.5 ns

At first glance, our naive memory-access model is validated: the 3-wise function is 46% more expensive than the 2-wise function. Yet we should not be surprised because most reasonable models would make such a prediction since in almost every way, the function does 50% more work.

It is more interesting to compare the two 2-wise function… the last one is 40% more expensive than the first 2-wise function. It contradicts our prediction. And so, at least in this instance, our simple memory-access cost model fails us on the Apple M1 processor.

Notes:

  1. My source code is available. The run-to-run variability is relatively high on such a test, but the conclusion is robust, on my Apple M1 system.
  2. I posted the assembly online.
  3. Importantly, I do not predict that other systems will follow the same pattern. Please do not run this benchmark on your non-M1 PC and expect comparable results.
  4. This benchmark is meant to be run on an Apple MacBook with the M1 processor, compiled with Apple’s clang compiler. It is not meant to be used on other systems.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

27 thoughts on “Memory access on the Apple M1 processor”

  1. Hi,
    This is interesting, I ran this on my Mac, with processor:2,2 GHz Quad-Core Intel Core i7
    There are the results:

    $ ./two_or_three
    N = 1000000000, 953.7 MB
    starting experiments.
    two : 44.7 ns
    two+ : 45.0 ns
    three: 67.6 ns
    bogus 137531640

    Way too slow for my PC 🙂
    Thanks for sharing.
    Regards,
    Jongi

  2. In the first version, the compiler may have scheduled the first memory access to run in parallel with the second random calculation, and failed to do it in the second. Looking at the asm could shine some light on what’s going on.

  3. Since no-one else has speculated on why the observed results are what they are, I’ll take a stab. My guess is that the different functions are being limited by different factors. For “compute_two” and “compute_three”, we’re being limited by MLP. But for “compute_two_plus”, we’re not able to achieve maximum MLP because the extra instruction cause us to be limited by the size of the Reorder Buffer (ROB).

    Looking at the assembly (and noting that I don’t actually know how to read ARM assembly) it looks like (only) the “compute_two” function has been unrolled 2x, which slightly reduces the loop overhead. If we presume the sequential reads from random[] are approximately free, we are issuing 4 uncached reads in 15 instructions. If we assume the ROB holds about 200 instructions (Skylake holds 224), this means we can fit about 12 copies of this loop into the ROB, and thus the processor can look ahead to see 48 memory accesses. If we assume the M1 can do 30 parallel accesses, and 48 > 30, this means we max out on MLP and are memory bandwidth limited.

    The function “compute_three” is not unrolled, and the inner loop does 3 memory reads in 12 instructions. Applying the same logic, we can fit 15 copies of the loop into the ROB, and the processor has 45 memory accesses to choose from. 45 > 30, MLP limited. Since we are doing 50% more memory reads per iteration, we expect 50% greater time, which is just about exactly what we see.

    The function “computer_two_plus” is not unrolled, and does 2 full memory reads in each 15 instructions. We’ll assume for now that the adjacent reads to the same cache line are free. If 12 copies of the loop fit in the ROB, this means the processor sees only 24 full memory reads at a time. 24 < 30, and thus we are not able to take full advantage of the possible MLP! Naively, we might expect to see about 25% slowdown from this. In actuality it’s a bit more, but I’m willing to hand wave this difference away because our assumption that the adjacent accesses are free is probably not quite true.

    I’m guessing at the exact numbers here, but my guess would that that something like this explains the numbers you are seeing. But an argument against this theory are claims that the M1 actually has a much deeper ROB: https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive/2. They suggest that it’s more than 600! I haven’t really looked at what they are measuring, though, and I feel like your example suggests otherwise. It might be interesting to try some other methods of increasing the per-iteration instruction count, and seeing whether they cause the same slowdown as the adjacent accesses.

    1. @Nathan : Thanks for the analysis. I will happily test any code you’d like me to test…

      My only claim here is that the naive memory-access model fails. I have not investigated further to see what the limiting factors are. I often run tests under Docker with perf with macOS… but, to my knowledge, Docker still hasn’t released a version of Docker for the M1. I could run the benchmark under Xcode and get performance counters there but it is not as convenient to me.

  4. The problem is that the CPU can speculatively execute the memory access. That is, the branch predictor will help to unroll your loop and execute all the memory access at once so that the memory latency is
    as low as 10ns.

    The right way to do this benchmark is generating indexes by idx = random[i] ^ answer for preventing speculation.

    See https://gist.github.com/louchenyao/75c3a6a3eeb0d7d9b1e8af7e18aacb03

    The fixed benchmark result on my Macbook Air with the M1 processor is the following, which is pretty reasonable.

    two : 130.3 ns
    two+ : 131.0 ns
    three: 133.2 ns

    1. @Chenyao : You are suggesting that we turn the benchmark into a pointer chasing benchmark, but if you follow the link for the memory-level parallelism, you will notice that this is how we do it to measure memory parallelism… we just create several independent lanes that work as you describe…

      https://github.com/lemire/testingmlp

      This being said, the benchmark described at the end of this blog post was deliberately designed, so it is not a mistake. That is, I specifically did want the processor to be able to start issuing new memory request ahead of time.

      The pointer-chasing benchmark you suggest is also interesting and I might use such an approach in a follow-up post. Thanks!

      1. @Daniel @Nathan. Sorry, I misunderstood the point. Thanks for point out that.

        Then I would guess there is a limitation on #speculated memory accesses (cache line prefetch), so there is a difference between 2-wise and 3-wise. But due to some unknown reasons, the 2-wise+ performs similar to 3-wise. Kudo to Daniel for this finding!

    2. You are right about the speculation, but I think you are missing that Daniel has intentionally designed his benchmark to allow this speculation. This is why he included the graph about Memory Level Parallelism and has lines like “Such a high degree of memory-level parallelism makes it less likely that our naive random-memory model applies.”

      Daniel’s goal in this benchmark is to maximize the available MLP so as to highlight the difference between the M1 and Intel chips. One benchmark isn’t more “right” than they other, it’s just that they are measuring different things. Your numbers are interesting, though, since they show how just much benefit there is from the speculation. And note that there is still speculation in your example, which is why making two parallel random read requests per iteration takes almost exactly the same amount of as making three.

      1. To be clear, this benchmark is not meant to be used to compare different processors on different systems. It certainly was never meant to be run under Windows. Among other limitations is the fact that the rand() function can return values that are limited to a small range on some systems.

        It is fine if folks want to run it on different system, but they have to make it sufficiently robust.

        Even on macOS, the tool has limited robustness. Just enough to make the point…

  5. Hi, this is really interesting. I want to understand code and memory optimizations , any suggestions for a beginner?

  6. M1 is fast indeed.
    For comparison, this is what I get with a i7-8700K CPU @ 3.7GHz, xubuntu-18.04.5:

    $ ./two_or_three
    N = 1000000000, 953.7 MB
    starting experiments.
    two : 16.5 ns
    two+ : 18.0 ns
    three: 24.6 ns
    bogus 1422321000

    With clang-11:

    $ ./two_or_three
    N = 1000000000, 953.7 MB
    starting experiments.
    two : 17.0 ns
    two+ : 18.6 ns
    three: 25.3 ns
    bogus 1422321000

    Hmmm, the value “bogus” is different than on M1.
    rand() does not give the same results on different platforms.
    I wonder how much this matters when comparing.
    It would be better to use a deterministic random number generator.

  7. Hi,
    I’ve run your tests on my Windows laptop (i7-8665U) and results are much better:

    c:\work\test>>two_or_three.exe
    N = 1000000000, 953.7 MB
    starting experiments.
    two : 1.0 ns
    two+ : 1.4 ns
    three: 1.7 ns
    bogus 1458643000

    The variance of first test is pretty big (up to 1.4 ns) while the other two are stable.

    1. Interesting, although I think it’s extremely unlikely that your results are correct. I’d suspect a bug. Did you change anything in the code to allow it to run on Windows? Maybe reducing the sizes to everything fits in L1? If not, I’m guessing there is something wrong with the Daniel’s time measurement code on Windows. It probably hasn’t been tested nearly as much as on Mac or Linux. Alternatively, maybe the compiler on Windows has figured out a way to defeat the benchmark by optimizing out the actual memory accesses? If you have time, I think it would be useful to figure out what’s happening here.

  8. Apple M1 chip adopts “warehouse/workshop model”

    Warehouse: unified memory
    Workshop: CPU, GPU, and other cores
    Product( material): information, data

    there’s also a new unified memory architecture that lets the CPU, GPU, and other cores exchange information between one another, and with unified memory, the CPU and GPU can access memory simultaneously rather than copying data between one area and another. Accessing the same pool of memory without the need for copying speeds up information exchange for faster overall performance.

    reference: Developer Delves Into Reasons Why Apple’s M1 Chip is So Fast

  9. The 2 vs 2+ results is indeed interesting. However, I’m a bit not sure about the point of the 3 test.

    Should the 3 version be any different from the 2 version with a M that’s 50% larger? I feel like this should be the case (apart from the loop overhead) no matter what system this runs on. Since there’s no barriers or anything like that between loop iterations, shouldn’t the total number of memory references be the only thing that matters.

    1. However, I’m a bit not sure about the point of the 3 test. (…) shouldn’t the total number of memory references be the only thing that matters

      Naively, you would think so, wouldn’t you? But that’s not what happens in the M1.

      That’s the point of the three tests, to show that what you think might happen does not.

  10. Hello! Thanks for sharing.

    I have done my test on Ryzen 4500u on Lenovo Flex 5
    and Ubuntu OS.

    Here my results.
    N = 1000000000, 953.7 MB
    starting experiments.

    two : 40.4 ns
    two+ : 49.7 ns
    three: 59.9 ns
    bogus 1422321000

Leave a Reply

Your email address will not be published. Required fields are marked *