Counting cycles and instructions on the Apple M1 processor

When benchmarking software, we often start by measuring the time elapsed. If you are benchmarking data bandwidth or latency, it is the right measure. However, if you are benchmarking computational tasks where you avoid  disk and network accesses and where you only access a few pages of memory, then the time elapsed is often not ideal because it can vary too much from run to run and it provides too little information.

Most processors will adjust their frequency in response to power and thermal constraints. Thus you should generally avoid using a laptop. Yet even if you can get stable measures, it is hard to reason about your code from a time measurement. Processors operate in cycles, retiring instructions. They have branches, and sometimes they mispredict these branches. These are the measures you want!

You can, of course, translate the time in CPU cycles if you know the CPU frequency. But it might be harder than it sounds because even without physical constraints, processors can vary their frequency during a test. You can measure the CPU frequency using predictable loops. It is a little bit awkward.

Most people then go to a graphical tool like Intel VTune or Apple Instruments. These are powerful tools that can provide fancy graphical displays, run samples, record precise instruction counts and so forth. They also tend to work across a wide range of programming languages.

These graphical tools use the fact that processor vendors include “performance counters” in their silicon. You can tell precisely how many instructions were executed between two points in time.

Sadly, these tools can be difficult to tailor to your needs and to automate. Thankfully, the Linux kernel exposes performance counters, on most processors. Thus if you write code for Linux, you can rather easily query the performance counters for yourself. Thus you can put markers in your code and find out how many instructions or cycles were spent between these markers. We often refer to such code as being “instrumented”. It requires you to modify your code and it will not work in all programming languages, but it is precise and flexible. It even works under Docker if you are into containers. You may need privileged  access to use the counters. Surely you can also access the performance counters from your own program under Windows, but I never found any documentation nor any example.

My main laptop these days is a new Apple macbook with an M1 processor. This ARM processor is remarkable. In many ways, it is more advanced that comparable Intel processors. Sadly, until recently, I did not know how to instrument my code for the Apple M1.

Recently, one of the readers of my blog (Duc Tri Nguyen) showed me how, inspired by code from Dougall Johnson. Dougall has been doing interesting research on Apple’s processors. As far as I can tell, it is entirely undocumented and could blow up your computer. Thankfully, to access the performance counters, you need administrative access (wheel group). In practice, it means that you could start your instrumented program in a shell using sudo so that your program has, itself, administrative privileges.

To illustrate the approach, I have posted a full C++ project which builds an instrumented benchmark. You need administrative access and an Apple M1 system. I assume you have installed the complete developer kit with command-line utilities provided by Apple.

I recommend measuring both minimal counters as well as the average counters. When the average is close to the minimum, you usually have reliable results. The maximum is less relevant in computational benchmarks. Observe that measures taken during a benchmark are not normally distributed: they are better described as following a log-normal distribution.

The core of the benchmark looks like the following C++ code:

  performance_counters agg_min{1e300};
  performance_counters agg_avg{0.0};
  for (size_t i = 0; i < repeat; i++) {
    performance_counters start = get_counters();
    my_function();
    performance_counters end = get_counters();
    performance_counters diff = end - start;
    agg_min = agg_min.min(diff);
    agg_avg += diff;
  }
  agg_avg /= repeat;

Afterward, it is simply a matter of printing the results. I decided to benchmark floating-point number parsers in C++. I get the following output:

# parsing random numbers
model: generate random numbers uniformly in the interval [0.000000,1.000000]
volume: 10000 floats
volume = 0.0762939 MB 
model: generate random numbers uniformly in the interval [0.000000,1.000000]
volume: 10000 floats
volume = 0.0762939 MB
    strtod    375.92 instructions/float (+/- 0.0 %)
                75.62 cycles/float (+/- 0.1 %)
                4.97 instructions/cycle
                88.93 branches/float (+/- 0.0 %)
                0.6083 mis. branches/float

fastfloat    162.01 instructions/float (+/- 0.0 %)
                22.01 cycles/float (+/- 0.0 %)
                7.36 instructions/cycle
                38.00 branches/float (+/- 0.0 %)
                0.0001 mis. branches/float
oat 

As you can see, I get the average  number of instructions, branches and mispredicted branches for every floating-point number. I also get the number of instructions retired per cycle. It appears that on this benchmark, the Apple M1 processor gets close to 8 instructions retired per cycle when parsing numbers with the fast_float library. That is a score far higher than anything possible on an Intel processor.

You should note how precise the results are: the minimum and the average number of cycles are almost identical. It is quite uncommon in my experience to get such consistent numbers on a laptop. But these Apple M1 systems seem to show remarkably little variation. It suggests that there is little in the way of thermal constraints. I usually avoid benchmarking on laptops, but I make an exception with these laptops.

You should be mindful that the ‘number of instructions’ is an ill-defined measure. Processors can fuse or split instructions, and they can decide to count or not the number of speculative instructions. In my particular benchmark, there are few mispredicted branches so that the difference between speculative instructions and actually retired instructions is not important.

To my knowledge, none of this performance-counter access is documented by Apple. Thus my code should be viewed with suspicion. It is possible that these numbers are not what I take them to be. However, the numbers are generally credible.

My source code is available.

Note: Though my code only works properly under the Apple M1 processor, I believe it could be fixed to support Intel processors.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

13 thoughts on “Counting cycles and instructions on the Apple M1 processor”

  1. I just tried it on my side on a macbook air m1, and am getting way lower results for instructions/float (not sure what it means). I am running latest version of osx.

    parsing random numbers

    model: generate random numbers uniformly in the interval [0.000000,1.000000]
    volume: 10000 floats
    volume = 0.0762939 MB
    strtod 376.04 instructions/float (+/- 0.0 %)
    75.53 cycles/float (+/- 0.0 %)
    4.98 instructions/cycle
    88.95 branches/float (+/- 0.0 %)
    0.6005 mis. branches/float
    fastfloat 162.01 instructions/float (+/- 0.0 %)
    22.01 cycles/float (+/- 0.0 %)
    7.36 instructions/cycle
    38.00 branches/float (+/- 0.0 %)
    0.0001 mis. branches/float

    Thanks a lot for the post. Very interesting.

  2. Do you know if the perf Linux tool works on the M1s (or any Mac)? It’s very easy to inspect performance monitors with perf.

  3. I found strange that you characterize 7.36 instructions by cycle as “close to 8”. Maybe you forgot to change this sentence when you updated your numbers?

    (There is also a typo in “then the time elapsed in often not ideal”: i belive the in should be a is. Also earlier ” it is right measure” seems to be missing a “the”.)

    1. For context, 8 is the absolute maximum possible number for any combination of instructions. Sure, 7.36 is closer to seven, but 92% is really amazingly and surprisingly close to 100% of possible IPC for any real-world code.

      1. Also worth noting that what’s characterized as the “number of instructions” is, as far as I can tell, the number of DECODED instructions.
        This is not exactly the same thing as the number of RETIRED instructions because of mis speculation. (I haven’t done enough testing to be certain, but I am pretty sure that counter setting (8c) increment in Decode, while the counter that’s locked as counter[1] is the number of Retired instructions.

        Even putting speculation aside, the M1 does a fascinating job of splitting instructions for some purposes (primarily resource allocation where two registers are required like ldp, or a load or store with a pre/post increment) and then joining them again.
        So for example LDP will count as
        1 for Decode
        2 for Map/Rename (allocate two registers)
        1 for Execute
        2 for Retire (have to deallocate the two registers)

        Surprisingly many instructions can be performed at Map time (zero cycle moves, zero cycle immediates). A number of instructions that look like they would split (like ADDS) don’t because of a clever way of handling flags. A number of instructions that have to perform two tasks (like ADD(extend) ) split into to ops, but only require one register allocation because the temporary that’s generated is snarfed off the bypass bus, and never written out.
        etc etc

        The community is still figuring out all the details, but like so much else in computing, the simple models people have of “number of instructions executed” is not appropriate when you look closely; you have to be much more careful in exactly what you are asking, for what purpose.

        1. Thanks. I am aware that the number of instructions is not a precise phrase, especially if you have speculative execution and fused/splitted instructions.

          In my particular case, there is not much branch misprediction so it is not a good benchmark to test that effect.

          Accessing counter[1] seems to give me the same numbers (or very close).

  4. Great post – glad some of that code has been useful!

    If it’s of interest, these performance events (and the whitelist for this API), are described by Apple at https://github.com/apple/darwin-xnu/blob/main/osfmk/arm64/kpc.c

    Counters.app is the official way to access performance counters. I believe it can use a few more (non-whitelisted) events, which are described in /usr/share/kpep/a14.plist

    (And, for my own measurements, I use a kernel module to bypass the whitelist, which is even more likely to blow up the computer, and definitely not recommended: https://github.com/dougallj/applecpu/tree/main/timer-hacks )

    1. I’m surprised by the event numbers, they don’t match what the Arm Architecture Reference Manual lists (section D7.10).

      Are they doing some internal remapping (perhaps to match Intel numbers)?

Leave a Reply

Your email address will not be published. Required fields are marked *