Counting cycles and instructions on ARM-based Apple systems

In my blog post Counting cycles and instructions on the Apple M1 processor, I showed how we could have access to “performance counters” to count how many cycles and instructions a given piece of code took on ARM-based mac systems. At the time, we only had access to one Apple processor, the remarkable M1. Shortly after, Apple came out with other ARM-based processors and my current laptop runs on the M2 processor. Sadly, my original code only works for the M1 processor.

Thanks to the reverse engineering work of ibireme, a software engineer, we can generalize the approach. We have further extended my original code so that it works under both Linux and on ARM-based macs. The code has benefited from contributions from Wojciech Muła and John Keiser.

For the most part, you setup a global event_collector instance, and then you surround the code you want to benchmark by collector.start() and collector.end(), pushing the results into an event_aggregate:

#include "performancecounters/event_counter.h"

event_collector collector;

void f() {
  event_aggregate aggregate{};
  for (size_t i = 0; i < repeat; i++) {
   collector.start();
   function(); // benchmark this function
   event_count allocate_count = collector.end();
   aggregate << allocate_count;
  }
}

And then you can query the aggregate to get the average or best performance counters:

aggregate.elapsed_ns() // average time in ns
aggregate.instructions() // average # of instructions
aggregate.cycles() // average # of cycles
aggregate.best.elapsed_ns() // time in ns (best round)
aggregate.best.instructions() // # of instructions (best round)
aggregate.best.cycles() // # of cycles (best round)

I updated my original benchmark which records the cost of parsing floating-point numbers, comparing the fast_float library against the C function strtod:

# parsing random numbers
model: generate random numbers uniformly in the interval [0.000000,1.000000]
volume: 10000 floats
volume = 0.0762939 MB
                           strtod     33.10 ns/float    428.06 instructions/float
                                      75.32 cycles/float
                                       5.68 instructions/cycle
                        fastfloat      9.53 ns/float    193.78 instructions/float
                                      27.24 cycles/float
                                       7.11 instructions/cycle

The code is freely available for research purposes.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

6 thoughts on “Counting cycles and instructions on ARM-based Apple systems”

  1. I have some doubts about linux-perf-events.h.

    Question 1: it seems that ids is never used. Why not a local variable?

    for (auto config : config_vec) {
    attribs.config = config;
    fd = static_cast<int>(syscall(__NR_perf_event_open, &attribs, pid, cpu, group, flags));
    if (fd == -1) {
    report_error("perf_event_open");
    }
    ioctl(fd, PERF_EVENT_IOC_ID, &ids[i++]);
    if (group == -1) {
    group = fd;
    }
    }

    Question 2: how to understand “our actual results are in slots 1,3,5”?

    for (uint32_t i = 1; i < temp_result_vec.size(); i += 2) {
    results[i / 2] = temp_result_vec[i];
    }

      1. I was able to reproduce the measurement on my end using Dougall’s code. Thank you for linking it in your code.

        On x86, rdtsc starts during CPU power on, and keeps increasing. But it is not the case for Dougall’s code. I think the counting starts with the program, as I’ve noticed it start with a small number everytime. Do you concur? Here[1] is the snippet I am running. Example run,

        // start, stop, stop-start

        1363160, 6005521463, 6004158303

        6005554311, 12010098583, 6004544272

        12010107912, 18013040904, 6002932992

        18013048020, 24017295657, 6004247637

        24017306678, 30023735547, 6006428869

        30023751252, 36031294826, 6007543574

        36031304510, 42037625498, 6006320988

        42037635149, 48046499216, 6008864067

        48046506980, 54050169260, 6003662280

        54050174555, 60058717182, 6008542627

        It’s mostly copy paste of the original code. I intend to expose rdtsc function and call it from a python program.

        [1] https://gist.github.com/quazi-irfan/3ee4789e9752bc8b3b958300157235a5

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.