Counting cycles and instructions on ARM-based Apple systems

In my blog post Counting cycles and instructions on the Apple M1 processor, I showed how we could have access to “performance counters” to count how many cycles and instructions a given piece of code took on ARM-based mac systems. At the time, we only had access to one Apple processor, the remarkable M1. Shortly after, Apple came out with other ARM-based processors and my current laptop runs on the M2 processor. Sadly, my original code only works for the M1 processor.

Thanks to the reverse engineering work of ibireme, a software engineer, we can generalize the approach. We have further extended my original code so that it works under both Linux and on ARM-based macs. The code has benefited from contributions from Wojciech Muła and John Keiser.

For the most part, you setup a global event_collector instance, and then you surround the code you want to benchmark by collector.start() and collector.end(), pushing the results into an event_aggregate:

#include "performancecounters/event_counter.h"

event_collector collector;

void f() {
  event_aggregate aggregate{};
  for (size_t i = 0; i < repeat; i++) {
   collector.start();
   function(); // benchmark this function
   event_count allocate_count = collector.end();
   aggregate << allocate_count;
  }
}

And then you can query the aggregate to get the average or best performance counters:

aggregate.elapsed_ns() // average time in ns
aggregate.instructions() // average # of instructions
aggregate.cycles() // average # of cycles
aggregate.best.elapsed_ns() // time in ns (best round)
aggregate.best.instructions() // # of instructions (best round)
aggregate.best.cycles() // # of cycles (best round)

I updated my original benchmark which records the cost of parsing floating-point numbers, comparing the fast_float library against the C function strtod:

# parsing random numbers
model: generate random numbers uniformly in the interval [0.000000,1.000000]
volume: 10000 floats
volume = 0.0762939 MB
                           strtod     33.10 ns/float    428.06 instructions/float
                                      75.32 cycles/float
                                       5.68 instructions/cycle
                        fastfloat      9.53 ns/float    193.78 instructions/float
                                      27.24 cycles/float
                                       7.11 instructions/cycle

The code is freely available for research purposes.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

2 thoughts on “Counting cycles and instructions on ARM-based Apple systems”

  1. I have some doubts about linux-perf-events.h.

    Question 1: it seems that ids is never used. Why not a local variable?

    for (auto config : config_vec) {
    attribs.config = config;
    fd = static_cast<int>(syscall(__NR_perf_event_open, &attribs, pid, cpu, group, flags));
    if (fd == -1) {
    report_error("perf_event_open");
    }
    ioctl(fd, PERF_EVENT_IOC_ID, &ids[i++]);
    if (group == -1) {
    group = fd;
    }
    }

    Question 2: how to understand “our actual results are in slots 1,3,5”?

    for (uint32_t i = 1; i < temp_result_vec.size(); i += 2) {
    results[i / 2] = temp_result_vec[i];
    }

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.