In my blog post Counting cycles and instructions on the Apple M1 processor, I showed how we could have access to “performance counters” to count how many cycles and instructions a given piece of code took on ARM-based mac systems. At the time, we only had access to one Apple processor, the remarkable M1. Shortly after, Apple came out with other ARM-based processors and my current laptop runs on the M2 processor. Sadly, my original code only works for the M1 processor.
Thanks to the reverse engineering work of ibireme, a software engineer, we can generalize the approach. We have further extended my original code so that it works under both Linux and on ARM-based macs. The code has benefited from contributions from Wojciech Muła and John Keiser.
For the most part, you setup a global event_collector instance, and then you surround the code you want to benchmark by collector.start() and collector.end(), pushing the results into an event_aggregate:
#include "performancecounters/event_counter.h" event_collector collector; void f() { event_aggregate aggregate{}; for (size_t i = 0; i < repeat; i++) { collector.start(); function(); // benchmark this function event_count allocate_count = collector.end(); aggregate << allocate_count; } }
And then you can query the aggregate to get the average or best performance counters:
aggregate.elapsed_ns() // average time in ns aggregate.instructions() // average # of instructions aggregate.cycles() // average # of cycles aggregate.best.elapsed_ns() // time in ns (best round) aggregate.best.instructions() // # of instructions (best round) aggregate.best.cycles() // # of cycles (best round)
I updated my original benchmark which records the cost of parsing floating-point numbers, comparing the fast_float library against the C function strtod:
# parsing random numbers model: generate random numbers uniformly in the interval [0.000000,1.000000] volume: 10000 floats volume = 0.0762939 MB strtod 33.10 ns/float 428.06 instructions/float 75.32 cycles/float 5.68 instructions/cycle fastfloat 9.53 ns/float 193.78 instructions/float 27.24 cycles/float 7.11 instructions/cycle
I have some doubts about linux-perf-events.h.
Question 1: it seems that ids is never used. Why not a local variable?
for (auto config : config_vec) {
attribs.config = config;
fd = static_cast<int>(syscall(__NR_perf_event_open, &attribs, pid, cpu, group, flags));
if (fd == -1) {
report_error("perf_event_open");
}
ioctl(fd, PERF_EVENT_IOC_ID, &ids[i++]);
if (group == -1) {
group = fd;
}
}
Question 2: how to understand “our actual results are in slots 1,3,5”?
for (uint32_t i = 1; i < temp_result_vec.size(); i += 2) {
results[i / 2] = temp_result_vec[i];
}
Pull requests invited !