In my blog post Counting cycles and instructions on the Apple M1 processor, I showed how we could have access to “performance counters” to count how many cycles and instructions a given piece of code took on ARM-based mac systems. At the time, we only had access to one Apple processor, the remarkable M1. Shortly after, Apple came out with other ARM-based processors and my current laptop runs on the M2 processor. Sadly, my original code only works for the M1 processor.
Thanks to the reverse engineering work of ibireme, a software engineer, we can generalize the approach. We have further extended my original code so that it works under both Linux and on ARM-based macs. The code has benefited from contributions from Wojciech Muła and John Keiser.
For the most part, you setup a global event_collector instance, and then you surround the code you want to benchmark by collector.start() and collector.end(), pushing the results into an event_aggregate:
#include "performancecounters/event_counter.h" event_collector collector; void f() { event_aggregate aggregate{}; for (size_t i = 0; i < repeat; i++) { collector.start(); function(); // benchmark this function event_count allocate_count = collector.end(); aggregate << allocate_count; } }
And then you can query the aggregate to get the average or best performance counters:
aggregate.elapsed_ns() // average time in ns aggregate.instructions() // average # of instructions aggregate.cycles() // average # of cycles aggregate.best.elapsed_ns() // time in ns (best round) aggregate.best.instructions() // # of instructions (best round) aggregate.best.cycles() // # of cycles (best round)
I updated my original benchmark which records the cost of parsing floating-point numbers, comparing the fast_float library against the C function strtod:
# parsing random numbers model: generate random numbers uniformly in the interval [0.000000,1.000000] volume: 10000 floats volume = 0.0762939 MB strtod 33.10 ns/float 428.06 instructions/float 75.32 cycles/float 5.68 instructions/cycle fastfloat 9.53 ns/float 193.78 instructions/float 27.24 cycles/float 7.11 instructions/cycle
I have some doubts about linux-perf-events.h.
Question 1: it seems that ids is never used. Why not a local variable?
for (auto config : config_vec) {
attribs.config = config;
fd = static_cast<int>(syscall(__NR_perf_event_open, &attribs, pid, cpu, group, flags));
if (fd == -1) {
report_error("perf_event_open");
}
ioctl(fd, PERF_EVENT_IOC_ID, &ids[i++]);
if (group == -1) {
group = fd;
}
}
Question 2: how to understand “our actual results are in slots 1,3,5”?
for (uint32_t i = 1; i < temp_result_vec.size(); i += 2) {
results[i / 2] = temp_result_vec[i];
}
Pull requests invited !
Why instruction count is double? Shouldn’t be a whole number?
Yes. Of course, you can represent an integer using a floating-point number… which is convenient if you want to compute an average, for example.
I was able to reproduce the measurement on my end using Dougall’s code. Thank you for linking it in your code.
On x86, rdtsc starts during CPU power on, and keeps increasing. But it is not the case for Dougall’s code. I think the counting starts with the program, as I’ve noticed it start with a small number everytime. Do you concur? Here[1] is the snippet I am running. Example run,
// start, stop, stop-start
1363160, 6005521463, 6004158303
6005554311, 12010098583, 6004544272
12010107912, 18013040904, 6002932992
18013048020, 24017295657, 6004247637
24017306678, 30023735547, 6006428869
30023751252, 36031294826, 6007543574
36031304510, 42037625498, 6006320988
42037635149, 48046499216, 6008864067
48046506980, 54050169260, 6003662280
54050174555, 60058717182, 6008542627
It’s mostly copy paste of the original code. I intend to expose rdtsc function and call it from a python program.
[1] https://gist.github.com/quazi-irfan/3ee4789e9752bc8b3b958300157235a5
You should make sure you understand what rdtsc outputs on modern CPUs.