Programmers use software benchmarking to measure the speed of software. We need to distinguish system benchmarking, where one seeks to measure the performance of a system, with microbenchmarking, where one seeks to measure the performance of a small, isolated operation.
For example, if you are building a browser, you might want to know how quickly the iPhone 7 can retrieve and display the Wikipedia home page. That’s what I call system benchmarking.
Microbenchmarking is much more narrow in scope. So maybe you are interested in how quickly your processor can divide two numbers. Or maybe you want to know how quickly your processor can generate a random number.
Microbenchmarking is almost entirely distinct from system benchmarking, even if it looks superficially the same.
If you are trying to find out how quickly your processor can divide two integers, you want to know that, nothing else. You don’t want to measure how quickly the processor can divide two numbers while loading a website and cleaning out its registry.
Microbenchmarking is a form of science. Think about the physicist trying to measure the speed of light in some medium. It usually calls for idealized conditions. Why would you want to use idealized conditions when you know that they will not incur in the normal course of system operations?
- Idealized conditions makes it easier to reason about the results. You are much more likely to know about the factors that influence your code if you have far fewer factors.
- Idealized conditions are more consistent and reproducible. Real systems are not stationary. Their performance tends to fluctuate. And to reproduce even your own results can be a challenge when working with real systems. A small idealized scenario is much easier to reproduce.
Though microbenchmarking can be used as part of an engineering project, the primary purpose is to gain novel insights.
You might object to idealized conditions… what if you want to know how fast your code would run “in practice”? The problem is that “in practice” is ill-defined. You need to tell us what the “practice” is. That is, you need to describe a system, and then you are back into system benchmarking.
If you are benchmarking one function, this function might have different performance characteristics in different contexts. Of course, if you know precisely how the function is used, with what data, in what server configuration… you can benchmark the system you care about and deduce the performance characteristics if you one function in that context.
System benchmarking is more typically engineering. You seek to make a system that people care about better. You can and should be able to get reproducible results with system benchmarking, but it is much more costly. For one thing, describing a whole system is much harder than describing a tiny function.
In the type of microbenchmarking I like to do, CPU microbenchmarking, the idealized conditions often take the following form:
- As much as possible, all the data is readily available to the processor. We often try to keep the data in cache.
- We focus on single-core performance.
- We avoid comparing functions that process vastly different memory layouts.
- We make sure that the processor runs at a flat clock frequency. In real systems, processors can run slower or faster depending on the system load. As far as I know, it is flat out impossible to know how many CPU cycles were spent on a given task if the CPU frequency varies on Intel processors. Let me quote Wikipedia on time-stamp counters (TSC):
Constant TSC behavior ensures that the duration of each clock tick is uniform and makes it possible to use of the TSC as a wall clock timer even if the processor core changes frequency. This is the architectural behavior for all later Intel processors.
Your processor can run faster or slower than the advertized frequency, unless you specifically forbid it.
- If the code performance depends on branching, then we make sure that the branch predictor has had the chance to see enough data to make sound predictions. An even better option is to avoid branching as much as possible (long loops are ok though).
- When using a programming language like Java, we make sure that the just-in-time compiler has done its work. Ideally, you’d want to avoid the just-in-time compiler entirely because it is typically not deterministic: you cannot count on it to compile the code the same way each time it runs.
- You keep memory allocation (including garbage collection) to a minimum.
- You keep system calls to a minimum.
- You must benchmark something that lasts longer than ~30 cycles, but you can use loops.
- You have to examine the assembly output from your compiler to make sure that your compiler is not defeating your benchmark. Optimizing compilers can do all sorts of surprising things. For example, your compiler could figure out that it does not need to compute the trigonometric functions you are calling very accurately, and so it could fall back on a cheaper approximation. If you can’t examine the assembly output, you should be very careful.
Of course, these are not requirements for a good microbenchmark, it is just a set of requirements that I have often found useful.
If you are lucky and can meet the above requirements, then you can run reliable CPU microbenchmarks in microseconds.
What do I mean by reliable? Last week, I reported that standard C code can interleave two 32-bit integers into a 64-bit integers using about 3.6 cycles on a Skylake processor. In the my original post, I only ran 5 trials, I picked the minimum and I divided the total clock time by the number of pairs. This was criticized: according to some, my benchmark did not last long enough to “stabilize” the results; according to others, my test is too short to have good accuracy.
I returned to this Skylake processor and computed a histogram. I run my short function (which computes 1000 interleaves) 2048 times, and each time I record the duration, before dividing by the number of pairs (1000). Running this whole experiment takes less than a second, and only requires a simple script. So let us look at the histogram:
The bulk of the results are in the 3.6 to 3.65 range (72% of all data points). The median is 3.636. As a first approximation, the noise follows a log-normal distribution. You might expect the measurement noise to follow a normal distribution (a bell curve), but normal distributions are uncommon when measuring computer performance.
From this histogram, one could argue that the “real” time is maybe slightly above 3.6 cycles per pair. It is maybe somewhere between 3.6 and 3.7 cycles. But that’s clearly a reasonable uncertainty.
Even though 72% of data points are between 3.6 and 3.65, the average is 3.6465… but that can be explained by the presence of outliers (for example, one measure was 17.2).
I like to report the minimum because it is easy to reason about: it is close to the best-case scenario for the processor. We know a lot about what processors can do in the best case… Yes, it is idealized but that’s fine.
My minimum is still the minimum of large averages (1000 pairs)… so it is more robust than you might expect. Also I can reasonably expect the noise to be strictly positive, so the minimum makes sense as an estimator.
If you want to know how fast a human being can run, you look at the world’s records and pick the smallest time. When I run microbenchmarks, I want to know how fast my processor can run my code in ideal conditions.
If you had a normal distribution (bell curve), then taking the minimum would be a terrible idea because you would tend to track unlikely events (also called sigma events). But that’s not the case in performance: with a proper testing methodology, your minimum will be consistent from run to run. Some people argue for the median, but the median is both harder to compute (more overhead) and not as precise (in practice) as the minimum.
Is the minimum necessarily a good metric? I think it is in CPU-bound microbenchmarks. If you can check that the average (on a quiet machine) is close to the minimum, then the minimum is surely sound. Intel recommends looking at the variance of the minimum from run to run. If the variance is small (e.g., 0), then the minimum is likely a sound metric.
There is nothing magical about the average because your distributions are almost never normal, it is just easily computed. If you are doing service benchmarks, percentiles (1%,20%, 50%,80%,99%) are very useful, and the minimum and maximum are just instances of percentiles.
When in doubt regarding which metric to use, just plot your data. If you can’t script it, just throw the numbers in a spreadsheet (like excel) and generate some graph.
Notice how I returned to this benchmark a week later, after the machine was updated and rebooted, and was able, without effort, to reproduce exactly my prior results.
Why don’t I include normality tests, standard errors and all that? Because I don’t need to. The best statistician is the one you never need. It is something of a myth that scientific results need fancy statistics: great scientific results require very little statistical sophistication. You just need to control the factors involved so that the numbers can speak for themselves.
A good analogy is how you weight people. If you do things right, that is, you weight the individual naked at always the same time at the beginning of the day, before any food was eaten, a single (short) measure each day is more than good enough. The “error” is probably small and irrelevant. If you weight people randomly during the day, after random meals, wearing random clothes, then you may need many measures to accurately estimate someone’s weight. Of course, once you throw in the complexity of having to deal with a whole population, then it becomes much harder to control all the factors involved and you may need fancy statistics again.
Wouldn’t I get more accurate results if I repeated my tests more often? Maybe not. If you pick a textbook, you might learn that averaging repeated measures brings you closer to a true value… but there are hidden assumptions behind this result that typically do not hold, and they typically do not hold in practice. Averaging many log-normal distributions does not lead to more precise results in theory, and averaging lots and lots of benchmark results often fails to provide higher precision. Moreover, it is very hard to keep a modern system busy doing just one small thing for a long time. You can do it, but it requires a lot more care.
So long-running benchmarks are often not good idealized microbenchmarks because they also measure all sorts of irrelevant operations like unnecessary cache misses, context switches and so forth. The numbers you get often depend on how quiet your machine was at the time. You can run the same benchmark for three days in a row and get different numbers (with error margins far above 1%). It is certainly possible to get good long-running microbenchmarks, it is just harder.
Think about the surface of the Earth. If you move very little, then you can safely assume that the ground is flat and that we live on a planar surface. If you move a lot, you will encounter montains and seas, and so forth.
(Of course, system benchmarks can and maybe should be run over a long time… but that’s different!)
Can it hurt to run really exhaustive tests? I think it can. The computer might not tire of waiting for seconds and minutes, but the human mind fares a lot better when answers keep coming quickly. I favor many small and cheap microbenchmarks over a few long-running ones, even when the long-running benchmarks are perfectly clean.
The idea that if a benchmark takes longer to run, it ought to be better might seem intuitively self-evident, but I think we should be critical of this view. Fast is good.