Programmers use software benchmarking to measure the speed of software. We need to distinguish system benchmarking, where one seeks to measure the performance of a system, with microbenchmarking, where one seeks to measure the performance of a small, isolated operation.
For example, if you are building a browser, you might want to know how quickly the iPhone 7 can retrieve and display the Wikipedia home page. That’s what I call system benchmarking.
Microbenchmarking is much more narrow in scope. So maybe you are interested in how quickly your processor can divide two numbers. Or maybe you want to know how quickly your processor can generate a random number.
Microbenchmarking is almost entirely distinct from system benchmarking, even if it looks superficially the same.
If you are trying to find out how quickly your processor can divide two integers, you want to know that, nothing else. You don’t want to measure how quickly the processor can divide two numbers while loading a website and cleaning out its registry.
Microbenchmarking is a form of science. Think about the physicist trying to measure the speed of light in some medium. It usually calls for idealized conditions. Why would you want to use idealized conditions when you know that they will not incur in the normal course of system operations?
- Idealized conditions makes it easier to reason about the results. You are much more likely to know about the factors that influence your code if you have far fewer factors.
- Idealized conditions are more consistent and reproducible. Real systems are not stationary. Their performance tends to fluctuate. And to reproduce even your own results can be a challenge when working with real systems. A small idealized scenario is much easier to reproduce.
Though microbenchmarking can be used as part of an engineering project, the primary purpose is to gain novel insights.
You might object to idealized conditions… what if you want to know how fast your code would run “in practice”? The problem is that “in practice” is ill-defined. You need to tell us what the “practice” is. That is, you need to describe a system, and then you are back into system benchmarking.
System benchmarking is more typically engineering. You seek to make a system that people care about better. You can and should be able to get reproducible results with system benchmarking, but it is much more costly. For one thing, describing a whole system is much harder than describing a tiny function.
In the type of microbenchmarking I like to do, CPU microbenchmarking, the idealized conditions often take the following form:
Of course, these are not requirements for a good microbenchmark, it is just a set of requirements that I have often found useful.
If you are lucky and can meet the above requirements, then you can run reliable CPU microbenchmarks in microseconds.
What do I mean by reliable? Last week, I reported that standard C code can interleave two 32-bit integers into a 64-bit integers using about 3.6 cycles on a Skylake processor. In the my original post, I only ran 5 trials, I picked the minimum and I divided the total clock time by the number of pairs. This was criticized: according to some, my benchmark did not last long enough to “stabilize” the results; according to others, my test is too short to have good accuracy.
I returned to this Skylake processor and computed a histogram. I run my short function (which computes 1000 interleaves) 2048 times, and each time I record the duration, before dividing by the number of pairs (1000). Running this whole experiment takes less than a second, and only requires a simple script. So let us look at the histogram:
The bulk of the results are in the 3.6 to 3.65 range (72% of all data points). The median is 3.636. As a first approximation, the noise follows a log-normal distribution. You might expect the measurement noise to follow a normal distribution (a bell curve), but normal distributions are uncommon when measuring computer performance.
From this histogram, one could argue that the “real” time is maybe slightly above 3.6 cycles per pair. It is maybe somewhere between 3.6 and 3.7 cycles. But that’s clearly a reasonable uncertainty.
Even though 72% of data points are between 3.6 and 3.65, the average is 3.6465… but that can be explained by the presence of outliers (for example, one measure was 17.2).
I like to report the minimum because it is easy to reason about: it is close to the best-case scenario for the processor. We know a lot about what processors can do in the best case… Yes, it is idealized but that’s fine.
My minimum is still the minimum of large averages (1000 pairs)… so it is more robust than you might expect. Also I can reasonably expect the noise to be strictly psitive, so the minimum makes sense as an estimator.
If you want to know how fast a human being can run, you look at the world’s records and pick the smallest time. When I run microbenchmarks, I want to know how fast my processor can run my code in ideal conditions.
If you had a normal distribution (bell curve), then taking the minimum would be a terrible idea because you would tend to track unlikely events (also called sigma events). But that’s not the case in performance: with a proper testing methodology, your minimum will be consistent from run to run.
Is the minimum necessarily a good metric? I think it is in CPU-bound microbenchmarks. If you can check that the average (on a quiet machine) is close to the minimum, then the minimum is surely sound. Intel recommends looking at the variance of the minimum from run to run. If the variance is small (e.g., 0), then the minimum is likely a sound metric.
There is nothing magical about the average because your distributions are almost never normal, it is just easily computed. If you are doing service benchmarks, percentiles (1%,20%, 50%,80%,99%) are very useful, and the minimum and maximum are just instances of percentiles.
When in doubt regarding which metric to use, just plot your data. If you can’t script it, just throw the numbers in a spreadsheet (like excel) and generate some graph.
Notice how I returned to this benchmark a week later, after the machine was updated and rebooted, and was able, without effort, to reproduce exactly my prior results.
Why don’t I include normality tests, standard errors and all that? Because I don’t need to. The best statistician is the one you never need. It is something of a myth that scientific results need fancy statistics: great scientific results require very little statistical sophistication. You just need to control the factors involved so that the numbers can speak for themselves.
A good analogy is how you weight people. If you do things right, that is, you weight the individual naked at always the same time at the beginning of the day, before any food was eaten, a single (short) measure each day is more than good enough. The “error” is probably small and irrelevant. If you weight people randomly during the day, after random meals, wearing random clothes, then you may need many measures to accurately estimate someone’s weight. Of course, once you throw in the complexity of having to deal with a whole population, then it becomes much harder to control all the factors involved and you may need fancy statistics again.
Wouldn’t I get more accurate results if I repeated my tests more often? Maybe not. If you pick a textbook, you might learn that averaging repeated measures brings you closer to a true value… but there are hidden assumptions behind this result that typically do not hold. Moreover, it is very hard to keep a modern system busy doing just one small thing for a long time. You can do it, but it requires a lot more care.
So long-running benchmarks are often not good idealized microbenchmarks because they also measure all sorts of irrelevant operations like unnecessary cache misses, context switches and so forth. The numbers you get often depend on how quiet your machine was at the time. You can run the same benchmark for three days in a row and get different numbers (with error margins far above 1%). It is certainly possible to get good long-running microbenchmarks, it is just harder.
Think about the surface of the Earth. If you move very little, then you can safely assume that the ground is flat and that we live on a planar surface. If you move a lot, you will encounter montains and seas, and so forth.
(Of course, system benchmarks can and maybe should be run over a long time… but that’s different!)
Can it hurt to run really exhaustive tests? I think it can. The computer might not tire of waiting for seconds and minutes, but the human mind fares a lot better when answers keep coming quickly. I favor many small and cheap microbenchmarks over a few long-running ones, even when the long-running benchmarks are perfectly clean.
The idea that if a benchmark takes longer to run, it ought to be better might seem intuitively self-evident, but I think we should be critical of this view. Fast is good.