When optimizing software, we routinely measure the time that takes a given function or task. Our goal is to decide whether the new code is more or less efficient. The typical assumption is that we get a normal distribution of the timings, and so we should therefore report the average time. If the average time goes up, our new code is less efficient.
I believe that the normality assumption is frequently violated in software benchmarks. I recently gave a talk on precise and accurate benchmarking (video, slides) at a “Benchmarking in the Data Center” workshop where I made this point in detail.
Why should it matter? If you have normal distributions, you can mathematically increase the accuracy of the measure by taking more samples. Your error should go down as the square root of the number of measures.
If you ever tried to increase the number of samples in the hope of getting a more accurate result, you may have been severely disappointed. And that is because the timings often more closely ressemble a log normal distribution:
A log normal distribution is asymmetrical, you have mean that is relatively close the minimum, and a long tail… as you take more and more samples, you may find more and more large values.
You can often show that you do not have a normal distribution because you find 4-sigma, 5-sigma or even 13-sigma events: you measure values that are far above the average compared to your estimated standard deviation. It is not possible, in a normal distribution, to be multiple times the standard deviation away from the mean. However, it happens much more easily with a log normal.
Of course, real data is messy. I am not claiming that your timings precisely follow a log normal distribution. It is merely a model. Nevertheless, it suggests that reporting the average and the standard error is inadequate. I like to measure the minimum and the average. You can then use the distance between the average and the minimum as an easy-to-measure error metric: if the average is far from the minimum, then your real-world performance could differ drastically from the minimum. The minimum value is akin to the friction-less model in Physics: it is the performance you get after taking out all the hard-to-model features of the problem.
People object that they want the real performance but that’s often an ill-posed problem like asking about how fast a real ball rolls down a real slope, with the friction and air resistance: you can measure it, but a ball rolling down a slope does not have inherent, natural friction and air resistance… that’s something that comes out in your specific use case. In other words, a given piece of software code does not have a natural performance distribution… independent from your specific use case… however, it is possible to measure the best performance that a function might have on a given CPU. The function does not have an intrinsic performance distribution but it has a well defined minimum time of execution.
The minimum is easier to measure accurately than the average. I routinely achieve a precision of 1% or better in practice. That is, if I rerun the same benchmark another day on the same machine, I can be almost certain that the result won’t vary by much more than 1%. The average is slightly more difficult to nail down. You can verify that it is so with a model: if you generate log-normally distributed samples, you will find it easier to determine the minimum than the average with high accuracy.
What about the median? Let us imagine that I take 30 samples from lognormal distribution. I repeat many times, each time measuring the average, the median and the minimum. Both the median and average will have greater relative standard error. In practice, I also find that the median is both harder to compute (more overhead) and not as precise as the minimum.
I find that under some conditions (fixed data structures/data layout, few system calls, single-core processing, no tiny function, and no JIT compiler) the minimum time elapsed is a good measure.
For my talk, I used a compute-bound routine: the performance of my function was not bound by the speed of the RAM, by the network or by the disk. I took data that was in CPU cache and I generated more data in CPU cache. For these types of problems, I find that timings often ressemble a log normal.
What about other types of tasks? What about memory-bound tasks?
I took a memory benchmark that I like to use. It consists of a large array spanning hundreds of megabytes, and the software must jump from location to location in it, being incapable of predicting the next jump before completing the read. It is often called a “pointer chasing” routine. I can interleave the pointer-chasing routines so that I have several loads in flight at each time: I call this the number of lanes. The more lanes I have, the more “bandwidth limited” I become, the fewer the lanes, the more “memory latency” I become. I am never compute bound in these tests, meaning that the number of instructions retired per cycle is always low.
For each fixed number of lanes, I run the benchmark 30 times on an Amazon Intel c6i node running Ubuntu 22. The time elapsed vary between over 3 seconds per run (for one lane) to about 1/6 s (for 30 lanes). I then estimate the standard deviation, I compute the mean, the maximum and the minimum. I then compute the bottom sigma as the gap (in number of standard deviations) between the minimum and the average, and then the gap between the average and the maximum. If the distribution is normal, I should have roughly the same number of sigmas on either side (min and max) and I should not exceed 3 sigmas. I find that one side (the maximum), easily exceed 3 sigmas, so it is not a normal distribution. It is also clearly not symmetrical.
Yet my measures are relatively precise. The relative distance between the minimum and the mean, I get a tiny margin, often under 1%.
It is interesting that the more lanes I have, the more accurate the results: intuitively, this is not entirely surprising as it breaks down the data dependency and one bad step has less impact on the whole processing.
Thus, for a wide range of performance-related timings, you should not assume that you have a normal distribution without checking first! Computing the distance between the maximum and the mean divided by the standard deviation is a useful indicator. I personally find that a log normal distribution is a better model for my timings, at a high level.
Further reading: Log-normal Distributions across the Sciences: Keys and Clues. David Gilbertson has a related post: The mean misleads: why the minimum is the true measure of a function’s run time.
Yes, this is why I generally use nonparametric statistical tests for my end-to-end performance benchmarks – e.g., Kolmogorov–Smirnov, Wasserstein Distance, etc.
Hi Daniel – thanks for raising this issue and proposing a better metric. One way of thinking about this is that normal distribution applies when the random errors are additive/subtractive, however for computer response times the errors are multiplicative. E.g. the system is slower so everything takes 10% longer rather than taking a fixed extra 100ms (or whatever), and that is the log normal model. I recently blogged about this – https://adrianco.medium.com/percentiles-dont-work-analyzing-the-distribution-of-response-times-for-web-services-ace36a6a2a19
It’s odd min seems not commonly used as a summary statistic, e.g. for performance regression reports. It’s intuitively an asymptote for any real workload: on a given machine an operation can get arbitrarily slow for any number of reasons (scheduling, contention for resources, hot cpu, etc etc), but it can only get so fast…
At hasura we get a histogram view of latencies and also do something which I think is quite useful: we overlay a histogram from just the first half of the samples taken (but with the counts doubled), with the full set of samples. This lets you see if the distribution is drifting over time, or suggests whether you should take more samples.
I once heard in a talk that you should use the median. (I think the talk was by Andrei Alexandrescu)
With enough samples the median approaches the minimum, but it’s more robust to outliers.
Why would the minimum be an outlier? Because CPUs are good at predicting things, so you want randomness to undo the prediction. But if you randomly get a fast run, or the CPU randomly predicts everything right, you get an unrealistic measurement. (See Emery Berger’s talk “performance matters” on why you want randomness)
Why not use the average? Because you never actually measured that number. The talk “How not to measure latency” by Gil Tene gives lots of reasons for why you only want numbers that actually happened.
Because CPUs are good at predicting things, so you want randomness to undo the prediction.
Though I am sure it happens, I have never seen, in my work, the minimum be an outlier.
If you run something in a loop, and the branch predictor does well in one run, it is likely to do well in other runs.
I agree that the branch predictor is likely to do well in many runs if it does well in the minimum run. So the minimum won’t be an outlier because of the branch predictor.
But I’ve had microbenchmarks that run fast, and when I make innocent changes and recompile, they run measurably slower. Why? Because they had randomly gotten a good layout, just like Emery Berger mentions in “performance matters”. It’s not a theoretical concern, it really has happened to me and I have wasted lots of time trying to understand why seemingly similar things have measurable differences.
So at some point I turned on ASLR and changed my benchmarking code to run the executable from scratch for every N iterations of the benchmark. And immediately the differences from before disappeared. The code that I expected to have the same performance actually did have the same performance. But this only works if I use the median measured time. (which I had already done before) If I use the min instead, I’d find the one random run where ASLR gives an unusually good layout, and then I’d be confused again why two similar things have different performance.
A distribution that has a similar shape is the Gumbel distribution https://en.wikipedia.org/wiki/Gumbel_distribution
which models the distribution of the maximum (or minimum) of a set of samples with some underlying distribution.
IMO this “extreme value distribution” may naturally arise in some benchmarks — e.g. when the “task” that is measured lasts the duration of the longest of a number of subtasks, the duration of the subtasks being randomly distributed [according to some underlying distribution].
Just wandering how you decided it’s a lognormal you have vs. a Gumbel, because if it’s by eyeing, they both look pretty similar.
Just wandering how you decided it’s a lognormal
My blog post says that in some cases, the log normal is a better model than the normal distribution (at a high level). My argument and methods do not rely on the exact nature of the distribution.
The slidedeck is very interesting.
Question: your definition of a compute-bound benchmark is that the data fits in CPU cache. When designing a micro-benchmark, would you make a difference between fitting in L3 cache (which is typically shared between cores and may incur additional benchmark noise) and L1/L2 caches? Would you actively try to avoid the L3 (or last-level cache more generally)?
It is possible to design a routine that is bound by the speed of the L3 cache, while not memory-bound per se, but I don’t expect that to be a common occurence.
Why not use the median?
I’ve also always learned that “the minimum is more sensitive to outliers”. However, I suppose there’s a difference between a microbenchmark and a larger benchmark: in a microbenchmark there’s naturally less variation, so the minimum is more “stable”, while in a larger benchmark you’re less likely to find the true minimum after a limited number of runs.
I’d also add that I’ve always considered the goal of statistics as a way to reduce many numbers (e.g. 1000s of individual measurements) to a few. However, trying to summarize performance results in just one or two numbers – whether it’s the minimum, median, or average & std dev – will always be too simplistic. Especially because distributions are often asymmetric. So an average and standard deviation doesn’t tell you much. Therefore I think it’s practically a necessity to use something like a box plot (or violin plot). Then you automatically get the minimum, median, etc, and an overview of the underlying distribution, which is much more informative.
Do you have an example in the context of a benchmark, where you are measuring the running time of a function and the minimum ends up being an outlier in the data?