Microbenchmarking calls for idealized conditions

Programmers use software benchmarking to measure the speed of software. We need to distinguish system benchmarking, where one seeks to measure the performance of a system, with microbenchmarking, where one seeks to measure the performance of a small, isolated operation.

For example, if you are building a browser, you might want to know how quickly the iPhone 7 can retrieve and display the Wikipedia home page. That’s what I call system benchmarking.

Microbenchmarking is much more narrow in scope. So maybe you are interested in how quickly your processor can divide two numbers. Or maybe you want to know how quickly your processor can generate a random number.

Microbenchmarking is almost entirely distinct from system benchmarking, even if it looks superficially the same.

If you are trying to find out how quickly your processor can divide two integers, you want to know that, nothing else. You don’t want to measure how quickly the processor can divide two numbers while loading a website and cleaning out its registry.

Microbenchmarking is a form of science. Think about the physicist trying to measure the speed of light in some medium. It usually calls for idealized conditions. Why would you want to use idealized conditions when you know that they will not incur in the normal course of system operations?

  1. Idealized conditions makes it easier to reason about the results. You are much more likely to know about the factors that influence your code if you have far fewer factors.
  2. Idealized conditions are more consistent and reproducible. Real systems are not stationary. Their performance tends to fluctuate. And to reproduce even your own results can be a challenge when working with real systems. A small idealized scenario is much easier to reproduce.

Though microbenchmarking can be used as part of an engineering project, the primary purpose is to gain novel insights.

You might object to idealized conditions… what if you want to know how fast your code would run “in practice”? The problem is that “in practice” is ill-defined. You need to tell us what the “practice” is. That is, you need to describe a system, and then you are back into system benchmarking.

If you are benchmarking one function, this function might have different performance characteristics in different contexts. Of course, if you know precisely how the function is used, with what data, in what server configuration… you can benchmark the system you care about and deduce the performance characteristics if you one function in that context.

System benchmarking is more typically engineering. You seek to make a system that people care about better. You can and should be able to get reproducible results with system benchmarking, but it is much more costly. For one thing, describing a whole system is much harder than describing a tiny function.

In the type of microbenchmarking I like to do, CPU microbenchmarking, the idealized conditions often take the following form:

  • As much as possible, all the data is readily available to the processor. We often try to keep the data in cache.
  • We focus on single-core performance.
  • We avoid comparing functions that process vastly different memory layouts.
  • We make sure that the processor runs at a flat clock frequency. In real systems, processors can run slower or faster depending on the system load. As far as I know, it is flat out impossible to know how many CPU cycles were spent on a given task if the CPU frequency varies on Intel processors. Let me quote Wikipedia on time-stamp counters (TSC):

    Constant TSC behavior ensures that the duration of each clock tick is uniform and makes it possible to use of the TSC as a wall clock timer even if the processor core changes frequency. This is the architectural behavior for all later Intel processors.

    Your processor can run faster or slower than the advertized frequency, unless you specifically forbid it.

  • If the code performance depends on branching, then we make sure that the branch predictor has had the chance to see enough data to make sound predictions. An even better option is to avoid branching as much as possible (long loops are ok though).
  • When using a programming language like Java, we make sure that the just-in-time compiler has done its work. Ideally, you’d want to avoid the just-in-time compiler entirely because it is typically not deterministic: you cannot count on it to compile the code the same way each time it runs.
  • You keep memory allocation (including garbage collection) to a minimum.
  • You keep system calls to a minimum.
  • You must benchmark something that lasts longer than ~30 cycles, but you can use loops.
  • You have to examine the assembly output from your compiler to make sure that your compiler is not defeating your benchmark. Optimizing compilers can do all sorts of surprising things. For example, your compiler could figure out that it does not need to compute the trigonometric functions you are calling very accurately, and so it could fall back on a cheaper approximation. If you can’t examine the assembly output, you should be very careful.

Of course, these are not requirements for a good microbenchmark, it is just a set of requirements that I have often found useful.

If you are lucky and can meet the above requirements, then you can run reliable CPU microbenchmarks in microseconds.

What do I mean by reliable? Last week, I reported that standard C code can interleave two 32-bit integers into a 64-bit integers using about 3.6 cycles on a Skylake processor. In the my original post, I only ran 5 trials, I picked the minimum and I divided the total clock time by the number of pairs. This was criticized: according to some, my benchmark did not last long enough to “stabilize” the results; according to others, my test is too short to have good accuracy.

I returned to this Skylake processor and computed a histogram. I run my short function (which computes 1000 interleaves) 2048 times, and each time I record the duration, before dividing by the number of pairs (1000). Running this whole experiment takes less than a second, and only requires a simple script. So let us look at the histogram:

The bulk of the results are in the 3.6 to 3.65 range (72% of all data points). The median is 3.636. As a first approximation, the noise follows a log-normal distribution. You might expect the measurement noise to follow a normal distribution (a bell curve), but normal distributions are uncommon when measuring computer performance.

From this histogram, one could argue that the “real” time is maybe slightly above 3.6 cycles per pair. It is maybe somewhere between 3.6 and 3.7 cycles. But that’s clearly a reasonable uncertainty.

Even though 72% of data points are between 3.6 and 3.65, the average is 3.6465… but that can be explained by the presence of outliers (for example, one measure was 17.2).

I like to report the minimum because it is easy to reason about: it is close to the best-case scenario for the processor. We know a lot about what processors can do in the best case… Yes, it is idealized but that’s fine.

My minimum is still the minimum of large averages (1000 pairs)… so it is more robust than you might expect. Also I can reasonably expect the noise to be strictly positive, so the minimum makes sense as an estimator.

If you want to know how fast a human being can run, you look at the world’s records and pick the smallest time. When I run microbenchmarks, I want to know how fast my processor can run my code in ideal conditions.

If you had a normal distribution (bell curve), then taking the minimum would be a terrible idea because you would tend to track unlikely events (also called sigma events). But that’s not the case in performance: with a proper testing methodology, your minimum will be consistent from run to run. Some people argue for the median, but the median is both harder to compute (more overhead) and not as precise (in practice) as the minimum.

Is the minimum necessarily a good metric? I think it is in CPU-bound microbenchmarks. If you can check that the average (on a quiet machine) is close to the minimum, then the minimum is surely sound. Intel recommends looking at the variance of the minimum from run to run. If the variance is small (e.g., 0), then the minimum is likely a sound metric.

There is nothing magical about the average because your distributions are almost never normal, it is just easily computed. If you are doing service benchmarks, percentiles (1%,20%, 50%,80%,99%) are very useful, and the minimum and maximum are just instances of percentiles.

When in doubt regarding which metric to use, just plot your data. If you can’t script it, just throw the numbers in a spreadsheet (like excel) and generate some graph.

Notice how I returned to this benchmark a week later, after the machine was updated and rebooted, and was able, without effort, to reproduce exactly my prior results.

Why don’t I include normality tests, standard errors and all that? Because I don’t need to. The best statistician is the one you never need. It is something of a myth that scientific results need fancy statistics: great scientific results require very little statistical sophistication. You just need to control the factors involved so that the numbers can speak for themselves.

A good analogy is how you weight people. If you do things right, that is, you weight the individual naked at always the same time at the beginning of the day, before any food was eaten, a single (short) measure each day is more than good enough. The “error” is probably small and irrelevant. If you weight people randomly during the day, after random meals, wearing random clothes, then you may need many measures to accurately estimate someone’s weight. Of course, once you throw in the complexity of having to deal with a whole population, then it becomes much harder to control all the factors involved and you may need fancy statistics again.

Wouldn’t I get more accurate results if I repeated my tests more often? Maybe not. If you pick a textbook, you might learn that averaging repeated measures brings you closer to a true value… but there are hidden assumptions behind this result that typically do not hold, and they typically do not hold in practice. Averaging many log-normal distributions does not lead to more precise results in theory, and averaging lots and lots of benchmark results often fails to provide higher precision. Moreover, it is very hard to keep a modern system busy doing just one small thing for a long time. You can do it, but it requires a lot more care.

So long-running benchmarks are often not good idealized microbenchmarks because they also measure all sorts of irrelevant operations like unnecessary cache misses, context switches and so forth. The numbers you get often depend on how quiet your machine was at the time. You can run the same benchmark for three days in a row and get different numbers (with error margins far above 1%). It is certainly possible to get good long-running microbenchmarks, it is just harder.

Think about the surface of the Earth. If you move very little, then you can safely assume that the ground is flat and that we live on a planar surface. If you move a lot, you will encounter montains and seas, and so forth.

(Of course, system benchmarks can and maybe should be run over a long time… but that’s different!)

Can it hurt to run really exhaustive tests? I think it can. The computer might not tire of waiting for seconds and minutes, but the human mind fares a lot better when answers keep coming quickly. I favor many small and cheap microbenchmarks over a few long-running ones, even when the long-running benchmarks are perfectly clean.

The idea that if a benchmark takes longer to run, it ought to be better might seem intuitively self-evident, but I think we should be critical of this view. Fast is good.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

8 thoughts on “Microbenchmarking calls for idealized conditions”

  1. Great article.

    In large part, the “stability” of results is only an issue if your test is long enough to experience the various external things that affect the results, such as frequency transitions, interrupts, context switches, another process being scheduled on the sibling hyperthread, etc.

    Then the suggestion “solution” is to run your test so many times that all this noise “averages out”. Much better is to remove these sources of influence, as much as possible. Very short tests are a start: you run the test enough times so it doesn’t (usually) receive any interrupt (and if it does you detect it and throw out the result).

    Here are the other things which reduce the noise:

    1) Disable hyperthreading: unless you specifically testing how some algorithm works when running on both hypercores, disabling hyperthreading reduced noise by an order of magnitude or more for me (for longer tests): the OS will sometimes schedule something on the sibling thread of the one running the benchmark which will start competing for all sorts of resources.

    2) Disable turboboost. This is the biggest one and reduced test noise by two orders of magnitude. CPUs have max turbo ration that depend on the number of cores currently running. This means that any activity of *any* core during your benchmark will suddenly change the speed of your benchmark: both because the frequency changes and because there is a short “dead time” where all cores stop running entirely while the voltage/current stabilizes across the CPU. This one is really important because the main way that other activity on your system can interfere what what appears to be a “core local” benchmark.

    3) Disable DVFS, aka “frequency scaling”. Even if TB is off, you’ll usually get sub-nominal frequency scaling to save power with the default settings with most OSes and CPU. This one doesn’t have the same effect as TB, since there is no cross-core “max turbo ratio” effect, and in fact you can ignore it with a sufficient long warmup (since the CPU will reach max frequency and stay there), but it removes another variable and avoids the need for a lengthy warmup period. If you ever see a weird peformance graph where the performance of the function seems to steady increase the more you run it it’s often frequency scaling messing up the results!

    4) Use performance counters with rdpmc instruction to measure true cycles rather than wall time. This has the advantage of mostly being frequency independent, so even if you have some kind of frequency scaling you’ll get about the same number. It also lets you record only cycles in user-space if you want, avoiding counting system interrupts (but interrupts still perturb your results in other ways). It let’s you get results in cycles directly, rather than trying to convert based on what you guess is the CPU frequency, and avoids the timing routines which may be more complex than necessary and may not be constant time (and the overhead is highly OS dependent). With this approach you can get reproducible results down to a single cycle, without looping your test at all!

    5) Various other things to “isolate” your CPU from the scheduler, such as setting cpu_isols or using realtime priority or tickless kernels. The idea here is to prevent the scheduler from interrupting your process. I did this, but haven’t actually found it necessary in practice for most benchmarks since as long as your test is very short the scheduler usually won’t interrupt it anyways (e.g., if your scheduler tick frequency is 250Hz, a test that runs for 1 us is almost never interrupted).

    6) Use “delta” measurement. The simple way to do this is if you test has a measurement loop and times the entire loop, run it for N iterations and 2N, and then use (run2 – run1)/N as the time. This cancels out all the fixed overhead, such as the clock/rdpmc call, the call to your benchmark method, any loop setup overhead. It doesn’t cancel out the actual loop overhead though (e.g., increment and end-of-loop check) – even though that’s often small or zero. If you want to do that you need actually two separate loops with different numbers of calls (or inlined copies) of the code under test and apply delta. It’s kind of risky to do that due to alignment effects so the loop approach is usually fine.

    7) Look at the “min” and “median” of repeated results. If you did everything right the min and median will usually be identical (or off by a few parts per million) +/- one cycle sometimes. Or just look at a histogram as Daniel suggests. It’s super obvious if your results are stable or not.

    I use these approaches in uarch-bench:

    https://github.com/travisdowns/uarch-bench

    which is intended for short, very accurate tests of things that might reveal interesting architectural details (although I haven’t added a ton of tests yet) – but it can also just be used as a micro-benchmark framework to throw random tests into.

    1. Hello Travis,

      I agree with most of your suggestions — indeed Krun does 1–5 (inclusive). For very small benchmarks, which I think Daniel is also most interested in, I can well believe that this approach does give quite stable numbers. uarch-bench looks very interesting!

      That said, we’re looking at benchmarks which, whether we want them to be or not, are too big to hope that we can isolate them from things like syscalls. That opens us up to a new world of pain and instability. For example, we can’t, in my opinion, use performance counters for two reasons. First, because users don’t perceive performance in terms of such counters: they can only perceive wall clock time. Second, because performance counters only measure userspace costs, we might provide a perverse incentive to move benchmark overhead into the kernel by abusing various features.

      An open question in my mind is: what all the potential sources of performance non-determinism in a system? I think we have something of a handle on the software sources (though I’m sure there are things I haven’t thought of), but, beyond a few obvious things (e.g. frequency scaling, caches) I have very little insight as to what performance non-determinism modern processors might introduce.

      Laurie

      1. That said, we’re looking at benchmarks which, whether we want them to be or not, are too big to hope that we can isolate them from things like syscalls. That opens us up to a new world of pain and instability.

        I think we need proper terminology. There are many reasons to “benchmark” and somehow, we end up with a single word encompassing various different activities. In my view, a microbenchmark is something you design so as to better understand software behavior. A microbenchmark, to me, is a “pure” and ideal benchmark, with as much of the complexity taken out as possible. If whatever I describe does not sound like a “microbenchmark” to you, then I am more than willing to use another term.

        1. Hello Daniel,

          Unfortunately microbenchmark is a heavily overloaded term already :/ I once tried to get some colleagues to use “picobenchmark” for very small benchmarks (i.e. a for loop with one statement inside), but failed. In retrospect, I’m not sure that name would have been hugely helpful; but, there again, I still don’t have any better ideas…

          Laurie

      2. Yeah, it’s horses for courses. Performance counters are a tool for (a) repeatedly measuring performance very exactly and (b) diagnosing performance problems.

        The (b) use is the common one: you want to know what your bottlenecks are and performance counters can tell you. I don’t think that use is controversial, and it’s useful to have them in a benchmarking framework since people tend to stick stuff in there they want to make faster.

        I guess we are mostly talking about (a) though. You are right that users don’t perceive performance in terms of performance counters, but in terms of wall-clock time. Sometimes, however, you just use performance counters as a high-precision proxy for wall clock time. E.g., when tuning a very small loop you might want to immediately get cycle-accurate feedback on a single change. Doing this will a heavy wall-clock timer might introduce enough noise to make this difficult, but a cycle-accurate performance counter makes this easy.

        Of course, you always need to go back and confirm the performance counter was in fact a good proxy for wall-clock time! It would be unfortunately if you were busy optimizing your perf-counter based metric and it turns out wall clock time was going in a different direction. Luckily, metrics like cycles have a direct relationship to wall clock so there aren’t many gotchas there.

        Note that performance counters measure kernel time just fine (it’s configurable: you can measure U only, K only or both). Measuring only user time can be a “feature” here too: if you know your code is user-only you can remove some noise from interrupts by excluding K time, although getting rid of the interrupts would be even better.

        About your last question, there aren’t _too_ many sources on non-determinism inside the CPUs themselves: although what you consider inside vs outside is up for debate (e.g., most interrupts are externally generated, but what about SMI?), at least in my experience. Many things are seem non-deterministic only because they are incompletely modeled (e.g., you often seen variance in caching performance from run-to-run, but this can just be caused by the varying virtual-to-phy mapping you get for your pages making L2 caching more or less effective – deterministic from the CPU point of view, but not from the process running in the OS).

        So I think a practical list would probably have things that are truly deterministic, and also ones that may be deterministic if you can put your hardware into an exact initial state and avoid all external perturbation, but since you can’t do that, and butterly effect and all that, they end up looking non-deterministic.

        In my experience the list above still isn’t too bad. The big ones are solved if you have consistent code and data alignment (including physical addresses, not just virtual), avoid hyperthreading and any contention for the core, keep your benchmark core-local, don’t experience any throttling conditions, and don’t get close to the ragged edge of any uarch resource (e.g., physical regs, ROB entries, LSD size, store buffers, etc, etc). Things then mostly work according to a reasonable model of processor performance.

    2. Hi Professor Lemire.

      Your benchmarking strategy (6) is very useful. It successfully stabilized my benchmark test results and it is so easy to implement. Thanks for your brilliant idea and posting it online.

Leave a Reply to Laurence Tratt Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.