Benchmarking is hard: processors learn to predict branches

A lot of software is an intricate of branches (ifthen clauses). For performance reasons, modern processors predict the results of these branches.

In my previous post, I showed how the bulk of your running time could be due to mispredicted branches. My benchmark consisted in writing 64 million random integers to an array. When I tried to only write odd random integers, the performance became much worse, due to the mispredictions.

while (howmany != 0) {
    val = random();
    if( val is an odd integer ) {
      out[index] =  val;
      index += 1;
    }
    howmany--;
}

Why 64 million integers and not, say, 2000? If you run just one test, then it should not matter. However, what if you are running multiple trials? Quickly, the number of mispredicted branches falls to zero. The numbers on an Intel Skylake processor speaks for themselves:

trial mispredicted branches (Intel Skylake)
1 48%
2 38%
3 28%
4 22%
5 14%

The “learning” keeps on going as the following plots shows… Eventually, the fraction of mispredicted branches falls to about 2%.

That is, as you keep measuring how long the same tasks takes, it gets faster and faster because the processor learns to better predicts the outcome. The quality of the “learning” depends on your exact processor, but I would expect that better, newer processors should learn better.

The latest server processors from AMD learn to predict the branch perfectly (within 0.1%) in under 10 trials.

trial mispredicted branches (AMD Rome)
1 52%
2 18%
3 6%
4 2%
5 1%
6 0.3%
7 0.15%
8 0.15%
9 0.1%

This perfect prediction on the AMD Rome falls apart if you grow the problem from 2000 to 10,000 values: the best prediction goes from a 0.1% error rate to a 33% error rate.

You should probably avoid benchmarking branchy code over small problems.

My code is available.

Credit: The AMD Rome numbers were provided by Velu Erwan.

Further reading: A case for (partially) TAgged GEometric history length branch prediction (Seznec et al.)

Mispredicted branches can multiply your running times

Modern processors are superscalar, meaning that they can execute many instructions at once. For example, some processors can retire four or six instructions per cycle. Furthermore, many of these processors can initiate instructions out-of-order: they can start working on instructions that appear much later in the code.

Meanwhile most code contains branches (ifthen clauses). These branches are often implemented as “jumps” where the processor either goes running instruction further away, or continues on its current path.

It is hard to reconcile branches with out-of-order superscalar execution. To do so, processors have sophisticated branch predictors. That is, the processor tries to predict the future. When it sees branch, and thus a jump, it tries to guess which way it will go.

This often works quite well. For example, most loops are actually implemented as branches. At the end of each iteration in the loop, the processor must predict whether there will be a next iteration. It is often safe for the processor to predict that the loop will continue (forever). The processor will only mispredict one branch per loop in this manner.

There are other common examples. If you are accessing the content of an array, many languages will add “bound checking”: before accessing the array value, there will be a hidden check to see whether the index is valid. If the index is not valid, then an error is generated, otherwise the code proceeds normally. Bound checks are predictable since all accesses should (normally) be valid. Consequently, most processors should be able to predict the outcome nearly perfectly.

What happens if the branch is hard to predict?

What happens inside the processor is that all of the instruction that were executed but that follow the mispredicted branch must be cancelled and the computation must start anew. You should expect a penalty of more than 10 cycles for each mispredicted branch. It can multiply the running times.

Let us look at some simple code where we write out random integers to an output array:

while (howmany != 0) {
    out[index] =  random();
    index += 1;
    howmany--;
}

We can generate a decent random number in about 3 cycles on average. That is, the total latency of the random number generator might be 10 cycles. But our processor is superscalar, so we can do several random-number computations at once. Thus we may generate a new random number every 3 cycles or so.

Let us modify slightly this function so that we only write out the odd integers:

while (howmany != 0) {
    val = random();
    if( val is an odd integer ) {
      out[index] =  val;
      index += 1;
    }
    howmany--;
}

Naively we might think that this new function could be faster. Indeed, we might only write out one out of two integers. We have a branch, but checking whether an integer is odd requires checking a single bit.

I benchmarked these two functions in C++ using a skylake processor:

write all random numbers 3.3 cycles per integer
write only odd random numbers 15 cycles per integer

The second function takes about five times longer!

Is there something you can do? Yes. You can just remove the branch. You can characterize an odd integer by the fact that its bitwise logical AND with the value 1 is one. The trick is to increment the array index by one only when the random value is odd.

while (howmany != 0) {
    val = random();
    out[index] = val;
    index += (val bitand 1);
    howmany--;
}

In this new version, we always write the random value to the output array, even when it is not needed. At a glance, it looks wasteful. However, it does away with the mispredicted branches. In practice the performance is nearly as good as the original code, and much better than the version with branches:

write all random numbers 3.3 cycles per integer
write only odd random numbers 15 cycles per integer
branch removed 3.8 cycles per integer

Could the compiler have solved this problem on its own? In general, the answer is negative. Sometimes compilers have some options to avoid branches entirely even if there is an if-then clause in the original code. For example, branches can sometimes be replaced by “conditional moves” or other arithmetic tricks. However, there are tricks that compilers cannot use safely.

An important take-away is that mispredicted branches are not a minor concern, they can have a large effect.

My source code is available.

Further reading: Processors learn to predict branches (follow-up blog post).

Science and Technology (October 12th 2019)

  1. In many countries, like Canada, there is relatively little private (business) research. Meanwhile, other research indicates that private research is precisely the kind of research that leads directly to productivity growths. Private research seems concentrated in hubs like the Silicon Valley. To entice businesses in doing research, the Canadian government has pursued an agressive tax credit strategy: roughly speaking, research becomes a tax avoidance strategy. It does not work well.
  2. Artificial ovaries work in mice, and we think that they may work in women too.
  3. Alzheimer’s research has failed us despite massive investments. It might be time for a reset.
  4. Drinking alcohol triggers an hormone (FGF21) that makes you more likely to drinking water.
  5. Dog owners are less likely to die.
  6. The greatest inequality might be in dating and sex. The bottom 80% of men (in terms of attractiveness) are competing for the bottom 22% of women and the top 78% of women are competing for the top 20% of men. In other words, a small number of men have their pick of all of the women and most guys are unattractive to most women.
  7. The U.S. Navy looked at skin cancer rates among its staff. They found the highest rate of skin cancer in people working essentially indoor.
  8. Fruit flies exposed to a combo of three different drugs lived 48% longer, even though the individual effect of each drug is relatively small.
  9. Moderate alcohol consumption is associated with reduced inflammation and improved responses to vaccination.
  10. Choline might protect against Alzheimer’s. Choline is found in meat among other places.
  11. You may have read that we find symmetric faces more attractive. A new study challenges this claim.
  12. Iron might be an aging factor.
  13. Climate models routinely predicts higher temperatures than what is actually observed. Where do the heat goes? A paper in Nature claimed that the heat in question went into the ocean. Nature withdrew the paper in question as it contained too many data processing mistakes.
  14. Alpha-ketoglutarate can significantly extend lifespan and healthspan in mice. You can get it in the form of cheap supplements.

The price of a MacBook Air worldwide

Apple sells identical laptops worldwide. There might be small differences with respect to power adaptors and so forth, but the laptops are the same internally. Of course, there are differences in taxes but Apple quote prices with taxes included. So I went shopping for a basic MacBook Air, in different countries, using Apple’s web site.

USA US$1099 US$1099
Canada CA$1449 US$1089
Japan ¥119,800 US$1120
Australia A$1,699 US$1146
France 1 249,00 € US$1374

According to these numbers, the cheapest MacBooks are in Canada and the most expensive ones are in France.

In countries where taxes vary by province or state, the Apple tax estimate might be wrong for some locations.

Science and Technology links (September 28th 2019)

  1. Researchers have effectively rejuvenated the damaged skin of mice by using “exosomes”. These are packages that cells send in their environment, and it appears that other cells react positively to exosomes in some conditions. The therapy is non-invasive (no needle!) and it appears to be superior to existing skin rejuvenation treatments (e.g., retinol, stem cells). Skin thickness and collagen production were both improved.
  2. Researchers have effectively created new neurons in the brain of mice by converting existing glial cells (which are abundant) into neurons. This could be used one day to help patients recover from brain injuries or strokes.
  3. About one in six death in the USA is due to lead exposure. Lead accumulates in your body over time and significantly increases your risks of cardiovascular diseases, among other bad things. There are chelation therapies to lower your lead levels, but they are not in wide use even though it is surprisingly effective at improving the health of some sick people.

Doubling the speed of std::uniform_int_distribution in the GNU C++ library

The standard way in C++ to generate a random integer in a range is to call the std::uniform_int_distribution function. The current implementation of std::uniform_int_distribution in the GNU C++ library (libstdc++) to generate an integer in the interval [0,range] looks as follows.

scaling = random_range / range;
limit = range * scaling;
do
   answer = random;
while ( answer >= limit);
return  answer / scaling;

Typically, it requires two divisions by invocation. Even on recent processors, integer division is relatively expensive.

We can achieve the same algorithmic result with at most one division, and typically no division at all without requiring more calls to the random number generator. It speeds things up on various systems. It was recently added to Swift language.

Travis Downs suggested that someone should try to fix the GNU C++ implementation. So I prepared and submitted a patch. It is currently under review.

The main challenge is that we need to be able to compute the “full” product. E.g., given two 64-bit integers, we want the 128-bit result; given two 32-bit integers we want the 64-bit result. This is fast on common processors.

The 128-bit product is not natively supported in C/C++ but can be achieved with the __int128 128-bit integer extension when it is available. Thankfully, we can check for __int128 support, and when the support is lacking, we can fall back on another approach. The 128-bit integer extension appears in many compilers (LLMV clang, GNU GCC, Intel icc), but even within a given compiler (say GNU GCC), you cannot rely on the extension being always present, since the compiler implementation might be system specific.

The code I ended up with is somewhat ugly, but it works:

      __product = (unsigned __int128)(__urng() - __urngmin) * __uerange;
      uint64_t __lsb = uint64_t(__product);
      if(__lsb < __uerange)
      {
        uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
        while (__lsb < __threshold)
        {
          __product = (unsigned __int128)(__urng() - __urngmin) * __uerange;
          __lsb = uint64_t(__product);
        }
      }
      __ret = __product >> 64;

It mostly avoids divisions. I designed a simple random shuffling benchmark that mostly just calls std::shuffle which, in turn, relies on std::uniform_int_distribution. Given an array of one million elements, I get the following timings on a skylake processor with GNU GCC 8.3:

before patch 15 ns/value
after patch 8 ns/value

For fun, let us compare with the implementation that is present in the Swift standard library:

     var random: T = next()
     var m = random.multipliedFullWidth(by: upperBound)
     if m.low < upperBound {
       let t = (0 &- upperBound) % upperBound
       while m.low < t {
         random = next()
         m = random.multipliedFullWidth(by: upperBound)
       }
     }
     return m.high

I think it is more readable in part because the Swift language has built in support for the full multiplication. It is somewhat puzzling that the C++ language does not see fit to make it easy to compute the full product between two integers.

Reference: Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation 29 (1), 2019

Science and Technology links (September 21st 2019)

    1. Amputees suffer from lack of sensory feedback from the missing limbs. Researchers found it beneficial to provide artificial sensory feedback.
    2. More economically equal societies favor people with better genes, something referred to as the Gene-Gini interplay.
    3. As we age, we tend to accumulate senescent cells. Removing senescent cells is likely to improve the health of older people. We know how to kill senescent cells in mice. Researchers have now shown that the same simple therapies work in human beings.
    4. At Harvard, 43% of the admitted caucasian students are children of faculty and staff or other privileged categories. If there were treated fairly, and not as students of faculty and staff, 3/4 of them would be rejected. In other words, Harvard is, to a first approximation, a way to reproduce the social elite. Ironically, I would guess that the majority of Harvard professors view themselves as favoring egalitarian values. Yet when it comes to their kids, they demande special treatments. (Source: Tyler Cowen)
    5. People who say that they are environmentally conscious are not less likely to buy plastic bags.

My kindergarten story

Though I was a straight-A student for most of my high school and college years, I failed kindergarten. I have told this story many times but I realize that I have never dedicated a blog post to it.

I ended up with a PhD from a top Canadian university and some of my research as an academic has been impactful.

But, as a kid, I wasn’t a good fit for schooling.

I could not tie my shoe laces. You might think that I was particularly lacking in hand-eye coordination, but I don’t think that’s the case. To this day, if you meet me, there is a good chance that my shoe laces will be undone. Yet while I am not an athlete, I can beat some hard video games that are demanding as far as reflexes and hand-eye coordination goes. I don’t know exactly why I am not good with shoe laces. I can do knots, as I was a boy scout. For some reasons, I am just not good with shoe laces. Yet it was one of the tests you had to pass in kindergarten.

The other test you had to pass was to learn your phone number. To this day, if you ask me for my phone number, I just don’t know it. I have to look it up, copy and paste it. So I failed this test.

You had to learn to count up to 10. I decided that since I was 5, it was fine to only know how to count up to 5. So I failed this test as well.

I probably failed all sorts of other tests as well.

So I was put in a class for students with learning disabilities. It wasn’t a punishment, I actually enjoyed it.

In the end, I would get really good grades. But my troubles were not over. Like Mandelbrot, I never learned the multiplication tables, instead I learned tricks to do multiplications in my head. Teachers tried really hard to prevent such an approach and you had to hide your schemes. I also never learned the quadratic formula, instead I figured out (on my own probably) how to complete the square quickly enough.

What I did learn, apparently, is independent thinking. Because I had a difficult realtionship with teachers and what they wanted, I think I learned to figure out what I wanted to learn on my own.

How far can you scale interleaved binary searches?

The binary search is the standard, textbook, approach when searching through sorted arrays. In a previous post, I showed how you can do multiple binary searches faster if you interleave them. The core idea is that while the processor is waiting for data from one memory access, you can squeeze in several more from another search. So you effectively compute them “in parallel” even though the code itself is single-threaded.

I hinted that it probably made no sense to scale this up beyond 32 interleaved searches. That is, you cannot do much better than a factor of 9 speedup using a recent Intel processor (Cannon Lake), with a lesser benefit on older architectures (Skylake). I have since written a benchmark to prove that it is so by trying all possible interleave counts between 2 and 64.

So how could we squeeze even more performance? Why can’t we reach a speedup factor of 15 or 20? A significant cost might be related to pages: computers organize their memory in blocks (pages) spanning a small number of bytes (e.g., 4kB). Yet they also can only keep thousands of pages available at any one time. If you ask for a page that has not been recently accessed, the processor needs to remap it, an expensive process. If you are doing lots of random access in a large array, it is likely that you will frequently hit a page that has not been recently accessed, and you might be limited speed-wise by the resulting miss. You can alleviate this problem by asking that the operating system use “huge pages”.

I present the results as a figure which I hope will be intuitive. There is quite a bit of measurement noise, which I leave as a sort of error bar.

  • Huge pages help: we go from a maximum speedup of about 9 to 11. This suggests that page size is a concern. Using huge pages is not very practical, however. Thankfully, there are other ways to alleviate page issues, such as well timed prefetching.
  • The plot shows that there is a very nice performance progression up to about 9 parallel searches, and that up to 9 parallel searches, there is little difference between the version with regular pages and the version with huge pages. I suspect that up to that point (9 searches), we are not fast enough for page size to be a performance concern.
  • Whether we use regular or huge pages, we reach a “noisy” plateau, some kind of hard speed limit. Furthermore, we do not scale up the performance linearly with the number of parallel searches, and even when we do, the slope is much smaller than one. What might be the limiting  factors? I am not sure. If you know, please share your thoughts!

My code is available, it should run everywhere C++ runs. Yet the benchmark assumes that you use the GNU GCC compiler, because I do not know how to deliberately generate the needed conditional move instructions with other compilers. Again, this is the result of joint work with Nathan Kurz and Travis Downs, but the mistakes are mine.

Science and Technology links (September 14th 2019)

  1. Streaming music makes up 80% of the revenue of the music industry. Revenue is up 18% for the first six months of 2019. This follows a record year in 2018 when the music industry reached its highest revenues in 10 years. Though it should be good news for musicians, it is also the case that record labels often take the lion share of the revenues.
  2. We have seen massive progress in the last five years in artificial intelligence. Yet we do not see obvious signs of economic gains from this progress.
  3. A widely cited study contradicting the existence of free will was fatally flawed.
  4. A common diabetes drug might spur neurogenesis and brain repair in the brain of (female) mice.
  5. Facebook is working on creating real-time models of your face for use in virtual reality.
  6. A simple and inexpensive eye test might be sufficient to detect early signs of Alzheimer’s.