Adding a (predictable) branch to existing code can increase branch mispredictions

Software is full of “branches”. They often take the form of if-then clauses in code. Modern processors try to predict the result of branches often long before evaluating them. Hard-to-predict branches are a challenge performance-wise because when a processor fails to predict correctly a branch, it does useless work that must be thrown away.

A convenient illustration is an algorithm that generates a random number and then only appends it to a list if the random number is odd*. When the numbers are genuinely random, half of the branches will be mispredicted. However, if we generate the same 2000 numbers using a pseudo-random number generator, the processor might learn to predict more accurately which number is odd.

while (howmany != 0) {
    randomval = random();
    if (randomval is odd)
      append randomval to array
    howmany--;
 }

What if we add a predictable branch? Let us say that we check whether the random 64-bit value is some arbitrary number. This new branch will be easily predicted as false.

while (howmany != 0) {
    randomval = random();
    if (randomval is 12313132)
       generate error
    if (randomval is odd)
      append randomval to array
    howmany--;
 }

Since the new branch is predictable, maybe it comes nearly for free?

Let us run 10 trials of the first algorithm, then 10 trials of the second, and so forth repeatedly, until the branch predictor is practically stable.

Let us count the number of mispredicted branches per loop iteration. We added an easy-to-predict branch, so it should not contribute directly to the number of mispredicted branches. I get the following numbers…

processor one hard branch one hard, one easy branch
Intel Skylake processor 4% to 9% 30% to 40%
ARM A72 24% to 26% 49% to 51%

So at least in this particular test, the mere addition of an easy-to-predict branch increased substantially the number of mispredicted branches.

My source code is available.

Note: The loop itself is an easily-predicted branch since the processor must determine whether it continues for another iteration or not at the end of each iteration.

*- It is a not a practical algorithm, it only serves to illustrate my point.

Parsing numbers in C++: streams, strtod, from_chars

When programming, we often want to convert strings (e.g., “1.0e2”) into numbers (e.g., 100). In C++, we have many options. In a previous post, I reported that it is an expensive process when using the standard approach (streams).

Many people pointed out to me that there are faster alternatives. In C++, we can use the C approach (e.g., strtod). We can also use the from_chars function. The net result is slightly more complicated code.

do {
    number = strtod(s, &end);
    if(end == s) break;
    sum += number;
    s = end; 
} while (s < theend);

I use long strings (100,000 numbers), and the GNU GCC 8.3 compiler on an Intel Skylake processor.

integers (stream) 200 MB/s
integers (from_chars) 750 MB/s
floats (stream) 50 MB/s
floats (strtod) 90 MB/s

We see that for integers, the from_chars function almost four times faster than the stream approach. Unfortunately my compiler does not support the from_chars function when parsing floating-point numbers. However, I can rely on the similar C function (strtod). It is nearly twice as fast as the floating-point approach. Even so, it still costs nearly 38 cycles per byte to parse floating-point numbers.

For each floating-point number, there are almost 10 branch misses in my tests, even though I generate numbers using a fixed format. The number of branch misses is nearly the same whether we use a C++ stream or the C function.

My source code is available.

Science and Technology links (October 26th 2019)

  1. People who were the oldest in the classes in school tend to be more confident and to take more risks.
  2. At the University of Montreal, about 32% of the students are male and 68% are female.
  3. Did blind orchestra auditions really benefit women?
  4. Wealthy people are happier, but the effect is small.
  5. Among Uber consumers, 60% never tip and only 1% always tip.
  6. How do you know how far away a given object is? A theory is that we rely on our familiarity of the object: you know how big a car can be, so if you see a tiny car, you know that it must be far away. However, it turns out that we are not better at estimating distances when we are familiar with the objects.
  7. What does it take to become a professor in a good university? In psychology, you need to have written about 16 papers, half of which as the first author; and most of the new hires have completed a postdoc or started from a job at a lesser institution.
  8. The end of our chromosomes are terminated by repeated sequences called telomeres. These sequences do not contribute directly to your genetic code. Instead they are used when the cells divide to hold the chromosome while it is being copied. With each cell division, the telomeres get shorter. Eventually your cells can no longer divide, unless they use a trick to elongate the telomeres (e.g., telomerase). Thus telomeres act as a clock in aging. It is also sometimes believed that telomeres act to protect us against cancer because a single cell can’t reproduce endlessly, unless it manages somehow to elongate its telomeres. So what happens when you create mice with longer-than-usual telomeres? They live longer, they are fitter and they do not have higher cancer rates:

    (…) we demonstrate here that it is possible to generate mice that have telomeres which are much longer than those of the natural species (…) These mice show a younger phenotype as indicated by improved mitochondrial function, improved metabolic parameters, decreased cancer, and increased longevity. These results also suggest that there is not a negative selection for individuals with longer telomeres than normal in species, and therefore, one can envision that natural selection processes which favor individuals with longer telomeres within a given species, could potentially increase species longevity.

    Source: Nature.

    It seems credible that we could engineer other longer-lived species by manipulating the telomere lengths.

  9. Daily vitamin D supplements improve cognition in Alzheimer’s patients.
  10. We have never seen so much ice coverage in Antartica. (Source: NASA)

How expensive is it to parse numbers from a string in C++?

In C++, there are multiple ways to convert strings into numbers. E.g., to convert the string “1.0e2” into the number 100.

In software, we frequently have to parse numbers from strings. Numbers are typically represented in computers as 32-bit or 64-bit words whereas strings are variable-length sequences of bytes. We need to go from one representation to the other.

For example, given the 3-character string “123”, we want to recover the integer value 123, which is typically stored as a byte value 0x7b followed by some zero bytes.

Though parsing integers is not trivial, parsing floating-point numbers is trickier: e.g., the strings “1e10”, “10e9”, “100e+8” and “10000000000” all map to the same value. Typically, we store floating-point values as either 32-bit or 64-bit values using the IEEE 754 standard. Parsing may fail in various fun ways even if we set aside garbage inputs that have no reasonable number value (like “4a.14x.10.14”). For example, the string “1e500” cannot be represented, typically, in software (except as the infinity). The string “18446744073709551616” cannot be represented as either a 64-bit floating-point value or a 64-bit integer. In some instances, we need to “round” the value: the string “1.0000000000000005” cannot be represented exactly as a 64-bit floating-point value, so we need to round it down to “1.0000000000000004” while we might round “1.0000000000000006” up to 1.0000000000000007.

How expensive is it to parse numbers? To test it out, I wrote a long string containing space-separated numbers. In one case, I generated random 64-bit unsigned integers, and another I generated random 64-bit normal floating-point numbers. Then I benchmark standard C++ parsing code that simply sums up the numbers:

std::stringstream in(mystring);
while(in >> x) {
   sum += x;
}
return sum;

I use long strings (100,000 numbers), and the GNU GCC 8.3 compiler on an Intel Skylake processor. I find that parsing numbers is relatively slow:

integers 360 cycles/integer 18 cycles/byte 200 MB/s
floats 1600 cycles/integer 66 cycles/byte 50 MB/s

Most disks and most networks can do much better than 50 MB/s. And good disks and good networks can beat 200 MB/s by an order of magnitude.

My source code is available.

Benchmarking is hard: processors learn to predict branches

A lot of software is an intricate of branches (ifthen clauses). For performance reasons, modern processors predict the results of these branches.

In my previous post, I showed how the bulk of your running time could be due to mispredicted branches. My benchmark consisted in writing 64 million random integers to an array. When I tried to only write odd random integers, the performance became much worse, due to the mispredictions.

while (howmany != 0) {
    val = random();
    if( val is an odd integer ) {
      out[index] =  val;
      index += 1;
    }
    howmany--;
}

Why 64 million integers and not, say, 2000? If you run just one test, then it should not matter. However, what if you are running multiple trials? Quickly, the number of mispredicted branches falls to zero. The numbers on an Intel Skylake processor speaks for themselves:

trial mispredicted branches (Intel Skylake)
1 48%
2 38%
3 28%
4 22%
5 14%

The “learning” keeps on going as the following plots shows… Eventually, the fraction of mispredicted branches falls to about 2%.

That is, as you keep measuring how long the same task takes, it gets faster and faster because the processor learns to better predicts the outcome. The quality of the “learning” depends on your exact processor, but I would expect that better, newer processors should learn better. Importantly, from run to run, we generate the same random integers.

The latest server processors from AMD learn to predict the branch perfectly (within 0.1%) in under 10 trials.

trial mispredicted branches (AMD Rome)
1 52%
2 18%
3 6%
4 2%
5 1%
6 0.3%
7 0.15%
8 0.15%
9 0.1%

This perfect prediction on the AMD Rome falls apart if you grow the problem from 2000 to 10,000 values: the best prediction goes from a 0.1% error rate to a 33% error rate.

You should probably avoid benchmarking branchy code over small problems.

My code is available.

Credit: The AMD Rome numbers were provided by Velu Erwan.

Further reading: A case for (partially) TAgged GEometric history length branch prediction (Seznec et al.)

Mispredicted branches can multiply your running times

Modern processors are superscalar, meaning that they can execute many instructions at once. For example, some processors can retire four or six instructions per cycle. Furthermore, many of these processors can initiate instructions out-of-order: they can start working on instructions that appear much later in the code.

Meanwhile most code contains branches (ifthen clauses). These branches are often implemented as “jumps” where the processor either goes running instruction further away, or continues on its current path.

It is hard to reconcile branches with out-of-order superscalar execution. To do so, processors have sophisticated branch predictors. That is, the processor tries to predict the future. When it sees branch, and thus a jump, it tries to guess which way it will go.

This often works quite well. For example, most loops are actually implemented as branches. At the end of each iteration in the loop, the processor must predict whether there will be a next iteration. It is often safe for the processor to predict that the loop will continue (forever). The processor will only mispredict one branch per loop in this manner.

There are other common examples. If you are accessing the content of an array, many languages will add “bound checking”: before accessing the array value, there will be a hidden check to see whether the index is valid. If the index is not valid, then an error is generated, otherwise the code proceeds normally. Bound checks are predictable since all accesses should (normally) be valid. Consequently, most processors should be able to predict the outcome nearly perfectly.

What happens if the branch is hard to predict?

What happens inside the processor is that all of the instruction that were executed but that follow the mispredicted branch must be cancelled and the computation must start anew. You should expect a penalty of more than 10 cycles for each mispredicted branch. It can multiply the running times.

Let us look at some simple code where we write out random integers to an output array:

while (howmany != 0) {
    out[index] =  random();
    index += 1;
    howmany--;
}

We can generate a decent random number in about 3 cycles on average. That is, the total latency of the random number generator might be 10 cycles. But our processor is superscalar, so we can do several random-number computations at once. Thus we may generate a new random number every 3 cycles or so.

Let us modify slightly this function so that we only write out the odd integers:

while (howmany != 0) {
    val = random();
    if( val is an odd integer ) {
      out[index] =  val;
      index += 1;
    }
    howmany--;
}

Naively we might think that this new function could be faster. Indeed, we might only write out one out of two integers. We have a branch, but checking whether an integer is odd requires checking a single bit.

I benchmarked these two functions in C++ using a skylake processor:

write all random numbers 3.3 cycles per integer
write only odd random numbers 15 cycles per integer

The second function takes about five times longer!

Is there something you can do? Yes. You can just remove the branch. You can characterize an odd integer by the fact that its bitwise logical AND with the value 1 is one. The trick is to increment the array index by one only when the random value is odd.

while (howmany != 0) {
    val = random();
    out[index] = val;
    index += (val bitand 1);
    howmany--;
}

In this new version, we always write the random value to the output array, even when it is not needed. At a glance, it looks wasteful. However, it does away with the mispredicted branches. In practice the performance is nearly as good as the original code, and much better than the version with branches:

write all random numbers 3.3 cycles per integer
write only odd random numbers 15 cycles per integer
branch removed 3.8 cycles per integer

Could the compiler have solved this problem on its own? In general, the answer is negative. Sometimes compilers have some options to avoid branches entirely even if there is an if-then clause in the original code. For example, branches can sometimes be replaced by “conditional moves” or other arithmetic tricks. However, there are tricks that compilers cannot use safely.

An important take-away is that mispredicted branches are not a minor concern, they can have a large effect.

My source code is available.

Further reading: Processors learn to predict branches (follow-up blog post).

Science and Technology (October 12th 2019)

  1. In many countries, like Canada, there is relatively little private (business) research. Meanwhile, other research indicates that private research is precisely the kind of research that leads directly to productivity growths. Private research seems concentrated in hubs like the Silicon Valley. To entice businesses in doing research, the Canadian government has pursued an agressive tax credit strategy: roughly speaking, research becomes a tax avoidance strategy. It does not work well.
  2. Artificial ovaries work in mice, and we think that they may work in women too.
  3. Alzheimer’s research has failed us despite massive investments. It might be time for a reset.
  4. Drinking alcohol triggers an hormone (FGF21) that makes you more likely to drinking water.
  5. Dog owners are less likely to die.
  6. The greatest inequality might be in dating and sex. The bottom 80% of men (in terms of attractiveness) are competing for the bottom 22% of women and the top 78% of women are competing for the top 20% of men. In other words, a small number of men have their pick of all of the women and most guys are unattractive to most women.
  7. The U.S. Navy looked at skin cancer rates among its staff. They found the highest rate of skin cancer in people working essentially indoor.
  8. Fruit flies exposed to a combo of three different drugs lived 48% longer, even though the individual effect of each drug is relatively small.
  9. Moderate alcohol consumption is associated with reduced inflammation and improved responses to vaccination.
  10. Choline might protect against Alzheimer’s. Choline is found in meat among other places.
  11. You may have read that we find symmetric faces more attractive. A new study challenges this claim.
  12. Iron might be an aging factor.
  13. Climate models routinely predicts higher temperatures than what is actually observed. Where do the heat goes? A paper in Nature claimed that the heat in question went into the ocean. Nature withdrew the paper in question as it contained too many data processing mistakes.
  14. Alpha-ketoglutarate can significantly extend lifespan and healthspan in mice. You can get it in the form of cheap supplements.

The price of a MacBook Air worldwide

Apple sells identical laptops worldwide. There might be small differences with respect to power adaptors and so forth, but the laptops are the same internally. Of course, there are differences in taxes but Apple quote prices with taxes included. So I went shopping for a basic MacBook Air, in different countries, using Apple’s web site.

USA US$1099 US$1099
Canada CA$1449 US$1089
Japan ¥119,800 US$1120
Australia A$1,699 US$1146
France 1 249,00 € US$1374

According to these numbers, the cheapest MacBooks are in Canada and the most expensive ones are in France.

In countries where taxes vary by province or state, the Apple tax estimate might be wrong for some locations.

Science and Technology links (September 28th 2019)

  1. Researchers have effectively rejuvenated the damaged skin of mice by using “exosomes”. These are packages that cells send in their environment, and it appears that other cells react positively to exosomes in some conditions. The therapy is non-invasive (no needle!) and it appears to be superior to existing skin rejuvenation treatments (e.g., retinol, stem cells). Skin thickness and collagen production were both improved.
  2. Researchers have effectively created new neurons in the brain of mice by converting existing glial cells (which are abundant) into neurons. This could be used one day to help patients recover from brain injuries or strokes.
  3. About one in six death in the USA is due to lead exposure. Lead accumulates in your body over time and significantly increases your risks of cardiovascular diseases, among other bad things. There are chelation therapies to lower your lead levels, but they are not in wide use even though it is surprisingly effective at improving the health of some sick people.

Doubling the speed of std::uniform_int_distribution in the GNU C++ library

The standard way in C++ to generate a random integer in a range is to call the std::uniform_int_distribution function. The current implementation of std::uniform_int_distribution in the GNU C++ library (libstdc++) to generate an integer in the interval [0,range] looks as follows.

scaling = random_range / range;
limit = range * scaling;
do
   answer = random;
while ( answer >= limit);
return  answer / scaling;

Typically, it requires two divisions by invocation. Even on recent processors, integer division is relatively expensive.

We can achieve the same algorithmic result with at most one division, and typically no division at all without requiring more calls to the random number generator. It speeds things up on various systems. It was recently added to Swift language.

Travis Downs suggested that someone should try to fix the GNU C++ implementation. So I prepared and submitted a patch. It is currently under review.

The main challenge is that we need to be able to compute the “full” product. E.g., given two 64-bit integers, we want the 128-bit result; given two 32-bit integers we want the 64-bit result. This is fast on common processors.

The 128-bit product is not natively supported in C/C++ but can be achieved with the __int128 128-bit integer extension when it is available. Thankfully, we can check for __int128 support, and when the support is lacking, we can fall back on another approach. The 128-bit integer extension appears in many compilers (LLMV clang, GNU GCC, Intel icc), but even within a given compiler (say GNU GCC), you cannot rely on the extension being always present, since the compiler implementation might be system specific.

The code I ended up with is somewhat ugly, but it works:

      __product = (unsigned __int128)(__urng() - __urngmin) * __uerange;
      uint64_t __lsb = uint64_t(__product);
      if(__lsb < __uerange)
      {
        uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
        while (__lsb < __threshold)
        {
          __product = (unsigned __int128)(__urng() - __urngmin) * __uerange;
          __lsb = uint64_t(__product);
        }
      }
      __ret = __product >> 64;

It mostly avoids divisions. I designed a simple random shuffling benchmark that mostly just calls std::shuffle which, in turn, relies on std::uniform_int_distribution. Given an array of one million elements, I get the following timings on a skylake processor with GNU GCC 8.3:

before patch 15 ns/value
after patch 8 ns/value

For fun, let us compare with the implementation that is present in the Swift standard library:

     var random: T = next()
     var m = random.multipliedFullWidth(by: upperBound)
     if m.low < upperBound {
       let t = (0 &- upperBound) % upperBound
       while m.low < t {
         random = next()
         m = random.multipliedFullWidth(by: upperBound)
       }
     }
     return m.high

I think it is more readable in part because the Swift language has built in support for the full multiplication. It is somewhat puzzling that the C++ language does not see fit to make it easy to compute the full product between two integers.

Reference: Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation 29 (1), 2019

Science and Technology links (September 21st 2019)

    1. Amputees suffer from lack of sensory feedback from the missing limbs. Researchers found it beneficial to provide artificial sensory feedback.
    2. More economically equal societies favor people with better genes, something referred to as the Gene-Gini interplay.
    3. As we age, we tend to accumulate senescent cells. Removing senescent cells is likely to improve the health of older people. We know how to kill senescent cells in mice. Researchers have now shown that the same simple therapies work in human beings.
    4. At Harvard, 43% of the admitted caucasian students are children of faculty and staff or other privileged categories. If there were treated fairly, and not as students of faculty and staff, 3/4 of them would be rejected. In other words, Harvard is, to a first approximation, a way to reproduce the social elite. Ironically, I would guess that the majority of Harvard professors view themselves as favoring egalitarian values. Yet when it comes to their kids, they demande special treatments. (Source: Tyler Cowen)
    5. People who say that they are environmentally conscious are not less likely to buy plastic bags.

My kindergarten story

Though I was a straight-A student for most of my high school and college years, I failed kindergarten. I have told this story many times but I realize that I have never dedicated a blog post to it.

I ended up with a PhD from a top Canadian university and some of my research as an academic has been impactful.

But, as a kid, I wasn’t a good fit for schooling.

I could not tie my shoe laces. You might think that I was particularly lacking in hand-eye coordination, but I don’t think that’s the case. To this day, if you meet me, there is a good chance that my shoe laces will be undone. Yet while I am not an athlete, I can beat some hard video games that are demanding as far as reflexes and hand-eye coordination goes. I don’t know exactly why I am not good with shoe laces. I can do knots, as I was a boy scout. For some reasons, I am just not good with shoe laces. Yet it was one of the tests you had to pass in kindergarten.

The other test you had to pass was to learn your phone number. To this day, if you ask me for my phone number, I just don’t know it. I have to look it up, copy and paste it. So I failed this test.

You had to learn to count up to 10. I decided that since I was 5, it was fine to only know how to count up to 5. So I failed this test as well.

I probably failed all sorts of other tests as well.

So I was put in a class for students with learning disabilities. It wasn’t a punishment, I actually enjoyed it.

In the end, I would get really good grades. But my troubles were not over. Like Mandelbrot, I never learned the multiplication tables, instead I learned tricks to do multiplications in my head. Teachers tried really hard to prevent such an approach and you had to hide your schemes. I also never learned the quadratic formula, instead I figured out (on my own probably) how to complete the square quickly enough.

What I did learn, apparently, is independent thinking. Because I had a difficult realtionship with teachers and what they wanted, I think I learned to figure out what I wanted to learn on my own.

How far can you scale interleaved binary searches?

The binary search is the standard, textbook, approach when searching through sorted arrays. In a previous post, I showed how you can do multiple binary searches faster if you interleave them. The core idea is that while the processor is waiting for data from one memory access, you can squeeze in several more from another search. So you effectively compute them “in parallel” even though the code itself is single-threaded.

I hinted that it probably made no sense to scale this up beyond 32 interleaved searches. That is, you cannot do much better than a factor of 9 speedup using a recent Intel processor (Cannon Lake), with a lesser benefit on older architectures (Skylake). I have since written a benchmark to prove that it is so by trying all possible interleave counts between 2 and 64.

So how could we squeeze even more performance? Why can’t we reach a speedup factor of 15 or 20? A significant cost might be related to pages: computers organize their memory in blocks (pages) spanning a small number of bytes (e.g., 4kB). Yet they also can only keep thousands of pages available at any one time. If you ask for a page that has not been recently accessed, the processor needs to remap it, an expensive process. If you are doing lots of random access in a large array, it is likely that you will frequently hit a page that has not been recently accessed, and you might be limited speed-wise by the resulting miss. You can alleviate this problem by asking that the operating system use “huge pages”.

I present the results as a figure which I hope will be intuitive. There is quite a bit of measurement noise, which I leave as a sort of error bar.

  • Huge pages help: we go from a maximum speedup of about 9 to 11. This suggests that page size is a concern. Using huge pages is not very practical, however. Thankfully, there are other ways to alleviate page issues, such as well timed prefetching.
  • The plot shows that there is a very nice performance progression up to about 9 parallel searches, and that up to 9 parallel searches, there is little difference between the version with regular pages and the version with huge pages. I suspect that up to that point (9 searches), we are not fast enough for page size to be a performance concern.
  • Whether we use regular or huge pages, we reach a “noisy” plateau, some kind of hard speed limit. Furthermore, we do not scale up the performance linearly with the number of parallel searches, and even when we do, the slope is much smaller than one. What might be the limiting  factors? I am not sure. If you know, please share your thoughts!

My code is available, it should run everywhere C++ runs. Yet the benchmark assumes that you use the GNU GCC compiler, because I do not know how to deliberately generate the needed conditional move instructions with other compilers. Again, this is the result of joint work with Nathan Kurz and Travis Downs, but the mistakes are mine.

Science and Technology links (September 14th 2019)

  1. Streaming music makes up 80% of the revenue of the music industry. Revenue is up 18% for the first six months of 2019. This follows a record year in 2018 when the music industry reached its highest revenues in 10 years. Though it should be good news for musicians, it is also the case that record labels often take the lion share of the revenues.
  2. We have seen massive progress in the last five years in artificial intelligence. Yet we do not see obvious signs of economic gains from this progress.
  3. A widely cited study contradicting the existence of free will was fatally flawed.
  4. A common diabetes drug might spur neurogenesis and brain repair in the brain of (female) mice.
  5. Facebook is working on creating real-time models of your face for use in virtual reality.
  6. A simple and inexpensive eye test might be sufficient to detect early signs of Alzheimer’s.

Speeding up independent binary searches by interleaving them

Given a long list of sorted values, how do you find the location of a particular value? A simple strategy is to first look at the middle of the list. If your value is larger than the middle value, look at the last half of the list, if not look at the first half of the list. Then repeat with select half, looking again in the middle. This algorithm is called a binary search. One of the first algorithms that computer science students learn is the binary search. I suspect that many people figure it out as a kid.

It is hard to drastically improve on the binary search if you only need to do one.

But what if you need to execute multiple binary searches, over distinct lists? Maybe surprisingly, in such cases, you can multiply the speed.

Let us first reason about how a binary search works. The processor needs to retrieve the middle value and compare it with your target value. Before the middle value is loaded, the processor cannot tell whether it will need to access the last or first half of the list. It might speculate. Most processors have branch predictors. In the case of a binary search, the branch is hard to predict so we might expect that the branch predictor will get it wrong half the time. Still: when it speculates properly, it is a net win. Importantly, we are using the fact that processors can do multiple memory requests at once.

What else could the processor do? If there are many binary searches to do, it might initiate the second one. And then maybe initiate the third one and so forth. This might be a lot more beneficial than speculating wastefully on a single binary search.

How can it start the second or third search before it finishes the current one? Again, due to speculation. If it is far along in the first search to predict its end, it might see the next search coming and start it even though it is still waiting for data regarding the current binary search. This should be especially easy if your sorted arrays have a predictable size.

There is a limit to how many instructions your processor might reorder in this manner. Let us say, for example, that the limit is 200 instructions. If each binary search takes 100 instructions, and that this value is reliable (maybe because the arrays have fixed sizes), then your processor might be able do up to two binary searches at once. So it might go twice as fast. But it probably cannot easily go much further (to 3 or 4).

But the programmer can help. We can manually tell the processor initiate right away four binary searches:

given arrays a1, a2, a3, a4
given target t1, t2, t3, t4

compare t1 with middle of a1
compare t2 with middle of a2
compare t3 with middle of a3
compare t4 with middle of a4

cut a1 in two based on above comparison
cut a2 in two based on above comparison
cut a3 in two based on above comparison
cut a4 in two based on above comparison

compare t1 with middle of a1
compare t2 with middle of a2
compare t3 with middle of a3
compare t4 with middle of a4

...

You can go far higher than 4 interleaved searches. You can do 8, 16, 32… I suspect that there is no practical need to go beyond 32.

How well does it work? Let us take 1024 sorted arrays containing a total of 64 million integers. In each array, I want to do one and just one binary search. Between each test, I access all of data in all of the arrays a couple of times to fool the cache.

By default, if you code a binary search, the resulting assembly will be made of comparisons and jumps. Hence your processor will execute this code in a speculative manner. At least with GNU GCC, we can write the C/C++ code in such a way that the branches are implemented as “conditional move” instructions which prevents the processor from speculating.

My results on a recent Intel processor (Cannon Lake) with GNU GCC 8 are as follows:

algorithm time per search relative speed
1-wise (independent) 2100 cycles/search 1.0 ×
1-wise (speculative) 900 cycles/search 2.3 ×
1-wise (branchless) 1100 cycles/search 2.0 ×
2-wise (branchless) 800 cycles/search 2.5 ×
4-wise (branchless) 600 cycles/search 3.5 ×
8-wise (branchless) 350 cycles/search 6.0 ×
16-wise (branchless) 280 cycles/search 7.5 ×
32-wise (branchless) 230 cycles/search 9 ×

So we are able to go 3 to 4 times faster, on what is effectively a memory bound problem, by interleaving 32 binary searches together. Interleaving merely 16 searches might also be preferable on some systems.

But why is this not higher than 4 times faster? Surely the processor can issue more than 4 memory loads at once?

That is because the 1-wise search, even without speculation, already benefits from the fact that we are streaming multiple binary searches, so that more than one is ongoing at any one time. Indeed, I can prevent the processor from usefully executing more than one search either by inserting memory fences between each search, or by making the target of one search dependent on the index found in the previous search. When I do so the time goes up to 2100 cycles/array which is approximately 9 times longer than 32-wise approach. The exact ratio (9) varies depending on the exact processor: I get a factor of 7 on the older Skylake architecture and a factor of 5 on an ARM Skylark processor.

My source code is available.

Implementation-wise, I code my algorithms in pure C/C++. There is no need for fancy, esoteric instructions. The condition move instructions are pretty much standard and old at this point. Sadly, I only know how to convince one compiler (GNU GCC) to reliably produce conditional move instructions. And I have no clue how to control how Java, Swift, Rust or any other language deals with branches.

Could you do better than my implementation? Maybe but there are arguments suggesting that you can’t beat it by much in general. Each data access is done using fewer than 10 instructions in my implementation, which is far below the number of cycles and small compared to the size of the instruction buffers, so finding ways to reduce the instruction count should not help. Furthermore, it looks like I am already nearly maxing out the amount of memory-level parallelism at least on some hardware (9 on Cannon Lake, 7 on Skylake, 5 on Skylark). On Cannon Lake, however, you should be able to do better.

Credit: This work is the result of a collaboration with Travis Downs and Nathan Kurz, though all of the mistakes are mine.

Science and Technology links (September 7th 2019)

  1. In a small clinical trial, scientists administered some “anti-aging” therapies to people between their fifties and sixties. They used growth hormone to regenerate the thymus, part of our immune system which is nearly all gone when people reach 50 years old. They complemented this therapy with anti-diabetes drugs (dehydroepiandrosterone and metformin) as an attempt to compensate for the fact that growth hormone is pro-diabetic. The researchers found that in a majority of the participants, the thymus was indeed regenerated: to my knowledge, this is a breakthrough in human beings (we only knew that in worked in animal models). Further, using an epigenetic clock, they found that participants’ biological age was reversed, by 2.5 years on average. The result persisted six months after stopping the trial. The surprising results should be viewed with caution given that it was a small trial with no control. Nevertheless, to my knowledge, it is the first time that biological aging has been reversed in human beings following a supervised therapy. The article is short and readable. (Source: Nature)
  2. Researchers found out how to edit large pieces of DNA, making it potentially easier than before to construct synthetic genomes. (Source: Science)
  3. Mathematician Hanna Fry on statin drugs:

    Of a thousand people who take statins for five years, the drugs will help only eighteen to avoid a major heart attack or stroke.

  4. Chronological aging (how many years you lived) and biological aging (loss of fitness with time) differ. There is apparently a tendency for human beings to biologically age slower. The trend was confirmed between 1988 and 2010. Older men benefited most from this trend. I do not think that we fully understand this phenomenon.
  5. A woman received a cornea made from reprogrammed stem cells.
  6. Drones are used in China to spray insecticides more efficiently than farmers can.
  7. If you go Harvard University, you will have access to the best teaching and best professors, won’t you? Not so fast write Hacker and Dreifus:

    At Harvard, even untenured asssistant professors get a fully paid year to complete a promotion-worthy book. Thus in a recent year, of its history department’s six assistant professors, only two were on hand to teach classes. In Harvard’s department of philosophy that same year, almost half of its full-time faculty were away on sabbaticals. Of course it was the students who paid. Many of their undergraduate courses were canceled or given by one-year visitors unfamiliar with the byways of the university.

    Harvard undergraduate students give their professors C- on classroom performance.

  8. Regarding anxiety, no psychological interventions had greater effects compared with placebo.
  9. Men who are avid consumers of pornography are no more sexist or misogynistic than the general public. In fact, pornography users hold more gender egalitarian views.
  10. A new drug was developed quickly using deep learning.
  11. A software program can pass an 8th grade science test.
  12. Our cells can replicate only so many times before their telomeres are too short. To keep on dividing, cells must use an enzyme called telomerase. It was believed until now that, except for cancer, most of our cells (somatic cells) did not use telomerase. It appears that this belief is wrong: when our cells have short telomeres, they will sometimes use telomerase to protect themselves.

Passing integers by reference can be expensive…

In languages like C++, you can pass values to functions in two ways. You can pass by value: the value is semantically “copied” before being passed to the function. Any change you made to the value within the function will be gone after the function terminates. The value is ephemeral. You can also pass by pointer or reference: semantically, we point the new function at the value we already have. Changes made within the function to the value will remain after the scope of the function.

As a rule of thumb, whenever a data element is large (e.g., it cannot fit in registers), it is probably faster to pass by reference or pointer. For tiny data elements that fit in registers, like integers, passing them by value ought to be best.

However, sometimes it is convenient to pass integer values by reference because you want to capture how the value changed during the scope of the function. C++ makes this trivially easy as you just need to add the ampersand (&) in front of the parameter. It looks “free”.

Let us consider the silly example of some function that passes a counter (the variable ‘i’) which gets incremented and added to multiple locations in an array:

void incrementr(uint64_t &i, uint64_t *x) {
  for (int k = 0; k < SIZE; k++) {
    x[k] += i++;
  }
}

We can write an almost entirely equivalent function without using a reference. We just take the counter by value, modify it, and then return it at the end of the function:

int increment(uint64_t i, uint64_t *x) {
  for (int k = 0; k < SIZE; k++) {
    x[k] += i++;
  }
  return i;
}

I expect these two types of functions to be largely equivalent semantically, as long as the counter is assumed not to reside in the array (x). What about performance? The function call itself is different, so there might be a couple of extra move instructions in total in the pass-by-reference case in the best of cases due to calling conventions. However, compilers, like GNU GCC, produce vastly different code that go far beyond a few extra move instructions at the start and end of the function.

GNU GCC 8.3 keeps the value of the passed-by-reference value in memory throughout instead of using a register, due to an aliasing issue. Thus each time you access the counter, you need to reload it. If you modify it, you need to store the new value again. The net result is far worse performance on my Skylake processor…

cycles per value instructions per value
reference 5.9 7.0
value 1.3 3.5

The effect is quite large as you can see. My source code is available.

You can avoid the penalty by copying the passed-by-reference variable to a local variable at the start, and copying it back at end of the function. Or, if your compiler supports it, you can add the __restrict qualifier on your reference.

Other languages like Swift are likely affected as well. In Swift, you can pass an integer as an inout variable which is semantically equivalent to a reference…

func fx(_ allints: inout [Int],  _ j : inout Int) {
      for k in allints.indices {
        allints[k] = j
        j &+= 1
      }
 }

You should avoid such code if you care for performance.

Of course, these considerations are likely to be irrelevant if the function gets inlined or if the function is very inexpensive.

Science and Technology links (August 31st 2019)

  1. The Conboy laboratory in Berkeley is responsible for some of the best work in aging research. In 2005, they showed that by connecting the blood vessels of an old and a young mice, they could rejuvenate the old mice (via a process called parabiosis). This earlier work showed that there are probably “aging factors” in one’s blood. Of course, this suggests that old people could benefit from blood transfusion coming from younger people, but that is not expected to be effective. In their latest paper, the Conboy lab shows that by inhibiting something called ALK5 and by increasing oxytocin, they could rejuvenate old mice.

    Alk5i plus OT quickly and robustly enhanced neurogenesis, reduced neuro-inflammation, improved cognitive performance, and rejuvenated livers and muscle in old mice. Summarily, simultaneously re-normalizing two pathways that change with age in opposite ways (up vs. down) synergistically reverses multiple symptoms of aging.

    This looks like a major breakthrough. The authors find that the treatment (which is relatively mild in dosage) reduces the number of senescent cells which might reduce inflammation. Senescent cells are old cells that will not die and they cause all sorts of problems as we age.

  2. If you want to motivate teachers to improve student scores with bonuses, you might think to give the bonuses at the end of the year. Yet it is more effective to give the bonus at the start of the year and collect it back if the students did not meet the required targets. Also it is better to reward students for completing tasks (e.g., homeworks) rather than for achieving good results.
  3. The pharmaceutical industry has a surprisingly high carbon footprint, larger than that of the automotive industry.

How fast can scancount be?

In an earlier post, I described the following problem. Suppose that you have tens of arrays of integers. You wish to find all integer that are in more than 3 (say) of the sets. This is neither a union nor an intersection. A simple algorithm to solve this problem is to use a counter for each possible value, and increment the counter each time the value is seen. We can finally check the counters and output the answer.

counter <- array of zeros
for array x in set of arrays {
    for value y in array x {
      counter[y] += 1
}
for i = 0; i < counter.size(); i++ {
  if(counter[i] > threshold)
     output i;
}

You can do much better if you break down the space of the problem, so that all of your counters fit in the CPU cache. You solve the problem using multiple passes. It also helps if you use small counters, say 8-bit counters.

With this early optimization, I was happy to report a performance of about 16 cycles per input value. It is pretty good… but Nathan Kurz and Travis Downs pointed out that you can do much better.

What are some of the issues?

  1. I was using textbook C++ with iterators and STL algorithms. Converting the problem into lower-level code (using pointers or indexes) seems better for performance.
  2. Using 8-bit counters triggers all sorts of nasty “aliasing” problems: effectively, the compiler becomes paranoid and thinks that you might be writing all over the memory. Travis wrote a blog post on the topic and on how you can mitigate the issue.
  3. There are many low-level issues one might want to handle. For example, if you have 8-bit counter values and an 32-bit threshold value, an extra instruction might be needed to zero-extend your counter value.

We now have two much faster versions to offer. They run at between 3 to 4 cycles per input element. If you are keeping track, it is better than four times faster than my original “optimized” version. I make the code available on GitHub as a header-only C++ library.

    1. The first version is derived from a comment by Nathan Kurz. It works identically to my original version, but instead of using a second pass through the counters, we check the counter values as we modify them. Other than that, I just carefully rewrote my original C++ in “plain C”. I achieve 4 cycles per input element and a number of instructions per cycle of 3.1 on a skylake processor with GNU GCC and around 3.5 cycles per input element under LLVM clang. The implementation should be portable.
    2. Travis was nice enough to share his implementation which follows the same original algorithm, but has been vectorized: it uses fast AVX2 instructions. It is slightly faster under the GNU GCC 8 compiler, sometimes approaching 3 cycles per input element.

 

So how fast can scancount be? I have not yet answered this question.

In an earlier comment, Travis gave a bound. If each input element must result into the increment of a counter that is not a register, then we have one store per input element, and thus we need at least one cycle per input element on processors that are limited to one store per cycle.

Of course, we also need to load data: say one load to get the element, and one load to get the corresponding counter. Assuming that our hardware can do either two loads per cycle or one store per cycle, then we get a limit of at least two cycles per input element. Based on this napkin mathematics, if we achieve 4 cycles, then we are within a factor of two of the best possible algorithm.

Still: we could go much faster without breaking the speed of light. We are almost solving an embarrassingly parallel problem: we break the problem into subspaces (for the caching benefits) but this means that should be able to dispatch independent components of the problem to distinct cores.

Science and Technology links (August 24th 2019)

  1. The net contribution of the Amazon ecosystem to the world’s oxygen is effectively zero. Furthermore, there is a lot of oxygen in the air and it would be excessively difficult to lower or increase the ratio. In effect, we are not going to run out of oxygen on Earth any time soon, even if we tried to.
  2. According to Nature, the Earth is getting greener. The main driver is climate change and CO2 which acts as a fertilizer. A large fraction of this greening is happening in China and India. These countries have increased their food production by 35% since 2000.
  3. Vitamin D supplements appear to reduce cancer risks. Vitamin D supplements are typically available as either vitamin D2 or D3. We know that D2 supplements do not lower mortality risks. We do not have enough data to assess vitamin D3. (Zhang et al., 2019)
  4. There is no evidence that plastic particules in water are harmful to human beings.
  5. CNBC reports that Mackey, a vegan who runs the successful Whole Foods chain, is critical of the fashionable Beyond Meat “plant-based meat”:

    I don’t think eating highly processed foods is healthy. I think people thrive on eating whole foods. As for health, I will not endorse that, and that is about as big of criticism that I will do in public.

    Meanwhile the stock price of Beyond Meat went up 800% in a few months.