Science and Technology links (November 16th 2019)

    1. We have new technology to do genetic engineering on human beings (CRISPR). In a small clinical trial, the researchers tested it on live human subjects and found it to be safe.
    2. Girls and boys have a similar intelligence according to psychometric tests, but girls do much better in school, and this is not due to their better verbal ability.
    3. Being illiterate is associated with dementia and cognitive decline in old age.
    4. Dumber people tend to smoke more cannabis, but cannabis itself does not appear to make people dumber.
    5. We transplanted pig skin onto human beings.
    6. A large fraction of Ph.D. students seek psychological help for anxiety and depression. The students often blame the Ph.D. supervisors.
    7. Researchers have apparently been able to reverse aging in the eye of old mice. The mice eyes can then recover from a crushed nerve and from glaucoma.

Unrolling your loops can improve branch prediction

Modern processors predict branches (e.g., if-then clauses), often many cycles a ahead of time. When predictions are incorrect, the processor has to start again, an expensive process. Adding a hard-to-predict branch can multiply your running time.

Does it mean that you should only care about hard-to-predict branches? Unfortunately no.

In a prior blog post, I showed how adding a predictable branch can degrade the branch prediction of other branches.

Conversely, removing a predictable branch might improve branch predictions. Wilco Dijkstra suggested a way to test it out. A common form of predictable branch is a loop. Though a loop does not look like an if-then clause, it still compiles to a branch. A loop is predictable if it iterates long enough. You can “unroll” loops, that is, reduce the number of iterations by doing more work with each iterations. Unrolling does not just remove a predictable branch, but that it does that among other things.

Let us consider an example. I generate random numbers. I check whether they are odd, and when they are, I write the result to an array. It is a silly example meant to illustrate the effect of branch prediction. Because my random numbers can be generated by a pseudo-random number generator, I can run many trials with the same random number. The more often I repeat the loop, the better the branch prediction gets.

  while (howmany != 0) {
    uint64_t randomval = rng(howmany + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    howmany--;
  }

We can unroll this loop: instead of generating one random number per loop iteration, I generate two. You can also generate four or eight, and so forth. The unrolled loop has fewer branches because each iteration through the loop is an implicit branch.

  uint64_t randomval;
  while (howmany >= 2) {
    randomval = rng(howmany + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    randomval = rng(howmany - 1 + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    howmany-=2;
  }  
  while (howmany != 0) {
    uint64_t randomval = rng(howmany + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    howmany--;
  }

I implemented this benchmark using 1000 random numbers per trial. I record the number of mispredicted branch per random number generated on different processors.

ARM 64-bit (Skylark):

trial basic loop unrolled twice unrolled four times
1 58% 57% 56%
2 48% 33% 26%
3 48% 28% 21%
15 45% 18% 8%

Intel x64 (Cannon Lake):

trial basic loop unrolled twice unrolled four times
1 53% 51% 49%
2 50% 35% 30%
3 46% 20% 17%
15 40% 0.5% 0.4%

 

Adding a (predictable) branch to existing code can increase branch mispredictions

Software is full of “branches”. They often take the form of if-then clauses in code. Modern processors try to predict the result of branches often long before evaluating them. Hard-to-predict branches are a challenge performance-wise because when a processor fails to predict correctly a branch, it does useless work that must be thrown away.

A convenient illustration is an algorithm that generates a random number and then only appends it to a list if the random number is odd*. When the numbers are genuinely random, half of the branches will be mispredicted. However, if we generate the same 2000 numbers using a pseudo-random number generator, the processor might learn to predict more accurately which number is odd.

while (howmany != 0) {
    randomval = random();
    if (randomval is odd)
      append randomval to array
    howmany--;
 }

What if we add a predictable branch? Let us say that we check whether the random 64-bit value is some arbitrary number. This new branch will be easily predicted as false.

while (howmany != 0) {
    randomval = random();
    if (randomval is 12313132)
       generate error
    if (randomval is odd)
      append randomval to array
    howmany--;
 }

Since the new branch is predictable, maybe it comes nearly for free?

Let us run 10 trials of the first algorithm, then 10 trials of the second, and so forth repeatedly, until the branch predictor is practically stable.

Let us count the number of mispredicted branches per loop iteration. We added an easy-to-predict branch, so it should not contribute directly to the number of mispredicted branches. I get the following numbers…

processor one hard branch one hard, one easy branch
Intel Skylake processor 4% to 9% 30% to 40%
ARM A72 24% to 26% 49% to 51%

So at least in this particular test, the mere addition of an easy-to-predict branch increased substantially the number of mispredicted branches.

My source code is available.

Note: The loop itself is an easily-predicted branch since the processor must determine whether it continues for another iteration or not at the end of each iteration.

*- It is a not a practical algorithm, it only serves to illustrate my point.

Parsing numbers in C++: streams, strtod, from_chars

When programming, we often want to convert strings (e.g., “1.0e2”) into numbers (e.g., 100). In C++, we have many options. In a previous post, I reported that it is an expensive process when using the standard approach (streams).

Many people pointed out to me that there are faster alternatives. In C++, we can use the C approach (e.g., strtod). We can also use the from_chars function. The net result is slightly more complicated code.

do {
    number = strtod(s, &end);
    if(end == s) break;
    sum += number;
    s = end; 
} while (s < theend);

I use long strings (100,000 numbers), and the GNU GCC 8.3 compiler on an Intel Skylake processor.

integers (stream) 200 MB/s
integers (from_chars) 750 MB/s
floats (stream) 50 MB/s
floats (strtod) 90 MB/s

We see that for integers, the from_chars function almost four times faster than the stream approach. Unfortunately my compiler does not support the from_chars function when parsing floating-point numbers. However, I can rely on the similar C function (strtod). It is nearly twice as fast as the floating-point approach. Even so, it still costs nearly 38 cycles per byte to parse floating-point numbers.

For each floating-point number, there are almost 10 branch misses in my tests, even though I generate numbers using a fixed format. The number of branch misses is nearly the same whether we use a C++ stream or the C function.

My source code is available.

Science and Technology links (October 26th 2019)

  1. People who were the oldest in the classes in school tend to be more confident and to take more risks.
  2. At the University of Montreal, about 32% of the students are male and 68% are female.
  3. Did blind orchestra auditions really benefit women?
  4. Wealthy people are happier, but the effect is small.
  5. Among Uber consumers, 60% never tip and only 1% always tip.
  6. How do you know how far away a given object is? A theory is that we rely on our familiarity of the object: you know how big a car can be, so if you see a tiny car, you know that it must be far away. However, it turns out that we are not better at estimating distances when we are familiar with the objects.
  7. What does it take to become a professor in a good university? In psychology, you need to have written about 16 papers, half of which as the first author; and most of the new hires have completed a postdoc or started from a job at a lesser institution.
  8. The end of our chromosomes are terminated by repeated sequences called telomeres. These sequences do not contribute directly to your genetic code. Instead they are used when the cells divide to hold the chromosome while it is being copied. With each cell division, the telomeres get shorter. Eventually your cells can no longer divide, unless they use a trick to elongate the telomeres (e.g., telomerase). Thus telomeres act as a clock in aging. It is also sometimes believed that telomeres act to protect us against cancer because a single cell can’t reproduce endlessly, unless it manages somehow to elongate its telomeres. So what happens when you create mice with longer-than-usual telomeres? They live longer, they are fitter and they do not have higher cancer rates:

    (…) we demonstrate here that it is possible to generate mice that have telomeres which are much longer than those of the natural species (…) These mice show a younger phenotype as indicated by improved mitochondrial function, improved metabolic parameters, decreased cancer, and increased longevity. These results also suggest that there is not a negative selection for individuals with longer telomeres than normal in species, and therefore, one can envision that natural selection processes which favor individuals with longer telomeres within a given species, could potentially increase species longevity.

    Source: Nature.

    It seems credible that we could engineer other longer-lived species by manipulating the telomere lengths.

  9. Daily vitamin D supplements improve cognition in Alzheimer’s patients.
  10. We have never seen so much ice coverage in Antartica. (Source: NASA)

How expensive is it to parse numbers from a string in C++?

In C++, there are multiple ways to convert strings into numbers. E.g., to convert the string “1.0e2” into the number 100.

In software, we frequently have to parse numbers from strings. Numbers are typically represented in computers as 32-bit or 64-bit words whereas strings are variable-length sequences of bytes. We need to go from one representation to the other.

For example, given the 3-character string “123”, we want to recover the integer value 123, which is typically stored as a byte value 0x7b followed by some zero bytes.

Though parsing integers is not trivial, parsing floating-point numbers is trickier: e.g., the strings “1e10”, “10e9”, “100e+8” and “10000000000” all map to the same value. Typically, we store floating-point values as either 32-bit or 64-bit values using the IEEE 754 standard. Parsing may fail in various fun ways even if we set aside garbage inputs that have no reasonable number value (like “4a.14x.10.14”). For example, the string “1e500” cannot be represented, typically, in software (except as the infinity). The string “18446744073709551616” cannot be represented as either a 64-bit floating-point value or a 64-bit integer. In some instances, we need to “round” the value: the string “1.0000000000000005” cannot be represented exactly as a 64-bit floating-point value, so we need to round it down to “1.0000000000000004” while we might round “1.0000000000000006” up to 1.0000000000000007.

How expensive is it to parse numbers? To test it out, I wrote a long string containing space-separated numbers. In one case, I generated random 64-bit unsigned integers, and another I generated random 64-bit normal floating-point numbers. Then I benchmark standard C++ parsing code that simply sums up the numbers:

std::stringstream in(mystring);
while(in >> x) {
   sum += x;
}
return sum;

I use long strings (100,000 numbers), and the GNU GCC 8.3 compiler on an Intel Skylake processor. I find that parsing numbers is relatively slow:

integers 360 cycles/integer 18 cycles/byte 200 MB/s
floats 1600 cycles/integer 66 cycles/byte 50 MB/s

Most disks and most networks can do much better than 50 MB/s. And good disks and good networks can beat 200 MB/s by an order of magnitude.

My source code is available.

Benchmarking is hard: processors learn to predict branches

A lot of software is an intricate of branches (ifthen clauses). For performance reasons, modern processors predict the results of these branches.

In my previous post, I showed how the bulk of your running time could be due to mispredicted branches. My benchmark consisted in writing 64 million random integers to an array. When I tried to only write odd random integers, the performance became much worse, due to the mispredictions.

while (howmany != 0) {
    val = random();
    if( val is an odd integer ) {
      out[index] =  val;
      index += 1;
    }
    howmany--;
}

Why 64 million integers and not, say, 2000? If you run just one test, then it should not matter. However, what if you are running multiple trials? Quickly, the number of mispredicted branches falls to zero. The numbers on an Intel Skylake processor speaks for themselves:

trial mispredicted branches (Intel Skylake)
1 48%
2 38%
3 28%
4 22%
5 14%

The “learning” keeps on going as the following plots shows… Eventually, the fraction of mispredicted branches falls to about 2%.

That is, as you keep measuring how long the same task takes, it gets faster and faster because the processor learns to better predicts the outcome. The quality of the “learning” depends on your exact processor, but I would expect that better, newer processors should learn better. Importantly, from run to run, we generate the same random integers.

The latest server processors from AMD learn to predict the branch perfectly (within 0.1%) in under 10 trials.

trial mispredicted branches (AMD Rome)
1 52%
2 18%
3 6%
4 2%
5 1%
6 0.3%
7 0.15%
8 0.15%
9 0.1%

This perfect prediction on the AMD Rome falls apart if you grow the problem from 2000 to 10,000 values: the best prediction goes from a 0.1% error rate to a 33% error rate.

You should probably avoid benchmarking branchy code over small problems.

My code is available.

Credit: The AMD Rome numbers were provided by Velu Erwan.

Further reading: A case for (partially) TAgged GEometric history length branch prediction (Seznec et al.)

Mispredicted branches can multiply your running times

Modern processors are superscalar, meaning that they can execute many instructions at once. For example, some processors can retire four or six instructions per cycle. Furthermore, many of these processors can initiate instructions out-of-order: they can start working on instructions that appear much later in the code.

Meanwhile most code contains branches (ifthen clauses). These branches are often implemented as “jumps” where the processor either goes running instruction further away, or continues on its current path.

It is hard to reconcile branches with out-of-order superscalar execution. To do so, processors have sophisticated branch predictors. That is, the processor tries to predict the future. When it sees branch, and thus a jump, it tries to guess which way it will go.

This often works quite well. For example, most loops are actually implemented as branches. At the end of each iteration in the loop, the processor must predict whether there will be a next iteration. It is often safe for the processor to predict that the loop will continue (forever). The processor will only mispredict one branch per loop in this manner.

There are other common examples. If you are accessing the content of an array, many languages will add “bound checking”: before accessing the array value, there will be a hidden check to see whether the index is valid. If the index is not valid, then an error is generated, otherwise the code proceeds normally. Bound checks are predictable since all accesses should (normally) be valid. Consequently, most processors should be able to predict the outcome nearly perfectly.

What happens if the branch is hard to predict?

What happens inside the processor is that all of the instruction that were executed but that follow the mispredicted branch must be cancelled and the computation must start anew. You should expect a penalty of more than 10 cycles for each mispredicted branch. It can multiply the running times.

Let us look at some simple code where we write out random integers to an output array:

while (howmany != 0) {
    out[index] =  random();
    index += 1;
    howmany--;
}

We can generate a decent random number in about 3 cycles on average. That is, the total latency of the random number generator might be 10 cycles. But our processor is superscalar, so we can do several random-number computations at once. Thus we may generate a new random number every 3 cycles or so.

Let us modify slightly this function so that we only write out the odd integers:

while (howmany != 0) {
    val = random();
    if( val is an odd integer ) {
      out[index] =  val;
      index += 1;
    }
    howmany--;
}

Naively we might think that this new function could be faster. Indeed, we might only write out one out of two integers. We have a branch, but checking whether an integer is odd requires checking a single bit.

I benchmarked these two functions in C++ using a skylake processor:

write all random numbers 3.3 cycles per integer
write only odd random numbers 15 cycles per integer

The second function takes about five times longer!

Is there something you can do? Yes. You can just remove the branch. You can characterize an odd integer by the fact that its bitwise logical AND with the value 1 is one. The trick is to increment the array index by one only when the random value is odd.

while (howmany != 0) {
    val = random();
    out[index] = val;
    index += (val bitand 1);
    howmany--;
}

In this new version, we always write the random value to the output array, even when it is not needed. At a glance, it looks wasteful. However, it does away with the mispredicted branches. In practice the performance is nearly as good as the original code, and much better than the version with branches:

write all random numbers 3.3 cycles per integer
write only odd random numbers 15 cycles per integer
branch removed 3.8 cycles per integer

Could the compiler have solved this problem on its own? In general, the answer is negative. Sometimes compilers have some options to avoid branches entirely even if there is an if-then clause in the original code. For example, branches can sometimes be replaced by “conditional moves” or other arithmetic tricks. However, there are tricks that compilers cannot use safely.

An important take-away is that mispredicted branches are not a minor concern, they can have a large effect.

My source code is available.

Further reading: Processors learn to predict branches (follow-up blog post).

Science and Technology (October 12th 2019)

  1. In many countries, like Canada, there is relatively little private (business) research. Meanwhile, other research indicates that private research is precisely the kind of research that leads directly to productivity growths. Private research seems concentrated in hubs like the Silicon Valley. To entice businesses in doing research, the Canadian government has pursued an agressive tax credit strategy: roughly speaking, research becomes a tax avoidance strategy. It does not work well.
  2. Artificial ovaries work in mice, and we think that they may work in women too.
  3. Alzheimer’s research has failed us despite massive investments. It might be time for a reset.
  4. Drinking alcohol triggers an hormone (FGF21) that makes you more likely to drinking water.
  5. Dog owners are less likely to die.
  6. The greatest inequality might be in dating and sex. The bottom 80% of men (in terms of attractiveness) are competing for the bottom 22% of women and the top 78% of women are competing for the top 20% of men. In other words, a small number of men have their pick of all of the women and most guys are unattractive to most women.
  7. The U.S. Navy looked at skin cancer rates among its staff. They found the highest rate of skin cancer in people working essentially indoor.
  8. Fruit flies exposed to a combo of three different drugs lived 48% longer, even though the individual effect of each drug is relatively small.
  9. Moderate alcohol consumption is associated with reduced inflammation and improved responses to vaccination.
  10. Choline might protect against Alzheimer’s. Choline is found in meat among other places.
  11. You may have read that we find symmetric faces more attractive. A new study challenges this claim.
  12. Iron might be an aging factor.
  13. Climate models routinely predicts higher temperatures than what is actually observed. Where do the heat goes? A paper in Nature claimed that the heat in question went into the ocean. Nature withdrew the paper in question as it contained too many data processing mistakes.
  14. Alpha-ketoglutarate can significantly extend lifespan and healthspan in mice. You can get it in the form of cheap supplements.

The price of a MacBook Air worldwide

Apple sells identical laptops worldwide. There might be small differences with respect to power adaptors and so forth, but the laptops are the same internally. Of course, there are differences in taxes but Apple quote prices with taxes included. So I went shopping for a basic MacBook Air, in different countries, using Apple’s web site.

USA US$1099 US$1099
Canada CA$1449 US$1089
Japan ¥119,800 US$1120
Australia A$1,699 US$1146
France 1 249,00 € US$1374

According to these numbers, the cheapest MacBooks are in Canada and the most expensive ones are in France.

In countries where taxes vary by province or state, the Apple tax estimate might be wrong for some locations.