Are 64-bit random identifiers free from collision?

It is common in software system to map objects to unique identifiers. For example, you might map all web pages on the Internet to a unique identifier.

Often, these identifiers are integers. For example, many people like to use 64-bit integers. If you assign two 64-bit integers at random to distinct objects, the probability of a collision is very, very small. You can be confident that they will not collide.

However, what about the case where you have 300 million objects? Or maybe 7 billion objects? What is the probably that at least two of them collide?

This is just the Birthday’s paradox. Wikipedia gives us an approximation to the collision probability assuming that the number of objects r is much smaller than the number of possible values N: 1-exp(-r**2/(2N)). Because there are so many 64-bit integers, it should be a good approximation.

Number of objects Collision probability
500M 0.7%
1B 3%
1.5B 6%
2B 10%
4B 35%

Thus if you have a large system with many objects, it is quite conceivable that your randomly assigned 64-bit identifiers might collide. If a collision is a critical flaw, you probably should not use only 64 bits.

Amazon’s new ARM servers: Graviton 2

Most servers on the Internet run on x64 processors, mostly made by Intel. Meanwhile, most smartphones run ARM processors.

From a business perspective, these are different technologies. The x64 processors are mostly designed by only two companies (Intel and AMD), with one very large dominant player (Intel). In contrast, ARM processors come in many different flavours. Apple, Qualcomm, Samsung and others design their own ARM processors. This diversity can be either a blessing or a curse. The blessing is that you get a lot of direct competition and many custom solutions. The curse is that it is harder to support ARM systems because there are so many variations.

Amazon is the largest public cloud providers and they are large enough to design their own servers and even their own silicon. Some time ago, they launched a service (Graviton) based on their own ARM-based servers. I tested them out, but the performance just was not there. Amazon just announced a second iteration of these servers called Graviton 2 and they claim a 7-fold performance increase over their previous ARM servers. They are based on processor designs made for servers called Neoverse. I do not yet have access to these servers, but Matthew Wilson from Amazon was kind enough to run my standard JSON parsing benchmark (using simdjson and the twitter.json data file).

Compiling the results in a table suggests that these new Amazon servers have processors that are nearly as good as those in a flagship smartphone.

iPhone XR A12 2.5 GHz 1.3 GB/s
Graviton 2 Neoverse N1 2.5 GHz 1.1 GB/s
Ampere (first generation) Skylark 3.2 GHz 0.37 GB/s
Rockpro64 Cortex-A72 1.8 GHz 0.32 GB/s

My fast JSON parsing benchmark is just something I happen to care about. It is probably not representative of whatever you care about. In particular, it is CPU intensive whereas servers have many other bottlenecks.

Nevertheless, I find these results quite encouraging. If I normalize the speed by the frequency, I get that the new Neoverse N1 processor is 2.5 times faster than the Cortex-A72. When they come out, they may be the fastest publicly accessible ARM servers.

Amazon is claiming that these Graviton 2 servers offer much better performance than Intel-based servers (EC2 M5) and that they will be slightly cheaper. My expectation is that the better performance will be due in large part of a higher number of cores.

Update: Matthew Wilson reports 1.3 GB/s following some optimization work.

Science and Technology links (December 7th 2019)

  1. Incredibly, there is a new simpler way to solve the quadratic formula. I used to rely on the completion of the square, but this is better! There is a video report on the finding.
  2. Surgeons are putting patients in suspended animation (biostasis) during surgery, replacing their blood with a cold saline solution, for time-sensitive interventions.
  3. Statins are a widely prescribed class of medications. They may adversely affect cognition.

AMD Zen 2 and branch mispredictions

Intel makes some of the very best processors many can buy. For a long time, its main rival (AMD) failed to compete. However, its latest generation of processors (Zen 2) appear to roughly match Intel, at a lower price point.

In several benchmarks that I care about, my good old Intel Skylake (2015) processor beats my brand-new Zen 2 processor.

To try to narrow it down, I create a fun benchmark. I run the follow algorithm where I generate random integers quickly, and then check the two least significant bits. By design, no matter how good the processor is, there should be one mispredicted branch per loop iteration.

while (true) {
   r = random_integer()
   if (( r AND 1) == 1) {
     write r to output
   }
   if (( r AND 2) == 2) {
     write r to output
   }
}

I record the number of CPU cycles per loop iteration. This number is largely independent from processor frequency, memory access and so forth. The main bottleneck in this case is branch misprediction.

Intel Skylake 29.7 cycles
AMD Zen 2 31.7 cycles

Thus it appears that in this instance the AMD Zen 2 has two extra cycles of penalty per mispredicted branch. If you run the same benchmark without the branching, the difference in execution time is about 0.5 cycles in favour of the Skylake processor. This suggests that AMD Zen 2 might waste between one to two extra cycles per mispredicted branch.

My code is available. I define a docker container so my results are easy to reproduce.

Instructions per cycle: AMD Zen 2 versus Intel

The performance of a processor is determined by several factors. For example, processors with a higher frequency tend to do more work per unit of time. Physics makes it difficult to produce processors that have higher frequency.

Modern processors can execute many instructions per cycle. Thus a 3.4GHz processor has 3.4 billion cycles per second, but it might easily execute 7 billion instructions per second on a single core.

Up until recently, Intel produced the best commodity processors: its processors had the highest frequencies, the most instructions per cycle, the most powerful instructions and so forth. However, Intel is increasingly being challenged. One smaller company that is making Intel nervous is AMD.

It has been reported that the most recent AMD processors surpass Intel in terms of instructions per cycle. However, it is not clear whether these reports are genuinely based on measures of instruction per cycle. Rather it appears that they are measures of the amount of work done per unit of time normalized by processor frequency.

In theory, a processor limited to one instruction per cycle could beat a modern Intel processor on many tasks if they had powerful instructions and faster data access. Thus “work per unit of time normalized per CPU frequency” and “instructions per cycle” are distinct notions.

I have only had access to a recent AMD processors (Zen 2) for a short time, but in this short time, I have routinely found that it has a lower number of instructions per cycle than even older Intel processors.

Let us consider a piece of software that has a high number of instructions per cycle, the fast JSON parser simdjson. I use GNU GCC 8 under Linux, I process a test file called twitter.json using the benchmark command line parse. I record the number of instructions per cycle, as measured by CPU counters, in the two stages of processing. The two stages together effectively parse a JSON document. This is an instruction-heavy benchmark: the numbers of mispredicted branches and cache misses are small. The Skylake processor has the highest frequency. I use an AMD Rome (server) processor.

I find that AMD is about 10% to 15% behind Intel.

processor IPC (stage 1) IPC (stage 2)
Intel Skylake (2015) 3.5 3.0
Intel Cannon Lake (2018) 3.6 3.1
Zen 2 (2019) 3.2 2.8

Another problem that I like is bitset decoding. That is given an array of bits (0s and 1s), I want to find the location of the ones. See my blog post Really fast bitset decoding for “average” densities. I benchmark just the “basic” decoder.

void basic_decoder(uint32_t *base_ptr, uint32_t &base, 
  uint32_t idx, uint64_t bits) {
  while (bits != 0) {
    base_ptr[base] = idx + _tzcnt_u64(bits);
    bits = bits & (bits - 1);
    base++;
  }
}

My source code is available.

processor IPC
Intel Skylake (2015) 2.1
Zen 2 (2019) 1.4

So AMD runs at 2/3 the IPC of an old Intel processor. That is quite poor!

Of course, your results will vary. And I am quite willing to believe that in many, maybe even most, real-world cases, AMD Zen 2 can do more work per unit of work than the best Intel processors. However I feel that we should qualify these claims. I do not think it is entirely reasonable for AMD customers to expect better numbers of instructions per cycle on the tasks that they care about, and they may even find lower numbers. AMD Zen 2 does not dominate Intel Skylake, it is more reasonable to expect rough parity.

Further reading: AMD Zen 2 and branch mispredictions

Better computational complexity does not imply better speed

A recent magazine article presents a theoretical result: Harvey and van der Hoeven have shown that you can multiply two n-bit integers using O(n log n) complexity. Roughly speaking, this means that as n grows large, there is a mathematical model of how long the resulting algorithm might run that grows like n log n in the worst case. That is, it does not get much worse as n gets larger, mathematically speaking. It is likely that this result will go into the textbook as an important contribution to theoretical computer science.

So far so good. But then the magazine articles then state that the authors have presented a faster way to compute multiplications. That is incorrect. At this point in time, they have not make this demonstration or even this claim, to my knowledge. Possibly they have an implementation somewhere and might be able to answer my questions, but I don’t expect so.

So what is the problem with claiming that they have a faster and more efficient way to do multiplications?

  • Computational complexities like O(n log n) are not a speed or even a time in the real world.
  • They are not even a model of the running time or speed. They are a mathematical asymptotic model for n large. How large must n be for a favorable model to theoretically beat a less favorable model mathematically? Possibly as large as the number of atoms in the universe. Or maybe even larger than that. I am not joking: ‪as far as I can tell, the Harvey and Van Der Hoeven would require more digits per integer than there are atoms in the universe to have a chance at being practical. Please pause to consider: if you are a scientist, you need to make predictions that apply to our universe otherwise you are doing mathematics. Mathematics is a fine activity, but it is not to be confused with science or engineering. I can hear people clamouring that, in the end, big n always win out. Let me quote a famous theoretical computer scientist on this topic:

    Asymptotics will eventually win out, as long as everything else stays fixed. But that’s the precise problem. Everything else doesn’t stay fixed.

  • But it gets worse: these are not scientific models. A scientific model would predict the running time of the algorithm given some implementation, within some error margin. However, these models do nothing of the sort. They are purely mathematical. They are not falsifiable. As long as they are mathematically correct, then they are always true. To be fair, some researchers like Knuth came up with models that closely mimic reasonable computers, but that’s not what people pursuing computational complexity bounds do, typically.
  • Importantly, this means that these asymptotic models make no prediction about the running time of actual code. If I ask Harvey and van der Hoeven how long (in seconds) their algorithm take to compute the multiplication between 1024-byte integers on my macbook, they cannot tell me for their paper alone. Thus they don’t know that it is actually faster than the big-integer library I am using (gmplib).
  • Maybe you would argue that, eventually, their algorithm would beat whatever gmplib has implemented, given large enough integers. But that’s not a scientific (i.e., falsifiable) statement. I could implement their algorithm and apply it to 100000-byte integers and get that their approach is no faster. You could then argue that I need to go even larger. If I do and fail again, you might argue that I need to go even larger. That is not how science or engineering should work. If you have a scientific model, it should make predictions that can be correct or false about the real world. And maybe you end up telling me that if I use 21024 bits per integer, then surely the implementation of the new algorithm is faster: but it is again an unscientific statement because there is no such thing as 21024-bit integers and there will never be in this universe.
  • Possibly Harvey and van der Hoeven have more precise bounds than just a big-O result. But even if they make falsifiable predictions about actual use cases, it does not follow that they are correct. They are working from a mathematical model. It is an idealized version of a computer. Without checking, you cannot tell whether your model is correct. You have to test it out. And, importantly, if you make flawed predictions, then your model is incorrect from a scientific point of view. And what are you modelling exactly? If you are assuming that a computer that manages petabytes of memory works the same as a computer that has gigabytes of memory, something is fishy.

I am not just nitpicking. This is of critical importance. Medical researchers cure Alzheimer’s or cancer in marvelous ways almost weekly using “animal models” or in vitro (in the laboratory). These breakthroughs almost never translate in the real world. It is ridiculously hard to go from models to applications.

If you do theoretical work and arrive at a model that suggests that you have a better, faster algorithm, then you are not nearly done. The map is not the territory. If you are good, then your model should match closely reality. But, no matter how smart you are, and no matter how clever your mathematical work is, you can only claim to be solving a problem faster after you have built it into a computer and recorded the elapsed time.

I am not diminishing Harvey and van der Hoeven accomplishments. If my understanding is correct, their names will be in textbooks for a very, very long time. It is well deserved based on the mathematical construction.

But we are not able to multiply integers faster thanks to their paper. This paper of theirs is not an engineering contribution. To be fair, it may lead to one such contribution. Working with models is often a useful way to do better engineering. However, you have to keep in mind that the model is not reality and that reality the only thing that matters at the end. If you lose track of this important fact, then you are falling back into prescientific thinking.

What if you disagree with my post? What if you think that Harvey and van der Hoeven’s contribution is indeed a step forward engineering-wise? Then you have a clear path forward: provide an implementation, benchmark it against well established software (e.g., gmplib) and show the benefits. No amount of white board arguments can make this requirement go away: show me the code.

Memory parallelism: AMD Rome versus Intel

When thinking about “parallelism”, most programmers think about having multiple processors. However, even a single core in a modern processor has plenty of parallelism. It can execute many instructions per cycle and, importantly, it can issue multiple memory requests concurrently.

Our processors are becoming “more parallel” over time, as is evident by the recent increases in the number of cores. AMD sells 64-core processors. But each individual core is also becoming “more parallel”.

To demonstrate, let me use a memory access test.

The starting point is a shuffled array. You access one entry in the array, read the result, and it points to the next location in the array. You get a random walk through an array. The test terminates when you have visited every location in the array.

Then you can “parallelize” this problem. Divide the random walk into two equal-size paths. Or divide it into 10 equal-size paths.

If you can issue only one memory request at any one time, parallelizing the problem won’t help. However, processors can issue more than one memory request and so as you parallelize the problem, your running times get smaller and your effective bandwidth higher.

How has this evolved over time? The very latest Intel processors (e.g., Cannon Lake), can sustain more than 20 memory requests at any one time. It is about twice what the prior generation (Skylake) could do. How do the latest AMD processor fare? About the same. They can sustain nearly 20 concurrent memory requests at any one time. AMD does not quite scale as well as Intel, but it is close. In these tests, I am hitting RAM: the array is larger than the CPU cache. I am also using huge pages.

The important lesson is that if you are thinking about your computer as a sequential machine, you can be dramatically wrong, even if you are just using one core.

And there are direct consequences. It appears that many technical interviews for engineering positions have to do with linked lists. In a linked list, you access the element of a list one by one, as the location of the next entry is always coded in the current entry and nowhere else. Obviously, it is a potential problem performance-wise because it makes it hard to exploit memory-level parallelism. And the larger your list, the more recent your processor, the worse it gets.

I make the raw results available. I use the testingmlp software package.

Cloud computing: a story of incentives

Many businesses today run “in the cloud”. What this often means is that they have abstracted out the hardware entirely. Large corporations like Amazon, Google, Microsoft or IBM operate the servers. The business only needs to access the software, remotely.

In theory, this means that you can adjust your capacity for just your needs. If you need only twelve servers most of the year, then you pay for only twelve servers. And on the specific days when you need 100 servers, you pay for the 100 servers on these days only. You may even use “serverless” computing and pay just for what you use, saving even more money.

Is this the whole story?

I am not so sure.

A tremendous benefit of cloud computing for the developers and operation people is that it cuts through the red tape. If you are using the cloud, then a developer can get ten more servers at the click of a button. I have met credible people from well-known businesses who told me that their developers have practically an unlimited ability to allocate new cloud resources.

If we make it easy for developers to quickly use lots of computing resources, these developers might effectively think of computing and storage as infinite. It also wipes away all incentives to produce efficient systems.

You may end up warming up the planet. It is not a joke: training a single machine-learning model can have over four times the footprint of a car over its entire lifetime.

Your ten servers end up needing a much more complicated architecture than your single laptop. But that is not obviously a negative for some developers who get to try out fancier-than-needed software, always useful on a resume.

Developers are not alone. When I take pictures these days, I never delete them. They get uploaded to the cloud and I forget about them. When I managed the scarce storage on my PC where digital pictures could be stored, two decades ago, I would spend a lot of time deleting bad pictures. I no longer care.

Further reading: The computing power needed to train AI is now rising seven times faster than ever before and Facebook AI research’s latest breakthrough in natural language understanding is running up against the limits of existing computing power. (credit: Peter Turney)

Science and Technology links (November 16th 2019)

    1. We have new technology to do genetic engineering on human beings (CRISPR). In a small clinical trial, the researchers tested it on live human subjects and found it to be safe.
    2. Girls and boys have a similar intelligence according to psychometric tests, but girls do much better in school, and this is not due to their better verbal ability.
    3. Being illiterate is associated with dementia and cognitive decline in old age.
    4. Dumber people tend to smoke more cannabis, but cannabis itself does not appear to make people dumber.
    5. We transplanted pig skin onto human beings.
    6. A large fraction of Ph.D. students seek psychological help for anxiety and depression. The students often blame the Ph.D. supervisors.
    7. Researchers have apparently been able to reverse aging in the eye of old mice. The mice eyes can then recover from a crushed nerve and from glaucoma.

Unrolling your loops can improve branch prediction

Modern processors predict branches (e.g., if-then clauses), often many cycles a ahead of time. When predictions are incorrect, the processor has to start again, an expensive process. Adding a hard-to-predict branch can multiply your running time.

Does it mean that you should only care about hard-to-predict branches? Unfortunately no.

In a prior blog post, I showed how adding a predictable branch can degrade the branch prediction of other branches.

Conversely, removing a predictable branch might improve branch predictions. Wilco Dijkstra suggested a way to test it out. A common form of predictable branch is a loop. Though a loop does not look like an if-then clause, it still compiles to a branch. A loop is predictable if it iterates long enough. You can “unroll” loops, that is, reduce the number of iterations by doing more work with each iterations. Unrolling does not just remove a predictable branch, but that it does that among other things.

Let us consider an example. I generate random numbers. I check whether they are odd, and when they are, I write the result to an array. It is a silly example meant to illustrate the effect of branch prediction. Because my random numbers can be generated by a pseudo-random number generator, I can run many trials with the same random number. The more often I repeat the loop, the better the branch prediction gets.

  while (howmany != 0) {
    uint64_t randomval = rng(howmany + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    howmany--;
  }

We can unroll this loop: instead of generating one random number per loop iteration, I generate two. You can also generate four or eight, and so forth. The unrolled loop has fewer branches because each iteration through the loop is an implicit branch.

  uint64_t randomval;
  while (howmany >= 2) {
    randomval = rng(howmany + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    randomval = rng(howmany - 1 + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    howmany-=2;
  }  
  while (howmany != 0) {
    uint64_t randomval = rng(howmany + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    howmany--;
  }

I implemented this benchmark using 1000 random numbers per trial. I record the number of mispredicted branch per random number generated on different processors.

ARM 64-bit (Skylark):

trial basic loop unrolled twice unrolled four times
1 58% 57% 56%
2 48% 33% 26%
3 48% 28% 21%
15 45% 18% 8%

Intel x64 (Cannon Lake):

trial basic loop unrolled twice unrolled four times
1 53% 51% 49%
2 50% 35% 30%
3 46% 20% 17%
15 40% 0.5% 0.4%

 

Adding a (predictable) branch to existing code can increase branch mispredictions

Software is full of “branches”. They often take the form of if-then clauses in code. Modern processors try to predict the result of branches often long before evaluating them. Hard-to-predict branches are a challenge performance-wise because when a processor fails to predict correctly a branch, it does useless work that must be thrown away.

A convenient illustration is an algorithm that generates a random number and then only appends it to a list if the random number is odd*. When the numbers are genuinely random, half of the branches will be mispredicted. However, if we generate the same 2000 numbers using a pseudo-random number generator, the processor might learn to predict more accurately which number is odd.

while (howmany != 0) {
    randomval = random();
    if (randomval is odd)
      append randomval to array
    howmany--;
 }

What if we add a predictable branch? Let us say that we check whether the random 64-bit value is some arbitrary number. This new branch will be easily predicted as false.

while (howmany != 0) {
    randomval = random();
    if (randomval is 12313132)
       generate error
    if (randomval is odd)
      append randomval to array
    howmany--;
 }

Since the new branch is predictable, maybe it comes nearly for free?

Let us run 10 trials of the first algorithm, then 10 trials of the second, and so forth repeatedly, until the branch predictor is practically stable.

Let us count the number of mispredicted branches per loop iteration. We added an easy-to-predict branch, so it should not contribute directly to the number of mispredicted branches. I get the following numbers…

processor one hard branch one hard, one easy branch
Intel Skylake processor 4% to 9% 30% to 40%
ARM A72 24% to 26% 49% to 51%

So at least in this particular test, the mere addition of an easy-to-predict branch increased substantially the number of mispredicted branches.

My source code is available.

Note: The loop itself is an easily-predicted branch since the processor must determine whether it continues for another iteration or not at the end of each iteration.

*- It is a not a practical algorithm, it only serves to illustrate my point.

Parsing numbers in C++: streams, strtod, from_chars

When programming, we often want to convert strings (e.g., “1.0e2”) into numbers (e.g., 100). In C++, we have many options. In a previous post, I reported that it is an expensive process when using the standard approach (streams).

Many people pointed out to me that there are faster alternatives. In C++, we can use the C approach (e.g., strtod). We can also use the from_chars function. The net result is slightly more complicated code.

do {
    number = strtod(s, &end);
    if(end == s) break;
    sum += number;
    s = end; 
} while (s < theend);

I use long strings (100,000 numbers), and the GNU GCC 8.3 compiler on an Intel Skylake processor.

integers (stream) 200 MB/s
integers (from_chars) 750 MB/s
floats (stream) 50 MB/s
floats (strtod) 90 MB/s

We see that for integers, the from_chars function almost four times faster than the stream approach. Unfortunately my compiler does not support the from_chars function when parsing floating-point numbers. However, I can rely on the similar C function (strtod). It is nearly twice as fast as the floating-point approach. Even so, it still costs nearly 38 cycles per byte to parse floating-point numbers.

For each floating-point number, there are almost 10 branch misses in my tests, even though I generate numbers using a fixed format. The number of branch misses is nearly the same whether we use a C++ stream or the C function.

My source code is available.

Science and Technology links (October 26th 2019)

  1. People who were the oldest in the classes in school tend to be more confident and to take more risks.
  2. At the University of Montreal, about 32% of the students are male and 68% are female.
  3. Did blind orchestra auditions really benefit women?
  4. Wealthy people are happier, but the effect is small.
  5. Among Uber consumers, 60% never tip and only 1% always tip.
  6. How do you know how far away a given object is? A theory is that we rely on our familiarity of the object: you know how big a car can be, so if you see a tiny car, you know that it must be far away. However, it turns out that we are not better at estimating distances when we are familiar with the objects.
  7. What does it take to become a professor in a good university? In psychology, you need to have written about 16 papers, half of which as the first author; and most of the new hires have completed a postdoc or started from a job at a lesser institution.
  8. The end of our chromosomes are terminated by repeated sequences called telomeres. These sequences do not contribute directly to your genetic code. Instead they are used when the cells divide to hold the chromosome while it is being copied. With each cell division, the telomeres get shorter. Eventually your cells can no longer divide, unless they use a trick to elongate the telomeres (e.g., telomerase). Thus telomeres act as a clock in aging. It is also sometimes believed that telomeres act to protect us against cancer because a single cell can’t reproduce endlessly, unless it manages somehow to elongate its telomeres. So what happens when you create mice with longer-than-usual telomeres? They live longer, they are fitter and they do not have higher cancer rates:

    (…) we demonstrate here that it is possible to generate mice that have telomeres which are much longer than those of the natural species (…) These mice show a younger phenotype as indicated by improved mitochondrial function, improved metabolic parameters, decreased cancer, and increased longevity. These results also suggest that there is not a negative selection for individuals with longer telomeres than normal in species, and therefore, one can envision that natural selection processes which favor individuals with longer telomeres within a given species, could potentially increase species longevity.

    Source: Nature.

    It seems credible that we could engineer other longer-lived species by manipulating the telomere lengths.

  9. Daily vitamin D supplements improve cognition in Alzheimer’s patients.
  10. We have never seen so much ice coverage in Antartica. (Source: NASA)

How expensive is it to parse numbers from a string in C++?

In C++, there are multiple ways to convert strings into numbers. E.g., to convert the string “1.0e2” into the number 100.

In software, we frequently have to parse numbers from strings. Numbers are typically represented in computers as 32-bit or 64-bit words whereas strings are variable-length sequences of bytes. We need to go from one representation to the other.

For example, given the 3-character string “123”, we want to recover the integer value 123, which is typically stored as a byte value 0x7b followed by some zero bytes.

Though parsing integers is not trivial, parsing floating-point numbers is trickier: e.g., the strings “1e10”, “10e9”, “100e+8” and “10000000000” all map to the same value. Typically, we store floating-point values as either 32-bit or 64-bit values using the IEEE 754 standard. Parsing may fail in various fun ways even if we set aside garbage inputs that have no reasonable number value (like “4a.14x.10.14”). For example, the string “1e500” cannot be represented, typically, in software (except as the infinity). The string “18446744073709551616” cannot be represented as either a 64-bit floating-point value or a 64-bit integer. In some instances, we need to “round” the value: the string “1.0000000000000005” cannot be represented exactly as a 64-bit floating-point value, so we need to round it down to “1.0000000000000004” while we might round “1.0000000000000006” up to 1.0000000000000007.

How expensive is it to parse numbers? To test it out, I wrote a long string containing space-separated numbers. In one case, I generated random 64-bit unsigned integers, and another I generated random 64-bit normal floating-point numbers. Then I benchmark standard C++ parsing code that simply sums up the numbers:

std::stringstream in(mystring);
while(in >> x) {
   sum += x;
}
return sum;

I use long strings (100,000 numbers), and the GNU GCC 8.3 compiler on an Intel Skylake processor. I find that parsing numbers is relatively slow:

integers 360 cycles/integer 18 cycles/byte 200 MB/s
floats 1600 cycles/integer 66 cycles/byte 50 MB/s

Most disks and most networks can do much better than 50 MB/s. And good disks and good networks can beat 200 MB/s by an order of magnitude.

My source code is available.

Benchmarking is hard: processors learn to predict branches

A lot of software is an intricate of branches (ifthen clauses). For performance reasons, modern processors predict the results of these branches.

In my previous post, I showed how the bulk of your running time could be due to mispredicted branches. My benchmark consisted in writing 64 million random integers to an array. When I tried to only write odd random integers, the performance became much worse, due to the mispredictions.

while (howmany != 0) {
    val = random();
    if( val is an odd integer ) {
      out[index] =  val;
      index += 1;
    }
    howmany--;
}

Why 64 million integers and not, say, 2000? If you run just one test, then it should not matter. However, what if you are running multiple trials? Quickly, the number of mispredicted branches falls to zero. The numbers on an Intel Skylake processor speaks for themselves:

trial mispredicted branches (Intel Skylake)
1 48%
2 38%
3 28%
4 22%
5 14%

The “learning” keeps on going as the following plots shows… Eventually, the fraction of mispredicted branches falls to about 2%.

That is, as you keep measuring how long the same task takes, it gets faster and faster because the processor learns to better predicts the outcome. The quality of the “learning” depends on your exact processor, but I would expect that better, newer processors should learn better. Importantly, from run to run, we generate the same random integers.

The latest server processors from AMD learn to predict the branch perfectly (within 0.1%) in under 10 trials.

trial mispredicted branches (AMD Rome)
1 52%
2 18%
3 6%
4 2%
5 1%
6 0.3%
7 0.15%
8 0.15%
9 0.1%

This perfect prediction on the AMD Rome falls apart if you grow the problem from 2000 to 10,000 values: the best prediction goes from a 0.1% error rate to a 33% error rate.

You should probably avoid benchmarking branchy code over small problems.

My code is available.

Credit: The AMD Rome numbers were provided by Velu Erwan.

Further reading: A case for (partially) TAgged GEometric history length branch prediction (Seznec et al.)

Mispredicted branches can multiply your running times

Modern processors are superscalar, meaning that they can execute many instructions at once. For example, some processors can retire four or six instructions per cycle. Furthermore, many of these processors can initiate instructions out-of-order: they can start working on instructions that appear much later in the code.

Meanwhile most code contains branches (ifthen clauses). These branches are often implemented as “jumps” where the processor either goes running instruction further away, or continues on its current path.

It is hard to reconcile branches with out-of-order superscalar execution. To do so, processors have sophisticated branch predictors. That is, the processor tries to predict the future. When it sees branch, and thus a jump, it tries to guess which way it will go.

This often works quite well. For example, most loops are actually implemented as branches. At the end of each iteration in the loop, the processor must predict whether there will be a next iteration. It is often safe for the processor to predict that the loop will continue (forever). The processor will only mispredict one branch per loop in this manner.

There are other common examples. If you are accessing the content of an array, many languages will add “bound checking”: before accessing the array value, there will be a hidden check to see whether the index is valid. If the index is not valid, then an error is generated, otherwise the code proceeds normally. Bound checks are predictable since all accesses should (normally) be valid. Consequently, most processors should be able to predict the outcome nearly perfectly.

What happens if the branch is hard to predict?

What happens inside the processor is that all of the instruction that were executed but that follow the mispredicted branch must be cancelled and the computation must start anew. You should expect a penalty of more than 10 cycles for each mispredicted branch. It can multiply the running times.

Let us look at some simple code where we write out random integers to an output array:

while (howmany != 0) {
    out[index] =  random();
    index += 1;
    howmany--;
}

We can generate a decent random number in about 3 cycles on average. That is, the total latency of the random number generator might be 10 cycles. But our processor is superscalar, so we can do several random-number computations at once. Thus we may generate a new random number every 3 cycles or so.

Let us modify slightly this function so that we only write out the odd integers:

while (howmany != 0) {
    val = random();
    if( val is an odd integer ) {
      out[index] =  val;
      index += 1;
    }
    howmany--;
}

Naively we might think that this new function could be faster. Indeed, we might only write out one out of two integers. We have a branch, but checking whether an integer is odd requires checking a single bit.

I benchmarked these two functions in C++ using a skylake processor:

write all random numbers 3.3 cycles per integer
write only odd random numbers 15 cycles per integer

The second function takes about five times longer!

Is there something you can do? Yes. You can just remove the branch. You can characterize an odd integer by the fact that its bitwise logical AND with the value 1 is one. The trick is to increment the array index by one only when the random value is odd.

while (howmany != 0) {
    val = random();
    out[index] = val;
    index += (val bitand 1);
    howmany--;
}

In this new version, we always write the random value to the output array, even when it is not needed. At a glance, it looks wasteful. However, it does away with the mispredicted branches. In practice the performance is nearly as good as the original code, and much better than the version with branches:

write all random numbers 3.3 cycles per integer
write only odd random numbers 15 cycles per integer
branch removed 3.8 cycles per integer

Could the compiler have solved this problem on its own? In general, the answer is negative. Sometimes compilers have some options to avoid branches entirely even if there is an if-then clause in the original code. For example, branches can sometimes be replaced by “conditional moves” or other arithmetic tricks. However, there are tricks that compilers cannot use safely.

An important take-away is that mispredicted branches are not a minor concern, they can have a large effect.

My source code is available.

Further reading: Processors learn to predict branches (follow-up blog post).

Science and Technology (October 12th 2019)

  1. In many countries, like Canada, there is relatively little private (business) research. Meanwhile, other research indicates that private research is precisely the kind of research that leads directly to productivity growths. Private research seems concentrated in hubs like the Silicon Valley. To entice businesses in doing research, the Canadian government has pursued an agressive tax credit strategy: roughly speaking, research becomes a tax avoidance strategy. It does not work well.
  2. Artificial ovaries work in mice, and we think that they may work in women too.
  3. Alzheimer’s research has failed us despite massive investments. It might be time for a reset.
  4. Drinking alcohol triggers an hormone (FGF21) that makes you more likely to drinking water.
  5. Dog owners are less likely to die.
  6. The greatest inequality might be in dating and sex. The bottom 80% of men (in terms of attractiveness) are competing for the bottom 22% of women and the top 78% of women are competing for the top 20% of men. In other words, a small number of men have their pick of all of the women and most guys are unattractive to most women.
  7. The U.S. Navy looked at skin cancer rates among its staff. They found the highest rate of skin cancer in people working essentially indoor.
  8. Fruit flies exposed to a combo of three different drugs lived 48% longer, even though the individual effect of each drug is relatively small.
  9. Moderate alcohol consumption is associated with reduced inflammation and improved responses to vaccination.
  10. Choline might protect against Alzheimer’s. Choline is found in meat among other places.
  11. You may have read that we find symmetric faces more attractive. A new study challenges this claim.
  12. Iron might be an aging factor.
  13. Climate models routinely predicts higher temperatures than what is actually observed. Where do the heat goes? A paper in Nature claimed that the heat in question went into the ocean. Nature withdrew the paper in question as it contained too many data processing mistakes.
  14. Alpha-ketoglutarate can significantly extend lifespan and healthspan in mice. You can get it in the form of cheap supplements.

The price of a MacBook Air worldwide

Apple sells identical laptops worldwide. There might be small differences with respect to power adaptors and so forth, but the laptops are the same internally. Of course, there are differences in taxes but Apple quote prices with taxes included. So I went shopping for a basic MacBook Air, in different countries, using Apple’s web site.

USA US$1099 US$1099
Canada CA$1449 US$1089
Japan ¥119,800 US$1120
Australia A$1,699 US$1146
France 1 249,00 € US$1374

According to these numbers, the cheapest MacBooks are in Canada and the most expensive ones are in France.

In countries where taxes vary by province or state, the Apple tax estimate might be wrong for some locations.

Science and Technology links (September 28th 2019)

  1. Researchers have effectively rejuvenated the damaged skin of mice by using “exosomes”. These are packages that cells send in their environment, and it appears that other cells react positively to exosomes in some conditions. The therapy is non-invasive (no needle!) and it appears to be superior to existing skin rejuvenation treatments (e.g., retinol, stem cells). Skin thickness and collagen production were both improved.
  2. Researchers have effectively created new neurons in the brain of mice by converting existing glial cells (which are abundant) into neurons. This could be used one day to help patients recover from brain injuries or strokes.
  3. About one in six death in the USA is due to lead exposure. Lead accumulates in your body over time and significantly increases your risks of cardiovascular diseases, among other bad things. There are chelation therapies to lower your lead levels, but they are not in wide use even though it is surprisingly effective at improving the health of some sick people.

Doubling the speed of std::uniform_int_distribution in the GNU C++ library

The standard way in C++ to generate a random integer in a range is to call the std::uniform_int_distribution function. The current implementation of std::uniform_int_distribution in the GNU C++ library (libstdc++) to generate an integer in the interval [0,range] looks as follows.

scaling = random_range / range;
limit = range * scaling;
do
   answer = random;
while ( answer >= limit);
return  answer / scaling;

Typically, it requires two divisions by invocation. Even on recent processors, integer division is relatively expensive.

We can achieve the same algorithmic result with at most one division, and typically no division at all without requiring more calls to the random number generator. It speeds things up on various systems. It was recently added to Swift language.

Travis Downs suggested that someone should try to fix the GNU C++ implementation. So I prepared and submitted a patch. It is currently under review.

The main challenge is that we need to be able to compute the “full” product. E.g., given two 64-bit integers, we want the 128-bit result; given two 32-bit integers we want the 64-bit result. This is fast on common processors.

The 128-bit product is not natively supported in C/C++ but can be achieved with the __int128 128-bit integer extension when it is available. Thankfully, we can check for __int128 support, and when the support is lacking, we can fall back on another approach. The 128-bit integer extension appears in many compilers (LLMV clang, GNU GCC, Intel icc), but even within a given compiler (say GNU GCC), you cannot rely on the extension being always present, since the compiler implementation might be system specific.

The code I ended up with is somewhat ugly, but it works:

      __product = (unsigned __int128)(__urng() - __urngmin) * __uerange;
      uint64_t __lsb = uint64_t(__product);
      if(__lsb < __uerange)
      {
        uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
        while (__lsb < __threshold)
        {
          __product = (unsigned __int128)(__urng() - __urngmin) * __uerange;
          __lsb = uint64_t(__product);
        }
      }
      __ret = __product >> 64;

It mostly avoids divisions. I designed a simple random shuffling benchmark that mostly just calls std::shuffle which, in turn, relies on std::uniform_int_distribution. Given an array of one million elements, I get the following timings on a skylake processor with GNU GCC 8.3:

before patch 15 ns/value
after patch 8 ns/value

For fun, let us compare with the implementation that is present in the Swift standard library:

     var random: T = next()
     var m = random.multipliedFullWidth(by: upperBound)
     if m.low < upperBound {
       let t = (0 &- upperBound) % upperBound
       while m.low < t {
         random = next()
         m = random.multipliedFullWidth(by: upperBound)
       }
     }
     return m.high

I think it is more readable in part because the Swift language has built in support for the full multiplication. It is somewhat puzzling that the C++ language does not see fit to make it easy to compute the full product between two integers.

Reference: Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation 29 (1), 2019