GNU GCC on x86 does not round floating-point divisions to the nearest value

I know that floating-point arithmetic is a bit crazy on modern computers. For example, floating-point numbers are not associative:

0.1+(0.2+0.3) == 0.599999999999999978
(0.1+0.2)+0.3 == 0.600000000000000089

But, at least, this is fairly consistent in my experience. You should simply not assume fancy properties like associativity to work in the real world.

Today I stumbled on a fun puzzle. Consider the following:

double x = 50178230318.0;
double y = 100000000000.0;
double ratio = x/y;

If God did exist, the variable ratio would be 0.50178230318 and the story would end there. Unfortunately, there is no floating-point number that is exactly 0.50178230318. Instead it falls between the floating-point number 0.501782303179999944 and the floating-point number 0.501782303180000055.

It important to be a bit more precise. The 64-bit floating-point standard represents numbers as a 53-bit mantissa followed by a power of two. So let us spell it out the way the computer sees it:

0.501782303179999944 == 4519653187245114  * 2 ** -53
0.501782303180000055 == 4519653187245115  * 2 ** -53

We have to pick the mantissa 4519653187245114 or the mantissa 4519653187245115. There is no way to represent exactly anything that falls in-between using 64-bit floating-point numbers. So where does 0.50178230318 fall exactly? We have approximately…

0.50178230318 = 4519653187245114.50011795456 * 2 ** -53

So the number is best approximated by the largest of the two values (0.501782303180000055).

Goldberg in What every computer scientist should know about floating-point arithmatic tells us that rounding must be to the nearest value:

The IEEE standard requires that the result of addition, subtraction, multiplication and division be exactly rounded. That is, the result must be computed exactly and then rounded to the nearest floating-point number (using round to even). (…) One reason for completely specifying the results of arithmetic operations is to improve the portability of software. When a program is moved between two machines and both support IEEE arithmetic, then if any intermediate result differs, it must be because of software bugs, not from differences in arithmetic. Another advantage of precise specification is that it makes it easier to reason about floating-point. Proofs about floating-point are hard enough, without having to deal with multiple cases arising from multiple kinds of arithmetic. Just as integer programs can be proven to be correct, so can floating-point programs, although what is proven in that case is that the rounding error of the result satisfies certain bounds.

Python gets it right:

>>> "%18.18f" % (50178230318.0/100000000000.0)

JavaScript gets it right:

> 0.50178230318 == 0.501782303180000055
> 0.50178230318 == 0.501782303179999944

So the story would end there, right?

Let us spin up the GNU GCC 7 compiler for x86 systems and run the following C/C++ program:

#include <stdio.h>
int main() {
  double x = 50178230318.0;
  double y = 100000000000.0;
  double ratio = x/y;
  printf("x/y  = %18.18f\n", ratio);

Can you predict the result?

$ g++ -o round round.cpp
$ ./round
x/y  = 0.501782303179999944

So GNU GCC actually picks the smallest and furthest value, instead of the nearest value.

You may doubt me so I have created a docker-based test.

You might think that it is a bug that should be reported, right?

There are dozens if not hundreds of similar reports to the GNU GCC team. They are being flagged as invalid.

Let me recap: the GNU GCC compiler may round the result of a division between two floating-point numbers to a value that is not the nearest. And it is not considered a bug.

The explanation is that the compiler first rounds to nearest using 80 bits and then rounds again (this is called double rounding). This is what fancy numerical folks call FLT_EVAL_METHOD = 2.

However, the value of FLT_EVAL_METHOD remains at 2 even if you add optimization flags such as -O2, and yet the result will change. The explanation is that the optimizer figures out the solution at compile-time and does so ignoring the FLT_EVAL_METHOD value. Why it is allowed to do so is beyond me.

Maybe you think it does not matter. After all, the value is going to be close, right? However, if you are an experienced programmer, you know the value of having deterministic code. You run the code and you always get the same results. If the results depend, some of the time, on your exact compiler flag, it makes your life much more difficult.

You can also try to pass GNU GCC the flags -msse -mfpmath=sse, as experts recommend, but as my script demonstrates, it does not solve the issue (and then you get FLT_EVAL_METHOD = -1). You need to also add an appropriate target (e.g., -msse -mfpmath=sse -march=pentium4). In effect, when using GNU GCC, you cannot get away from specifying a target. The flags -msse -mfpmath=sse alone will silently fail to help you.

Some people have recommended using other flags to switch the compiler in pc64 mode (-pc64). While it would fix this particular example, it does not fix the general problem of floating-point accuracy. It will just create new problems.

If you are confused as to why all of this could be possible without any of it being a bug, welcome to the club. My conclusion is that you should probably never compile C/C++ using GNU GCC for a generic x86 target. It is broken.

You can examine the assembly on godbolt.

Note: Please run my tests in the specified docker images so that you get the exact same configuration as I do.

Science and Technology links (June 20th 2020)

  1. UCLA researchers have achieved widespread rejuvenation in old mice through blood plasma diluation, a relatively simple process.

    (…) these results establish broad tissues rejuvenation by a single replacement of old blood plasma with physiologic fluid: muscle repair was improved, fibrosis was attenuated, and inhibition of myogenic proliferation was switched to enhancement; liver adiposity and fibrosis were reduced; and hippocampal neurogenesis was increased.(…) These findings are most consistent with the conclusion that the age-altered systemic milieu inhibits the health and repair of multiple tissues in the old mice, and also exerts a dominant progeric effect on the young partners in parabiosis or blood exchange.

    They plan to conduct clinical trials in human beings “soon”.

  2. It used to be that universities would happily pay large sums to private publishers like Elsevier for access to the research articles. The prestigious MIT joins the ranks of the universities who are challenging Elsevier.
  3. Medical doctors follow clinical practice guidelines. In turn, the producers of these guidelines are often funded by the industry, and they fail to disclose it.
  4. The upcoming Sony PlayStation 5 will have a disk with a bandwidth of over 5 GB/s. For comparison, good Apple laptops typically achieve only about 2 GB/s, and older conventional disks are 10 to 20 times slower.

    Our already-low tolerance for slow and unresponsive applications and web sites will fall. Loading screens, loading bars, and similar “make the user wait” strategies will become more and more annoying. We will come to expect application updates to occur in the blink of an eye.

    Programmers used to blame disk and network performance, but these excuses will not hold in the near future. More and more, poor performance will be due to poor software engineering. I gave a talk recently on the topic: data engineering at the speed of your disk (slides).

Update: Someone objected that disks with 6Gb/s bandwidth are already commonplace and have been inexpensive for many years. That is true, but 6Gb/s is 10 times slower than 5 GB/s. Notice that ‘b’ stands for bit whereas ‘B’ stands for byte.

Computational overhead due to Docker under macOS

For my programming work, I tend to assume that I have a Linux environnement. That is true whether I am under Windows, under macOS or under a genuine Linux.

How do I emulate Linux wherever I go? I use docker containers. Effectively, the docker container gives me a small subsystem where everything is “as if” I was under Linux.

Containers are a trade-off. They offer a nice sandbox where your code can run, isolated from the rest of your system. However they also have lower performance. Disk and network access is slower. I expect that it is true wherever you run your containers.

However, part of your computing workload might be entirely computational. If you are training a model or filtering data, you may not be allocating memory, writing to disk or accessing the network. In such cases, my tests suggest that you have pretty much the same performance whether you are running your tasks inside a container, or outside of the container… as long as your host is Linux.

When running docker under Windows or macOS, docker must rely on a virtual machine. Under Windows, it may use VirtualBox or other solutions, depending on your configuration, whereas it appears to use Hyperkit under macOS. These virtual machines are highly efficient, but they still carry an overhead.

Let me benchmark a simple Go program that just repeatedly computes random numbers and compares them with the value 0. It prints out the result at the end.

package main

import (

func main() {
        counter := 0
        for n := 0; n < 1000000000; n++ {
                if rand.Int63() == 0 {
                        counter += 1
        fmt.Printf("number of zeros: %d \n", counter)

It is deliberately simple. I am going to use Go 1.14 (always).

Under macOS, I get that my program takes 11.7 s to run.

$ go build -o myprogram
$ time ./myprogram
number of zeros: 0

real	0m11.911s
user	0m11.660s
sys	0m0.034s

I am ignoring the “sys” time since I only want the computational time (“user”).

Let me run the same program after starting a docker container (from an ubuntu 20.10 image):

$ go build -o mydockerprogram
$ time ./mydockerprogram
number of zeros: 0

real	0m12.038s
user	0m12.026s
sys	0m0.025s

So my program now takes 12 s, so 3% longer. Observe that my methodology is not fool-proof: I do not know that this 3% slowdown is due to the overhead incurred by docker. However, it bounds the overhead.

Let me do something fun. I am going to start the container and run my program in the container, and then shut it down.

$ time run 'bash -c "time ./mydockerprogram"'
number of zeros: 0

real	0m12.012s
user	0m12.003s
sys	0m0.008s

real	0m12.545s
user	0m0.086s
sys	0m0.041s

It now takes 0.5 s longer. That is the time it takes for me start a container, do nothing, and then shut it down. Doing it in this manner takes 8% longer than running it natively in macOS.

Of course, if you run many small jobs, the 0.5 s is going to hurt you. It may come to dominate the running time.

If you want to squeeze every ounce of computational performance out your machine, it is likely that you should avoid the docker overhead under macOS. A 3% overhead may prove to be unacceptable. However, for developing and benchmarking your code, it may well be an acceptable trade-off.

Reusing a thread in C++ for better performance

In a previous post, I measured the time necessary to start a thread, execute a small job and return.

auto mythread = std::thread([] { counter++; });

The answer is thousands of nanoseconds. Importantly, that is the time as measured by the main thread. That is, sending the query, and getting back the result, takes thousands of nanoseconds and thousands of cycles. The work in my case is just incrementing a counter: any task more involved will increase the overall cost. The C++ standard API also provides an async function to call one function and return: it is practically equivalent to starting a new thread and joining it, as I just did.

Creating a new thread each time is fine if you have a large task that needs to run for milliseconds. However, if you have tiny tasks, it won’t do.

What else could you do? Instead of creating a thread each time, you could create a single thread. This thread loops and periodically sleep, waiting to be notified that there is work to be done. I am using the C++11 standard approach.

  std::thread thread = std::thread([this] {
    while (!exiting) {
      std::unique_lock<std::mutex> lock(locking_mutex);
      cond_var.wait(lock, [this]{return has_work||exiting;});
      if (exiting) {
      has_work = false;

It should be faster and overall more efficient. You should expect gains ranging from 2x to 5x. If you use a C++ library with thread pools and/or workers, it is likely to adopt such an approach, albeit with more functionality and generality. However, the operating system is in charge of waking up the thread and may not do so immediately so it is not likely to be the fastest approach.

What else could you do? You could simply avoid as much as possible system dependencies and just loop on an atomic variable. The downside of the tight loop (lockspin) approach is that your thread might fully use the processor while it waits. However, you should expect it to get to work much quicker.

  std::thread thread = std::thread([this] {;
    while (true) {
      while (!has_work.load()) {
        if (exiting.load()) {

The results will depend crucially on your processor and on your operation system. Let me report the rough numbers I get with an Intel-based linux box and GNU GCC 8.

new thread each time 9,000 ns
async call 9,000 ns
worker with mutexes 5,000 ns
worker with spinlock 100 ns

My source code is available.

Science and Technology links (June 6th 2020)

  1. A small number of people are responsible for a disproportionate number of inventions and innovations. Why are these people different? Using neuroimaging techniques, scientists find that superior individuals may not have distinct processes per se, but rather they use common processes differently, in a more economical fashion:

    (…) extraordinary creative ability is not the outcome of a unique set of neurocognitive processes; rather, it is associated with the same neural mechanisms that support ordinary creativity, but to a different degree (…). Indeed, our findings would support the argument that similar creative outcomes (…) come about with a less extensive recruitment of brain networks shown to contribute to creative thought (…), which we speculate may allow eminent creators to pursue concurrently, for example, multiple lines of creative thought.

    This suggests that most of us with a healthy brain could potentially become highly creative thinkers. It would seem important to determine whether that is true.

  2. The average female mammal lives about 20% longer than the corresponding male. In human beings, womens have only an 8% advantage (men have significantly shorter lives). This difference is not attributed to the riskier behavior of males. Rather, a credible explanation is that males have a weaker immune system.
  3. Sea urchins can regenerate appendages throughout their lives. They come into different subspecies with long and short lives. The end of our chromosomes contains repeated sequences called telomeres. These telomeres get shorter with every cell division, unless they are replenished (e.g., via telomerase). It is often theorized that aging is characterized or explained by telomere shortening. However, in sea urchins, the telomeres do not get shorter with life because of constant telomerase activity. And yet, there are short-lived (i.e. aging) sea urchins.
  4. Scientists have created hair-bearing human skin from stem cells. Though the authors do not stress this possibility, it seems obvious that it could lead to therapies for treating hair loss in human beings:

    Moreover, we show that skin organoids form planar hair-bearing skin when grafted onto nude mice. Together, our results demonstrate that nearly complete skin can self-assemble in vitro and be used to reconstitute skin in vivo.

    This was not tested in human being. Yet some doctors are optimistic:

    This achievement places us closer to generating a limitless supply of hair follicles that can be transplanted to the scalps of people who have thinning or no hair.

  5. The sun creates skin damage over time and contributes to a particular form of skin aging that is quite visible in some older people. The damage goes deep in the skin and is therefore challenging. Scientists have used stem cells to attempt to reverse the damage. In at least some (human) patients, the results can be characterized as an extensive reversal of the damage deep in the skin.
  6. Our cells need a compound called NAD+ to produce energy. As we age, we tend to have less and less NAD+. A particular disease called mitochondrial myopathy leads to NAD+ deficiency. Scientists found that niacin (an expensive supplement) was an efficient NAD+ booster in these patients. (It does not mean that you should take niacin.)

How Innovation Works (book review)

I read How Innovation Works by Matt Ridley in a few hours. It is a delicious book.

Ridley distinguishes invention from innovation. The inventor creates something new, the innovator applies the novelty to change the world. Jeff Bezos innovated with his Amazon site, but he may not be much of an inventor. Scientists often invent new “things” but they often fail to innovate.

Innovation is often quite positive and Ridley attributes much of the large gains in wealth and well-being that we have known. So what makes innovation possible, or what causes innovation?

Ridley does not know exactly. However, he identifies some conditions that favour innovation:

  • You need some freedom, the freedom to try new enterprises, the freedom to apply new ideas.
  • You need some wealth. Necessity does not reliably drive innovation according to Ridley.
  • You need the ability to make mistakes and learn quickly from them. Innovation does not happen overnight from the brain of a genius. It is a deeply iterative process.

Thus we get little innovation in the nuclear industry because it is difficult to get approval for a new nuclear power plant. And it is difficult to experiment. However, we get a lot of innovation on the Web where anyone can try to offer a new service and where it is easy to iterate quickly.

What is the role of government in innovation? It is often believed that government drives innovation. However, it does not match the historical record. Until sometime in the XXth century, governments did not have any idea that they could promote innovation. Yet innovation did occur. Furthermore, governments decided to adopt a laissez-faire policy with respect to the Web, which enables massive innovation.

When you look at how much difficulty the state has to even embrace innovation, you cannot believe that it is also the source of our innovation. Ridley, like myself, is pessimistic regarding government interventions like patent protections. We get more innovation when people are free to iterate and reuse ideas. Furthermore, focusing the innovators on patents takes time and money away from innovation.

I am also skeptical of the ability of research grants to spur innovation, even when said grants and tied to industry collaboration. I do think that governments can play a positive role, besides protecting free enterprise and free markets: governments can issue prizes and contracts. For example, NASA may not be able to innovate and get us in space cheaply. However, NASA can offer contracts to space companies. Governments could similarly encourage progress in medicine by giving prizes to the first innovators to reach certain milestones. For example, we could give 10 billion dollars to the first team to stop cognitive decline in Alzheimer’s patients, at a reasonable cost. I get the feeling that Ridley would agree.

Research grants tend to favour the incumbents. Research prizes are more likely to reward innovators.

It is true that the innovators are not rewarded for nearly all of the wealth that their innovation generate… but innovations are frequently a team sport. We may think that only the Wright brothers invented the aeroplane and made it work, leading to all the marvellous innovations… but many people had a hand in their work. They corresponded with enthusiast who were experimenting with planers. They had an employee design a very light and powerful engine. It is not clear that by trying to ensure that innovators get a larger share of the benefits, we get more innovation and this assumes that we innovators are always fairly identified. Is the first people to patent an idea the sole innovator?

Where will innovation happen in the near future? The answer that seems obvious today, and the one that Ridley goes to is China. China seems uniquely able to try new things as far as technology and industry goes. Ridley is not pleased by this likely outcome given that China lacks democracy. Instead, Ridley believes that we could renew our habit of creating innovation elsewhere if only we choose to.

The Go compiler needs to be smarter

One of my favorite languages is the Go language. I love its simplicity. It is popular and useful in a cloud setting. Many popular tools are written in Go, and for good reasons.

I gave a talk on Go last year and I was asked for a criticism of Go. I do not mind that Go lacks exceptions or generics. These features are often overrated.

However, as charming as Go might be, I find that its compiler is not on par with what I expect from other programming languages. It could be excused when Go was young and immature. But I think that the Go folks now need to tackle these issues.

My first beef with the compiler is that it is shy about inlining. Inlining is the process by which you bring  a function into another function, bypassing the need for a function call. Even when the function call is inexpensive, inlining often brings many benefits.

Go improved its inlining strategies over time, but I would still describe it as “overly shy”.

Let me consider the following example where I first define a function which sums the element in an array, and then I call this function on an array I just defined:

func sum(x  []uint64) uint64 {
    var sum = uint64(0)
    for _, v := range x {
        sum += v
    return  sum

func fun() uint64 {
    x := []uint64{10, 20, 30}
    return sum(x)

Whether you use Rust, Swift, C, C++… you expect a good optimizing compiler to basically inline the call to the ‘sum’ function and then to figure out that the answer can be determined at compile time and to optimize the ‘fun’ function to something trivial.

Not so in Go. It will construct the array and then call the ‘sum’ function.

In practice, it means that if you want good performance in Go, you often have to manually inline your functions. And I mean: fully inline. You have to write a really explicit function if you want the Go compiler to optimize the computation away, like so:

func fun3() uint64 { 
    x := [3]uint64{10001, 21, 31}
    return x[0] + x[1] + x[2] 

My second concern with the Go language is that it has no real concept of runtime constant variable. That is, you have compile-time constants but if you have a variable that is set once in the life of your program, and never change, Go will still treat it as if it could change. The compiler does not help you.

Let us take an example. Go has added nice function that give you access to fast processor instructions. For example, most x64 processors have a popcnt instruction that gives you the number 1-bit in a 64-bit word. It used to be that the only way to access this instruction in Go was by writing assembly. Thas been resolved. So let us put this code into action:

import "math/bits"

func silly() int {
    return  bits.OnesCount64(1) + bits.OnesCount64(2)

This function should return 2 since both values provided (1 and 2) have exactly one bit set. I bet that most C/C++ compilers can figure that one out. But we may excuse Go for not getting there.

Go needs to check, before using the popcnt instruction, that the processor supports it. When you start Go, it queries the processor and fills a variable with this knowledge. This could be done at compile-time but then your binary would crash or worse when run on a processor that does not support popcnt.

In a language with just-in-time compilation like Java or C#, the processor is detected at compile-time so no check is needed. In less fanciful languages like C or C++, the programmer needs to check what the processor supports themselves.

I can excuse Go for checking that popcnt is supported each and every time that the ‘silly’ function called. But that is not what Go does. Go checks it twice:

        cmpb    runtime.x86HasPOPCNT(SB), $0
        jeq     silly_pc115
        movl    $1, AX
        popcntq AX, AX
        cmpb    runtime.x86HasPOPCNT(SB), $0
        jeq     silly_pc85
        movl    $2, CX
        popcntq CX, CX

That is because the compiler does not trust, or cannot determine, that the variable ‘runtime.x86HasPOPCNT’ is a runtime constant.

Some people will object that such checks are inexpensive. I think that this view should be challenged:

  • As is apparent in the assembly code I provide, you might be doubling or at least increasing by 50% the number of instructions required. A comparison and a jump is cheap, but so is popcnt (some processors can retire two popcnt per cycle!). Increasing the number of instructions makes code slower.
  • It is true that the branch/jump is likely to be correctly predicted by the processor. This makes the guarding code much cheaper than a branch that could sometimes be mispredicted. But that does not mean that you are not getting hurt:

    Even when it is impossible to remove all branches, reducing the number of branches “almost always taken” or “almost never taken” may help the processor better predict the remaining branches.  (…) A possible simplified explanation for this phenomenon is that processors use the history of recent branches to predict future branches. Uninformative branches may reduce the ability of the processors to make good predictions.

Go’s saving grace is that it makes it easy to integrate assembly code into your code base. So you can write your performance-critical in C, compile it, and use the result in your Go project. That is how we do it in roaring, for example. People have ported the really fast Stream VByte encoding and the very fast simdjson parser in Go, again by using assembly. It works.

However, it leaves the bulk of the Go software running at a fraction of the performance it could reach with a great optimizing compiler.

Appendix: Compiling Go with gccgo solves these particular problems. However, reportedly, the overall performance of gccgo is worse.

Science and Technology links (May 30th 2020)

  1. We might soon be able to buy memory cards with speeds nearing 4 GB/s. For comparison, an expensive and recent macBook currently has a disk with a 2 GB/s bandwidth. The PlayStation 5 should have a 5 GB/s bandwith.
  2. Human industry has boosted the amount of CO2 in the atmosphere. This has two predictible outcomes: slightly higher global temperature (CO2 has a mild green house effect) and higher plant productivity (CO2 acts as a fertilizer). The CO2 fertilization effect is strong: a 30% gain from 1900 in photosynthesis efficiency. Moreover, higher plant productivity translates into more CO2 capture and thus it tends to reduce the quantity of CO2. Haverd et al. report that we may have underestimated the carbon sink effect of CO2 fertilization.
  3. The successful e-commerce firm, Shopify, will allow most of its employees to work remotely in the future.
  4. Human beings may have hunted mammoths by chasing them into predetermined traps.
  5. There is a theory that sending your kids to a more selective school help them because being exposed to high achieving peers raises their level. But it seems that this peer effect is a myth. In other words, paying a lot of money to send your kids to an exclusive school is probably a waste. (It does not imply that sending your kids to a dysfunctional school is harmless.)
  6. We should apply with care the principle that extraordinary claims require extraordinary evidence. Indeed, this principle can be used to reject results that violate human consensus and slow the progress of science. Indeed, scientific progress is often characterized by a change in the consensus as we go from one realization to another.
  7. We can prevent age-related bone losses in mice by tweaking the content of their blood plasma.
  8. Many recent advances in artificial intelligence do not hold up to scrutiny. This is why you will often hear me dismiss novelty with the phrase “originality is overrated”.

    Kolter says researchers are more motivated to produce a new algorithm and tweak it until it’s state-of-the-art than to tune an existing one. The latter can appear less novel, he notes, making it “much harder to get a paper from.”

    The net result is that researchers tend to overrate novelty and originality. In practice, you often get better results by selecting time-tested approaches and ignoring the hype.

    So, how should read research, knowing that much of it won’t stand to scrutiny?

    1. Do not dismiss older research merely because it is older. Do the opposite: focus your energies on older work still in use.
    2. Instead of picking up papers one by one, try to find the underlying themes. In effect, dismiss each individual paper, and instead focus on the recurring themes and effects. If an idea only appears in one paper, it probably can be discarded. If it is appears again and again and proves useful, it might be worth knowing.

Mapping an interval of integers to the whole 64-bit range, fairly?

In my blog post A fast alternative to the modulo reduction, I described how one might map 64-bit values to an interval of integers (say from 0 to N) with minimal bias and without using an expensive division. All one needs to do is to compute x * N ÷ 264 where ‘÷’ is the integer division. A division by a power of two is just a shift. Such an approach is simple and efficient.

Let us consider the opposite problem.

Suppose that I give you integers in the range from 0 (inclusively) to some value N (exclusively). We want to map these values to integers spanning the whole 64-bit range. Obviously, since we only have N distinct values to start with, we cannot expect our function to cover all possible 64-bit values. Nevertheless, we would like to be “as fair as possible”. Let us use only integers.

Let 264 ÷ N be the integer division of 264  by N. Let 264 % N be the remainder of the integer division. Because N is a constant, these are constants as well.

As a first attempt, I might try to map integer x using the function (264 ÷ N) * x. It works beautifully when N is a power of two, but, in general, it is a tad crude. In particular, the result of this multiplication can never exceed N – (264 % N) whereas it starts at value 0 when x is 0 so it is biased toward smaller values when (264 % N) is non-zero (which is always true when N is not a power of two).

Let 264 % N be the remainder of the integer division. To compensate, we need to add a value that can up to 264 % N. Mapping integers in the interval from 0 to N to integers in the interval from 0 to 264 % N cannot be done with a simple multiplication. We do not want to use divisions in general because they are expensive. However, we can use a shift, that is, a division by a power of two. So let us look for a map of the form (264 ÷ N) * x + (u * x)÷264  for some unknown value u. We know that x can never exceed N, but that when x reaches N, the value of (u * x)÷264 should be close to 264 % N. So we set (u * N)÷264  to 264 % N and solve for u, which gives us that u must be at least (264 % N) * 264 ÷ N.

The values (264 ÷ N) and (264 % N) * 264 ÷ N need to computed just once: let us call them m and n respectively. We finally get the function m * x + (n * x)÷264. In Python, the code might look as follows…

def high(x,y):
    return (x*y >> 64) % 2**64

def inv(x):
   n = ((2**64) % N)*2**64 // N
   m = 2**64 // N
   return m*(x+1) + high(x,n)

How good is this formula? To test it out, I can test how well it can invert the approach that goes in the other direction. That is, if I plug x * N ÷ 264, I would hope to get back x, or something very close for suitable values of N. That is, I want m * x * N ÷ 264 + (n * x * N ÷ 264)÷264 to be almost x. In general, I cannot hope to find x exactly because some information was lost in the process. Yet I expect to have a lower bound on the error of ceil(264/ N). I benchmark my little routine using a probabilistic test and the results are promising. In the worst case, I can be orders of magnitude larger than the lower bound, but most of the time, my estimated error is lower than the lower bound. This suggests that though my approach is not quite suitable to get back x, with a little bit of extra mathematics and a few more instructions, I could probably make it work exactly within a strict error bound.

Programming inside a container

I have a small collection of servers, laptops and desktops. My servers were purchased and configured at different times. By design, they have different hardware and software configurations. I have processors from AMD, Intel, Ampere and Rockchip. I have a wide range of Linux distributions, both old and new. I also mostly manage everything myself, with some help from our lab technician for the initial setup.

The net result is that I sometimes end up with very interesting systems that are saddled with old Linux distributions. Reinstalling Linux and keeping it safe and secure is hard. Furthermore, even if I update my Linux distributions carefully, I may end up with a different Linux distribution than my collaborators, with a different compiler and so forth. Installing multiple different compilers on the same Linux distribution is time consuming.

So what can you do instead?

I could run virtual machines. With something like VirtualBox, you can run Linux inside Windows insider macOS. It is beautiful. But it is also slow and computationally expensive.

You can switch to containers, and Docker specifically, which have much less overhead. Docker is a ubiquitous tool in cloud computing. As a gross characterization, Docker allows you to run Linux inside Linux. It is a sandbox, but a sandbox that runs almost directly on the host. Unlike virtual machines, my tests show that on computationally intensive tasks, Docker containers run at “native speed” (bare metal). There are reports that system interaction is slower. Network connections and disk access is slower. For my purposes, it is fine.

If you must, you can also run Docker under macOS and Windows, though there will then be more overhead, I expect.

The idea of a container approach is to always start from a pristine state. So you define the configuration that your database server needs to have, and you launch it, in this precise state each time. This makes your infrastructure predictable.

It is not as perfect as it sounds. You still critically depend on the quality of the container you start from. Various hacking can be necessary if you need two applications with different requirements to run together in the same image.

Still, containers work well enough that they are basically sustaining our civilization: much of the cloud-based applications are based on containers one way or another.

Containers were built to deploy software into production. Programming inside containers is not directly supported: you will not find much documentation about it and there is simply not business model around it. What do I mean by “programming inside containers”? I mean that I’d to start a C programming project, decide that I will use the Linux Ubuntu 16.10 distribution and that I will compile and run my code under Linux Ubuntu 16.10, even though my server might be running a totally different Linux distribution (or might be under macOS).

The first problem is that your disk and the disk of the image built from the container are distinct. A running image does not have free access to the underlying server (the host). Remember that it is a sandbox.

So you can do all of your work inside the image. However, remember that the point of container technology is to always start from a pristine state. If you load up an image, do some work, and leave… your work is gone. Images are immutable by design. It is a great thing: you cannot easily mess up an image by tinkering with it accidentally.

You can, after doing some work inside an image, take a snapshot of the new state, commit it and create a new image from which you would start again. It is complicated and not practical.

What else could you do? What you can do instead is keep the image stateless, as images are meant to be. The image will only contain the compiler and build tools. There is no reason to change any of these tools. You will have all of your code in a directory, as you would normally do. To run and compile code, you will enter in the the image and run your commands. You can bind the repository from the host disk to the image just as you enter it.

This works much better, but there are glitches if you are issuing directly your docker command lines:

  1. Depending on how Docker is configured on your machine, you may find that you are unable to read or write to the disk bound to the image from the image. A quickfix is to run the image with privileged access but it is normally frowned upon (and unnecessary).
  2. The files that you create or modify from within the Docker image will appear on the host disk, often with strange file permissions. For example, maybe all of the files are owned by the root user. I had a research assistant that had a good workaround: he ran Linux as root all the time. I do not recommend such a strategy.

These glitches come from the strange way in which Docker deals with permissions and security. Contrary to what you may have read, it is not a simple matter of setting user and group identifiers: it may be sufficient on some systems but not on systems supporting Security-Enhanced Linux which require additional care.

And finally, you need to remember lots of complicated commands. If you are anything like me, you would rather not to have to think about Docker. You want to focus all of your attention on the code.

So the solution is to use a little script. In my case I use a bash script. You can find it on GitHub. It handles messy commands and file permissions for you.

For years, I tried to avoid having to rely on a script, but it is simply unavoidable to work productively.

Basically, I copy two files at the root of the directory where I want to work (Dockerfile and run), and then I type:

./run bash

And that is all. I am now in a subshell, inside the host directory. I can run programs, compile them. I have complete access to a recent Ubuntu distribution. This works even under the ARM-based servers that I have.

The run script can take other commands as well, so I can use it as part of other scripts.

Science and Technology links (May 16th 2020)

  1. Most of our processors, whether in our PCs or mobile phones, are 64-bit processors. In the case of your PC, it has been so for a couple of decades. Unfortunately, we have been stuck with 32-bit operating systems for a long time. Apple has stopped supporting 32-bit applications in the most recent version of its macOS operating system and 32-bit applications have stopped running on the iPhone quite some time ago. Microsoft will now stop providing 32-bit versions of its operating system.
  2. Moths transport pollen at night and play an important role as pollinators.
  3. If you transplant fecal microbes from old to young rats, you age the young rats. We have solid evidence that blood transfusions from old mammals to young mammals will age the young recipient (Nature, 2016). The other direction, evidently more useful, is achievable in vitro, but harder to achieve in actual organisms. In what might be a breakthrough, Horvath et al. report system-wide rejuventation in rats using plasma (blood) transfusion: the treatment more than halved the epigenetic ages of blood, heart, and liver tissue. Details are missing and we should reserve judgement. Horvath is a well regarded scientist from UCLA.

Encoding binary in ASCII very fast

In software, we typically work with binary values. That is, we have arbitrary streams of bytes. To encode these arbitrary stream of bytes in standard formats like email, HTML, XML, JSON, we often need to convert them to a standard format like base64. You can encode and decode base64 very quickly.

But what if you do not care for standards and just want to go fast and have simple code? Maybe all you care about is that the string is ASCII. That is, it must be a stream of bytes with the most significant bit of each byte set to zero. In such a case, you want to convert any 7-byte value  into an 8-byte value, and back.

I can do it in about five instructions (not counting stores and moves) both ways in standard C: five instructions to encode, and five instructions to decode. It is less than one instruction per byte. I could not convince myself that my solution is optimal.

// converts any value in [0, 2**56) into a value that
// could pass as ASCII (every 8th bit is zero)
// can be inverted with convert_from_ascii
uint64_t convert_to_ascii(uint64_t x) {
  return ((0x2040810204081 * (x & 0x80808080808080)) 
         & 0xff00000000000000) +
         (x & 0x7f7f7f7f7f7f7f);
// converts any 8 ASCII chars into an integer in [0, 2**56),
// this inverts convert_to_ascii
uint64_t convert_from_ascii(uint64_t x) {
  return ((0x102040810204080 * (x >> 56)) 
         & 0x8080808080808080) +
         (x & 0xffffffffffffff);

Under recent Intel processors, the pdep/pext instructions may serve the same purpose. However, they are slow under AMD processors and there is no counterpart under ARM.

Science and Technology links (May 2nd 2020)

  1. As we age, we tend to produce less of NAD+, an essential chemical compound for our bodies. We can restore youthful levels of NAD+ by using freely available supplements such as NMN. It is believed that such supplements have an anti-aging or rejuvenating effect. Some scientists believe that restoring NAD+ might make the powerplants (mitochondria) of your cells youthful again. A recent paper shows that we can reverse cognitive aging in mice using NMN. The treated mice were more agile, comparable to young mice. The paper suggests that it might be beneficial for some neurodegenerative diseases. (Note: I caution you that many things that work in mice may not work in human beings.)
  2. While we reject at a visceral level discrimination based on ethnicity, sexual orientation or gender, we are collectively happy to discriminate against older people. Nobody would think it is acceptable to fire women when they become pregnant, but it is acceptable to terminate someone’s employment because they are too old. Professor Paul Ewart from Oxford University fought his employer in court because while he was force to leave his post. He is now 73 years old. I do not know Ewart, but he was a co-author of over 10 papers in 2019 and he still held research grants, so I take it that he is still active in research. Ewart wrote:

    It cannot be right to dismiss active physicists, or indeed any productive academic, at some arbitrary age. It is traumatic to be forcibly retired, especially when one’s work is still in full swing and there are new ideas to be explored. The “emeritus” status offered by Oxford, instead of full employment, is of no use to experimental scientists who need a research team and principal-investigator status to apply for their own research funding. It is simply ageism that underlies many of the arguments used to justify mandatory retirement. Age is often used as a proxy for competence and this lazy stereotype feeds off the myth that scientists have their best ideas when they are young. Indeed, studies have shown that a scientist’s most impactful work can occur at any stage in their career. (…) It is ageism that sees a 40-year-old as “filling” a post whereas a 65-year-old is “blocking” a post. Apart from providing the dignity and fulfilment of employment, there are general imperatives for people to work longer, including the growing pension burden and increases in life expectancy. Recent studies by the World Health Organization also highlight the physical and mental health benefits of working longer. (…) Ageism is endemic in our society and attitudes persist that would be recognized as shameful if they related to race, religion or sexual orientation. The University of Oxford’s claim that dismissing older academics is necessary to maintain its high standards is simply ageism, implying that academic performance deteriorates with age. If retirement policies are to be truly evidence-based, they need to be justified by reasoned estimates of proportionality that are consistent with the available data.

    Ewart  showed, both with a model and hard data, that forcing older people to go only had marginal benefits for younger people and gender diversity. It can be fair for an employer to seek to review the performance of its staff and to let people go when they fail below a productivity threshold. You may object that it is warranted to let 73 years old go since they are, maybe, less productive than younger folks. But on the same account, would we let young female professors who choose to have young children go if we can establish, statistically, that female professors with young children are less productive? Of course, we would not. The basis of our policies against older people are rooted in prejudice. On the basis of reason, we should encourage older people who are productive to keep working as long as possible.  So what happened? Ewart went to court and won. How did Oxford respond? They are considering an appeal. No doubt, they will have to argue that forcing older productive people to leave is good…

  3. Economists tell us that our economies do not grow as fast as they once did and thus, we are getting richer (or less poor) at a slower rate than what might have been expected a few decades ago. There is debate regarding these numbers since it is hard to tell how much Facebook, Zoom and the iPhone XR would be worth to someone in 1980.What might be causing this stagnation? If we believe that the problem lies with economics, then financial and fiscal tools might be the solution. Ramey argues for a different cause: she suggests that we might be in a technological lull. If she is correct, then the fix might be to encourage more technological innovation rather than jump at fiscal or financial interventions.
  4. Most of the cells in our bodies can only divide so many times. This is known as the Hayflick limit and was once argue to fundamentally limit the lifespan of animals. However, we now know that stem cells, the cells that produce our specialized cells, do not face such a limit normally. The Hayflick limit comes from the fact that with each division, the tail end of our chromosome (the telomeres) get shorter. However, stem cells benefit from telomerase and thus may replenish their telomeres. Yet there are diseases where stem cells lose this ability to restore their telomeres. Researchers have shown that we can restore telomere lengths in such cases, in mice.
  5. As we age, we accumulate “senescent cells”. They are large and dysfunctional cells that have often reached the Hayflick limit. They should normally die, but some of them stick around. Removing them is believed to form the basis for a rejuvenation therapy. The difficulty is to find effective compounds that kill senescent cells but nothing else. A paper in Nature presents a new approach that is apparently safe and selective, in both human beings and mice. In old mice, the therapy restored physical function and reduced age-related inflammation.

For case-insensitive string comparisons, avoid char-by-char functions

Sometimes we need to compare strings in a case-insensitive manner. For example, you might want ‘abc’ and ‘ABC’ to be considered. It is a well-defined problem for ASCII strings. In C/C++, there are basically two common approaches. You can do whole string comparisons:

bool isequal = (strncasecmp(string1, string2, N) == 0);

Or you can do character-by-character comparisons, mapping each and every character to a lower-case version and comparing that:

bool isequal{true};
for (size_t i = 0; i < N; i++) {
      if (tolower(string1[i]) != tolower(string2[i])) {
        is_the_same = false;

Intuitively, the second version is worse because it requires more code. We might also expect it to be slower. How much slower? I wrote a quick benchmark to test it out:

strncasecmp Linux/GNU GCC macOS/LLVM
strncasecmp 0.15 ns/byte 1 ns/byte
tolower 4.5 ns/byte 4.0 ns/byte

I got these results with GNU GCC under Linux. And on a different machine running macOS.

So for sizeable strings, the character-by-character approach might be 4 to 40 times slower! Results will vary depending on your standard library and of the time of the day. However, in all my tests, strncasecmp is always substantially faster. (Note: Microsoft provides similar functions under different names, see _strnicmp for example.)

Could you go faster? I tried rolling my own and it runs at about 0.3 ns/byte. So it is faster than the competition under macOS, but slower under Linux. I suspect that the standard library under Linux must rely on a vectorized implementation which might explain how it beats me by a factor of two.

I bet that if we use vectorization, we can beat the standard librairies.


My code is available.

Sampling efficiently from groups

Suppose that you have to sample a student at random in a school. However, you cannot go into a classroom and just pick a student. All you are allowed to do is to pick a classroom, and then let the teacher pick a student at random. The student is then removed from the classroom. You may then have to pick a student at random, again, but with a classroom that is now missing a student. Importantly, I care about the order in which the students were picked. But I also have a limited memory: there are too many students, and I can’t remember too many. I can barely remember how many students I have per classroom.

If the classrooms contain a variable number of students, you cannot just pick classrooms at random. If you do so, students in classrooms with lots of students are going to be less likely to be picked.

So instead, you can do it as follows. If there are N students in total, pick a number x between 1 and N. If the number x is smaller or equal to the number of students in the first classroom, pick the first classroom and stop. If the number is smaller or equal to the number of students in the first two classrooms, pick the second classroom, and so forth. If you remove the student, you have to account for the fact that the sum (N) is reduced and the number of students in the classroom is being updated.

    while(N > 0) {
            int x = r.nextInt(sum); // random integer in [0,sum)
            int runningsum = 0;
            int z = 0;
            for(; z < runninghisto.length; z++) {
                runningsum += runninghisto[z];
                if(y < runningsum) { break; }
            // found z
            runninghisto[z] -= 1;
            N -= 1;

If you have many classrooms, this algorithm can be naive because each time you must pick a classroom you may need to iterate over all classroom counts. Suppose you have K classroom: you may need to do N K tasks to solve the problem.

Importantly, the problem is dynamic: the number of students in each classroom is not fixed. You cannot simply precompute some kind of map once and for all.

Can you do better than the naive algorithm? I can craft a “logarithmic” algorithm, one where the number of operations you do is the logarithm of the number of classrooms. I believe that my algorithm is efficient in practice.

So divide your classrooms into two sets of classrooms of size R and S so that R + S = N. Pick an integer in [0,N). If it is smaller than R, decrement R and pick a buffer in the corresponding set. If it is larger than or equal to R, then go with the other set of buffers. So we have reduced the size of the problem in two. We can repeat the same trick recursively. Divide the two subsets of classrooms and so forth. Instead of merely maintaining the total number of students in all of the classrooms, you keep track of the counts in a tree-like manner.

We can operationalize this strategy efficiently in software:

  • Build an array containing the number students in each classroom.
  • Replace every two counts by the sum of two counts: instead of storing the number of students in second classroom, store the number of students in the first two classrooms.
  • Replace every four counts by the sum of the four counts: where the count of the fourth class was, store the sum of the number of students in the first four classrooms.
  • And so forth.

In Java, it might look like this:

int level = 1;
for(;(1<<level) < runninghisto.length; level++) {
  for(int z = 0; 
     z + (1<<level) < runninghisto.length; 
    z += 2*(1<<level)) {
       runninghisto[z + (1<<level)] += runninghisto[z];

You effectively have a tree of prefixes which you can then use to find the right classroom without visiting all of the classroom counts. You visit a tree with a height that is the logarithm of the number of classrooms. At each level, you may have to decrement a counter.

        while(N > 0) {
            int y = r.nextInt(sum);
            // select logarithmic time
            level = maxlevel;
            int offset = 0;
            for(; level >= 0; level -= 1) {
                if(y >= runninghisto[offset + (1<<level)]) {
                    runninghisto[offset + (1<<level)] -= 1;
                    offset += (1<<level);
            // found offset
            N -= 1;

I have written a Java demonstration of this idea. For simplicity, I assume that the number of classrooms is a power of two, but this limitation could be lifted by doing a bit more work. Instead of N K operations, you can now do only N log K operations.

Importantly, the second approach uses no more memory than the first one. All you need is enough memory to maintain the histogram.

My Java code includes a benchmark.

Update. If you are willing to use batch processing and use a little bit of extra memory, you might be able to do even better in some cases. Go back to the original algorithm. Instead of picking one integer in [0,N), pick C for some number C. Then find out in which room each of these k integers fit, as before, decrementing the counts. If you pick the C integers in sorted order, a simple sequential scan should be enough. For C small compared to N, you can pick the C integers in linear time using effective reservoir sampling. Then you can shuffle the result again using Knuth’s random shuffle, in linear time. So you get to do about N K / C operations. (Credit: Travis Downs and Richard Startin.)

Further reading: Hierarchical bin buffering: Online local moments for dynamic external memory arrays, ACM Transactions on Algorithms.

Follow-up: Sampling Whisky by Richard Startin

Science and Technology links (April 25th 2020)

  1. People’s muscles tends to become weaker with age, a process called sarcopenia. It appears that eating more fruits and vegetables is associated with far lower risks of sarcopenia. Unfortunately, it is merely an association: it does not mean that if you eat more fruits and vegetables, you are at lower risk for sarcopenia. We just do not know.
  2. In human beings, males are more diverse than female with respect to many attributes. Just look at height: it is common that, in a classroom, both the shortest and tallest individual are boys. In chimpanzees, researchers report that males have greater brain structure diversity.
  3. Scientists have created “healing patches” for hearts. They greatly accelerate heart repair following a heart attack, in mice and pigs. It is made out of heart pig tissue, with the pig cells removed. The patches are believed to be safe, they can be frozen for later use.
  4. Many people lose some of their hair over time. About half of all men go bald over time. We do not quite understand the process. New research suggests that hair loss may result from the loss of stem cells in the skin. Stem cells are “progenitor cells” able to produce new specialized cells. The research suggests that stem cell therapies could be beneficial to fight hair loss.
  5. It is commonly believed that evolution made us as fit as possible. In this model, aging is the inevitable result of entropy. I do not believe this entropic model. In simulations, researchers have shown that a population of worms might be fitter if its individuals age and die. This should not surprise us: annual plants have evolved out of perennial plants. Your garden flowers can often be easily be kept alive indefinitely even though they are annuals. Your plants age and die on schedule because it is advantageous for the species. Animals of a similar size have frequently vastly different lifespans… naked mole rats are rats that are not subject to Gompertz curves (they are effectively ageless).

Rounding integers to even, efficiently

When dividing a numerator n by a divisor d, most programming languages round “down”. It means that 1/2 is 0. Mathematicians will insist that 1/2 and claim that you really are computing floor(1/2). But let me think like a programmer. So 3/2 is 1.

If you always want to round up, you can instead compute (– 1)/d.  If I want to apply it to n is 3 and d is 2, I do (3 + 2 – 1) /2 = 2.

We sometimes want to round the value to the nearest integer. So 1.4 becomes 1, 1.6 becomes 2. But what if you have 1.5 (the result of 3/2)? You can either round up or round down.

  • You can round “up” when hitting the midpoint with  (d/2)/d.
  • You can round “down” when hitting the midpoint with (– d/2)/d.

But there is a third common way to round. You want to round to even. It means that when you hit the midpoint, you round to the nearest even integer. So 1.5 becomes 2, 2.5 becomes 2, 3.5 becomes 4 and so forth. I am sure that it has been worked out before, but I could not find an example so I rolled my own:

off = (n + d / 2)
roundup = off / d;
ismultiple = ((off % d) == 0);
iseven = (d & 1) == 0;
return (ismultiple && iseven) ? roundup - (roundup & 1) : roundup;

Though there is a comparison and what appears like a branch, I expect most compilers to produce essentially branchless code. The result should be about five instructions on most hardware. I expect that it will be faster than converting the result to a floating-point number, rounding it up and converting the resulting floating-point number back.

I have posted working source code.

Why does it work?

Firstly, you never have to worry about hitting the midpoint if the divisor is odd. The remainder of a division by an odd number cannot be equal to d/2. Thus, starting from the round-up approach (d/2)/d we only need to correct the result. When n is equal to d + d/2, we need to round up, when it is equal to 2 d + d/2 we need to round down and so forth.  So we only need to remove 1 from (d/2)/when n is equal to 2kd+d/2 for some integer k. But when n is equal to 2kd+d/2, I have that n + d/2 is equal to (2k+1)d. That is, the quotient is odd.

Even better: You can make it explicitly branchless (credit: Falk Hüffner):

off = (n + d / 2)
roundup = off / d;
ismultiple = ((off % d) == 0);
return (d | (roundup ^ ismultiple)) & roundup

Nitpicking: You may be concerned with overflows. Indeed, if both n and d are large, it is possible for n+d/2 to exceed to allowable range. However, the above functions are used in the Linux kernel. And you can make them more robust at the expensive of a little bit more work.

Credit: This blog post was motivated by an email exchange with Colin Bartlett,

Science and Technology links (April 11th 2020)

  1. Greenland sharks reach their sexual maturity when they are 150 years old and they live hundreds of years. Some living sharks today were born in the 16th century.
  2. Only 5% of the general population rates themselves as below average in intelligence. And they overestimate the intelligence of their mate even more.
  3. Fresh soil has a distinctive smell which may be caused by bacteria that are trying to attract specific animals.
  4. Earth’s mantle is kept warm due to radioactivity. So we are basically living on a warm radioactive rock.
  5. Employees who work from home can be noticeably more productive. My own estimate is that I am about 20% more productive when working from home. However, social interactions are easier in person.
  6. We had the coldest Artic winter since 1979. It caused the first genuine ozone hole in the Artic. The ozone layer protects the Earth from radiations.
  7. According a recent research paper in Nature, Fermented foods (cheese, butter, alcoholic beverages, yogurt) as well as fibers may have anti-obesity properties due to the production of short-chain fatty acids (acetate, propionate, and butyrate). Propionate is also used as a food preservative as well as being produced by our own gut bacteria when digesting fibers, or by bacteria in the process of fermentation.  There is much evidence that propionate may helpful for obesity: Lin et al. (2012), Chambers et al. (2019), Arora et al. (2011) . In contrast, a widely reported previous work by Tirosh et al. suggest that propionate leads to weight gain in mice and might promote obesity and metabolic abnormalities. The press reported the idea that propionate could make use fat by presenting it as a food preservative:

    My take? The science is probably “complicated” and we do not have all the answers. This being said, while it is true that there may be propionate in your bread and pizza, to prevent it from moulding, and while it is also true that pizza and bread may make you fat, I’d worry a lot more about the calories in bread and pizza rather than the effect of propionate itself.

  8. The current pandemics might be the start of a revolution in peer review.

Multiplying backward for profit

Most programming languages have integer types with arithmetic operations like multiplications, additions and so forth. Our main processors support 64-bit integers which means that you can deal with rather large integers. However, you cannot represent everything with a 64-bit integer. What if you need to multiply 5100 by 37? Programming languages like Python or JavaScript will gracefully handle it. In programming languages like C/C++, you will need to use a library. In Java, you will have to use special classes part of the standard library.

You will almost never need to deal with integers that do not fit in 64 bits, unless you are doing cryptography or some esoteric numerical analysis. Nevertheless, it does happen that you need to work with very large integers.

How does the product 5100 by 37 gets computed if processors are limited to 64-bit words? Underneath, your software represents such large integers using multiple 64-bit words. The standard approach computes the product starting from the least significant bits. Thus if you are multiplying an integer that requires a single machine word (w) with an integer that requires n machine words, you will use n multiplications starting with a multiplication between the word w and the least significant word of the other integer, going up to the most significant words. The code in C/C++ might look like so:

// multiply w * b where b contains n words, n > 0
p = full_multiplication(w, b[0]); // 64-bit x 64-bit => 128-bit
output[0] = p.low;// least significant 64-bits
uint64_t r = p.high;// most significant 64-bits
for (i = 1; i < n; i++) {
  p = full_multiplication(w, b[i]);
  p.low += r; // add r with carry
  if(p.low < r) p.high++;
  output[i] = p.low; // least significant 64-bits
  r = p.high;

It is akin to how we are taught to multiply in school: start with the least significant digits, and go up. I am assuming that all integers are positive: it is not difficult to generalize the problem to include signed integers.

So far, it is quite boring. So let me add a twist.

// multiply w * b where b contains n words, n > 0
p = full_multiplication(w, b[0]); // 64-bit x 64-bit => 128-bit
output[0] = p.low;// least significant 64-bits
uint64_t r = p.high;// most significant 64-bits
for (i = 1; i < n; i++) {
p = full_multiplication(w, b[i]);
p.low += r; // add r with carry
if(p.low < r) p.high++;
output[i] = p.low; // least significant 64-bits
r = p.high;
output[n] = r;

You can also multiply backward, starting from the most significant words. Though you might think it would be less efficient, you can still do the same product using the same n multiplications. The code is going to be a bit more complicated because you have to carry the overflow that you may encounter in the less significant words upward. Nevertheless, you can implement it in C/C++ using only a few extra lines of code. According to my hasty tests, it is only marginally slower (by about 20%).

// multiply w * b where b contains n words, n > 0
p = full_multiplication(w, b[n-1]); // 64-bit x 64-bit => 128-bit
uint64_t r = p.high// least significant 64-bits ;
output[n - 1] = p.low;// most significant 64-bits
output[n] = p.high; 
for (i = n-2; i >=0; i--) {
    p = full_multiplication(w, b[i]);
    output[i] = p.low; // least significant 64-bits 
    // check for overflow
    bool overflow = (output[i + 1] + p.high < output[i + 1]);
    output[i + 1] += p.high;
    for (size_t j = i + 2; overflow; j++) {
      output[j]++; // propagate the carry
      overflow = (output[j] == 0);
// multiply w * b where b contains n words, n > 0
p = full_multiplication(w, b[n-1]); // 64-bit x 64-bit => 128-bit
uint64_t r = p.high// least significant 64-bits ;
output[n – 1] = p.low;// most significant 64-bits
output[n] = p.high;
for (i = n-2; i >=0; i–) {
p = full_multiplication(w, b[i]);
output[i] = p.low; // least significant 64-bits
// check for overflow
bool overflow = (output[i + 1] + p.high < output[i + 1]);
output[i + 1] += p.high;
for (size_t j = i + 2; overflow; j++) {
output[j]++ // propagate the carry;
overflow = (output[j] == 0);

Why would computing the multiplication backward ever be useful?

Suppose that you are not interested in the full product. Maybe you just need to check whether the result in within some bounds, or maybe you just need a good approximation of the product. Then starting from the most significant bits could be helpful if you can stop the computation after you have enough significant bits.

It turns out that you can do so, efficiently. You can compute the most significant k words using no more than an expected + 0.5 multiplications. Furthermore, if you are careful, you can later resume the computation and complete it.

At each step, after doing k multiplications, and computing k + 1 words, these k + 1 most significant words are possibly underestimating the true k + 1 most significant words because we have not added the carry from the less significant multiplications. However, we can bound the value of the carry: it is less than w. To see that it must be so, let r be the number of remaining words in the multiword integer that we have not yet multiplied by w. The maximal value of these words is 264r – 1. So we are going to add, at most, (264r – 1)w to our partially computed integer: the overflow (carry) above the r is at most ((264r – 1)w)/264a value strictly less than w.

This means that if we add w – 1 to the least significant of the k + 1 words and it does not overflow, then we know that the k most significant words are exact, and we can stop. This might happen, roughly, half the time, assuming that we are dealing with random inputs. When it does overflow, you can just continue and compute one more product. If, again, adding w – 1 to the least significant word you computed does not create an overflow, you can stop, confident that the + 1 most significant words are exact.

However, you then gain another more powerful stopping condition: if the second least significant word is not exactly 264 – 1, then you can also stop, confident that k most significant words are exact, because adding w to the least significant word can, at most, translate in a carry of +1 to the second least significant word. Because it is quite unlikely that you will end up with exactly the value 264 – 1, we know that, probabilistically, you will not need more than k + 1 multiplications to compute exactly the k more significant words. And quite often, you will be able to stop after k multiplications.

My code implementing these ideas is a bit too long for a blog post, but I have published it as its own GitHub repository, so you can check it out for yourself. It comes complete with tests and benchmarks.

I have restricted my presentation to the case where at least one integer fits in a single word. Generalizing the problem to the case where you have two multiword integers is a fun problem. If you have ideas about it, I’d love to chat.

Implementation notes: The code is mostly standard C++. The main difficulty is to be able to compute the full multiplication which takes two 64-bit words and generates two 64-bit words to represent the product. To my knowledge, there is no standard way to do it in C/C++ but most compilers offer you some way to do it. At the CPU level, computing the full product is always supported (64-bit ARM and x64) with efficient instructions.

Credit: This work is inspired by notes and code from Micheal Eisel.