Most programming languages have integer types with arithmetic operations like multiplications, additions and so forth. Our main processors support 64-bit integers which means that you can deal with rather large integers. However, you cannot represent everything with a 64-bit integer. What if you need to multiply 5^{100} by 37? Programming languages like Python or JavaScript will gracefully handle it. In programming languages like C/C++, you will need to use a library. In Java, you will have to use special classes part of the standard library.

You will almost never need to deal with integers that do not fit in 64 bits, unless you are doing cryptography or some esoteric numerical analysis. Nevertheless, it does happen that you need to work with very large integers.

How does the product 5^{100} by 37 gets computed if processors are limited to 64-bit words? Underneath, your software represents such large integers using multiple 64-bit words. The standard approach computes the product starting from the least significant bits. Thus if you are multiplying an integer that requires a single machine word (*w*) with an integer that requires *n* machine words, you will use *n* multiplications starting with a multiplication between the word *w* and the least significant word of the other integer, going up to the most significant words. The code in C/C++ might look like so:

// multiply w * b where b contains n words, n > 0 p = full_multiplication(w, b[0]); // 64-bit x 64-bit => 128-bit output[0] = p.low;// least significant 64-bits uint64_t r = p.high;// most significant 64-bits for (i = 1; i < n; i++) { p = full_multiplication(w, b[i]); p.low += r; // add r with carry if(p.low < r) p.high++; output[i] = p.low; // least significant 64-bits r = p.high; }

It is akin to how we are taught to multiply in school: start with the least significant digits, and go up. I am assuming that all integers are positive: it is not difficult to generalize the problem to include signed integers.

So far, it is quite boring. So let me add a twist.

You can also multiply backward, starting from the most significant words. Though you might think it would be less efficient, you can still do the same product using the same *n* multiplications. The code is going to be a bit more complicated because you have to carry the overflow that you may encounter in the less significant words upward. Nevertheless, you can implement it in C/C++ using only a few extra lines of code. According to my hasty tests, it is only marginally slower (by about 20%).

// multiply w * b where b contains n words, n > 0 p = full_multiplication(w, b[n-1]); // 64-bit x 64-bit => 128-bit uint64_t r = p.high// least significant 64-bits ; output[n - 1] = p.low;// most significant 64-bits output[n] = p.high; for (i = n-2; i >=0; i--) { p = full_multiplication(w, b[i]); output[i] = p.low; // least significant 64-bits // check for overflow bool overflow = (output[i + 1] + p.high < output[i + 1]); output[i + 1] += p.high; for (size_t j = i + 2; overflow; j++) { output[j]++; // propagate the carry overflow = (output[j] == 0); } }

Why would computing the multiplication backward ever be useful?

Suppose that you are not interested in the full product. Maybe you just need to check whether the result in within some bounds, or maybe you just need a good approximation of the product. Then starting from the most significant bits could be helpful if you can stop the computation after you have enough significant bits.

It turns out that you can do so, efficiently. You can compute the most significant *k* words using no more than an expected *k *+ 0.5 multiplications. Furthermore, if you are careful, you can later resume the computation and complete it.

At each step, after doing *k* multiplications, and computing *k* + 1 words, these *k* + 1 most significant words are possibly underestimating the true *k* + 1 most significant words because we have not added the carry from the less significant multiplications. However, we can bound the value of the carry: it is less than *w*. To see that it must be so, let *r* be the number of remaining words in the multiword integer that we have not yet multiplied by w. The maximal value of these words is 2^{64r }– 1. So we are going to add, at most, (2^{64r }– 1)*w* to our partially computed integer: the overflow (carry) above the *r* is at most ((2^{64r }– 1)*w*)/2^{64r }a value strictly less than *w*.

This means that if we add *w* – 1 to the least significant of the *k* + 1 words and it does not overflow, then we know that the *k* most significant words are exact, and we can stop. This might happen, roughly, half the time, assuming that we are dealing with random inputs. When it does overflow, you can just continue and compute one more product. If, again, adding *w* – 1 to the least significant word you computed does not create an overflow, you can stop, confident that the *k *+ 1 most significant words are exact.

However, you then gain another more powerful stopping condition: if the second least significant word is not exactly 2^{64 }– 1, then you can also stop, confident that *k* most significant words are exact, because adding w to the least significant word can, at most, translate in a carry of +1 to the second least significant word. Because it is quite unlikely that you will end up with exactly the value 2^{64 }– 1, we know that, probabilistically, you will not need more than *k* + 1 multiplications to compute exactly the *k* more significant words. And quite often, you will be able to stop after *k* multiplications.

My code implementing these ideas is a bit too long for a blog post, but I have published it as its own GitHub repository, so you can check it out for yourself. It comes complete with tests and benchmarks.

I have restricted my presentation to the case where at least one integer fits in a single word. Generalizing the problem to the case where you have two multiword integers is a fun problem. If you have ideas about it, I’d love to chat.

**Implementation notes**: The code is mostly standard C++. The main difficulty is to be able to compute the full multiplication which takes two 64-bit words and generates two 64-bit words to represent the product. To my knowledge, there is no standard way to do it in C/C++ but most compilers offer you some way to do it. At the CPU level, computing the full product is always supported (64-bit ARM and x64) with efficient instructions.

**Credit**: This work is inspired by notes and code from Micheal Eisel.

Daniel Lemire, "Multiplying backward for profit," in *Daniel Lemire's blog*, April 5, 2020.

Very interesting, thank you for the article ! I wonder if any of the bignum library every thought about this…

I’d be interested in knowing the answer to this question… 🙂 I have looked at the usual suspects and found nothing.

This is the same thing that I was doing last summer here when I was fiddling with your range generator. The multiplication stops when a carry into the bits of interest is no longer possible; I ended up writing it as a tail-recursive carry() function.

I haven’t read your code fully, but it looks like your gate on the carry flag is the same; I golfed it down to (x + w < x) to cover the possibility of [-1,w-1] in the continued multiplication.

At the time I wrote this I didn’t notice any algorithms that do the same, but I don’t know much about numerics.

Interesting. I was aware of your code, but I don’t think I ever noticed that it is what you were doing.

That’s…understandable.

It started as a finite extension of your modulo-and-discard to see if I could put off the modulo more; then I realized that extending the multiplication indefinitely was simpler in some ways.

Mysteriously, the range generation solutions all seemed to center on checking the [0,w) range, even your discard method. Backwards multiplication also requires looking for -1 which can carry from further right, but one addition and a CF check cover both cases.

This problem comes up in multi precision floating point multiplication where it is called the “short product”. MPFR uses a method similar to yours for small sizes (called “naive” in their docs) and a divide and conquer method due to Mulders for larger sizes. See the MPFR docs for the references.

https://hal.inria.fr/inria-00070266/document

“Naive” method

https://www.researchgate.net/publication/2786248_Efficient_Multiprecision_Floating_Point_Multiplication_with_Optimal_Directional_Rounding

Mulders algorithm

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.285&rep=rep1&type=pdf

Thank you for the references.

It’s also studied for integer arithmetic:

Faster truncated integer multiplication – David Harvey -https://arxiv.org/abs/1703.00640