# Computing the inverse of odd integers

Given x, its (multiplicative) inverse is another value y such that x y = y x = 1. We all know that the multiplicative inverse of x is 1/x and it exists as long as x is non-zero. That’s for real numbers, or at least, rational numbers.

But the idea of a multiplicative inverse is more general.

It certainly fails integers in general. I.e., there is no integer x such that 2 x is 1. But, maybe surprisingly, all odd integers have an inverse if you use normal computer arithmetic. Indeed, when your processor computes x y, then it actually outputs x y modulo 232 or x y modulo 264 depending on whether you use 32-bit or 64-bit instructions. (The value x y modulo 264 can be defined as the remainder of the division of x y by 264.)

Let me briefly explain why there must be an inverse. You can skip this part if you want. Take any odd integer x. Because x is odd, then it is not divisible by 2. In fact, that’s what it means to be odd. But this also means that powers of x will also be odd. E.g., xk is also odd for any integer k. Ok. So xk modulo 264 is never going to be zero. The only way it could be zero is if xk were divisible by 264, but that’s impossible because it is an odd value. At the same time, we only have a finite number of distinct odd values smaller than 264. So it must be the case that xk modulo 264 = xk' modulo 264 for some pair of powers k and k'. Assume without loss of generality that k is larger than k'. Then we have that xk-k' modulo 264 = 1 modulo 264 (I am not proving this last step, but you can figure it out from the previous one). And thus it follows that xk-k'-1 is the inverse of x. If you did not follow this sketch of a proof, don’t worry.

So how do you find this inverse? You can brute force it, which works well for 32-bit values, but not so well for 64-bit values.

Wikipedia has a page on this, entitled modular multiplicative inverses. It points to an Euclidean algorithm that appears to rely on repeated divisions.

Thankfully, you can solve for the inverse efficiently using very little code.

One approach is based on “Newton’s method”. That is, we start with a guess and from the guess, we get a better one, and so forth, until we naturally converge to the right value. So we need some formula f(y), so that we can repeatedly call y = f(y) until y converges.

A useful recurrence formula is f(y) = y (2 - y x ) modulo 264. You can verify that if y is the 64-bit inverse of x, then this will output y. So the formula passes a basic sanity test. But would calling y = f(y) repeatedly converge to the inverse?

Suppose that y is not quite the inverse, suppose that x y = 1 + z 2N for some z and some N that is smaller than 64. So y is the inverse “for the first N bits” (where “first” means “least significant”). That is, x y modulo 2N = 1.

It is easy to find such a y for N greater than zero. Indeed, let y = 1, then x y = 1 + z 21.

Ok, so substituting x y = 1 + z 2N in
y (2 - y x ) modulo 264, we get
y (2 - ( 1 + z 2N) ) modulo 264
or
y ( 1 - z 2N ) modulo 264.
So I set y' = f(y) = y (1 - z 2N ) modulo 264.
What is x y'? It is ( 1 + z 2N ) (1 - z 2N ) modulo 264
or ( 1 - z2 22 N ) modulo 264.

That is, if y was the inverse “for the first N bits”, then y' = f(y) is the inverse “for the first 2 N bits”. In some sense, I double my precision each time I call the recurrence formula. This is great! This means that I will quickly converge to the inverse.

Can we do better, as an initial guess, than y = 1? Yes. We can start with a very interesting observation: if we use 3-bit words, instead of 32-bit or 64-bit words, then every number is its own inverse. E.g., you can check that 3*3 modulo 8 = 1.

(Marc Reynolds pointed out to me that you can get 4 bits of accuracy by starting out with x * x + x - 1 . Brian Kessler points out that you can do even better: ( 3 * x ) XOR 2 provides 5 bits of accuracy.)

So a good initial guess is y = x, and that already buys us 3 bits. The first call to the recurrence formula gives me 6 bits, then 12 bits for the second call, then 24 bits, then 48 bits, then 96 bits, and so forth. So, we need to call our recurrence formula 4 times for 32-bit values and 5 times for 64-bit values. I could actually go to 128-bit values by calling the recurrence formula 6 times.

Here is the code to compute the inverse of a 64-bit integer:

```uint64_t f64(uint64_t x, uint64_t y) {
return  y * ( 2 - y * x );
}

static uint64_t findInverse64(uint64_t x) {
uint64_t y = x;
y = f64(x,y);
y = f64(x,y);
y = f64(x,y);
y = f64(x,y);
y = f64(x,y);
return y;
}
```

Each call to the recurrence formula should consume about 5 CPU cycles so that the whole function should take no more than 25 cycles or no more than the cost of a single integer division. Actually, it might be cheaper than a single integer division.

Because of the way we construct the inverse, if you somehow knew the 32-bit inverse, you could call the recurrence formula just once to get the 64-bit inverse.

How did we arrive at this formula ( y ( 2 - y x ) )? It is actually a straight-forward application of Newton’s method as one would apply it to finding the zero of g(y) = 1/y - x. So there is no magic involved.

My code seems to assume that I am working with unsigned integers, but the same algorithm works with signed integers, and in binary form, it will provide the same results.

Reference and further reading: Granlund and Montgomery, SIGPLAN Not. (1994). Some people point me at On Newton-Raphson iteration for multiplicative inverses modulo prime powers by Dumas (2012).

Credit: Marc Reynolds asked on Twitter for an informal reference on computing the multiplicative inverse modulo a power of two. It motivated me to write this blog post. He finally wrote a decent article on the subject with many interesting remarks. ### Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

## 8 thoughts on “Computing the inverse of odd integers”

1. Sam Kramer says:

Interesting article!
Those using C++14 or newer can use the constexpr specifier to evaluate the functions at compile time:

constexpr uint64_t f64(uint64_t x, uint64_t y) {
return y * ( 2 – y * x );
}

constexpr uint64_t findInverse64(uint64_t x) {
uint64_t y = x;
y = f64(x,y);
y = f64(x,y);
y = f64(x,y);
y = f64(x,y);
y = f64(x,y);
return y;
}

1. Daniel Lemire says:

Great!

Does this work with C++11?

1. Sam Kramer says:

C++11 Solution:

constexpr uint64_t rec_f64(uint64_t x, uint64_t y, int n) {
return (n > 0) ? rec_f64(x, f64(x, y), n-1) : f64(x, y);
}

constexpr uint64_t findInverse64(uint64_t x) {
return rec_f64(x, x, 5);
}

See the following blog post to learn why this is necessary:
http://fendrich.se/blog/2012/11/22/compile-time-loops-in-c-plus-plus-11-with-trampolines-and-exponential-recursion/

2. Brian Kessler says:

In terms of applications, this came up for me recently when looking at an algorithm for exact division (remainder known to be zero).

You can actually do a little better and get 5 correct starting bits with y = (3 * x) ^ 2 and in that case you also only need 4 iterations for the 64 bit inverse.

That’s according to “personal communication from Montgomery” from this paper: https://arxiv.org/pdf/1303.0328.pdf

3. Albert Chan says:

For older machine, this may be faster to calculate 64-bits inverse:

```static uint32_t inverse32(uint32_t a) { uint32_t x = (3*a) ^ 2; x *= 2 - a * x; x *= 2 - a * x; x *= 2 - a * x; return x; }```

``` ```

```static uint64_t inverse64(uint64_t a) // (a x) mod 2^64 = 1 { uint32_t terms, a0 = a; // a = a1<<32 | a0 uint32_t x0 = inverse32(a0); // a0 x0 = 1 terms = (uint32_t) (a >> 32) * x0; // term a1 x0 terms += ((uint64_t) a0 * x0) >> 32; // term a0 x0 / 2^32 uint32_t x1 = x0 * -terms; // a0 x1 + terms = 0 return (uint64_t) x1 << 32 | x0; // x = a^-1 mod 2^64 } ```

4. Albert Chan says:

Found an old post with formula that trebled bits accuracy per step.
For inverse32(), only 2 steps are needed.

```uint32_t inverse32(uint32_t a) { uint32_t t, x = (3*a) ^ 2; t = a * x - 1; x *= t * t - t + 1; t = a * x - 1; x *= t * t - t + 1; return x; } ```

5. LCR says:

I might be late to the party, but here’s a solution to remove some more cycles by hacking an 8bit approximation with a small table (especially efficient if you have the bextr instruction).

```uint64_t inverse64(uint64_t a) { uint8_t t[] = { 2, 174, 210, 190, 66, 174, 210, 254, 2, 46, 82, 190, 66, 46, 82, 254, }; uint64_t x; x = t[(a>>1)&15] - a; // 8bit precision x *= 2 - a * x; // 16 x *= 2 - a * x; // 32 x *= 2 - a * x; // 64 return x; } ```

1. Daniel Lemire says:

Here’s a solution to remove some more cycles by hacking an 8bit approximation with a small table

It is interesting, but I doubt that it will be much faster.