A fast alternative to the modulo reduction

Suppose you want to pick an integer at random in a set of N elements. Your computer has functions to generate random 32-bit integers, how do you transform such numbers into indexes no larger than N? Suppose you have a hash table with a capacity N. Again, you need to transform your hash values (typically 32-bit or 64-bit integers) down to an index no larger than N. Programmers often get around this problem by making sure that N is a power of two, but that is not always ideal.

We want a map that as fair as possible for an arbitrary integer N. That is, ideally, we would want that there are exactly 232/N values mapped to each value in the range {0, 1 ,…, N – 1} when starting from all 232 32-bit integers.

Sadly, we cannot have a perfectly fair map if 232 is not divisible by N. But we can have the next best thing: we can require that there be either floor(232/N) or ceil(232/N) values mapped to each value in the range.

If N is small compared to 232, then this map could be considered as good as perfect.

The common solution is to do a modulo reduction: x mod N. (Since we are computer scientists, we define the modulo reduction to be the remainder of the division, unless otherwise stated.)

```uint32_t reduce(uint32_t x, uint32_t N) {
return x % N;
}
```

How can I tell that it is fair? Well. Let us just run through the values of x starting with 0. You should be able to see that the modulo reduction takes on the values 0, 1, …, N – 1, 0, 1, … as you increment x. Eventually, x arrives at its last value (232 – 1), at which point the cycle stops, leaving the values 0, 1, …, (232 – 1) mod N with ceil(232/N) occurrences, and the remaining values with floor(232/N) occurrences. It is a fair map with a bias for smaller values.

It works, but a modulo reduction involves a division, and divisions are expensive. Much more expensive than multiplications. A single 32-bit division on a recent x64 processor has a throughput of one instruction every six cycles with a latency of 26 cycles. In contrast, a multiplication has a throughput of one instruction every cycle and a latency of 3 cycles.

There are fancy tricks to “precompute” a modulo reduction so that it can be transformed into a couple of multiplications as well as a few other operations, as long as N is known ahead of time. Your compiler will make use of them if N is known at compile time. Otherwise, you can use a software library or work out your own formula.

But it turns out that you can do even better! That is, there is an approach that is easy to implement, and provides just as good a map, without the same performance concerns.

Assume that x and N are 32-bit integers, consider the 64-bit product x * N. You have that (x * N) div 232 is in the range, and it is a fair map.

```uint32_t reduce(uint32_t x, uint32_t N) {
return ((uint64_t) x * (uint64_t) N) >> 32 ;
}
```

Computing (x * N) div 232 is very fast on a 64-bit processor. It is a multiplication followed by a shift. On a recent Intel processor, I expect that it has a latency of about 4 cycles and a throughput of at least on call every 2 cycles.

So how fast is our map compared to a 32-bit modulo reduction?

To test it out, I have implemented a benchmark where you repeatedly access random indexes in an array of size N. The indexes are obtained either with a modulo reduction or our approach. On a recent Intel processor (Skylake), I get the following number of CPU cycles per accesses:

 modulo reduction fast range 8.1 2.2

So it is four times faster! No bad.

As usual, my code is freely available.

What can this be good for? Well… if you have been forcing your arrays and hash tables to have power-of-two capacities to avoid expensive divisions, you may be able to use the fast range map to support arbitrary capacities without too much of a performance penalty. You can also generate random numbers in a range faster, which matters if you have a very fast random number generator.

So how can I tell that the map is fair?

By multiplying by N, we take integer values in the range [0, 232) and map them to multiples of N in [0, N * 232). By dividing by 232, we map all multiples of N in [0, 232) to 0, all multiples of N in [232, 2 * 232) to one, and so forth. To check that this is fair, we just need to count the number of multiples of N in intervals of length 232. This count must be either ceil(232/N) or floor(232/N).

Suppose that the first value in the interval is a multiple of N: that is clearly the scenario that maximizes the number of multiples in the interval. How many will we find? Exactly ceil(232/N). Indeed, if you draw sub-intervals of length N, then every complete interval begins with a multiple of N and if there is any remainder, then there will be one extra multiple of N. In the worst case scenario, the first multiple of N appears at position N – 1 in the interval. In that case, we get floor(232/N) multiples. To see why, again, draw sub-intervals of length N. Every complete sub-interval ends with a multiple of N.

This completes the proof that the map is fair.

For fun, we can be slightly more precise. We have argued that the number of multiples was maximized when a multiple of N appears at the very beginning of the interval of length 232. At the end, we get an incomplete interval of length 232 mod N. If instead of having the first multiple of N appear at the very beginning of the interval, it appeared at index 232 mod N, then there would not be room for the incomplete subinterval at the end. This means that whenever a multiple of N occurs before 232 mod N, then we shall have ceil(232/N) multiples, and otherwise we shall have floor(232/N) multiples.

Can we tell which outcomes occur with frequency floor(232/N) and which occurs with frequency ceil(232/N)? Yes. Suppose we have an output value k. We need to find the location of the first multiple of N no smaller than k 232. This location is ceil(k 232 / N) Nk 232 which we just need to compare with 232 mod N. If it is smaller, then we have a count of ceil(232/N), otherwise we have a count of floor(232/N).

You can correct the bias with a rejection, see my post on fast shuffle functions.

Useful code: I published a C/C++ header on GitHub that you can use in your projects.

(Update: I have made the proof more intuitive following a comment by Kendall Willets.)

30 thoughts on “A fast alternative to the modulo reduction”

1. I’m missing something… are you saying that x % N and (x * N) div 2^32 are equivalent? This is trivially false for, e.g., 1 % 7 vs (1 * 7) div 2^32.

1. I am not saying that x % N is equal to (x * N) div 2^32 . This is obviously false, as you indicate. I am saying that they are both fair maps from the set of all 32-bit integers down to integers in [0,N).

1. It is not, unless you have numbers uniformly distributed from 0 to numeric_limits::max()

1. If you have, say, 31-bit integers, instead of 32-bit integers, as might happen with a call to rand, you can adapt the map by shifting by 31 instead of 32.

2. PS: Especially, if you use it for hashing and you numbers tend to be small then all your hash values will be horribly biased down. Perhaps, reducing div 2^32 to div 2^(some smaller degree of tw) may help.

1. A good hash function should be regular. That is, it should be so that all integers are “equally likely” (given a random input). If your hash values do not cover the whole 32-bit range, you need to adapt the map, probably as you describe.

1. What is the cost of computing a good hash value? š Perhaps, it should be included in the overall computation time. This may (drastically?) reduce the difference between two approaches.

I also suspect that in many cases, given a list of IDs you can get reasonable results by doing just ID % . In this case, you can get away without applying a hashing transformation. With your trick, you do need this.

For example, if you numbers are smaller than 2^20 (which is quite likely) and N < 2^16, then it looks like all your hash values will be zero.

1. by doing just ID * some perhaps prime number. Angle brackets were removed by HTML š

2. What is the cost of computing a good hash value?

Of course, you are correct that there are many cases where the modulo reduction will not be a bottleneck. In such cases, this fast map is useless.

However, there are real-world examples where people set their capacity to a power of two to improve performance. A latency of a couple of dozens of cycles (for a division) is not great.

I also suspect that in many cases, given a list of IDs you can get reasonable results by doing just ID % N.

Yes. Moreover, as alluded to in my post, if N is known ahead of time, you can avoid the modulo reduction entirely and replace it with a faster function. See Hacker’s Delight.

2. Maxim Egorushkin says:

Would byte-swapping x help to preserve lower-order bits, e.g.:

uint32_t reduce2(uint32_t x, uint32_t N) {
return static_cast<uint64_t>(bswap_32(x)) * N >> 32;
}

2. On a second thought, if modulo reduction slows you down, you may want to try SIMD and bulk conversion. You can something like:
__m128 recip = __mm_set1_ps(1/float(N));
__m128 mult = __mm_set1_ps(N);
__m128 tmp = _mm_mul_ps(mult, _mm_floor_ps(_mm_mul_ps(to_convert_,recip)));
__m128i res = _mm_cvtps_epi32(_mm_floor_ps(_mm_sub_ps(to_convert, tmp)));
Peraphs, an additional SIMD min is required to ensure the value is < N. Anyways, seems like it is worth trying such as solution.

1. Yes, Leonid, I suspect that you are quite right. There are cases where using floating-point numbers, especially in conjunction with SIMD instructions, could lead to great results.

3. To re-phase Daniel:

We want to generate an index (I) chosen uniformly from the range 0..N-1.

We have a 32-bit random value (R), and assume a uniform distribution.

When N is a power-of-two, then the good/easy/fair answer: mask off the higher bits of R. (I have used this more than a few times, as a programmer.)

When N is not a power-of-two, the easy answer is:
I = R mod N
Two problems: the modulus operation is somewhat expensive, and the distribution is not *quite* even (that last wrap-around) – though likely good enough.

Daniel’s solution is (in two steps):
R = R * N
We now have a value in the range to N * 2^32.
I = R >> 32
We now have a value in the range to N. The two operations (multiply and shift) are almost certainly cheaper than the modulus operation.

I suspect the values are not *quite* uniform, but likely good enough. (How uniformly distributed are the multiplication products in each of the N buckets?)

Might write a test program to measure. š

1. If N is small compared to 2^32, then I submit to you that you won’t be able to measure any bias.

4. KWillets says:

It’s a fairly simple proof. You stretch the [0, 2^32) random variable range out to multiples of N in [0, N*2^32] and then map [0,2^32) to 0, [2^32, 2*2^32) to 1, etc. by integer division (>>32). The number of multiples of N in each range is floor or ceil(2^32/N).

1. Thanks. I had a relatively short proof that used elementary number theory (a few lines), but it slightly got out of hand when I tried to make it accessible to people who may not master these concepts. A more direct approach like you suggest would have been better! Kudos!

1. KWillets says:

Thanks — I actually thought it through hoping to find a different method, and ended up with a proof of the same one. Oh well.

5. Maxim Egorushkin says:

Note that expression `((uint64_t) x * (uint64_t) N) >> 32` on x86-64 compiles into a 64-bit multiplication that yields a 128-bit result, followed by the shift.

With a bit of inline assembly it can use a shorter 32-bit mul instruction that yields a 64-bit result in EDX:EAX. The required result is in EDX, no shift instruction is required.

1. That’s a very good point. I am somewhat disappointed that the compiler fails to exploit this optimization.

1. For the record, something like this could do the job…

``` uint32_t asm_highmult32to32(uint32_t u, uint32_t v) { uint32_t answer; __asm__ ("imull %[v]\n" "movl %%edx,%[answer]\n" : [answer] "+r" (answer) : [u] "a" (u), [v] "r" (v) :"eax","edx" ); return answer; } ```

But it is unlikely to help performance as is. You’d probably need to rewrite a larger chunk of code using assembly.

1. For some reason, gcc does the correct thing with `uint128_t` operands.

2. Maxim Egorushkin says:

mulx instruction would be ideal here: takes 2 assembly instructions to produce the value in eax, no flags affected:

uint32_t reduce3(uint32_t x, uint32_t N) {
uint32_t r;
asm(“movl %[N], %%edx\n\t”
“mulxl %[x], %[r], %[r]”
: [r]”=r”(r)
: [x]”r”(x), [N]”r”(N)
: “edx”);
return r;
}

6. KWillets says:

Doing this for floats (random within [0,x)) is also fairly easy since it just requires subtracting 32 from the exponent, which ldexp can do, eg ldexp(rand(), -32) gives a float in [0,1) (assuming rand() is 32-bit here).

There’s some loss of precision as floats discard lower-order bits automatically, unless we cast to a longer mantissa first.

7. Pablo says:

Since I don’t read math proofs daily, this took me a bit of thinking to parse. At the end I realized all this was a variation of:

rand(1) * N

In order:

( rand(2^32) * N ) / 2^32

is equivalent to

( rand(2^32) / 2^32 ) * N

which is equivalent to

floor( (float)[0,1) * N )

1. Yes, for some fuzzy notion of “equivalent”. For example, rand(2^32) / 2^32 is not equivalent to picking a number in [0,1) fairly.