More fun with fast remainders when the divisor is a constant

In software, compilers can often optimize away integer divisions, and replace them with cheaper instructions, especially when the divisor is a constant. I recently wrote about some work on faster remainders when the divisor is a constant. I reported that it can be fruitful to compute the remainder directly, instead of first computing the quotient (as compilers are doing when the divisor is a constant).

To get good results, we can use an important insight that is not documented anywhere at any length: we can use 64-bit processor instructions to do 32-bit arithmetic. This is fair game and compilers could use this insight, but they do not do it systematically. Using this trick alone is enough to get substantial gains in some instances, if the algorithmic issues are just right.

So it is a bit complicated. Using 64-bit processor instructions for 32-bit arithmetic is sometimes useful. In addition, computing the remainder directly without first computing the quotient is sometimes useful. Let us collect a data point for fun and to motivate further work.

First let us consider how you might compute the remainder by leaving it up to the compiler to do the heavy lifting (D is a constant known to the compiler). I expect that the compiler will turn this code into a sequence of instructions over 32-bit registers:

```uint32_t compilermod32(uint32_t a) {
return a % D;
}

```

Then we can compute the remainder directly, using some magical mathematics and 64-bit instructions:

```#define M ((uint64_t)(UINT64_C(0xFFFFFFFFFFFFFFFF) / (D) + 1))

uint32_t directmod64(uint32_t a) {
uint64_t lowbits = M * a;
return ((__uint128_t)lowbits * D) >> 64;
}
```

Finally, you can compute the remainder “indirectly” (by first computing the quotient) but using 64-bit processor instructions.

```uint32_t indirectmod64(uint32_t a) {
uint64_t quotient = ( (__uint128_t) M * a ) >> 64;
return a - quotient * D;
}
```

As a benchmark, I am going to compute a linear congruential generator (basically a recursive linear function with a remainder thrown in), using these three approaches, plus the naive one. I use as a divisor the constant number 22, a skylake processor and the GNU GCC 8.1 compiler. For each generated number I measure the following number of CPU cycles (on average):

 slow (division instruction) 29 cycles compiler (32-bit) 12 cycles direct (64-bit) 10 cycles indirect (64-bit) 11 cycles

Depending on your exact platform, all three approaches (compiler, direct, indirect) could be a contender for best results. In fact, it is even possible that the division instruction could win out in some cases. For example, on ARM and POWER processors, the division instruction does beat some compilers.

Where does this leave us? There is no silver bullet but a simple C function can beat a state-of-the-art optimizing compiler. In many cases, we found that a direct computation of the 32-bit remainder using 64-bit instructions was best.

Faster remainders when the divisor is a constant: beating compilers and libdivide

Not all instructions on modern processors cost the same. Additions and subtractions are cheaper than multiplications which are themselves cheaper than divisions. For this reason, compilers frequently replace division instructions by multiplications. Roughly speaking, it works in this manner. Suppose that you want to divide a variable n by a constant d. You have that n/d = n * (2N/d) / (2N). The division by a power of two (/ (2N)) can be implemented as a right shift if we are working with unsigned integers, which compiles to single instruction: that is possible because the underlying hardware uses a base 2. Thus if 2N/d has been precomputed, you can compute the division n/d as a multiplication and a shift. Of course, if d is not a power of two, 2N/d cannot be represented as an integer. Yet for N large enoughfootnote, we can approximate 2N/d by an integer and have the exact computation of the remainder for all possible n within a range. I believe that all optimizing C/C++ compilers know how to pull this trick and it is generally beneficial irrespective of the processor’s architecture.

The idea is not novel and goes back to at least 1973 (Jacobsohn). However, engineering matters because computer registers have finite number of bits, and multiplications can overflow. I believe that, historically, this was first introduced into a major compiler (the GNU GCC compiler) by Granlund and Montgomery (1994). While GNU GCC and the Go compiler still rely on the approach developed by Granlund and Montgomery, other compilers like LLVM’s clang use a slightly improved version described by Warren in his book Hacker’s Delight.

What if d is a constant, but not known to the compiler? Then you can use a library like libdivide. In some instances, libdivide can even be more efficient than compilers because it uses an approach introduced by Robison (2005) where we not only use multiplications and shifts, but also an addition to avoid arithmetic overflows.

Can we do better? It turns out that in some instances, we can beat both the compilers and a library like libdivide.

Everything I have described so far has to do with the computation of the quotient (n/d) but quite often, we are looking for the remainder (noted n % d). How do compilers compute the remainder? They first compute the quotient n/d and then they multiply it by the divisor, and subtract all of that from the original value (using the identity n % d = n - (n/d) * d).

Can we take a more direct route? We can.

Let us go back to the intuitive formula n/d = n * (2N/d) / (2N). Notice how we compute the multiplication and then drop the least significant N bits? It turns out that if, instead, we keep these least significant bits, and multiply them by the divisor, we get the remainder, directly without first computing the quotient.

The intuition is as follows. To divide by four, you might choose to multiply by 0.25 instead. Take 5 * 0.25, you get 1.25. The integer part (1) gives you the quotient, and the decimal part (0.25) is indicative of the remainder: multiply 0.25 by 4 and you get 1, which is the remainder. Not only is this more direct and potential useful in itself, it also gives us a way to check quickly whether the remainder is zero. That is, it gives us a way to check that we have an integer that is divisible by another: do x * 0.25, the decimal part is less than 0.25 if and only if x is a multiple of 4.

This approach was known to Jacobsohn in 1973, but as far as I can tell, he did not derive the mathematics. Vowels in 1994 worked it out for the case where the divisor is 10, but (to my knowledge), nobody worked out the general case. It has now been worked out in a paper to appear in Software: Practice and Experience called Faster Remainder by Direct Computation.

In concrete terms, here is the C code to compute the remainder of the division by some fixed divisor d:

```uint32_t d = ...;// your divisor > 0

uint64_t c = UINT64_C(0xFFFFFFFFFFFFFFFF) / d + 1;

// fastmod computes (n mod d) given precomputed c
uint32_t fastmod(uint32_t n ) {
uint64_t lowbits = c * n;
return ((__uint128_t)lowbits * d) >> 64;
}
```

The divisibility test is similar…

```uint64_t c = 1 + UINT64_C(0xffffffffffffffff) / d;

// given precomputed c, checks whether n % d == 0
bool is_divisible(uint32_t n) {
return n * c <= c - 1;
}
```

To test it out, we did many things, but in one particular tests, we used a hashing function that depends on the computation of the remainder. We vary the divisor and compute many random values. In one instance, we make sure that the compiler cannot assume that the divisor is known (so that the division instruction is used), in another case we let the compiler do its work, and finally we plug in our function. On a recent Intel processor (Skylake), we beat state-of-the-art compilers (e.g., LLVM’s clang, GNU GCC).

The computation of the remainder is nice, but I really like better the divisibility test. Compilers generally don’t optimize divisibility tests very well. A line of code like (n % d) = 0 is typically compiled to the computation of the remainder ((n % d)) and a test to see whether it is zero. Granlund and Montgomery have a better approach if d is known ahead of time and it involves computing the inverse of an odd integer using Newton’s method. Our approach is simpler and faster (on all tested platforms) in our tests. It is a multiplication by a constant followed by a comparison of the result with said constant: it does not get much cheaper than that. It seems that compilers could easily apply such an approach.

We packaged the functions as part of a header-only library which works with all major C/C++ compilers (GNU GCC, LLVM’s clang, Visual Studio). We also published our benchmarks for research purposes.

I feel that the paper is short and to the point. There is some mathematics, but we worked hard so that it is as easy to understand as possible. And don’t skip the introduction! It tells a nice story.

The paper contains carefully crafted benchmarks, but I came up with a fun one for this blog post which I call “fizzbuzz”. Let us go through all integers in sequence and count how many are divisible by 3 and how many are divisible by 5. There are far more efficient ways to do that, but here is the programming 101 approach in C:

```  for (uint32_t i = 0; i < N; i++) {
if ((i % 3) == 0)
count3 += 1;
if ((i % 5) == 0)
count5 += 1;
}
```

Here is the version with our approach:

```static inline bool is_divisible(uint32_t n, uint64_t M) {
return n * M <= M - 1;
}

...

uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1;
uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1;
for (uint32_t i = 0; i < N; i++) {
if (is_divisible(i, M3))
count3 += 1;
if (is_divisible(i, M5))
count5 += 1;
}
```

Here is the number of CPU cycles spent on each integer checked (average):

 Compiler 4.5 cycles per integer Fast approach 1.9 cycles per integer

I make my benchmarking code available. For this test, I am using an Intel (skylake) processing and GCC 8.1.

Your results will vary. Our proposed approach may not always be faster. However, we can claim that some of time, it is advantageous.

Update: There is a Go library implementing this technique.

Further reading: Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries, Software: Practice and Experience 49 (6), 2019.

Footnote: What is N? If both the numerator n and the divisor d are 32-bit unsigned integers, then you can pick N=64. This is not the smallest possible value. The smallest possible value is given by Algorithm 2 in our paper and it involves a bit of mathematics (note: the notation in my blog post differs from the paper, N becomes F).

Follow-up blog post: More fun with fast remainders when the divisor is a constant (where I discuss finer points)

Quickly sampling from two arrays (C++ edition)

Suppose that you are given two arrays. Maybe you have a list of cities from the USA and a list of cities from Europe. You want to generate a new list which mixes the two lists, taking a sample from one array (say 50%), and a sample from the other array (say 50%). So if you have 50 cities from the USA and 50 cities from Europe, you want a new array that contains, in random order, 25 cities from the USA and 25 cities from Europe.

We need this kind of mixed sampling all the time in machine learning or data science. This summer, I was running simulations and the bulk of the time was spent mixing arrays. I need to pick, say, 25% of all elements from one array and combine them with, say, 75% of all elements from another array.

There are many bad ways to solve this problem. But here is a reasonable one. First you pick a sample from the first array using reservoir sampling; then you pick a sample from the other array (again using reservoir sampling), and you finally apply a random shuffle to the result.

Reservoir sampling is an efficient way to sample N values from an array:

```  for (i = 0; i < N; i++) {
output[i] = source[i];
}
for (; i < ...; i++) {
r = random_bounded(i);// value in [0,i)
if (r < howmany) {
output[r] = source[i];
}
}
```

Knuth shuffling is an efficient way to randomly shuffle the elements in an array:

```  for (i = size; i > 1; i--) {
r = random_bounded(i);// value in [0,i)
swap(array[i-1], array[r]);
}
```

With these two algorithms in place, I can sample from two source arrays using three function calls:

```reservoirsampling(output, N1, source1, length1);
reservoirsampling(output + N1, N2, source2, length2);
shuffle(output, N1 + N2);
```

So how efficient is it? Suppose that I have two arrays made of a million elements each and I want to sample half a million elements from each. On my iMac, I use a bit over 12 CPU cycles per input element (so about 24 million cycles in total). You probably can go even faster, but this approach has the benefit of being both simple and efficient.