Faster remainders when the divisor is a constant: beating compilers and libdivide

Not all instructions on modern processors cost the same. Additions and subtractions are cheaper than multiplications which are themselves cheaper than divisions. For this reason, compilers frequently replace division instructions by multiplications. Roughly speaking, it works in this manner. Suppose that you want to divide a variable n by a constant d. You have that n/d = n * (2N/d) / (2N). The division by a power of two (/ (2N)) can be implemented as a right shift if we are working with unsigned integers, which compiles to single instruction: that is possible because the underlying hardware uses a base 2. Thus if 2N/d has been precomputed, you can compute the division n/d as a multiplication and a shift. Of course, if d is not a power of two, 2N/d cannot be represented as an integer. Yet for N large enoughfootnote, we can approximate 2N/d by an integer and have the exact computation of the remainder for all possible n within a range. I believe that all optimizing C/C++ compilers know how to pull this trick and it is generally beneficial irrespective of the processor’s architecture.

The idea is not novel and goes back to at least 1973 (Jacobsohn). However, engineering matters because computer registers have finite number of bits, and multiplications can overflow. I believe that, historically, this was first introduced into a major compiler (the GNU GCC compiler) by Granlund and Montgomery (1994). While GNU GCC and the Go compiler still rely on the approach developed by Granlund and Montgomery, other compilers like LLVM’s clang use a slightly improved version described by Warren in his book Hacker’s Delight.

What if d is a constant, but not known to the compiler? Then you can use a library like libdivide. In some instances, libdivide can even be more efficient than compilers because it uses an approach introduced by Robison (2005) where we not only use multiplications and shifts, but also an addition to avoid arithmetic overflows.

Can we do better? It turns out that in some instances, we can beat both the compilers and a library like libdivide.

Everything I have described so far has to do with the computation of the quotient (n/d) but quite often, we are looking for the remainder (noted n % d). How do compilers compute the remainder? They first compute the quotient n/d and then they multiply it by the divisor, and subtract all of that from the original value (using the identity n % d = n - (n/d) * d).

Can we take a more direct route? We can.

Let us go back to the intuitive formula n/d = n * (2N/d) / (2N). Notice how we compute the multiplication and then drop the least significant N bits? It turns out that if, instead, we keep these least significant bits, and multiply them by the divisor, we get the remainder, directly without first computing the quotient.

The intuition is as follows. To divide by four, you might choose to multiply by 0.25 instead. Take 5 * 0.25, you get 1.25. The integer part (1) gives you the quotient, and the decimal part (0.25) is indicative of the remainder: multiply 0.25 by 4 and you get 1, which is the remainder. Not only is this more direct and potential useful in itself, it also gives us a way to check quickly whether the remainder is zero. That is, it gives us a way to check that we have an integer that is divisible by another: do x * 0.25, the decimal part is less than 0.25 if and only if x is a multiple of 4.

This approach was known to Jacobsohn in 1973, but as far as I can tell, he did not derive the mathematics. Vowels in 1994 worked it out for the case where the divisor is 10, but (to my knowledge), nobody worked out the general case. It has now been worked out in a paper to appear in Software: Practice and Experience called Faster Remainder by Direct Computation.

In concrete terms, here is the C code to compute the remainder of the division by some fixed divisor d:

uint32_t d = ...;// your divisor > 0

uint64_t c = UINT64_C(0xFFFFFFFFFFFFFFFF) / d + 1;

// fastmod computes (n mod d) given precomputed c
uint32_t fastmod(uint32_t n ) {
  uint64_t lowbits = c * n;
  return ((__uint128_t)lowbits * d) >> 64; 
}

The divisibility test is similar…

uint64_t c = 1 + UINT64_C(0xffffffffffffffff) / d;


// given precomputed c, checks whether n % d == 0
bool is_divisible(uint32_t n) {
  return n * c <= c - 1; 
}

To test it out, we did many things, but in one particular tests, we used a hashing function that depends on the computation of the remainder. We vary the divisor and compute many random values. In one instance, we make sure that the compiler cannot assume that the divisor is known (so that the division instruction is used), in another case we let the compiler do its work, and finally we plug in our function. On a recent Intel processor (Skylake), we beat state-of-the-art compilers (e.g., LLVM’s clang, GNU GCC).

The computation of the remainder is nice, but I really like better the divisibility test. Compilers generally don’t optimize divisibility tests very well. A line of code like (n % d) = 0 is typically compiled to the computation of the remainder ((n % d)) and a test to see whether it is zero. Granlund and Montgomery have a better approach if d is known ahead of time and it involves computing the inverse of an odd integer using Newton’s method. Our approach is simpler and faster (on all tested platforms) in our tests. It is a multiplication by a constant followed by a comparison of the result with said constant: it does not get much cheaper than that. It seems that compilers could easily apply such an approach.

We packaged the functions as part of a header-only library which works with all major C/C++ compilers (GNU GCC, LLVM’s clang, Visual Studio). We also published our benchmarks for research purposes.

I feel that the paper is short and to the point. There is some mathematics, but we worked hard so that it is as easy to understand as possible. And don’t skip the introduction! It tells a nice story.

The paper contains carefully crafted benchmarks, but I came up with a fun one for this blog post which I call “fizzbuzz”. Let us go through all integers in sequence and count how many are divisible by 3 and how many are divisible by 5. There are far more efficient ways to do that, but here is the programming 101 approach in C:

  for (uint32_t i = 0; i < N; i++) {
    if ((i % 3) == 0)
      count3 += 1;
    if ((i % 5) == 0)
      count5 += 1;
  }

Here is the version with our approach:

static inline bool is_divisible(uint32_t n, uint64_t M) {
  return n * M <= M - 1;
}

...


  uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1;
  uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1;
  for (uint32_t i = 0; i < N; i++) {
    if (is_divisible(i, M3))
      count3 += 1;
    if (is_divisible(i, M5))
      count5 += 1;
  }

Here is the number of CPU cycles spent on each integer checked (average):

Compiler 4.5 cycles per integer
Fast approach 1.9 cycles per integer

I make my benchmarking code available. For this test, I am using an Intel (skylake) processing and GCC 8.1.

Your results will vary. Our proposed approach may not always be faster. However, we can claim that some of time, it is advantageous.

Update: There is a Go library implementing this technique.

Further reading: Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries, Software: Practice and Experience 49 (6), 2019.

Footnote: What is N? If both the numerator n and the divisor d are 32-bit unsigned integers, then you can pick N=64. This is not the smallest possible value. The smallest possible value is given by Algorithm 2 in our paper and it involves a bit of mathematics (note: the notation in my blog post differs from the paper, N becomes F).

Follow-up blog post: More fun with fast remainders when the divisor is a constant (where I discuss finer points)

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

37 thoughts on “Faster remainders when the divisor is a constant: beating compilers and libdivide”

  1. Your first paragraph states “multiplications which are themselves slower than divisions”, but surely you meant the opposite.

    1. I ageee. This sounds like a great patch which would be integrated into llvm and/or gcc, rather than a library that would need to be included in. Its great to test to the approach, but realistically most people writing code will not use your library only because they dont know it exists…. It would be better if just a normal part of the compile process (in the compiler itself)

    1. This is applicable to JavaScript. Integer arithmetic in JavaScript is tricky so you would need to focus on a specific problem.

      Evidently, this trick could be implemented in the JavaScript compiler.

      1. Hi Daniel,i have try below javascript code,but get the wrong result ,can you explain why,Thanks.

        function fmod(divisor,num){
        let c=(0xFFFFFFFFFFFFFFFF)/divisor+1;
        let lowbits=cnum;
        return ((lowbits
        divisor)>>64)
        }

        console.log(fmod(5/18))

        1. All numbers in JS are floats under the hood so the raw bits are not going to act the same as ints. That’s probably why the function isn’t working as is in pure JS.

        2. Numbers in JavaScript are stored as doubles, and thus cannot exactly represent an integer over 53 bits. Additionally, the bitwise operators all convert the operands to 32-bit signed integers before performing the operation. You are doubly unable to do 64-bit shifts in JavaScript.

        3. The most likely answer is that JavaScript does not use 64 bit integer numbers. It uses 64 bit floating point numbers which happen to look like integers for most development purposes.

  2. Great article and paper! Small typo: “This approach was known Jacobsohn in 1973” should be “This approach was known TO Jacobsohn in 1973”.

  3. Coincidentally, in September 2018 a slightly extended version of divisibility check shown in Hacker’s Delight was implemented in GCC (so it doesn’t appear in gcc-8, but will be available in gcc-9). Some discussion can be found in GCC PR 82853, and you can see the optimization in action on Compiler Explorer.

  4. Please, have you also tried it with 64-bits “n”, as it would need 192-bits intermediate computation (if I understood correctly)? Can yo perhaps share some timings as well, in addition to the 32-bits case?

    PS: Intel’s Cannon Lake is supposed to have 10-18 cycles latency for integer division, could make for interesting comparison!

    1. Please, have you also tried it with 64-bits “n”, as it would need 192-bits intermediate computation (if I understood correctly)? Can yo perhaps share some timings as well, in addition to the 32-bits case?

      You do not need 192-bit intermediate computations, 128-bit is enough. However, the engineering gets more complicated if you are to deal with overflows properly. We are inviting people to pursue the research. We do not claim to have written the last word on this. If you are interested in pursuing the research and want to chat, do get in touch with us.

      Intel’s Cannon Lake is supposed to have 10-18 cycles latency for integer division, could make for interesting comparison!

      I have a CannonLake processor right here, and here are the results of the fizzbuzz benchmark:

      $ ./fizzbuzz
      count35(N, &count3, &count5)                :  3.690 cycles per input word (best)  3.698 cycles per input word (avg)
      1666666700 1000000000
      fastcount35(N, &count3, &count5)            :  2.082 cycles per input word (best)  2.085 cycles per input word (avg)
      1666666700 1000000000
      

      (This is GCC 8.1.)

      Of course, these fizzbuzz tests do not use the division instruction. The division instruction would be slower. However, it would be less terrible than on skylake processors.

      1. I’ve added support for 64-bit unsigned modulo in a fork here (https://github.com/dnbaker/fastmod/tree/fastmod64). You’re right that 192- (or 256-) bit integers aren’t needed; all you need to do is make sure you handle overflows correctly.

        I’m not saying that this is optimal code, but it does appear to pass the expanded unit tests.

        Of course, division and support for signed integers would be desirable as well, but for my purposes, a fast 64-bit mod is all I need.

        1. Daniel Baker: Thanks for adapting this!

          By the way, your function to compute M:

          FASTMOD_API __uint128_t computeM_u64(uint64_t d) {
          __uint128_t M = UINT64_C(0xFFFFFFFFFFFFFFFF);
          M <<= 64;
          M |= UINT64_C(0xFFFFFFFFFFFFFFFF);
          M /= d;
          M += 1;
          return M;
          }

          could, I believe, simply be written like this instead:

          FASTMOD_API __uint128_t computeM_u64(uint64_t d) {
          return (((__uint128_t)0 - 1) / d) + 1;
          }

          or equivalently:

          FASTMOD_API __uint128_t computeM_u64(uint64_t d) {
          return (-(__uint128_t)1 / d) + 1;
          }

          or even:

          FASTMOD_API __uint128_t computeM_u64(uint64_t d) {
          return (~(__uint128_t)0 / d) + 1;
          }

          That is to say, it’s not necessary to construct and composite the two 64-bit portions in the case of all 1-bits.

          1. I did not write these functions (not in the current form). I think that the contributor who rewrote them meant to make them more transparent.

            Your point is well taken, however.

  5. It’s lookd like this could be quite useful in the cases that applies! For example this libdivide issue could be finally updated (looks like someone got to it since I first ran across it).

    They paper could be clearer though that it only applies (for real ALU sizes) to an N/2-bit division on an N-bit machine. Usually you want an N-bit division on an N-bit machine, and that’s what the competing approaches offer.

    This means that on a (typical) 32-bit machine only 16-divisors are supported, and on 64-bit only 32-bit divisors are supported. On machines like most x86 where 32-bit and 64-bit multiplications have the same cost, this can still be useful when you are dealing with narrower-than-machine-word things as your uint32_t benchmarks show.

    This need for double-width multiplications isn’t just indidental: the main speedup is by avoiding the rounding corrections which are required in the quotient approaches: by doubling the width of the operations you avoid the need for rounding.

  6. Thanks for the good words.

    The mathematics can be presented as an extension of GM’s approach, and it is just as general. A commenter on this blog post has reported an application of the mathematical result to 64-bit unsigned division on 64-bit processors.

    An additional insight is that you can avoid the extra work related to avoiding overflows by just using wider registers.

    I have more ideas regarding other applications of the mathematical results (that have nothing to do with 32-bit/64-bit registers)… but I don’t have time right now to pursue it so it will have to wait.

    1. The mathematics can be presented as an extension of GM’s approach, and
      it is just as general.

      The basic mathematics is general, yes — I think everyone agrees that the remainder can be calculated as frac(n / d) * d and in the abstract that’s just as general as n - trunc(n / d) * d.

      However, the earlier work (and this one) primarily focus on the efficient implementation of these methods on real hardware. That’s the interesting part, if I’m not mistaken?

      In that domain, the general target is to implement W-bit division on W-bit machines, and a large amount of the complexity in analysis and much of the runtime cost comes as a result of the difficulty of doing that.

      So if you are going to implement only W-bit operations assuming that your machine can do 2W-bit arithmetic, I would certainly consider that less general – and comparing W-bit approaches against 2W-bit approaches doesn’t shed much light on the performance of the underlying techniques since it mixes the two effects together. In fact, I claim that the majority of the improvement in the assembly you show is due to the use of the wider operations and more precise reciprocal, rather than the more “direct” calculation method.

      A commenter on this blog post has reported an application of the
      mathematical result to 64-bit unsigned division on 64-bit processors.

      I didn’t look at it, but if it were true – and it was as efficient as the narrow-input case you illustrate – then my point is basically moot (although it of course still applies to the paper itself, which doesn’t mention this).

      1. That’s the interesting part, if I’m not mistaken?

        I cannot dictate what is important or interesting to you. But maybe you agree that more than one thing can be interesting at once.

        As far as we know, nobody worked out the mathematics for the direct computation of the remainder. Authors simply ignored the issue. Completing the mathematical framework was important, I think. A direct application of this “new math” is the divisibility test which I think is really nice.

        In fact, I claim that the majority of the improvement in the assembly you show is due to the use of the wider operations and more precise reciprocal, rather than the more “direct” calculation method.

        It might be. It is certainly worth exploring. We published our benchmarking code (see link above) and we include in these benchmarks an approach that computes the remainder with a fast “wide” division to get the quotient first. It was often faster than other alternatives, but also generally slower than the direct approach. That does not contradict your “majority” claim… but few people are interested in the second-best approach when the best approach is no more expensive. Because it was unexciting, it is not reported in the published paper even if it appears in our logs and public code.

        1. I cannot dictate what is important or interesting to you. But maybe
          you agree that more than one thing can be interesting at once.

          As far as we know, nobody worked out the mathematics for the direct
          computation of the remainder. Authors simply ignored the issue.
          Completing the mathematical framework was important, I think.

          Yes, I did not mean to imply that the mathematics weren’t important. What I mean is that at a high level what makes this whole thing interesting (to me, to you, to almost everyone, I think) in the first place is that it is potentially more efficient than any existing approach. Once that is established, then the math becomes very important indeed to establish the correctness and to understand the limits.

          A direct
          application of this “new math” is the divisibility test which I think
          is really nice.

          It is a good point. I haven’t considered the divisibility test aspect of this or evaluated competing approaches.

          but few people are interested in the second-best approach when the best approach is no more expensive.

          That seems strange to me, I would expect the second best approach (i.e., the best existing approach) to be the most important baseline so you can report an improvement over the best existing approach.

          Or do you mean that this second-best approach is also new? Assuming it’s just a 32-bit division using a 64-bit reciprocal, I don’t think that is right: it is well-known that using a large enough reciprocal removes all rounding errors and hence speeds things up – IIRC the needed number of bits is W + log(d) just like the new direct remainder calculation. Almost every treatment of the topic explains that if you have an X-bit reciprocal, the result is exact, and then go on to handle the harder case where W < X and so you need extra steps to get the exact result, because people do really want W-bit math on W-bit machines.

          Yeah, compilers (and libdivide?) don’t seem to take advantage of this special case, but that’s not news: there are a lot of “narrow word” tricks they don’t implement (and some they do, they seem to have been added over time as special cases).

          So for me, the observation with the biggest impact is that compilers and libdivide should special case 32-bit division on 64-bit platforms (where 64-bit multiplication is as fast as 32-bit at least, and the handling of 64-bit constants isn’t too onerous) by using a finer-precision reciprocal approximation that produces an exact result. This applies to division as well, not just remainder, so it has wide applicability. The most extensive empirical justification I know of for this improvement is hidden in a paper talking about “faster remainders via direct computation” though!

          Compilers and libdivide should of course also consider using this new faster remainder algo when only remainder is needed and the inputs are narrower than the word size (that said, I think you can extend your algorithm to work generally for W-bit divisions on W-bit hardware, using the same types of rounding tricks as division uses – that would certainly give a boost to likelihood of integration, I think). They should also use the divisibility check (although again I didn’t really evaluate this part).

          1. The most extensive empirical justification I know of for this
            improvement is hidden in a paper talking about “faster remainders via
            direct computation” though!

            That’s fair criticism.

  7. This is really great! Do you happen to know of similar research for “mathematical modulo”? That is, where the modulo function is always a unsigned integer, so instead of

    (−𝑛) mod 𝑑 = −(𝑛 mod 𝑑)

    we have

    (−𝑛) mod 𝑑 = d − (𝑛 mod 𝑑)

    I’m also curious if there are any newer results on floating point remainder (for my application specifically, a floating point numerator and integer divisor).

  8. Great. If I ever get time, I’ll take a look at that with your paper as inspiration. Current implementations seem very slow, either using branching or two modulo operations.

  9. For comparison with the Granlund and Montgomery divisibility check, I just put in a couple of changes to implement that check in the Go compiler following the treatment in Hacker’s Delight section 10-17.

    The general method can compute unsigned divisibility by one multiplication and a compare for odd divisors and an additional rotate for even divisors. That matches your method when the divisor is odd, but obviously has the extra rotate instruction when the divisor is even. It also has the benefit of using register-sized operations so can work for 64-bit divisibility checks on 64-bit platforms as well. For signed divisibility, the method is similar but requires an additional add.

    Unsigned divisibility checks: https://github.com/golang/go/commit/a28a9427688846f971017924bcf6e8cb623ffba4

    Signed divisibility checks: https://github.com/golang/go/commit/4d9dd3580624df413d65d83e467fcd6ad4a0168b

  10. As a follow-up to the Granlund-Montgomery-Warren divisibility check, it looks like recently released gcc9.1 now includes the Hacker’s Delight section 10-17 version of the check at -O2 and will vectorize it at -O3.

    Would be interesting to re-run the benchmark above on gcc9.1. Also, looking at the generated code for gcc8.1 it looks like the big advantage of the LKK method is that gcc is able to combine both the 3 and 5 checks into a single multiplication and the compiler can’t vectorize either version because count3 and count5 are stored on every loop iteration rather than only at the end of the loop:

    https://godbolt.org/z/DivJrk

    vs.

    https://godbolt.org/z/nb6hMg

    It also looks like LLVM has an open change from Aug 2, 2018 to include the Granlund-Montgomery-Warren divisibility check optimization:

    https://reviews.llvm.org/D50222

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax