In software, compilers can often optimize away integer divisions, and replace them with cheaper instructions, especially when the divisor is a constant. I recently wrote about some work on faster remainders when the divisor is a constant. I reported that it can be fruitful to compute the remainder directly, instead of first computing the quotient (as compilers are doing when the divisor is a constant).

To get good results, we can use an important insight that is not documented anywhere at any length: we can use 64-bit processor instructions to do 32-bit arithmetic. This is fair game and compilers could use this insight, but they do not do it systematically. Using this trick alone is enough to get substantial gains in some instances, if the algorithmic issues are just right.

So it is a bit complicated. Using 64-bit processor instructions for 32-bit arithmetic is sometimes useful. In addition, computing the remainder directly without first computing the quotient is sometimes useful. Let us collect a data point for fun and to motivate further work.

First let us consider how you might compute the remainder by leaving it up to the compiler to do the heavy lifting (D is a constant known to the compiler). I expect that the compiler will turn this code into a sequence of instructions over 32-bit registers:

uint32_t compilermod32(uint32_t a) { return a % D; }

Then we can compute the remainder directly, using some magical mathematics and 64-bit instructions:

#define M ((uint64_t)(UINT64_C(0xFFFFFFFFFFFFFFFF) / (D) + 1)) uint32_t directmod64(uint32_t a) { uint64_t lowbits = M * a; return ((__uint128_t)lowbits * D) >> 64; }

Finally, you can compute the remainder “indirectly” (by first computing the quotient) but using 64-bit processor instructions.

uint32_t indirectmod64(uint32_t a) { uint64_t quotient = ( (__uint128_t) M * a ) >> 64; return a - quotient * D; }

As a benchmark, I am going to compute a linear congruential generator (basically a recursive linear function with a remainder thrown in), using these three approaches, plus the naive one. I use as a divisor the constant number 22, a skylake processor and the GNU GCC 8.1 compiler. For each generated number I measure the following number of CPU cycles (on average):

slow (division instruction) | 29 cycles |

compiler (32-bit) | 12 cycles |

direct (64-bit) | 10 cycles |

indirect (64-bit) | 11 cycles |

Depending on your exact platform, all three approaches (compiler, direct, indirect) could be a contender for best results. In fact, it is even possible that the division instruction could win out in some cases. For example, on ARM and POWER processors, the division instruction does beat some compilers.

Where does this leave us? There is no silver bullet but a simple C function can beat a state-of-the-art optimizing compiler. In many cases, we found that a direct computation of the 32-bit remainder using 64-bit instructions was best.

Interesting!

Btw., GCC 8.1 seems to generate:

`compilermod32(unsigned int):`

mov eax, edi

mov edx, -1171354717

mul edx

mov eax, edx

shr eax, 4

imul eax, eax, 22

sub edi, eax

mov eax, edi

ret

whereas GCC trunk generates one less instruction:

`compilermod32(unsigned int):`

mov eax, edi

mov edx, 3123612579

imul rax, rdx

shr rax, 36

imul eax, eax, 22

sub edi, eax

mov eax, edi

ret

PS: And please, I certainly don’t mean to tell you what to do, but since last time you mentioned you have Cannon Lake nearby and I don’t, I think it would be interesting to compare this on it, too.

PPS: One another common case which would benefit from fast remainders is modular multiplication, i.e.

`a*b%c`

(all 64-bit unsigned ints,`c`

is “runtime” constant) and the way to currently approach it is to use Montgomery multiplication. Just saying…Yes further investigations are warranted.

On Cannonlake the results are similar (broadly speaking) but the division instruction is much faster.

Thanks! Do you happen to have any numbers?

Just noting that up to four simple register-register moves per clock cycle can be “free” on Core microarchitectures in the sense they don’t use an execution unit. I’m not certain how much difference these variants make in practice…

Although it doesn’t use an execution unit, it still uses one of the 4 renamer “slots” which are what bottlenecks all recent Intel chips to at most 4 fused-domain ops per cycle. Since recent CPUs are rich in execution units, it is not uncommon that this limit is a bottleneck rather than EUs. Eliminated moves also take up a slot in the ROB, but not in the scheduler.

So I really hesitate to call them free. Very unscientifically I would say that they are something like half the cost of the cheapest real instructions (things like basic integer math).

Good to know that, I haven’t thought of renamer slots can be such a bottleneck.

BTW, the better GCC generated compilermod32 is still pretty bad code. It could be trivially rewritten as:

`compilermod32(unsigned int):`

mov rax, 3123612579

imul rax, rdi

shr rax, 36

imul eax, eax, -22

add eax, edi

ret

Argh, I forgot that plain 32-bit move on amd64 zeroes upper bits of the 64-bit register. Thus the first move can’t be eliminated this way while maintaining correct semantics. The second, though, can.

Yeah this limit of 4 is pretty fundamental and why people still say that Intel chips are 4-wide (since all the other parts of the pipeline are wider these days: 8-wide retirement, 8 EUs, 6 uops from uop cache, or 5 from legacy decoder, etc).

It’s even baked into Intel’s top-down analysis stuff, which is based on the idea of 4 slots per cycle, and assigning each slot to “effectively used” (they call “retiring”) or empty for some reason in a large tree of possible reasons. I.e., if you solve every possible bottleneck in a top-down analysis, you’ll end up at 4 IPC (4 FUPC, fused-uops-per-cycle, but grant me a bit of looseness in the terminology here).

Even nops count against this limit, even though they too do not execute.

I believe the original code can do uint64_t mod uint32_t, without any changes, except for fastmod_u32’s signature:)

Wrong of course. I thought that because uint128_t can hold the result of uint64_t x uint64_t, but what matters here is the precision.

The most I could get out of testing was you can do a reliable uintA % uintB as long as A + B <= 64. E.g., uint48 % uint16, or uint52 % uint12.

(In uint52 I mean a uint64 with top 12 bits being 0)