On current processors, integer division is slow. If you need to compute many quotients or remainders, you can be in trouble. You potentially need divisions when programming a circular buffer, a hash table, generating random numbers, shuffling data randomly, sampling from a set, and so forth.

There are many tricks to avoid performance penalties:

- You can avoid dividing by an arbitrary integer and, instead, divide by a known power of two.
- You can use a divisor that is known to your compiler at compile-time. In these cases, most optimizing compilers will “optimize away” the division using magical algorithms that precompute a fast division routine.
- If you have a divisor that is not known at a compile time, but that you reuse often, you can make use of a library like liddivide to precompute a fast division routine.
- You can reengineer your code to avoid needing a division in the first place, see my post A fast alternative to the modulo reduction.

But sometimes, you are really stuck and need those divisions. The divisor is not frequently reused, and you have lots of divisions to do.

If you have 64-bit integers, and you need those 64 bits, then you might be in a bit of trouble. Those long 64-bit integers have a terribly slow division on most processors, and there may not be a trivial way to avoid the price.

However, if you have a 32-bit integers, you might have a way out. Modern 64-bit processors have 64-bit floating-pointer numbers using IEEE standards. These 64-bit floating-point numbers can be used to represent exactly all integers in the interval [0,2^{53}). That means that you can safely cast your 32-bit unsigned integers as 64-bit floating-point numbers.

Furthermore, common x64 processors have fast floating-point divisions. And the division operation over floating-point numbers is certain to result in the closest number that the standard can represent. The division of an integer in [0,2^{32}) by an integer in [1,2^{32}) is sure to be in [0,2^{32}). This means that you can almost replace the 32-bit integer division by a 64-bit floating point division:

uint32_t divide(uint32_t a, uint32_t b) { double da = (double) a; double db = (double) b; double q = da/db; return (uint32_t) q; }

Sadly, if you try to divide by zero, you will not get a runtime error, but rather some nonsensical result. Still, if you can be trusted to not divide by zero, this provides a fast and exact integer division routine.

How much faster is it? I wrote a small program to measure the throughput:

64-bit integer division | 25 cycles |

32-bit integer division (compile-time constant) | 2+ cycles |

32-bit integer division | 8 cycles |

32-bit integer division via 64-bit float | 4 cycles |

These numbers are rough, but we can estimate that we double the throughput.

I am not entirely sure why compilers fail to exploit this trick. Of course, they would need to handle the division by zero, but that does not seem like a significant barrier. There is also another downside to the floating-point approach: it generates many more instructions.

Regarding signed integers, they work much the same, but you need extra care. For example, most processors rely on two’s complement notation which implies that you have one negative number that cannot be represented as a positive number. Thus implementing “x / (-1)” can cause some headaches. You probably do not want to divide signed integers anyhow.

I plan to come back to the scenario where you have lots of 64-bit integer divisions with a dynamic divisor.

This result is for current Intel x64 processors, what happens on ARM processors is quite different.

The awk programming language is defined to represent all numbers in (double precision) floating point. This may have been a more practical decision that I originally expected.

It would be interesting to see how the awka compiler performs versus a native C integer and/or floating point application that is heavy with division. From what I’ve read here, awka could likely beat an a.out in some circumstances.

While looking into this, I found there is an FIDIV instruction that auto-converts the divisor from int32 to double, but I don’t see Godbolt choosing it for any compiler. It seems to have the same latency as FDIV, but instead I see compilers picking CVTSI2SD followed by FDIV.

Maybe I’m missing something, but FIDIV seems like it should have lower reciprocal latency than the latter two instructions.

Some old note that Agner Fog wrote:

FDIV takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision respectively. FIDIV takes 3 clocks more.I thought I saw some architectures in Agner’s tables where FDIV and FIDIV have the same latency, but I didn’t have time to dig into it.

Occurs to me that the SSE2 instruction DIVPS performs two 64-bit float divides in parallel, and its modern variants would do four or eight in parallel, respectively. There exists no SSE instruction for integer divides.

It’s not always possible to do so many divides in parallel, but it should be possible to adapt your test to use SSE intrinsics and see a corresponding speedup when doing uint32_t division as float64 using these SIMD instructions.

whoops, DIVPD not DIVPS š

Daniel, thank you for the great article!

Do you have maybe a good links how integer division is implemented in HW nowadays? I’m rather interested in algorithm and high-level HW design.

I have some idea about how additions and multiplications are implemented in hardware with circuits, but I have no idea whatsoever how the division is implemented at the hardware level…

Have we looked at the assembly generated from the test program? I am starting to suspect that the compiler auto-vectorized the loop using SSE instructions. If the compiler used SSE2 to auto-vectorize, then it would have emitted the DIVPD instruction, as I mentioned earlier. This instruction has 2x parallelism, and is expected to have roughly double the performance of a non-parallel DIV instruction.

I posted my code online:

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2017/11/16/divide.c

So one can verify whether I made a mistake in my analysis.