Fast exact integer divisions using floating-point operations (ARM edition)

In my latest post, I explained how you could accelerate 32-bit integer divisions by transforming them into 64-bit floating-point divisions. Indeed, 64-bit floating-point numbers can represent accurately all 32-bit integers on most processors.

It is a strange result: Intel processors seem to do a lot better with floating-point divisions than integer divisions.

Recall the numbers that I got for the throughput of division operations:

64-bit integer division25 cycles
32-bit integer division (compile-time constant)2+ cycles
32-bit integer division8 cycles
32-bit integer division via 64-bit float4 cycles

I decided to run the same test on a 64-bit ARM processor (AMD A1100):

64-bit integer division7 ns
32-bit integer division (compile-time constant)2 ns
32-bit integer division6 ns
32-bit integer division via 64-bit float18 ns

These numbers are rough, my benchmark is naive (see code). Still, on this particular ARM processor, 64-bit floating-point divisions are not faster (in throughput) than 32-bit integer divisions. So ARM processors differ from Intel x64 processors quite a bit in this respect.

3 thoughts on “Fast exact integer divisions using floating-point operations (ARM edition)”

  1. One important note, about UDIV/SDIV instruction on arm64 form ARMv8 ISA: ” The divide instructions do not generate a trap upon division by zero, but write zero to the destination register.”

  2. Can you check the same on 16 bit integers and 32 bit floats? Maybe the arm processor divisor is not fast, say go through a lot of uops to get the results, but the 32 bit float is more probable to be fast.
    Another caveat is that in SKX you are pushed more for a division less algorithm as you have only a double pumped 256b divisor for a 512b vector. Still no integer divisor so it’s much more fast than scalar int.

Leave a Reply

Your email address will not be published. Required fields are marked *