# Fast exact integer divisions using floating-point operations (ARM edition)

In my latest post, I explained how you could accelerate 32-bit integer divisions by transforming them into 64-bit floating-point divisions. Indeed, 64-bit floating-point numbers can represent accurately all 32-bit integers on most processors.

It is a strange result: Intel processors seem to do a lot better with floating-point divisions than integer divisions.

Recall the numbers that I got for the throughput of division operations:

 64-bit integer division 25 cycles 32-bit integer division (compile-time constant) 2+ cycles 32-bit integer division 8 cycles 32-bit integer division via 64-bit float 4 cycles

I decided to run the same test on a 64-bit ARM processor (AMD A1100):

 64-bit integer division 7 ns 32-bit integer division (compile-time constant) 2 ns 32-bit integer division 6 ns 32-bit integer division via 64-bit float 18 ns

These numbers are rough, my benchmark is naive (see code). Still, on this particular ARM processor, 64-bit floating-point divisions are not faster (in throughput) than 32-bit integer divisions. So ARM processors differ from Intel x64 processors quite a bit in this respect.

Daniel Lemire, "Fast exact integer divisions using floating-point operations (ARM edition)," in Daniel Lemire's blog, November 17, 2017.

### Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

## 4 thoughts on “Fast exact integer divisions using floating-point operations (ARM edition)”

1. Cyril says:

One important note, about UDIV/SDIV instruction on arm64 form ARMv8 ISA: ” The divide instructions do not generate a trap upon division by zero, but write zero to the destination register.”

2. eden segal says:

Can you check the same on 16 bit integers and 32 bit floats? Maybe the arm processor divisor is not fast, say go through a lot of uops to get the results, but the 32 bit float is more probable to be fast.
Another caveat is that in SKX you are pushed more for a division less algorithm as you have only a double pumped 256b divisor for a 512b vector. Still no integer divisor so it’s much more fast than scalar int.

1. You can pull the same trick with 16-bit integers, yes. It is a good observation.

3. Timothy Herchen says:

This is nice. Note that if you need signed (floor) integer division this way, you can set the FP control register to round toward -inf (`_mm_setcsr(_MM_ROUND_TOWARD_ZERO)`, or `fesetround` for portability).

You may subscribe to this blog by email.