In my latest post, I explained how you could accelerate 32-bit integer divisions by transforming them into 64-bit floating-point divisions. Indeed, 64-bit floating-point numbers can represent accurately all 32-bit integers on most processors.

It is a strange result: Intel processors seem to do a lot better with floating-point divisions than integer divisions.

Recall the numbers that I got for the throughput of division operations:

64-bit integer division | 25 cycles |

32-bit integer division (compile-time constant) | 2+ cycles |

32-bit integer division | 8 cycles |

32-bit integer division via 64-bit float | 4 cycles |

I decided to run the same test on a 64-bit ARM processor (AMD A1100):

64-bit integer division | 7 ns |

32-bit integer division (compile-time constant) | 2 ns |

32-bit integer division | 6 ns |

32-bit integer division via 64-bit float | 18 ns |

These numbers are rough, my benchmark is naive (see code). Still, on this particular ARM processor, 64-bit floating-point divisions are not faster (in throughput) than 32-bit integer divisions. So ARM processors differ from Intel x64 processors quite a bit in this respect.

One important note, about UDIV/SDIV instruction on arm64 form ARMv8 ISA: ” The divide instructions do not generate a trap upon division by zero, but write zero to the destination register.”

Can you check the same on 16 bit integers and 32 bit floats? Maybe the arm processor divisor is not fast, say go through a lot of uops to get the results, but the 32 bit float is more probable to be fast.

Another caveat is that in SKX you are pushed more for a division less algorithm as you have only a double pumped 256b divisor for a 512b vector. Still no integer divisor so it’s much more fast than scalar int.

You can pull the same trick with 16-bit integers, yes. It is a good observation.