A fast function to check your floating-point rounding mode

For speed, we use finite-precision number types in software. When doing floating-point computations in software, the results are usually not exact. For example, you can have the number 1.0 and the number 2-1000. They can both be represented exactly using standard double-precision IEEE floating-point numbers. However, you cannot represent their sum or difference exactly. So what does a computer do when you ask it to compute (1.0 + 2-1000) or (1.0 – 2-1000) ? It usually outputs the nearest approximation, that 1.0. In other words, it looks at all the numbers that it can represent and picks one that is nearest to the exact value. This works for most basic operations like addition, multiplication, division, subtraction.

However, it is possible to ask the processor to change the rounding mode. In some cases, you can request a specific rounding mode per instruction (e.g., with some AVX-512 instructions), but most times there is one setting valid for all instructions within the current thread. In C/C++, you can change the rounding mode to round upward (fesetround(FE_UPWARD)), downward (fesetround(FE_DOWNWARD)), toward zero (fesetround(FE_TOWARDZERO)). The default is to round to the nearest value (fesetround(FE_TONEAREST)).

Some of your code might be assuming that the rounding is to the nearest value. Maybe you want to check that it is so if you are providing a function for others to use. All you need is the following…

#include <fenv.h>

bool are_we_ok() {
  return fegetround() == FE_TONEAREST;

Unfortunately, this function is not free. It might be a function call to a non-trivial instruction such as stmxcsr on x64 processors. It is fine to call it from time to time, but if you have tiny functions that only cost a few hundred instructions, it may be too expensive.

The C functions strtod()/strtof()  available on Linux contain the following switch/case code:

	bc.rounding = 1;
	switch(fegetround()) {
	  case FE_TOWARDZERO:	bc.rounding = 0; break;
	  case FE_UPWARD:	bc.rounding = 2; break;
	  case FE_DOWNWARD:	bc.rounding = 3;

It means that your software might routinely call the fegetround() function. Thankfully, there is a fast approach requiring just an addition, a subtraction and a comparison. Let x be a small value, much smaller than 1, so that 1+x is very close to 1 and should be rounded to 1. Then it suffices to check that (1+x == 1-x). Indeed, when rounding up, 1+x will differ from 1, it will round to a small number just slightly larger than 1. Similarly, when rounding toward zero or rounding down, 1-x will get rounded to a value just under 1. Only when you round to the nearest value do you get that 1+x == 1-x.

You might think that we simply can do 1.0 + 1e-100 == 1.0 - 1e-100, using a function such as this one…

bool rounds_to_nearest() {
  return (1.0f +  1e-38f == 1.0f - 1e-38f);

Sadly, your optimizing compiler might turn this function into a function which returns true unconditionally, irrespective of the processor’s rounding mode.

As a workaround, the following function should do:

bool rounds_to_nearest() {
  static volatile float fmin = 1e-38;
  return (fmin + 1.0f == 1.0f - fmin);

We declare the small value to be volatile: this asks the C/C++ compiler to load the variable each time it needs to access it. It prevents the compiler from optimizing away the function at compile time. The compiler is forced to recompute the function each time.

You can do slightly better performance-wise because this function loads the value 1e-38 twice. To load it just once, do…

bool rounds_to_nearest() {
  static volatile float fmin = 1e-38;
  float fmini = fmin;
  return (fmini + 1.0f == 1.0f - fmini);

In my tests, on both an Apple ARM platform and on a recent Intel processor with GCC, this fast function is several times faster than a call to fegetround(). It is cheap enough that you can probably call it repeatedly without much care.

Credit: I took the idea from GitHub user @mwalcott3.

Note: Some processors have modes where all operations are done using a higher precision (e.g., 80 bits) so a float may use more than 32 bits internally. This means that you should choose you small factor (e.g. 1e-38) to be really small.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

11 thoughts on “A fast function to check your floating-point rounding mode”

  1. Something like this: https://godbolt.org/z/svo8hjq9K may be preferable to the volatile.


    /* Make the compiler believe that something it cannot see is reading
    this value either from a register or directly from its location in
    memory. */

    #define extern_write(p) asm volatile("" : "+rm"(p) : : )

    bool rounds_to_nearest() noexcept {
    float a = 1.0f + 1e-38f;
    float b = 1.0f - 1e-38f;

    // fake a write to a and b so compiler
    // cannot elide the compare


    return a == b;

    1. This will not work, as the compiler will happily precalculate the sum and difference and load those constants at run time. This is exactly what your godbolt link shows clang doing.

      There is also something funny going on with storing and loading the constants multiple times, so it seems the compiler will generate better code with volatile.

      1. Hey! Good catch, my code is not correct at all.

        Here’s a better version in which I actually take a look at the output and I’m pretty sure it’s doing the right thing!


        This one is using a less clever and more correct macro to pretend that some instruction read and wrote to a general purpose register, breaking the compiler’s ability to fold constants. The code gen in this version is much better and much more correct than the previous broken one..

        #define munge(p) asm (“# munge(” #p “[%0])@” : “+r”(p) : : )

        this version results in bouncing our values off of register instead of the L1 (as the (volatile has to do, kinda), and performance does seem to be better on both my M1 arm machine and a rather ancient Ivy Bridge machine I have.

        On x68:

        $ ./a.out
        v 932007
        f 341029
        v 698101
        f 340962
        v 698105
        f 340962
        v 705098
        f 326151

        but this is all a bit dubiously useful. I only mention it because I’ve switched from using volatiles to hide info from compilers to using macros like this, but I usually inspect the ASM much more carefully!

  2. A couple comments, in no particular order:
    – Loading fmin to a temporary saves a load, at least on GCC.
    – With flags such as -ffast-math / -funsafe-math-optimizations, the volatile trick still doesn’t guarantee that the compiler won’t invalidate your code because it sees 1.0f on both sides of the comparison. See the following failed implementation of a rounds_down variant for an example (though not in your code): https://godbolt.org/z/ef7W7b4c7

    Just for fun, here are quick implementations of the other 3 “official” rounding modes in the C standard (that is, not “implementation-defined”):
    I’m interested to see a faster variant of fegetround(), supporting all 4 modes, and comparisons against the current standard library version.

    1. Yes, I would imagine using a temporary should improve things for most compilers. Because otherwise the compiler is forced to load it twice to preserve the access of the volatile variable.

      I get why you want to use “std::numeric_limits::epsilon() / 2” and that is certainly a number that makes more sense in certain respects. But it needs to be slightly smaller than that since with nearest rounding (1.0f – (std::numeric_limits::epsilon() / 2)) does not equal 1.0f. This is due to the fact epsilon is the difference between the 1.0 and the next representable number GREATER than 1.0. The difference going toward zero happens to be smaller than that. Example of your rounds_to_nearest failing https://godbolt.org/z/86e6778xe because of this.

      1. That is an excellent point; as an alternative to adjusting epsilon (which would need to vary based on sign), selecting a base value precisely representable in the range (1,2) suffices.

        As an example, selecting 1.5f instead of 1.0f gives the desired behavior. See corrected implementations here: https://godbolt.org/z/Mr8enbMWo

        Thank you for highlighting the value of code review.

  3. It might be worth noting that 1e-100 can *not* be represented exactly, just as the sum of 1 and 1e-100 (this doesn’t matter for the mechanics of the rounding mode estimation, but it confused me).

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.