# Converting floating-point numbers to integers while preserving order

Many programming languages have a number type corresponding to the IEEE binary64. In many languages such as Java or C++, it is called a double. A double value uses 64 bits and it represents a significand (or mantissa) multiplied by a power of two: m * 2p. There is also a sign bit.

A simpler data type is the 64-bit unsigned integer. It is a simple binary representation of all numbers between 0 to 264.

In a low-level programming language like C++, you can access a double value as if it were an unsigned integer. After all, bits are bits. For some applications, it can be convenient to regard floating-point numbers as if they were simple 64-bit integers.

In C++, you can do the conversion as follows:

```uint64_t to_uint64(double x) {
uint64_t a;
::memcpy(&a,&x,sizeof(x));
return a;
}
```

Though it looks expensive, an optimizing compiler might turn such code into something that is almost free.

In such an integer representation, a double value looks as follows:

• The most significant bit is the sign bit. It has value 1 when the number is negative, and it has value 0 otherwise.
• The next 11 bits are usually the exponent code (which determines p).
• The other bits (52 of them) are the significand.

If you omit infinite values and not-a-number code, a comparison between two floating-point numbers is almost trivially the same as a comparison two integer values.

If you know that all of your numbers are positive and finite, then you are done. They are already in sorted order. The following comparison function should suffice:

```bool is_smaller(double x1, double x2) {
uint64_t i1 = to_uint64(x1);
uint64_t i2 = to_uint64(x2);
return i1 < i2;
}
```

If your values can be negative, then you minimally need to reverse the sign bit, since it is wrong: we want large values to have their most significant bits set, and small values to have it unset. But just flipping one bit is not enough, you want negative values having a large absolute value to become small. To do so, you need to negate all bits, but only when the sign bit is set. It turns out that some clever programmer has worked up an efficient solution:

```uint64_t sign_flip(uint64_t x) {
// when the most significant bit is set, we need to
// flip all bits
uint64_t mask = uint64_t(int64_t(x) >> 63);
// in all case, we need to flip the most significant bit
}
```

You have now an efficient comparator between two floating-point values using integer arithmetic:

```bool generic_comparator(double x1, double x2) {
uint64_t i1 = sign_flip(to_uint64(x1));
uint64_t i2 = sign_flip(to_uint64(x2));
return i1 < i2;
}
```

For finite numbers, we have shown how to map floating-point numbers to integer values while preserving order. The map is also invertible.

Sometimes you are working with floating-point numbers but would rather process integers. If you only need to preserve order, you can use such a map.

### Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

## 11 thoughts on “Converting floating-point numbers to integers while preserving order”

1. George Spelvin says:

It might be worth mentioning the similar code for signed integers, which seem a more natural target for signed floating-point values:

```int64_t sign_flip(int64_t x) { uint64_t mask = x >> 62; // Or 63, your choice return x ^ (mask >> 1); } ```

Here, you never flip the sign bit, so the mask needs to cover only the low 63 bits.

On most modern machines, shift instructions are smaller and faster than large immediate constants, but there are some processors (particularly synthesizable cores for FPGA implementation) which have slow shifts. In such cases, an alternative is:

``` uint64_t mask = x & ((uint64_t)1 << 63); return x ^ (mask - 1); ```

2. aqrit says:

Is it possible to compare signed integers with a floating-point compare, or do some bit-patterns not work?

Context: SSE2 has a 64-bit floating-point compare `_mm_cmpgt_pd()` but has only 32-bit integer compares.

Current work-around:

```__m128i pcmpgtq_sse2 (__m128i a, __m128i b) { __m128i r = _mm_and_si128(_mm_cmpeq_epi32(a, b), _mm_sub_epi64(b, a)); r = _mm_or_si128(r, _mm_cmpgt_epi32(a, b)); return _mm_shuffle_epi32(r, _MM_SHUFFLE(3,3,1,1)); } ```

1. Per Vognsen says:

There’s a simple combinatorial reason you can’t express a 64-bit integer comparison, signed or unsigned, with a 64-bit floating-point comparison: some floating-point numbers “are the same” as far as float comparisons are concerned, notably all the NaN bit patterns, and hence by the pigeonhole principle any mapping from 64-bit ints to 64-bit floats must map some ints to values that won’t order correctly.

But if you have 64-bit uints where the two most significant bits are 0 (i.e. values in the range 0 to 2^62), so the float sign is positive and the exponent can’t be the all 1s pattern which signifies a NaN, you should be able to load it as a 64-bit float and get isomorphic comparisons. If you want 2’s complement signed comparisons this way, I think you’d have to do a pre-transform.

1. aqrit says:

Thank You.

3. Falk Hüffner says:

This can be done with two instructions less:

```bool generic_comparator(double x1, double x2) { uint64_t i0 = to_uint64(x1); uint64_t i1 = to_uint64(x2); uint64_t mask = uint64_t(int64_t(i0) >> 63); mask |= 0x8000000000000000; return (i0 ^ mask) < (i1 ^ mask); } ```

Thanks for the great post – this is going to come in very handy!

I assumed the sign_flip function would be reversible by changing the direction of the bit shift. This seems to work for some values, but not quite. Negative values seem to be incorrect by a small amount. It feels like the most insignificant bit is incorrect in the result. Can you see what I’m doing wrong here?

```uint64_t inverse_sign_flip(uint64_t x) { // credit http://stereopsis.com/radix.html // when the most significant bit is set, we need to // flip all bits uint64_t mask = uint64_t(int64_t(x) << 63); // in all case, we need to flip the most significant bit mask |= 0x8000000000000000; return x ^ mask; } ```

1. Can you elaborate on how you arrive at this inverse function? What is your expectation with respect to what is in the `mask` variable?

My goal is to implement a function that when passed the return value of sign_flip as its argument it returns the original value.

The following must hold true: i.e. `b = sign_flip(a)` then `a = inverse_sign_flip(b)`.

My bit twiddling skills are not my string point, however the above code that I pasted seems to almost work – the result looks incorrect on the least significant bit.

I wouldn’t be able to explain how it worked backwards, but I am surprised the result was so close to being correct for all values I’ve thrown at it (-ve, +ve, large and small numbers).

1. I would recommend against doing it with trial and error…

I have not tested it, but it should be something like that…

``````   uint64_t mask = uint64_t(int64_t(x ^ 0x8000000000000000) >> 63);