A practical trick in software is to “bit-interleave” your data. Suppose that I have two 4-bit integers like 0b1011 (11 in decimal) and 0b1100 (12 in decimal). I can interleave the two numbers to get one 8-bit number 0b11011010 where I simply pick the most significant bit from the first number, then the most significant bit from the second integer, and then the second most significant bit from the first integer, and so forth. This is a useful trick because if you sort the interleaved numbers, you can then quickly filter numbers on the most significant bits of all components at once. For example, if you are looking for coordinates such as the first and second component is greater or equal to 0b1000, then all values you are looking for will take the form 0b11****** and thus, they will be sorted toward at the end. This is called a z-order.

Of course, you still have to interleave and de-interleave bits. How expensive is that?

The problem is rather symmetrical, so I will only present the code for interleaving. A useful function is one that takes a 32-bit integer, and spreads its bits over a 64-bit integer, effectively interleaving it with zeros:

uint64_t interleave_uint32_with_zeros(uint32_t input) { uint64_t word = input; word = (word ^ (word << 16)) & 0x0000ffff0000ffff; word = (word ^ (word << 8 )) & 0x00ff00ff00ff00ff; word = (word ^ (word << 4 )) & 0x0f0f0f0f0f0f0f0f; word = (word ^ (word << 2 )) & 0x3333333333333333; word = (word ^ (word << 1 )) & 0x5555555555555555; return word; }

Given this function, you can interleave in the following manner:

interleave_uint32_with_zeros(x) | (interleave_uint32_with_zeros(y) << 1);

I believe that this is the standard approach. It seems efficient enough.

Can you go faster? You might. On recent x64 processors, there are bit manipulation instructions ideally suited to the problem (`pdep` and `pext`). The interleaving code looks like this:

uint64_t interleave_pdep(uint32_2 input) { return _pdep_u64(input.x, 0x5555555555555555) | _pdep_u64(input.y,0xaaaaaaaaaaaaaaaa); }

The decoding is similar but uses the `pext` instruction instead.

Suppose that you have a bunch of data points, is it worth it to use the fancy x64 instructions?

Let us record how many cycles are needed to interleave a pair of 32-bit values:

shifts | 3.6 cycles |

pdep | 2.1 cycles |

So, roughly speaking, using specialized instructions doubles the speed. In some cases, it might be worth the effort.

The `pdep` function is probably optimal in the sense that `pdep` has a throughput of one instruction per cycle, and I need two `pdep` instructions to interleave a pair of values.

Deinterleaving takes about as long when using my implementation and the clang compiler. The GCC compiler seems to hate my deinterleaving code and produces very slow binaries.

Is this the best we can do? I suspect not. My guess is that with more careful engineering, we can go down to 1 cycle per interleave.

Your post makes it sound like PDEP and PEXT are available on all AMD64 CPUs, but that’s not the case; they’re in BMI2, which is an optional extension to AMD64. IIRC they appear in ≥ Haswell (2013) on the Intel side, and Excavator (2015) on AMD, so you need a pretty new CPU.

Reading this post, people may think to use the `__amd64` macro (or `_M_AMD64` for VS), but `__BMI2__` is the right one on GCC-like compilers (I don’t know that VS has one). Run-time detection is a bit more complicated if you want to do it portably, but if all you care about is GCC 5.1+: `__builtin_cpu_init()` and `__builtin_cpu_supports(“bmi2”)`.

Everything you wrote is correct, I believe.

IIRC they’re also excruciatingly slow on Ryzen.

I see this. 18 cycles of latency?

Probably that AMD has figured out that these instructions only rarely occur in live binaries.