How fast can you bit-interleave 32-bit integers? (SIMD edition)

In a previous post, I asked how fast one could interleave the bits between two 32-bit integers. That is, given 0b1011 (11 in decimal) and 0b1100 (12 in decimal), I want the number 0b11011010.

On recent (2013) Intel processors, the answer is that you process one pair of integers every two cycles using the pdep instruction. Deinterleaving can be done just as quickly using the pext instruction.

There are a few downsides to relying on the pdep instruction:

  • It is only available on recent x64 processors.
  • It is very slow on even the most recent AMD processors.

What else can we do? We can vectorize the processing. Recent x64 processors support the AVX2 instruction set. Vector instructions tend to get better over time. For example, we can expect future processors to include wider instructions (AVX-512 on x64 processors and ARM SVE). So vectorization is a good long-term investment.

For the most part, the approach I have taken is derived from the scalar approach. That is, we start by interleaving each element of the pair with zeros…

uint64_t interleave_uint32_with_zeros(uint32_t input)  {
    uint64_t word = input;
    word = (word ^ (word << 16)) & 0x0000ffff0000ffff;
    word = (word ^ (word << 8 )) & 0x00ff00ff00ff00ff;
    word = (word ^ (word << 4 )) & 0x0f0f0f0f0f0f0f0f;
    word = (word ^ (word << 2 )) & 0x3333333333333333;
    word = (word ^ (word << 1 )) & 0x5555555555555555;
    return word;

And then we put the result back together, shifting one of the by one bit…

  | (interleave_uint32_with_zeros(y) << 1);

With vector instructions, we can skip part of the processing because it is easy to shuffle bytes (with the _mm256_shuffle_epi8 intrinsic). I need to credit Geoff Langdale because I was initially, stupidly, trying to shuffle the bytes with shifts as in the scalar code. Otherwise, the code is very similar, except that it relies on intrinsics.

__m256i interleave_uint8_with_zeros_avx(__m256i word) {
  const __m256i m3 = _mm256_set1_epi64x(0x0f0f0f0f0f0f0f0f);
  const __m256i m4 = _mm256_set1_epi64x(0x3333333333333333);
  const __m256i m5 = _mm256_set1_epi64x(0x5555555555555555);
  word = _mm256_xor_si256(word, _mm256_slli_epi16(word, 4));
  word = _mm256_and_si256(word, m3);
  word = _mm256_xor_si256(word, _mm256_slli_epi16(word, 2));
  word = _mm256_and_si256(word, m4);
  word = _mm256_xor_si256(word, _mm256_slli_epi16(word, 1));
  word = _mm256_and_si256(word, m5);
  return word;

void interleave_avx2(uint32_2 *input, uint64_t *out) {
  __m256i xy = _mm256_lddqu_si256((const __m256i *)input);
  __m256i justx = _mm256_shuffle_epi8(
      xy, _mm256_set_epi8(-1, 11, -1, 10, -1, 9, -1, 8,
            -1, 3, -1, 2, -1, 1, -1, 0, -1, 11, -1, 10,
            -1, 9, -1, 8, -1, 3, -1, 2, -1, 1,-1, 0));
  __m256i justy = _mm256_shuffle_epi8(
      xy, _mm256_set_epi8(-1 15, -1, 14, -1, 13, -1, 12, 
             -1, 7, -1, 6, -1, 5, -1, 4, -1, 15, -1, 14, -1, 
             -1, 13, -1, 12, -1, 7, -1, 6, -1, 5, -1, 4));
  justx = interleave_uint8_with_zeros_avx(justx);
  justy = interleave_uint8_with_zeros_avx(justy);
  __m256i answer = _mm256_or_si256(justx, 
            _mm256_slli_epi16(justy, 1));
  _mm256_storeu_si256((__m256i *)out, answer);

Is this the best you can do? Kendall Willets commented that you can use a look-up for the interleave. Let us put this to good use:

__m256i interleave_uint8_with_zeros_avx_lut(__m256i word) {
  const __m256i m = _mm256_set_epi8(85, 84, 81, 80, 69, 68,
               65, 64, 21, 20, 17, 16, 5, 4, 1, 0, 85, 84, 
               81, 80, 69, 68, 65, 64, 21, 20, 17, 16, 5, 
               4, 1, 0);
  __m256i lownibbles =
      _mm256_shuffle_epi8(m, _mm256_and_si256(word,
  __m256i highnibbles = _mm256_and_si256(word, 
   highnibbles = _mm256_srli_epi16(highnibbles,4);
   highnibbles = _mm256_shuffle_epi8(m, highnibbles);
   highnibbles =  _mm256_slli_epi16(highnibbles, 8);
  return _mm256_or_si256(lownibbles,highnibbles);

So how well does it do?

shifts 3.6 cycles
pdep 2.1 cycles
SIMD 2.2 cycles
SIMD (with lookup) 1.6 cycles

So, roughly speaking, the vector code is just as fast as the code using the pdep instruction on my Skylake processor. I have not tested it on an AMD processor, but it should run fine (although maybe not as efficiently).

You can improve slightly the performance (down to 2.1 cycles) at the expense of using about twice as much code if you use an idea Geoff offered: save the final shift by using two-interleaving-with-zeros functions.

When you use table lookups, the interleave the nibbles, then the vectorized code is clearly faster than the pdep instruction.

It could probably be ported to ARM NEON, but I am not sure it would be efficient on current ARM processors because they only have 128-bit vector registers. Moreover, ARM processors can do the XOR and the shift operation in a single instruction.

My source code is available.

(This post was updated with better results.)

Further reading: Geohash in Golang Assembly

Daniel Lemire, "How fast can you bit-interleave 32-bit integers? (SIMD edition)," in Daniel Lemire's blog, January 9, 2018.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

8 thoughts on “How fast can you bit-interleave 32-bit integers? (SIMD edition)”

  1. From here: *I am disappointed as I expected vector code to be about twice as fast.*

    From previous: *The GCC compiler seems to hate my deinterleaving code and produces very slow binaries.*

    I haven’t gone through to figure out the expected speeds, but if you think it should be twice as fast, it seems likely that it could be made twice as fast. Is there reason to believe that this isn’t another case of the compiler(s) creating suboptimal assembly for your otherwise fast algorithm?

    My quick guess would be that the compiler is reloading all the 256-bit constants on every call. Normally, this wouldn’t slow things down, but in a tightly tuned approach like yours it might be the limiting factor. Would 8 loads (and a possibly complex addressed store competing for the same address units) account for the observed speed?

    1. Kendall’s comment pushed me to update the post with a table lookup approach, based on nibbles. My vector code is now clearly faster than pdep.

      Possibly someone (maybe Kendall) can do even better.

  2. I think the zero-interleaving can be around 6 AVX instructions per four 32-bit int’s. The key is to shift/mask them into even and odd nybbles and look up their 8-bit expansions with vpshufb. After that it’s just matter of getting the 32 bytes into the right order.

  3. I’m late to the party, but I just want to point out, that on architectures where no pdep instruction is available, you can often abuse the galois field arithmetic instructions which made it into the instruction sets to speed up cryptography.

    on ARM NEON for example if you square a number using vmul_p8 you’re bit interleaving with zeroes. On Intel PCLMULQDQ will do the same trick.

  4. Very nice! I ended up here while I was trying to compile my code (which uses pdep) on ARM, unfortunately yep, that’s what RISC means I guess…
    Just a note on the tests: please please use struct timespec/clock_gettime(). If you are just as picky as me about these little things, it will make a “big” difference on precision and WHAT you are going to actually measure (e.g. thread-time only clocks) 😀

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.