Dynamic bit shuffle using AVX-512

Suppose that you want to reorder, arbitrarily, the bits in a 64-bit word. This question was raised on Twitter by @experquisite. Formally, you might want to provide, for each of the 64 bit position, an original bit position you want to copy.

Hence, the following code would reverse the bit order in your 64-bit word:

uint64_t w = some value;
uint8_t indexes[64] = {63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51,
                       50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38,
                       37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25,
                       24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12,
                       11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
bit_shuffle(w, indexes); // returns a reversed version 

A naive way to do it in C might be as follows:

uint64_t slow_bit_shuffle(uint64_t w, uint8_t indexes[64]) {
  uint64_t out{};
  for (size_t i = 0; i < 64; i++) {
    bool bit_set = w & (uint64_t(1) << indexes[i]);
    out |= (uint64_t(bit_set) << i);
  }
  return out;
}

This might be an acceptable implementation, but what if you want do it using few instructions? You can do it on recent Intel and AMD processors with support for AVX-512 instructions. You go from the general-purpose register to a mask register, to a 512-bit AVX-512 register, you apply a shuffle (vpermb), you go back to a mask register and finally back to a general-purpose register.

The code with Intel intrinsic functions looks as follows:

uint64_t bit_shuffle(uint64_t w, uint8_t indexes[64]) {
  __mmask64 as_mask = _cvtu64_mask64(w);
  __m512i as_vec_register =
  _mm512_maskz_mov_epi8(as_mask, _mm512_set1_epi8(0xFF));
  __m512i as_vec_register_shuf =
  _mm512_permutexvar_epi8(_mm512_loadu_si512(indexes), as_vec_register);
  return _cvtmask64_u64(_mm512_movepi8_mask(as_vec_register_shuf));
}

It might compile to about six instructions:

kmovq k0, rdi
vpmovm2b zmm0, k0
vmovdqu8 zmm1, ZMMWORD PTR [rsi]
vpermb zmm0, zmm1, zmm0
vpmovb2m k1, zmm0
kmovq rax, k1

As one reader points out, you can do better because AVX-512 has a dedicated instruction for bit shuffling which directly returns a mask and works directly from the 64-bit word as long as it is loaded in a vector register:

uint64_t faster_bit_shuffle(uint64_t w, uint8_t indexes[64]) {
  __m512i as_vec_register = _mm512_set1_epi64(w);
  __mmask64 as_mask = _mm512_bitshuffle_epi64_mask(as_vec_register,
     _mm512_loadu_si512(indexes));
  return _cvtmask64_u64(as_mask);
}

The resulting assembly is quite short:

vpbroadcastq zmm0, rdi
vpshufbitqmb k0, zmm0, ZMMWORD PTR [rsi]
kmovq rax, k0

Loading your indexes is likely to have a long latency, so if you can buffer the load (_mm512_loadu_si512(indexes)), you will reduce significantly the latency.

I have an implementation in C++.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

15 thoughts on “Dynamic bit shuffle using AVX-512”

      1. Have you profiled bitshuffle? I remember trying to implement bit unpacking with it and it was MUCH slower than the AVX2 version of fastunpack.

        1. Keep in mind that this article is about arbitrarily shuffling one 64-bit integer.

          If you’re handling more than one integer, and the permutation isn’t so arbitrary, there may be faster approaches.

    1. rbit only reverses bits. It doesn’t do an arbitrary bit shuffle.

      AArch64’s NEON would actually do a decent job (better than SSE4), but the instruction sequence would be much longer than what is achievable with AVX-512.

      Having said that, an AVX2/NEON implementation would be interesting. Should be possible to rshift the indexes by 3, shuffle bytes into the right location, then use a TEST to amplify the relevant bits, then extract them.

      1. Speaking of AArch64—how does SVE(2) fare at this? Might you be able to cleverly exploit BGRP/BDEP/BEXT for this, or just do something similar to AVX-512 using the predicate registers?

      2. Ooo, wait, idea! Totally untested asm follows; should fit the same C signature as the rest:

        mov x2, #0
        mov d0, #0
        loop:
        whilelo p0.d, x2, #64
        b.none loop_end
        mov z1.d, p0/z, x0
        ld1d z2.d, p0/z, [x1, x2]
        lsr z1.d, p0/m, z2.d
        and z1.d, p0/m, #1
        lsl z1.d, p0/m, z2.d
        orv d1, p0, z1
        orr d0, d0, d1
        incd x2
        b loop
        loop_end:
        mov x0, d0
        ret

        1. Untested here as well, but my gut says that won’t work, because you’re shifting the bits back to their original position after singling them out. You could probably make it work if the lsl shifted the bits based on index (which needs to be incremented each loop cycle).

          More problematic is that you’re only processing one bit per 64, which doesn’t exactly scream fast (not to mention that horizontal reductions like orv typically aren’t highly performant either).

          The challenge with SVE2 is that the vector width isn’t fixed, whilst this is a fixed width problem (single 64-bit value).
          You could take a similar approach to an SSE/NEON implementation, using a TBL+shift then extracting bits via the predicate. For vector width >=512-bit, you could also use a similar approach to what’s described in the first post (skipping the need to shift), though you’d need to implement two code paths with this technique.

  1. I had a go at a rvv implementation, I’m not able to test it rn, but I think it’s roughly correct. Probably not optimal though:

    vsetivli x0, 1, e64, m1, ma, ta
    vle8.v v0, (a0) # load uint64_t
    vsetivli x0, 64, e16, m8, ma, ta
    vle8.v v8, (a1) # load uint16_t[64]
    vsetivli x0, 64, e8, m4, ma, ta
    vmv.v.i v4, 1
    vmv.v.i v4, 0, v0 # mask to 0/1 bytes
    vrgathere16.vv v4, v4, v8 # gather does the shuffle
    vmseq.vi v0, v4, 0 # 0/1 bytes to mask
    vsetivli x0, 1, e64, m1, ma, ta
    vse8.v v0, (a1) # store mask

      1. Edit: It should’ve been vmerge.vim instead of the second vmv.v.i, because that one isn’t maskable.

  2. I don’t know when, or why, but you just saved me hours upon hours of research in the future. Thanks in advance!

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.