Iterating over set bits quickly (SIMD edition)

Suppose that you have a long sequence of bits 10101011100000… you want to visit all the bits set to 1. That is, given 10101011100000, you would like to get the indexes of all the bits set to one: 0,2,4,6,7,8,9.

In a recent blog post, I reviewed fast techniques to iterate over the position of the bits set to 1 in such a bit stream. A fast function in C to solve the problem makes use of the trailing-zero instruction found in recent x64 processors and generated by the __builtin_ctzll intrinsic in several compilers such as LLVM’s clang and GNU gcc. The trailing-zero instruction gives us the number of consecutive zeroes present in a word starting from the least significant bit (so that odd integers have a number of trailing zeros equal to 0). The code is relatively elegant:

size_t bitmap_decode_ctz(uint64_t *bitmap, 
  size_t bitmapsize, 
  uint32_t *out) {
  size_t pos = 0;
  uint64_t bitset;
  for (size_t k = 0; k < bitmapsize; ++k) {
    bitset = bitmap[k];
    while (bitset != 0) {
      uint64_t t = bitset & -bitset;
      int r = __builtin_ctzll(bitset);
      out[pos++] = k * 64 + r;
      bitset ^= t;
  return pos;

In the comments, powturbo (it is a pseudonym) remarked that you could solve the same problem using SIMD instructions. SIMD instructions operate on vectors of words and are often faster. I have not looked at powturbo’s code, but I happen to have an implementation of my own, inherited from my friend Nathan Kurz. It basically processes bytes one by one, looking up the indexes in a table (8kB). Omitting the table, the code is just a bit longer than the trailing-zero version:

int bitmap_decode_avx2(uint64_t * array, 
  size_t sizeinwords, uint32_t *out) {
 uint32_t *initout = out;
 __m256i baseVec = _mm256_set1_epi32(-1);
 __m256i incVec = _mm256_set1_epi32(64);
 __m256i add8 = _mm256_set1_epi32(8);

 for (int i = 0; i < sizeinwords; ++i) {
  uint64_t w = array[i];
  if (w == 0) {
   baseVec = _mm256_add_epi32(baseVec, incVec);
  } else {
   for (int k = 0; k < 4; ++k) {
    uint8_t byteA = (uint8_t) w;
    uint8_t byteB = (uint8_t)(w >> 8);
    w >>= 16;
    __m256i vecA = _mm256_load_si256(vt[byteA]);
    __m256i vecB = _mm256_load_si256(vt[byteB]);
    uint8_t advanceA = lengthTable[byteA];
    uint8_t advanceB = lengthTable[byteB];
    vecA = _mm256_add_epi32(baseVec, vecA);
    baseVec = _mm256_add_epi32(baseVec, add8);
    vecB = _mm256_add_epi32(baseVec, vecB);
    baseVec = _mm256_add_epi32(baseVec, add8);
    _mm256_storeu_si256((__m256i *) out, vecA);
    out += advanceA;
    _mm256_storeu_si256((__m256i *) out, vecB);
    out += advanceB;
 return out - initout;

(At the expense of some performance, you can use a smaller 2kB table.)

As usual, it looks like gibberish if you do not know about Intel intrinsics, but trust me: this code does nothing complicated. It is basically just memoization.

So how does the vectorized version (using SIMD instructions) does against the trailing-zero version? Here are approximate timings on a Skylake processor:

0.0625~6.5 cycles per int~6 cycles per int
0.125~5 cycles per int~3 cycles per int
0.25~4 cycles per int~2 cycles per int
0.5~3 cycles per int~1.25 cycles per int
0.9~3 cycles per int~0.4 cycles per int

For dense bitstreams, my timings are accurate, but they grow more and more inaccurate for sparser bitstreams. Nevertheless, the pattern is clear: if you have dense bitstreams, the SIMD code is about twice as fast. The gains are higher when the bitmaps are very dense. However, the gains disappear when the bit stream is sparse…

I should point out that these timings are not directly comparable with those from my earlier blog post because I write the indexes to an array instead of calling a function for each index. Thus I am currently solving a slightly easier problem.

My code is available.

Note: The “compress” instructions in AVX-512 (e.g., vpcompressd) could make this problem easier. Sadly, AVX-512 instructions are not yet widely available on commodity processors.

9 thoughts on “Iterating over set bits quickly (SIMD edition)”

  1. congratulation! your lookup tables are quite large for AVX2.
    In TurboPFor:Integer Compression I’m using a using a 2k table instead of a 8k table and converting the mask with: _mm256_cvtepu8_epi32(_mm_cvtsi64_si128(*(uint64_t *)(vecDecodeTable[byteA]))).

    I’m also using popcount instead of a length table for advanceA/B.

  2. Thanks for sharing I love these types of articles.

    I’m assuming the number of bitmap passes is large enough justfy to pulling the 8kB lookup table in cache for the use case?

    What would this be used for?

  3. An application.

    Assume you want to find the union of some index sets. Also assume the union should be sorted. This relevant when doing a socalled symbolic Cholesky factorization.

    Most likely you will have a bit vector that marks whether an element is in the union. If you have the bit vector it might be more efficient to generated the sorted union from that than doing a quicksort.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see