# Iterating over set bits quickly

A common problem in my line of work is to iterate over the set bits (bits having value 1) in a large array.

My standard approach involves a “counting trailing zeroes” function. Given an integer, this function counts how many consecutive bits are zero starting from the less significant bits. Any odd integer has no “trailing zero”. Any even integer has at least one “trailing zero”, and so forth. Many compilers such as LLVM’s clang and GNU GCC have an intrinsic called __builtin_ctzl for this purpose. There are equivalent standard functions in Java (numberOfTrailingZeros), Go and so forth. There is a whole Wikipedia page dedicated to these functions, but recent x64 processors have a fast dedicated instruction so the implementation is taken care at the processor level.

The following function will call the function “callback” with the index of each set bit:

```uint64_t bitset;
for (size_t k = 0; k < bitmapsize; ++k) {
bitset = bitmap[k];
while (bitset != 0) {
uint64_t t = bitset & -bitset;
int r = __builtin_ctzl(bitset);
callback(k * 64 + r);
bitset ^= t;
}
}
```

The trick is that bitset & -bitset returns an integer having just the least significant bit of bitset turned on, all other bits are off. With this observation, you should be able to figure out why the routine work.

Note that your compiler can probably optimize bitset & -bitset to a single instruction on x64 processors. Java has an equivalent function called lowestOneBit.

If you are in a rush, that’s probably not how you’d program it. You would probably iterate through all bits, in this manner:

```uint64_t bitset;
for (size_t k = 0; k < bitmapsize; ++k) {
bitset = bitmap[k];
size_t p = k * 64;
while (bitset != 0) {
if (bitset & 0x1) {
callback(p);
}
bitset >>= 1;
p += 1;
}
}
```

Which is faster?

Obviously, you have to make sure that your code compiles down to the fast x64 trailing-zero instruction. If you do, then the trailing-zero approach is much faster.

I designed a benchmark where the callback function just adds the indexes together. The speed per decoded index will depend on the density (fraction of set bits). I ran my benchmark on Skylake processor:

density trailing-zero naive
0.125 ~5 cycles per int 40 cycles per int
0.25 ~3.5 cycles per int 30 cycles per int
0.5 ~2.6 cycles per int 23 cycles per int

Thus using a fast trailing-zero function is about ten times faster.

Credit: This post was inspired by Wojciech MuÅ‚a.

### Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

## 16 thoughts on “Iterating over set bits quickly”

1. D.I. says:

``` uint64_t t = bitset & -bitset; bitset ^= t; ```

should also be possible

```bitset -= ((uint64_t)1) << r; ```

I should admit that while there are only two operations here, the performance on a particular platform/compiler may depend and is yet to be tested… 🙂

1. I think bitset ^= ((uint64_t)1) << r is better than -=, but otherwise, that’s how it is is implemented in tantivy.

Just benched it against the x & -x trick. For a density of 1% over a bitset of 1M capacity, they are very close (around 15ns per elements on my laptop).

(bitset ^= 1u64 << r) was maybe a notch faster.

1. Travis Downs says:

Using the variable shift is probably generates worse code in all cases on x86, although you might get a tie in benchmarking due to the overhead of callback() dominating. Before BMI (Haswell) variables shifts are somewhat slow (needing 3 uops), although they are better with BMI (shlx and friends) – but with BMI the whole two lines:

```uint64_t t = bitset & -bitset; bitset ^= t; ```

end up as a single BLSR instruction on compilers that recognize it. It would have been simpler to write the two lines above as the single line:

```bitset = bitset & (bitset - 1); ```

Which is the usual idiom for “clearing the lowest bit” – although gcc was smart enough to compile both to the same form (and you don’t need the temporary and all that).

1. Paul Masurel says:

Thanks for the explanation!

I implemented both in Rust, and I am unsure the generated code ended up using a BLSR instruction. I’ll investigate what happened there.

The work done in my bench is small and inlinable so I am not too worried about the callback cost.

1. Paul Masurel says:

I had a look at the assembly code. Both version compile to the same assembly. BLSR is indeed used.

https://godbolt.org/g/sT24rm

2. Depends on your use case. There’s some sets of data that tend to group whatever the bits represent to be all the same.

So you may find that there are situations where comparing to 0 or 0xFFFFFFFF tells you right away that the next 64 bits are all the same.

Can’t wait for 128 bit processors, this will work even faster.

There are optimizations to be had everywhere.

3. Vitaly says:

Did you start blocking Russian IPs by any chance, or is it me moving to Russia from the Netherlands? I know for a fact LinkedIn is blocked in Russia, but I wonder why would anyone want to block this blog :/

1. I am not blocking anyone, but it is possible that my blog is blocked in some countries. My blog was blocked in China and South Korea some years ago.

I am unsure why anyone would block my blog. I am just a computer scientist.

1. Vitaly says:

My exact thoughts, just wanted to confirm. Thank you for your reply.

4. Roman Leventov says:

The difference is so bit that it’s hard to believe in those results. Did you test those routines functionally?

There is one other way to write this loop:

```int r = 0; while (bitset != 0) { int zeros = __builtin_ctzl(bitset); r += zeros; callback(k * 64 + r); r += 1; bitset >>= zeros + 1; } ```

1. What happens in your code when zeros is 63? You could add a branch to check if bitset is equal to 1<<63. Since this will almost never happen, the branch is going to be free (and you only need one branch per word).

The difference is so bit that it’s hard to believe in those results. Did you test those routines functionally?

The fast version (ctz) has been benchmarked thoroughly and is in production code.

The other version was proposed by Wojciech. It is my first time benchmarking it.

I offer my source code, it is quite short, so you should be able to check whether I made a mistake or whether my results are robust. I encourage re-examination of my results. Please prove me wrong!

This code is sensitive to branch prediction so your results will vary depending on the data. I used random data with uniform density. You should expect slower processing with realistic data having non-uniform density.

1. Roman Leventov says:

What happens in your code when zeros is 63? You could add a branch to
check if bitset is equal to 1<<63. Since this will almost never
happen, the branch is going to be free (and you only need one branch
per word).

Or, it could be split into two instructions: bitset >>= zeros; bitset >>= 1. I think the previous three operations, especially callback (if it’s something non-trivial) have enough data parallelism to perform this extra operation with the same overall thoughput.

The fast version (ctz) has been benchmarked thoroughly and is in production code.

Another option is that the inferior alternative works worse than it could.

2. Travis Downs says:

These results are not at all hard to believe and in fact you can ballpark from the results from first principles: consider that the naive approach will take a branch misprediction on at least every set bit, which bounds its performance from above at about 20 cycles per set bit. When the density gets lower it also wastes a lot of time examining every unset bit.

The ctz approach never examines unset bits and just jumps immediately to the next set bit, and only involves about 4 instructions per bit (plus the callback overhead). For really dense bitmaps you can get close to 1 cycle per int!

The main cost as density gets lower is branch prediction on the inner (bitmap != 0) loop: you’ll almost always mispredict the exit iteration and you can pretty much model the performance exactly as 1 mispredict divided by the average number of set bits in a 64-bit bitmap (until density gets so low that most bitmaps are zero: then you approach 1 mispredict per set bit).

You can play some tricks to avoid the mispredicts, such as conditional moves rather than nested loops – but it’s all highly dependent on the bit frequency and typical patterns. I settled on using various strategies, branching and not, depending on the frequency.

5. The fastest way is to use the expand instructions (ex. _mm256_maskz_expand_epi32) on AVX-512 or a shuffle/permute table for SSE/AVX2 and process mutiple (4/8/16) integers at a time with SIMD. This technique is used in the TurboPFor (https://github.com/powturbo/TurboPFor) package.

1. Thanks for your comment. I got so-so results in the past from similar vectorized code (some of which was written by Nathan Kurz), but I think I will make a blog post out of it. Crediting you of course, as needed.

6. Christopher Chang says:

I have hundreds of loops of the following form in my code:

```uint32_t variant_uidx = 0; // Iterate over all variant_ct set bits in variant_include, which is a bitmap of type const uintptr_t*. for (uint32_t variant_idx = 0; variant_idx < variant_ct; ++variant_idx, ++variant_uidx) { if (!IsSet(variant_include, variant_uidx)) { variant_uidx = FindFirst1BitFrom(variant_include, variant_uidx); } // ...do stuff... } ```

IsSet() and FindFirst1BitFrom() are currently defined on Linux and OS X as follows:

```inline uintptr_t IsSet(const uintptr_t* bitarr, uintptr_t idx) { return (bitarr[idx / kBitsPerWord] >> (idx % kBitsPerWord)) & 1; }```

``` ```

```uintptr_t FindFirst1BitFrom(const uintptr_t* bitarr, uintptr_t loc) { const uintptr_t* bitarr_iter = &(bitarr[loc / kBitsPerWord]); uintptr_t ulii = (*bitarr_iter) >> (loc % kBitsPerWord); if (ulii) { return loc + __builtin_ctzl(ulii); } do { ulii = *(++bitarr_iter); } while (!ulii); return static_cast<uintptr_t>(bitarr_iter - bitarr) * kBitsPerWord + __builtin_ctzl(ulii); } ```

I’ve been pretty happy with the performance of this idiom in both the common all-bits-are-1 case and the sparse case, but if you can find a better approach I’m all ears.