Even faster bitmap decoding

Bitmaps are a simple data structure used to represent sets of integers. For example, you can represent all sets of integers in [0,64) using a single 64-bit integer. When they are applicable, bitmaps are very efficient compared to the alternatives (e.g., a hash set).

Unfortunately, extracting the bit sets in a bitmap can be expensive. Suppose I give you the integer 27 (written as 0b011011 in binary notation), you would want to recover the integers 0, 1, 3 and 4. Of course, you can check the value of each bit (by computing v & (1<<bit)) but this can be atrociously slow. To do it faster, you can use the fact that you can quickly find the least significant bit set:

```int pos = 0;
for(int k = 0; k < bitmaps.length; ++k) {
long data = bitmaps[k];
while (data != 0) {
int ntz = Long.numberOfTrailingZeros(data);
output[pos++] = k * 64 + ntz;
data ^= (1l << ntz);
}
}
```

In C or C++, the call to numberOfTrailingZeros can be mapped directly some assembly instructions (e.g., bit scan forward or BSF). Though it looks like numberOfTrailingZeros is implemented using several branches in Java, the compiler is smart enough to compile it down similar machine instructions. (Note: thanks to Erich Schubert for pointing out that Java is so smart.)

Up until recently, I did not think we could do better than relying on Long.numberOfTrailingZeros, but I stumbled on a blog post by Erling Ellingsen where he remarked that the function Long.bitCount (reporting the number of bit sets in a word) was essentially branch-free. (As it turns out, Java also converts bitCount to efficient machine instructions like it does for numberOfTrailingZeros.) This suggests an alternative to decode the set bits from a bitmap:

```int pos = 0;
for(int k = 0; k < bitmaps.length; ++k) {
long bitset = bitmaps[k];
while (bitset != 0) {
long t = bitset & -bitset;
output[pos++] = k * 64 +  Long.bitCount(t-1);
bitset ^= t;
}
}
```

Experimentally, I find that the approach based on Long.bitCount can be 10% slower when there are few bit sets. However, it can also be significantly faster (e.g., 30% or even 70% faster) in some instances. Here are the decoding speeds in millions of integers per second on a recent Intel core i7 processor (in Java);

Density   Naive   Long.numberOfTrailingZeros   Long.bitCount
1/64 15 143 131
2/64 27 196 202
4/64 45 213 252
8/64 68 261 350
16/64 90 281 431
32/64 115 285 480

On the whole, the approach based on Long.bitCount is probably better unless you expect very sparse bitmaps.

Update: I have created a C++ version and in this case, both techniques have the same speed. So Java particularly loves bitCount for this problem.

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

13 thoughts on “Even faster bitmap decoding”

1. KWillets says:

For the lab, why not just use a table for the rightmost 8 bits or so? Something like

while(val) {
lsb = table[val&255];
if( lsb < 9 ) {
output lsb-1;
val >>= lsb;
} else val >>= 8;
}

Basically the table is 1+the position of the 1 in the byte, 9 for the 0x00 edge case (maybe some other test would work better for that). You only waste a loop 1/256th of the time.

2. @KWillets

I understand that you are trying to improve the identification of the least significant bit. This is equivalent to improving the implementation of Java’s Long.numberOfTrailingZeros. I think if you can do this, you could become semi-famous.

I am suspicious about your memoization approach because a table made of 256 integers is quite large. Accessing such a table is likely slow compared to basic integer operations.

3. KWillets says:

Most library implementations stay away from table-based approaches for obvious reasons: the tables are large, and the first call will stall while the cache lines are read in. However if you’ve determined that this function is going to be called heavily then some judicious use of tables or enumerative switch statements is called for.

The actual table size is tiny since the values are 0-7 or so (on second thought I would handle 0x00 in code and not as an extra table value). A 16-way table would fit into 32 bits.

Another (long known) method I’ve used in C is simply to convert to float and mask/shift the exponent to get the high bit. I’m not sure how far you’re willing to go with this (SSE?), but hacky solutions are many.

4. @KWillets

Can you spell out your approach again for me? I just don’t understand what you have in mind. Note that we work with 64-bit words. I suppose you can take your 64-bit words and iterate over them as 8 x 8-bit words but I fear this will introduce quite a bit of expensive branching.

If you have a look at my code (see https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2013/12/23), you will find a table-based approach that was reported to be faster than the bitCount approach but, at least on my core i7, it is slower. It is a tiny table and the only catch is a 64-bit multiplication. Even so, it is slower than calling bitCount.

The smarter our processors get, the less memoization is useful. You are often better off computing cheap things from scratch rather than look them up… at least as long as there is no expensive branching involved.

5. KWillets says:

This is the 4-bit-suffix version in C. The basic idea is to lookup the rightmost 4 bits in a table for the location of the least significant bit, then shift past it and iterate. At lower densities it will definitely suffer.

Also, forgot crucial variable in my earlier sketch :(.

void bits( int v, int *out, int *nout) {

// lowest-order 1 positions: 16-values, 2 bits each
int lsbs = (0<<0) + (0<<2) + (1<<4) + (0<<6) + (2<<8) + (0<<10) + (1<<12) + (0<<14)
+ (3<<16) + (0<<18) + (1<<20) + (0<<22) + (2<<24) + (0<<26) + (1<<28) + (0<<30)
;

int shift = 0;

while(v) {
int mask = v & 15;
int lsb;
lsb = (lsbs>>( mask<<1 )) & 3;
out[*nout] = shift + lsb;
*nout += 1;
}
else
lsb = 3;

v >>= (lsb + 1);
shift += (lsb + 1);
}

}

6. I totally agree with Daniel. In fact, reading even from L1 takes a couple of CPU cycles. At the same time, you can probably execute a couple of bit-counting intrinsics in a single CPU cycle.

It posted another well-known memoization approach online. And its performance is even more pathetic. It is up to 8 times slower than a naive approach in C++.

BTW, with -march=native flag (again on core i7) the naive approach apparently outperforms everything else by a good margin (up to 30%):

https://github.com/searchivarius/BlogCode/tree/master/2013/12/StupidMemoryBitscan

7. PS: Sorry, it wasn’t naive, it was bitcan1, looked at the wrong column in the output.

8. KWillets says:

I agree that the intrinsic is definitely faster; it’s just that the last time I dealt with this I don’t think there were any consistent instructions to do this. So mine is just a suggestion for bare-bones processors these days.

9. @KWillets

I think that instructions such as x86’s BSR have been around for a long time (as far back as the Intel 386). Still, it is likely that if you go that far back, BSR was probably slow and the lack of smart superscalar execution and branch prediction might have made memoization worth it.

10. Jeff Plaisance says:

What jvm version did you use for your benchmark? I’m seeing that numberOfTrailingZeros is consistently faster than bitCount on java 6u33 which is surprising to me since bitCount is branchless.

11. @Plaisance

The benchmarks were with Java 7 on a recent Intel core i7 processor.

Here is my exact output:

```1 15.721772394850259 130.7076604166934 108.14306849412094 142.57966656658982 47.07466682521651 109.05387639509533
1 15.719865221400806 130.75023263556812 108.23983388274465 142.48395929541155 47.055448709307086 109.01093059260431
2 27.581771325981062 203.46921947232224 144.26321503767772 196.22725379809336 54.640605791918645 174.86125467743133
2 27.580898096812284 203.41295240307974 144.517990570322 196.2934226766008 54.61655047032696 174.79773330962325
4 46.18731765151476 278.1934945975009 178.2828904319236 210.16702578352476 62.038651431607924 264.8420813395451
4 46.18743343030088 278.1961657021847 178.40839167364427 210.06200874486856 62.01685890854171 265.122615285776
8 69.45215572862662 369.4220746581799 199.97836654758095 276.3127587215588 76.64864861923547 343.31817552020885
8 69.47323783413718 369.0569650016884 199.97733460115012 276.35278123012506 76.73822955437699 343.1901170217992
16 91.62453922232 455.65586979005445 212.64403795214915 341.16565104787577 98.03713889055986 408.3932414941178
16 91.6183671298949 455.4029909131834 212.6857679140175 340.7542524764352 98.00478438937044 408.22104217767463
```
12. Jeff Plaisance says:

Thanks, problem was that I wrote my own benchmark and was only testing the lowest bit in each long, after fixing to test all the bits i’m getting consistent results on both java 6 and java 7 that agree with yours.

13. Claus Larsen says:

I realize that this post is a few years old, but I stumbled upon it in my quest for a faster nextSetBit(int fromIndex) on the Java BitSet.

Actually, my quest was for a faster way to do:

```for (int i = bs.nextSetBit(0); i >= 0; i = bs.nextSetBit(i + 1)) { ```

And I wanted to share my new way of doing this, which on a dense set is about 4-5 times faster than the for loop above and on sets with about 50% density about 3 times faster. Results may vary, I have only done some simple testing.

``` interface BitSetCallback { void nextSetBit(int i); } public void nextSetBitCallBack(BitSetCallback cb) { long[] words = this.words; int wordsInUse = words.length; int u = 0; int indexCounter = 0; while (u < wordsInUse) { long word = words[u]; while (word != 0) { long idx = word & 1; while (idx == 0) { word >>>= 1; ++indexCounter; idx = word & 1; } cb.nextSetBit(indexCounter); word >>>= 1; ++indexCounter; } ++u; } } ```