Even faster bitmap decoding

Bitmaps are a simple data structure used to represent sets of integers. For example, you can represent all sets of integers in [0,64) using a single 64-bit integer. When they are applicable, bitmaps are very efficient compared to the alternatives (e.g., a hash set).

Unfortunately, extracting the bit sets in a bitmap can be expensive. Suppose I give you the integer 27 (written as 0b011011 in binary notation), you would want to recover the integers 0, 1, 3 and 4. Of course, you can check the value of each bit (by computing v & (1<<bit)) but this can be atrociously slow. To do it faster, you can use the fact that you can quickly find the least significant bit set:

int pos = 0;
for(int k = 0; k < bitmaps.length; ++k) {  
  long data = bitmaps[k];
  while (data != 0) {
    int ntz = Long.numberOfTrailingZeros(data);
    output[pos++] = k * 64 + ntz;
    data ^= (1l << ntz);

In C or C++, the call to numberOfTrailingZeros can be mapped directly some assembly instructions (e.g., bit scan forward or BSF). Though it looks like numberOfTrailingZeros is implemented using several branches in Java, the compiler is smart enough to compile it down similar machine instructions. (Note: thanks to Erich Schubert for pointing out that Java is so smart.)

Up until recently, I did not think we could do better than relying on Long.numberOfTrailingZeros, but I stumbled on a blog post by Erling Ellingsen where he remarked that the function Long.bitCount (reporting the number of bit sets in a word) was essentially branch-free. (As it turns out, Java also converts bitCount to efficient machine instructions like it does for numberOfTrailingZeros.) This suggests an alternative to decode the set bits from a bitmap:

int pos = 0;
for(int k = 0; k < bitmaps.length; ++k) {
   long bitset = bitmaps[k];
   while (bitset != 0) {
     long t = bitset & -bitset;
     output[pos++] = k * 64 +  Long.bitCount(t-1);
     bitset ^= t;

Experimentally, I find that the approach based on Long.bitCount can be 10% slower when there are few bit sets. However, it can also be significantly faster (e.g., 30% or even 70% faster) in some instances. Here are the decoding speeds in millions of integers per second on a recent Intel core i7 processor (in Java);

 Density  Naive  Long.numberOfTrailingZeros  Long.bitCount 

My code is available.

On the whole, the approach based on Long.bitCount is probably better unless you expect very sparse bitmaps.

Update: I have created a C++ version and in this case, both techniques have the same speed. So Java particularly loves bitCount for this problem.

12 thoughts on “Even faster bitmap decoding”

  1. For the lab, why not just use a table for the rightmost 8 bits or so? Something like

    while(val) {
    lsb = table[val&255];
    if( lsb < 9 ) {
    output lsb-1;
    val >>= lsb;
    } else val >>= 8;

    Basically the table is 1+the position of the 1 in the byte, 9 for the 0x00 edge case (maybe some other test would work better for that). You only waste a loop 1/256th of the time.

  2. @KWillets

    I understand that you are trying to improve the identification of the least significant bit. This is equivalent to improving the implementation of Java’s Long.numberOfTrailingZeros. I think if you can do this, you could become semi-famous.

    I am suspicious about your memoization approach because a table made of 256 integers is quite large. Accessing such a table is likely slow compared to basic integer operations.

  3. Most library implementations stay away from table-based approaches for obvious reasons: the tables are large, and the first call will stall while the cache lines are read in. However if you’ve determined that this function is going to be called heavily then some judicious use of tables or enumerative switch statements is called for.

    The actual table size is tiny since the values are 0-7 or so (on second thought I would handle 0x00 in code and not as an extra table value). A 16-way table would fit into 32 bits.

    Another (long known) method I’ve used in C is simply to convert to float and mask/shift the exponent to get the high bit. I’m not sure how far you’re willing to go with this (SSE?), but hacky solutions are many.

  4. @KWillets

    Can you spell out your approach again for me? I just don’t understand what you have in mind. Note that we work with 64-bit words. I suppose you can take your 64-bit words and iterate over them as 8 x 8-bit words but I fear this will introduce quite a bit of expensive branching.

    If you have a look at my code (see https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2013/12/23), you will find a table-based approach that was reported to be faster than the bitCount approach but, at least on my core i7, it is slower. It is a tiny table and the only catch is a 64-bit multiplication. Even so, it is slower than calling bitCount.

    The smarter our processors get, the less memoization is useful. You are often better off computing cheap things from scratch rather than look them up… at least as long as there is no expensive branching involved.

  5. This is the 4-bit-suffix version in C. The basic idea is to lookup the rightmost 4 bits in a table for the location of the least significant bit, then shift past it and iterate. At lower densities it will definitely suffer.

    Also, forgot crucial variable in my earlier sketch :(.

    void bits( int v, int *out, int *nout) {

    // lowest-order 1 positions: 16-values, 2 bits each
    int lsbs = (0<<0) + (0<<2) + (1<<4) + (0<<6) + (2<<8) + (0<<10) + (1<<12) + (0<<14)
    + (3<<16) + (0<<18) + (1<<20) + (0<<22) + (2<<24) + (0<<26) + (1<<28) + (0<<30)

    int shift = 0;

    while(v) {
    int mask = v & 15;
    int lsb;
    if( mask ) {
    lsb = (lsbs>>( mask<<1 )) & 3;
    out[*nout] = shift + lsb;
    *nout += 1;
    lsb = 3;

    v >>= (lsb + 1);
    shift += (lsb + 1);


  6. I totally agree with Daniel. In fact, reading even from L1 takes a couple of CPU cycles. At the same time, you can probably execute a couple of bit-counting intrinsics in a single CPU cycle.

    It posted another well-known memoization approach online. And its performance is even more pathetic. It is up to 8 times slower than a naive approach in C++.

    BTW, with -march=native flag (again on core i7) the naive approach apparently outperforms everything else by a good margin (up to 30%):


  7. I agree that the intrinsic is definitely faster; it’s just that the last time I dealt with this I don’t think there were any consistent instructions to do this. So mine is just a suggestion for bare-bones processors these days.

  8. @KWillets

    I think that instructions such as x86’s BSR have been around for a long time (as far back as the Intel 386). Still, it is likely that if you go that far back, BSR was probably slow and the lack of smart superscalar execution and branch prediction might have made memoization worth it.

  9. What jvm version did you use for your benchmark? I’m seeing that numberOfTrailingZeros is consistently faster than bitCount on java 6u33 which is surprising to me since bitCount is branchless.

  10. @Plaisance

    The benchmarks were with Java 7 on a recent Intel core i7 processor.

    Here is my exact output:

    1 15.721772394850259 130.7076604166934 108.14306849412094 142.57966656658982 47.07466682521651 109.05387639509533
    1 15.719865221400806 130.75023263556812 108.23983388274465 142.48395929541155 47.055448709307086 109.01093059260431
    2 27.581771325981062 203.46921947232224 144.26321503767772 196.22725379809336 54.640605791918645 174.86125467743133
    2 27.580898096812284 203.41295240307974 144.517990570322 196.2934226766008 54.61655047032696 174.79773330962325
    4 46.18731765151476 278.1934945975009 178.2828904319236 210.16702578352476 62.038651431607924 264.8420813395451
    4 46.18743343030088 278.1961657021847 178.40839167364427 210.06200874486856 62.01685890854171 265.122615285776
    8 69.45215572862662 369.4220746581799 199.97836654758095 276.3127587215588 76.64864861923547 343.31817552020885
    8 69.47323783413718 369.0569650016884 199.97733460115012 276.35278123012506 76.73822955437699 343.1901170217992
    16 91.62453922232 455.65586979005445 212.64403795214915 341.16565104787577 98.03713889055986 408.3932414941178
    16 91.6183671298949 455.4029909131834 212.6857679140175 340.7542524764352 98.00478438937044 408.22104217767463
  11. Thanks, problem was that I wrote my own benchmark and was only testing the lowest bit in each long, after fixing to test all the bits i’m getting consistent results on both java 6 and java 7 that agree with yours.

Leave a Reply

Your email address will not be published. Required fields are marked *