An old programming trick is to represent sets of integers as bitmaps. For example, the sequence of integers 1, 4, 6 becomes the number 0b1010010 in binary (or 82 in decimal). Bitmaps are efficient data structures.

It is rather obvious how to code integers in a bitmap. It suffices to enumerate each integer and set the corresponding bit in the bitmap. Here is a java sample:

for(int k = 0; k < array.length;++k) {
 bitmap[array[k]/64] |= 1l << (array[k]%64);

How would you get the integers back?

It is tempting to visit every bit, one by one, and create an integer whenever a set bit is found:

for(int k = 0; k<bitmap.length; ++k) {
 for(int c = 0; c<64; ++c) {
  if (((1l << c) & bitmap[k]) != 0) {
   ans[counter++] = k*64+c;

Maybe for relatively dense bitmaps, that is, when there are many bits set to 1, this would be reasonably efficient.

However, we can do better. Most languages (including Java) have fast functions to compute the number of trailing zeros in an integer. For example, the number 82 (0b1010010 in binary) has a single trailing zero. Similarly, the number 16 has 4 trailing zeroes because we can write it in binary as 0b10000.

Instead of checking every bit, we can repeatedly call a trailing-zero function:

for(int k = 0; k<bitmap.length; ++k) {
 long v = bitmap[k];
 while(v!=0) {
  int ntz = Long.numberOfTrailingZeros(v);
  ans[counter++] = ntz +k * 64;

How much faster is it? I wrote some Java code to determine the difference. On my MacBook Air (1.8 GHz Intel Core i7), the approach with trailing zeros beats the naive code by a factor of two or more, for all reasonable densities. Only when all bits are set to 1 is the naive approach better.

You can see the speed (in millions of integers per second) in this figure:

The x-axis (labelled as sparsity) represents the average number of zero-bits between any two one-bits. I used a non-uniform distribution to get a more realistic benchmark. See my code for details.

Conclusion: Fast functions to compute trailing zeros are your friends when working with bitmaps.

Update: You can do even better.


  1. Hi Daniel,

    This is very interesting. Have you tried to use an array with precomputed integer offsets? One can precompute 256 or even 65536 combinations. The disadvantage, of course, is a necessity to fetch offsets from memory. This might be slow.

    Comment by Itman — 22/5/2012 @ 12:15

  2. @Itman

    I thought about memoization but I am not sure how to implement it efficiently.

    If I had coded it in C++, I could have tried a template approach, generating 256 different functions, and using a switch case to select the right one. This feels like a lot of jumping around in a large pool of RAM. I’m not optimistic.

    I suppose you can use memoization to try to accelerate the computation of the trailing zeros, but the gains might not be large (if there are gains at all).

    Comment by Daniel Lemire — 22/5/2012 @ 13:17

  3. No, switch case is out of question. You can use a byte (or a two-byte number) as an offset in the array that contains a list of IDs. For a two-byte case, this is, indeed, a large memory pool (hence, caching issues are possible), but for single-byte offsets, the amount of extra memory is tiny.

    This might not be much faster, but we would still be interested to know what are possible gains/losses.

    Comment by Itman — 22/5/2012 @ 13:25

  4. PS: A switch may be used anyway as a part of calculation process, but you don’t need 256 different functions, just 256 arrays.

    Comment by Itman — 22/5/2012 @ 13:27

  5. @Itman

    I’ve implemented memoization. The result is poor. Please see my implementation:

    Comment by Daniel Lemire — 22/5/2012 @ 16:22

  6. Interesting. On my laptop, BTW, naive encodings are the fastest, not fast32, fast64 versions. I will have to double check this (can’t do now). In addition, I see that the memoization is implemented through a 2-level addressing scheme. I would have put all the data in a single contiguous array (maybe better from a perspective of caching). I also wonder if results obtained for Java also hold for C++.

    Comment by Itman — 23/5/2012 @ 10:39

  7. @Itman

    On my laptop, BTW, naive encodings are the fastest, not fast32, fast64 versions.

    Really? Can you share the output you get?

    Please use the -server flag with java, otherwise numbers tend to be all over the place.

    I would have put all the data in a single contiguous array (maybe better from a perspective of caching).

    I’d be curious to see how you implement it.

    I also wonder if results obtained for Java also hold for C++.

    I agree. I might hack something together.

    Comment by Daniel Lemire — 23/5/2012 @ 11:31

  8. I will re-run it -server option. Maybe that was causing a problem.

    >I’d be curious to see how you implement it.

    In Java, it is a bit a hard. In C++, you just use an array of structures.

    Comment by Itman — 23/5/2012 @ 11:42

  9. I have added extra info. Otherwise, it was hard to parse. Naive algorithms are the best, next goes memoization. I also suspect that memoization might be optimized:

    to32: 16.384 32: 10.923 32Fast: 13.107 32Memo: 9.362 to64: 10.923 64: 10.923 64Fast: 8.192 64Memo: 9.362
    1 to32: 16.384 32: 4.369 32Fast: 10.923 32Memo: 5.958 to64: 9.362 64: 0.809 64Fast: 8.192 64Memo: 5.958
    2 to32: 16.384 32: 2.849 32Fast: 9.362 32Memo: 4.096 to64: 8.192 64: 2.114 64Fast: 7.282 64Memo: 3.855
    3 to32: 5.461 32: 1.82 32Fast: 8.192 32Memo: 2.849 to64: 7.282 64: 1.337 64Fast: 5.958 64Memo: 2.731
    4 to32: 13.107 32: 1.092 32Fast: 6.554 32Memo: 2.185 to64: 6.554 64: 0.728 64Fast: 5.041 64Memo: 1.928
    5 to32: 13.107 32: 0.607 32Fast: 4.681 32Memo: 1.598 to64: 5.461 64: 0.428 64Fast: 4.369 64Memo: 1.394
    6 to32: 7.282 32: 0.324 32Fast: 3.449 32Memo: 1.04 to64: 4.369 64: 0.224 64Fast: 3.121 64Memo: 0.923
    7 to32: 3.277 32: 0.168 32Fast: 2.427 32Memo: 0.655 to64: 2.849 64: 0.115 64Fast: 2.427 64Memo: 0.516
    8 to32: 1.725 32: 0.085 32Fast: 1.68 32Memo: 0.383 to64: 1.598 64: 0.058 64Fast: 1.82 64Memo: 0.298
    9 to32: 0.851 32: 0.043 32Fast: 1.024 32Memo: 0.21 to64: 0.95 64: 0.029 64Fast: 1.285 64Memo: 0.172

    Comment by Itman — 23/5/2012 @ 23:06

  10. @Itman

    My little program reports speed in millions of integers per second. So higher values are better.

    Do you still think our results are inconsistent?

    Comment by Daniel Lemire — 23/5/2012 @ 23:33

  11. Uuuuups. Sorry. Yes, you are right. Memoization is faster than a naive approach (we know it from experience without system). It also makes sense that a memory-less BitmapFast version is faster than memoization, because it memory access is expensive, even in the case of L1 hit.

    I still would not exclude that your implementation of memoization can be improved somewhat (will check someday).

    Comment by Itman — 23/5/2012 @ 23:42

  12. Misspelling
    without -> with our

    Comment by Itman — 23/5/2012 @ 23:42

  13. @Itman

    All the implementations I report on my blog can be improved. It is part of the fun:

    1. Come up with semi-surprising statement backed with actual code.

    2. Wait for people to fix your code, potentially proving that you are an idiot.

    I wish all of science would work this way. ;-)

    Comment by Daniel Lemire — 24/5/2012 @ 8:18

  14. @itman

    Regarding the question as to whether this result holds true for C++, the answer is yes.

    I’ve implemented a very similar benchmark using EWAH-compressed bitmaps with a flag allowing me to enable and disable the trailing-zero approach. Again, I see a factor of two gain with the trailing-zero approach.

    You can find the code on github:

    Comment by Daniel Lemire — 24/5/2012 @ 12:08

  15. Cool, I will certainly have a look at it. Is it a two-fold gain over the naiver counter or over the memoization approach? In Java, this difference is up to an order of magnitude.

    Comment by Itman — 24/5/2012 @ 13:00

  16. @Itman

    In C++, I did not try memoization.

    When I say a factor of two, that’s just a way of saying that it makes a big difference. I have not yet made a clean analysis, I just made a few quick tests to reassure myself that the C++ results were consistent with the Java results.

    I ought to revisit the problem later.

    Comment by Daniel Lemire — 24/5/2012 @ 15:55

  17. Would this be quicker…?

    while (value)
    LowestBit = LowestBitArray[ value ];
    printf(“Value %d\n”,LowestBit);
    value = value – LowestBit;

    LowestBitArray holds the value of the lowest bit e.g. LBA[1]=1 LBA[2]=2 LBA[3]=1 LBA[4]=4 LBA[5]=1 LBA[6]=2 etc.

    Once you have this value you subtract it from the initial value and repeat until you’ve eaten all the bits.

    Comment by SimonP — 5/7/2012 @ 11:46

  18. @SimonP

    Your array (LBA) has how many entries?

    If “value” is a 32-bit word, you need 1<<32 entries? That is a lot of memory. It is unlikely to reside on the CPU cache which means that your CPU will spend a great deal of time retrieving data from RAM… that's slow.

    Comment by Daniel Lemire — 5/7/2012 @ 12:39

Sorry, the comment form is closed at this time.

« Blog's main page

Powered by WordPress