Bitmaps are a simple data structure used to represent sets of integers. For example, you can represent all sets of integers in [0,64) using a single 64-bit integer. When they are applicable, bitmaps are very efficient compared to the alternatives (e.g., a hash set).
Unfortunately, extracting the bit sets in a bitmap can be expensive. Suppose I give you the integer 27 (written as 0b011011 in binary notation), you would want to recover the integers 0, 1, 3 and 4. Of course, you can check the value of each bit (by computing v & (1<<bit)) but this can be atrociously slow. To do it faster, you can use the fact that you can quickly find the least significant bit set:
int pos = 0; for(int k = 0; k < bitmaps.length; ++k) { long data = bitmaps[k]; while (data != 0) { int ntz = Long.numberOfTrailingZeros(data); output[pos++] = k * 64 + ntz; data ^= (1l << ntz); } }
In C or C++, the call to numberOfTrailingZeros can be mapped directly some assembly instructions (e.g., bit scan forward or BSF). Though it looks like numberOfTrailingZeros is implemented using several branches in Java, the compiler is smart enough to compile it down similar machine instructions. (Note: thanks to Erich Schubert for pointing out that Java is so smart.)
Up until recently, I did not think we could do better than relying on Long.numberOfTrailingZeros, but I stumbled on a blog post by Erling Ellingsen where he remarked that the function Long.bitCount (reporting the number of bit sets in a word) was essentially branch-free. (As it turns out, Java also converts bitCount to efficient machine instructions like it does for numberOfTrailingZeros.) This suggests an alternative to decode the set bits from a bitmap:
int pos = 0; for(int k = 0; k < bitmaps.length; ++k) { long bitset = bitmaps[k]; while (bitset != 0) { long t = bitset & -bitset; output[pos++] = k * 64 + Long.bitCount(t-1); bitset ^= t; } }
Experimentally, I find that the approach based on Long.bitCount can be 10% slower when there are few bit sets. However, it can also be significantly faster (e.g., 30% or even 70% faster) in some instances. Here are the decoding speeds in millions of integers per second on a recent Intel core i7 processor (in Java);
Density | Naive | Long.numberOfTrailingZeros | Long.bitCount |
---|---|---|---|
1/64 | 15 | 143 | 131 |
2/64 | 27 | 196 | 202 |
4/64 | 45 | 213 | 252 |
8/64 | 68 | 261 | 350 |
16/64 | 90 | 281 | 431 |
32/64 | 115 | 285 | 480 |
On the whole, the approach based on Long.bitCount is probably better unless you expect very sparse bitmaps.
Update: I have created a C++ version and in this case, both techniques have the same speed. So Java particularly loves bitCount for this problem.
For the lab, why not just use a table for the rightmost 8 bits or so? Something like
while(val) {
lsb = table[val&255];
if( lsb < 9 ) {
output lsb-1;
val >>= lsb;
} else val >>= 8;
}
Basically the table is 1+the position of the 1 in the byte, 9 for the 0x00 edge case (maybe some other test would work better for that). You only waste a loop 1/256th of the time.
@KWillets
I understand that you are trying to improve the identification of the least significant bit. This is equivalent to improving the implementation of Java’s Long.numberOfTrailingZeros. I think if you can do this, you could become semi-famous.
I am suspicious about your memoization approach because a table made of 256 integers is quite large. Accessing such a table is likely slow compared to basic integer operations.
Most library implementations stay away from table-based approaches for obvious reasons: the tables are large, and the first call will stall while the cache lines are read in. However if you’ve determined that this function is going to be called heavily then some judicious use of tables or enumerative switch statements is called for.
The actual table size is tiny since the values are 0-7 or so (on second thought I would handle 0x00 in code and not as an extra table value). A 16-way table would fit into 32 bits.
Another (long known) method I’ve used in C is simply to convert to float and mask/shift the exponent to get the high bit. I’m not sure how far you’re willing to go with this (SSE?), but hacky solutions are many.
@KWillets
Can you spell out your approach again for me? I just don’t understand what you have in mind. Note that we work with 64-bit words. I suppose you can take your 64-bit words and iterate over them as 8 x 8-bit words but I fear this will introduce quite a bit of expensive branching.
If you have a look at my code (see https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2013/12/23), you will find a table-based approach that was reported to be faster than the bitCount approach but, at least on my core i7, it is slower. It is a tiny table and the only catch is a 64-bit multiplication. Even so, it is slower than calling bitCount.
The smarter our processors get, the less memoization is useful. You are often better off computing cheap things from scratch rather than look them up… at least as long as there is no expensive branching involved.
This is the 4-bit-suffix version in C. The basic idea is to lookup the rightmost 4 bits in a table for the location of the least significant bit, then shift past it and iterate. At lower densities it will definitely suffer.
Also, forgot crucial variable in my earlier sketch :(.
void bits( int v, int *out, int *nout) {
// lowest-order 1 positions: 16-values, 2 bits each
int lsbs = (0<<0) + (0<<2) + (1<<4) + (0<<6) + (2<<8) + (0<<10) + (1<<12) + (0<<14)
+ (3<<16) + (0<<18) + (1<<20) + (0<<22) + (2<<24) + (0<<26) + (1<<28) + (0<<30)
;
int shift = 0;
while(v) {
int mask = v & 15;
int lsb;
if( mask ) {
lsb = (lsbs>>( mask<<1 )) & 3;
out[*nout] = shift + lsb;
*nout += 1;
}
else
lsb = 3;
v >>= (lsb + 1);
shift += (lsb + 1);
}
}
I totally agree with Daniel. In fact, reading even from L1 takes a couple of CPU cycles. At the same time, you can probably execute a couple of bit-counting intrinsics in a single CPU cycle.
It posted another well-known memoization approach online. And its performance is even more pathetic. It is up to 8 times slower than a naive approach in C++.
BTW, with -march=native flag (again on core i7) the naive approach apparently outperforms everything else by a good margin (up to 30%):
https://github.com/searchivarius/BlogCode/tree/master/2013/12/StupidMemoryBitscan
PS: Sorry, it wasn’t naive, it was bitcan1, looked at the wrong column in the output.
I agree that the intrinsic is definitely faster; it’s just that the last time I dealt with this I don’t think there were any consistent instructions to do this. So mine is just a suggestion for bare-bones processors these days.
@KWillets
I think that instructions such as x86’s BSR have been around for a long time (as far back as the Intel 386). Still, it is likely that if you go that far back, BSR was probably slow and the lack of smart superscalar execution and branch prediction might have made memoization worth it.
What jvm version did you use for your benchmark? I’m seeing that numberOfTrailingZeros is consistently faster than bitCount on java 6u33 which is surprising to me since bitCount is branchless.
@Plaisance
The benchmarks were with Java 7 on a recent Intel core i7 processor.
Here is my exact output:
Thanks, problem was that I wrote my own benchmark and was only testing the lowest bit in each long, after fixing to test all the bits i’m getting consistent results on both java 6 and java 7 that agree with yours.
I realize that this post is a few years old, but I stumbled upon it in my quest for a faster nextSetBit(int fromIndex) on the Java BitSet.
Actually, my quest was for a faster way to do:
for (int i = bs.nextSetBit(0); i >= 0; i = bs.nextSetBit(i + 1)) {
And I wanted to share my new way of doing this, which on a dense set is about 4-5 times faster than the for loop above and on sets with about 50% density about 3 times faster. Results may vary, I have only done some simple testing.
interface BitSetCallback {
void nextSetBit(int i);
}
public void nextSetBitCallBack(BitSetCallback cb) {
long[] words = this.words;
int wordsInUse = words.length;
int u = 0;
int indexCounter = 0;
while (u < wordsInUse) {
long word = words[u];
while (word != 0) {
long idx = word & 1;
while (idx == 0) {
word >>>= 1;
++indexCounter;
idx = word & 1;
}
cb.nextSetBit(indexCounter);
word >>>= 1;
++indexCounter;
}
++u;
}
}