You can represent a list of distinct integers no larger than N using exactly N bits: if the integer i appears in your list, you set the i th bit to true. Bits for which there is no corresponding integers are set to false. For example, the integers 3, 4, 7 can be represented as 00011001. As another example, the integers 1, 2, 7 can be represented as 01100001.
Bitmaps are great to compute intersections and unions fast. For example, to compute the union between 3, 4, 7 and 1, 2, 7, all I need to do is compute the bitwise OR between 00011001 and 01100001 (=01111001) which a computer can do in one CPU cycle. Similarly, the intersection can be computed as the bitwise AND between 00011001 and 01100001 (=00000001).
Though it does not necessarily make use of fancy SSE instructions on your desktop, bitmaps are nevertheless an example of vectorization. That is, we use the fact that the processor can process several bits with one instruction.
There are some downsides to the bitmap approach: you first have to construct the bitmaps and then you have to extract the set bits. Thankfully, there are fast algorithms to decode bitmaps.
Nevertheless, we cannot expect bitmaps to be always faster. If most bits are set to false, then you are better off working over sets of sorted integers. So where is the threshold?
I decided to use the JavaEWAH library to test it out. This library is used, among other things, by Apache Hive to index queries over Hadoop. JavaEWAH uses compressed bitmaps (see Lemire at al. 2010 for details) instead of the simple bitmaps I just described, but the core idea remains the same. I have also added a simpler sparse bitmap implementation to this test.
I generated random numbers using the ClusterData model proposed by Vo Ngoc Anh and Alistair Moffat. It is a decent model for “real-world data”.
Consider the computation of the intersection between any two random sets of integers. The next figure gives the speed (in millions of integers per second) versus the density measured as the number of integers divided by the range of values. The “naive” scheme refer to an intersection computed over a list of integers (without bitmaps).
I ran the test on a desktop core i7 computer.
Conclusion: Unsurprisingly, the break-even sparsity for JavaEWAH is about 1/32: if you have more than 1000 integers in the range [0,32000) then bitmaps might be faster than working over lists of integers. Of course, better speed is possible with some optimization and your data may differ from my synthetic data, but we have a ballpark estimate. A simpler sparse bitmap implementation can be useful over sparser data though it comes at a cost: the best speed is reduced compared to EWAH.
Source code: As usual, I provide full source code so that you can reproduce my results.
Update: This post was updated on Oct. 26th 2012.
There are some downsides to the bitmap approach: you first have to construct the bitmaps and then you have to extract the set bits.
If you do a bunch of intersections, you end up with few non-zero bits/bytes. Hence, extraction may be quite cheap.
PS: Is JavaEH a sparse bitmap?
@Itman
1) Bitmap decoding should be cheap compared to the cost of computing an intersection, but it is still a significant overhead. Thankfully, you can reduce this overhead with fast bitmap decoding algorithms (see http://lemire.me/blog/archives/2012/05/21/fast-bitmap-decoding/).
2) Yes, JavaEWAH supports sparse bitmaps. There is also a C++ counterpart (https://github.com/lemire/EWAHBoolArray) but JavaEWAH is more popular.
A good question to ask is how few bits we have to examine to construct the intersection of 2 or more vectors, given any choice of encoding for the vectors (without pre-storing intersections, etc.).
I looked at this problem some years ago and decided that (delta and) Golomb-Rice coding could be rearranged into a sort of bucketed bitmap arrangement, where the range is broken into buckets of M consecutive int’s, represented in a bitmap (corresponding the unary parts of the coding), and an array of offsets within each occupied bucket, encoded in truncated binary as in the original.
For example, a sorted vector of int’s in the range 0-127, with a Golomb-Rice bucket size of 8, would be broken into a 16-bit bitmap (buckets 0-7,8-15, …,120-127), and an array of offsets r < 8 within those buckets, in truncated binary (since a bucket can contain more than one value, the offsets also need a bit to indicate "consecutive in the same bucket" or some such).
Calculating the intersection is a two-step process of first and-ing the bitmaps, and then fetching and comparing offsets in any bucket where a 1 remains. Since the offsets may be slow to fetch, the idea is to use the bitmaps as much as possible.
The optimality of Golomb-Rice means the average bitmap will be about 50% sparse, and n-way intersections can get 2^-n sparsity before any bucket contents need to be examined.
I never decided on the best way to store the offsets, but this seemed like a good way to size the summary bitmaps.
You have basically just described Elias-Fano coding, which is provably optimal. See these slides for a nice intro: http://shonan.nii.ac.jp/seminar/029/wp-content/uploads/sites/12/2013/07/Sebastiano_Shonan.pdf
@KWillets
It is a reasonable approach. Have you tried implementing it?
I just love the way you treat simple problems into fun analysis!
No, I did some fiddling with it, but my employer at the time wasn’t interested, and I didn’t have the time to flesh it out fully. It might be fun to code up a simple benchmark.
@KWillets
Get in touch with me if you do.
It might be worth a shot.
It’s also easier in Rice coding, because the offsets/remainders are fixed width.
Clearing the sparse bitmap can take a while and that’s a pain if it’s a frequent operation. Russ Cox gives a nice explanation of an old technique for using uninitialised memory for a sparse set that allows a fast clear with slightly higher cost to the set and test operations. http://research.swtch.com/sparse
I read somewhere that the most efficient data structure to represent a deck of playing cards is actually a bitmap, perhaps organised as 4*16-bit numbers representing the suits. Then it becomes much simpler/faster to detect various hands using bitwise operations.
This is most useful in AI where a lot of eventualities need to be computed quickly.
Any idea how EWAH compare in speed or size to “COmpressed ‘N’ Composable Integer SET”
see: http://ricerca.mat.uniroma3.it/users/colanton/concise.html
I am trying to heuristic that would let a database engine decide which implementation to use.
@Maxime
I have written a benchmark to compare the various bitmap libraries, including Concise.
Please go to
https://github.com/lemire/simplebitmapbenchmark
If you are interested in the results I get on my machine, you can browse:
https://github.com/lemire/simplebitmapbenchmark/blob/master/RESULTS
As usual, there are trade-offs.
If you want to run the test yourself, grab a copy and execute the bash script “run.sh” (or the Windows equivalent).
If you want to hack the code or add your own benchmarks, I’ll be very interested in any contribution.