The mythical bitmap index

A bitmap index is a popular data structure to speed up the retrieval of matching rows in a table. (It is also used in Information Retrieval, to retrieve matching words.)

Building a bitmap index is not hard. You match each possible value with a vector of bits. When the value occur, you insert the value “true” in the vector, otherwise you insert the value “false”. Hence, you get a set of vectors containing binary values. To find out when a particular value occur, you just load up the corresponding vector of bits. (You still need a B-tree or a hash table to map values of the corresponding vector.) In practice, programming with (compressed) vector of bits leads to much faster software than working of sets of row IDs.

The size of such a matrix is the number of distinct values times the number of rows. Hence, some people conclude that bitmap indexes are mostly good to index data were we have few distinct values. In fact, the first sentence of the bitmap index article on wikipedia tells us that bitmap indexes are meant to index data such as “gender” were only two values are possible.

Of course, this reasoning is wrong. Why?

Because bitmap indexes (the big matrix of zeroes and ones) are never stored uncompressed. You always manipulate them in compressed form.

How big is the compressed matrix? Let us see. By run-length encoding (RLE), the simplest compression algorithm I know, each vector of bits has a compressed size at most proportional to the number of occurrences of the corresponding value. If you sum it up, you have that the compressed size of a bitmap index is at most proportional to the size of your table! Irrespective of the number of distinct values!

For a wide range of decision-support applications, bitmap indexes can match or surpass the performance of most indexes, irrespective of the number of distinct values.

You don’t believe me? Read these benchmarks using an Oracle database: “In summary, bitmap indexes are best suited for DSS regardless of cardinality (…)”.

With my very own bitmap index library, I have scaled up to hundreds of millions of attribute values without a problem and tables having several gigabytes of data. And my budget is not comparable to Oracle (I have one developer, me).

I am not saying that bitmap indexes are always the best solution. Of course not! But there is no published evidence that the number of distinct values is a practical criterion.

Then why does the bitmap index article on wikipedia suggests otherwise? Because it is all over the blogosphere and posting boards… because when I tried to fix the wikipedia article, my changes got reverted. So, I post my rebuttal here. If you have practical evidence that bitmap indexes mostly work well when you have 2-3 attribute values, let us see it. Otherwise, help me kill this myth!

Further reading:

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

31 thoughts on “The mythical bitmap index”

  1. Daniel, I’m glad your pushing this. If you check out the talk page for the entry (http://en.wikipedia.org/wiki/Talk:Bitmap_index), you’ll see that a couple of us are supporting your efforts.

    Fixing Wikipedia is a thankless but worthwhile task, and I’m glad to see researcher-bloggers taking it on. I’ve had a few success stories myself:

    http://thenoisychannel.blogspot.com/2008/06/your-input-really-is-relevant.html

    http://thenoisychannel.blogspot.com/2008/06/exploratory-search-is-relevant-too.html

  2. Hi Daniel, I am not sure why your edit got reverted (can’t find the diff at a glance), but Wikipedia is pretty much open to referenced information. So, if almost any info is properly referenced from several reliable sources, it gets to stay.

    I have added the info and reference to the Sharma benchmark article, and feel free to provide further links/refs about this in the article talk page.

  3. If all the values of a field (indexed with a bitmap field) were unique would the bitmap indices be extremely short? Would there need to be a tree-structure to store the indicies (or their locations)? How is the correct bitmap index, for a field, found when there are large numbers of distinct values?

  4. Daniel,

    I’m installing the software today, and don’t have the GNU seq command installed; so the /testequalityqueries.sh failed.

    Let me suggest you replace `seq 0 10` with
    0 1 2 3 4 5 6 7 8 9 10 in the two places seq is called in the script. This was then successful.

  5. If all the values of a field (indexed with a bitmap field) were unique would the bitmap indices be extremely short? Would there need to be a tree-structure to store the indicies (or their locations)? How is the correct bitmap index, for a field, found when there are large numbers of distinct values?

    As I point out in my post, you typically use a b-tree to get the location of the bitmap. You need some sort of b-tree-like data structure because compressed bitmaps have different sizes so you can’t just have them in a fix-length-record flat file.

    It is not very exciting to index a field where all values are distinct because, yes, the bitmaps would all be very short. Ah! But you rarely have a single dimension. Bitmap indexes are typically used in a DSS context where you “always” have numerous dimensions.

  6. @Otis: Wu is an expert indeed. Please see his 2006 paper “Optimizing bitmap indices with efficient compression”

    I quote their abstract:

    ” In this paper, we present a new compression scheme called Word-Aligned Hybrid (WAH) code that makes com-
    pressed bitmap indices efficient even for high cardinality attributes. ”

    (WAH is based on BBC which is used by Oracle.)

    1. Though I am a student; I can confirm that I have tested this assumption in Oracle 11g. Bitmap index is efficient for high cardinality and very fast for evaluating aggregate queries.

  7. Can we say that WAH is the best of all kinds of bitmap indices? How does chan’s multi-component interval encoding method work compared to WAH?
    Will WAH be the overall winner? According to Wu’s paper, the WAH compression works well for sparse bitmaps,i.e.,the large cardinality, and skewed data. Maybe some other schemes will beat WAH in lower cardinality.

  8. The general idea behind WAH is sound and that’s what people should be using. Not necessarily WAH itself, but some variant. For example, we decided to use EWAH. (Note that WAH itself is patented.)

    Chan’s interval encoding is a bit of a problem because it will generate an incompressible index. If you have L different attribute values in your column, and n rows, you will end up with nL bits that are incompressible. This can be quite a pain. It will be slow to index, it will use a lot of storage, it will be hard to keep things in memory, and so on.

    I would not recommend interval coding. Rather, look at hierarchical coding with maybe 2-3 levels.

  9. where can I find the description of EWAH? What is hierarchical coding? where can I find it?

    Yes, results of interval encoding and such like will be hard to compress further, but they are not supposed to do that. The n*L bits with L distinct values and n rows are not interval encoding, they are just simple(basic) bitmap index. For interval encoding, with m components, let m = log_b_(L),it will produce m*(b/2)*n bits, far less than n*L bits.

  10. where can I find the description of EWAH?

    EWAH is not an important contribution, nevertheless… see this paper:

    Sorting improves word-aligned bitmap indexes

    See also my slides:

    http://www.slideshare.net/lemire/all-about-bitmap-indexes-and-sorting-them

    What is hierarchical coding? where can I find it?

    R. R. Sinha, M. Winslett, Multi-resolution bitmap indexes for scientific
    data, ACM Trans. Database Syst. 32 (3) (2007)

    Yes, results of interval encoding and such like will be hard to compress further, but they are not supposed to do that. The n*L bits with L distinct values and n rows are not interval encoding, they are just simple(basic) bitmap index. For interval encoding, with m components, let m = log_b_(L),it will produce m*(b/2)*n bits, far less than n*L bits.

    With RLE-based compression (including BBC, WAH, EWAH), compression accelerates query as well as reducing storage. The total storage will be O(n) with a small constant.

    Note that Chan and Ionnadis did not invent multicomponent indexing. It goes back to the seventies. What they stated is that if you don’t compress your bitmaps using RLE compression, then one level of multicomponent indexing was ideal. The problem with this analysis is that there is no reason not to compress the bitmap indexes with RLE.

    Then, they invented interval coding. The problem is that interval coding precludes compression!

    Yes, you can use multicomponents indexes… but you can use multicomponents with anything, not just interval coding. Is your factor m*b/2 small? It is actually m L^(1/m)/2. It is pretty hard to make it into a small constant.

    Think about the fact that if the index is ten times the size of the original table, this may be prohibitive. Indexes are not supposed to be much larger than the original data set… Very large indexes tend to be slow in practice due to buffering problems.

    But read what Sinha and Winslett have written on this topic. Wu has also written on why interval coding is not a good idea.

  11. According to Wu’s paper, the WAH space complexity is 2N words for most datasets, 4N for the worst case. Note it is 2N words,that is 64N bits. My m*b/2*n is in bits!m and b are both very small, isn’t it? I can’t image m L^(m/2)/2, since L can be a very big number. Chan would not be so foolish to present such a solution. 🙂

    Chan is far from foolish. That is not my point. Interval coding is a nice idea, but it fails for non-trivial reasons. You only realize it is a bad idea after playing with real-world implementations of the idea.

    The formula is something like m L^(1/m)/2 and not m L^(m/2)/2. To make L^(m/2)/2 into a small constant, you need for m to be rather large (say m=5).

    Eventually, you will make interval coding to be just as small as a regular BBC, WAH or EWAH bitmap index, but what will happen of your query performance? Remember that every time you use the multicomponent trick, your performance degrades… and it does so faster than you might think! To see why, try to implement it! You will see that it gets nasty, and you end up having rather complicated boolean functions.

    So you will need to load many incompressible bitmaps. Hence, for any range query, the amount of data you will load is in Omega(n) because your bitmaps are incompressible. The only way this is ok is if your selectivity is always very low. But then, at that point, why not use a Bit-Sliced or projection indexes?

    There is absolutely no way you can get good practical performance, and nice fundamental properties, with incompressible bitmaps. Multicomponent is not a good substitute to compression because it trades off space for speed. You want to get better speed and less storage, both together! RLE compression does exactly this!

    BTW, how do you comment on Hakan’s Approximate bitmap index? I have developed this scheme into an 100% accurate one with 0 false positives.
    Do you think this is promising?

    Hard to tell. Send me a draft and I’ll give you an opinion. But a good sanity test is to implement your scheme and see how fast it is!

    I’m guessing you have a scheme that allows random access? Is it random access in constant time? Then what is the constant?

  12. Thank you for your links to those matierals…

    According to Wu’s paper, the WAH space complexity is 2N words for most datasets, 4N for the worst case. Note it is 2N words,that is 64N bits. My m*b/2*n is in bits!m and b are both very small, isn’t it? I can’t image m L^(m/2)/2, since L can be a very big number. Chan would not be so foolish to present such a solution. 🙂

    BTW, how do you comment on Hakan’s Approximate bitmap index? I have developed this scheme into an 100% accurate one with 0 false positives.
    Do you think this is promising?

  13. I didn’t mean interval encoding outperforms WAH, I agree that WAH or EWAH could be the best scheme for bitmap indexes. I have read Wu’s ACM paper(38 pages) and it really makes sense.

    If you have a high selectivity query, your effort should be small. But with interval coding, your effort is always proportional to the number of rows in the entire table. That is not good for high selectivity queries. And interval coding does not even fare all that well for low selectivity queries where I prefer projection and bit-sliced indexes. For one thing, it is far simpler to implement a projection index to interval coding!

    As for approximate bitmap(AB) index, I implemented it and improved it into a accurate scheme with a small space and time overhead. But experiments show that the AB scheme is hard to be pratical for it only suits for small query region(narrow row and col selectivity). This technique essentially doesn’t work for bitmap index. The problem is the heavy iterative cost for making (row+col) keys for hashing, which is far more time consuming compared to CPU bitwise operations as WAH exploits.

    It is unfortunately easy to underestimate the CPU bottleneck when working on paper. Researchers often write that since we have 4 cores or more per CPU, CPU cycles are cheap. But that is not so simple, is it?

  14. Thanks,daniel. 🙂
    I didn’t mean interval encoding outperforms WAH, I agree that WAH or EWAH could be the best scheme for bitmap indexes. I have read Wu’s ACM paper(38 pages) and it really makes sense.

    As for approximate bitmap(AB) index, I implemented it and improved it into a accurate scheme with a small space and time overhead. But experiments show that the AB scheme is hard to be pratical for it only suits for small query region(narrow row and col selectivity). This technique essentially doesn’t work for bitmap index. The problem is the heavy iterative cost for making (row+col) keys for hashing, which is far more time consuming compared to CPU bitwise operations as WAH exploits.

  15. Maybe I’m not understanding bitmap indexes, but it seems to me that if you need 1 bit per possible value per record, it requires n^2 bits for n records when each record has a distinct value. The table itself requires only n*log(n) space to store the same values, so the bitmap index is bigger.

    Even If you only have two possible values, male and female, then your table requires one bit per record, but your index requires two bits per record!

  16. Hi Daniel,

    Thank you very much for sharing your work on EWAH implementations. As far as I understand, bitmap indexes are expected to be and-ed or-ed, etc many times over. However, your implementation creates a new bitmap as the result of the operation. That would make it necessary to scan the indexes at least as many times as the number of logical predicates in a query. I was wondering whether this can be changed. Instead of returning a new bitmap you could return an EWAHIterator. And also accept and EWAHIterator as input. This would allow a single scan be done for the whole chain of logical operations. Could you please comment on this

    Regards,
    Kemal

  17. Hi I was trying to use your c++ implementation, but when I try to do the make I got the following error

    headers/ewahutil.h:21:23: error: conflicting declaration ‘typedef long unsigned int uint32_t’
    /usr/include/stdint.h:52:23: error: ‘uint32_t’ has a previous declaration as ‘typedef unsigned int uint32_t’
    make: *** [unit] Error 1

    can yout tell me how to fix it.

  18. Hi again, I am working with C++ library, and I want to store the compress bitmap hexadecimal representation. But still I don´t figured out how to do it. I was trying to use the compressed word iterator, but since I am not use to work in C++ is not very clear to me how to do it. Also I was using the method writeBuffer but does not write anything in the file, but I am not sure if I am using it correctly. Can you help me?

Leave a Reply to Daniel Lemire Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.