Which should you pick: a bitmap index or a B-tree?

Morteza Zaker sent me pointer to their work comparing bitmap indexes and B-trees in the Oracle database. They examine the folklore surrounding bitmap indexes—which are often thought to be mostly useful over low cardinality columns (columns having few distinct values, such as gender). Their conclusion is that the Bitmap index is the conclusive choice for data warehouse design for columns with high or low cardinality. Their claim is backed by experiments using columns with millions of distinct values. This confirms the observation made in my earlier post: The mythical bitmap index.

They make an interesting distinction between full cardinality columns, columns where each value appears only once, and high cardinality columns where at least a few values are often repeated (think about storing last names). A bitmap index is inadequate for a full cardinality column because there is no compression possible, and this is probably how the folklore around bitmap indexes came about. Yet, unlike transaction processing, data warehousing is usually not concerned with full cardinality columns.

Reference: Zaker et al., An Adequate Design for Large Data Warehouse Systems: Bitmap Index Versus B-Tree Index, IJCC 2 (2), 2008.

Further reading: Trading compression for speed with vectorization, To improve your indexes: sort your tables! and my research paper Sorting improves word-aligned bitmap indexes.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

7 thoughts on “Which should you pick: a bitmap index or a B-tree?”

  1. @Pagh I agree that your paper, which I read when it first came out on arxiv, is very interesting.

    Of course, we often using fixed-length counters in actual databases since delta codes, such as the one you use, may require many CPU operations to read a single number. Trading space for fewer CPU operations is often worth it. So it is not entirely clear how your data structure would hold out on our current breed of CPUs.

    (This is not to be taken as a criticism of your work, which I found to be excellent.)

  2. This is an interesting question, both in practice and theory. I cannot resist pointing out a related paper by Srinivasa Rao and myself that appeared at PODS 2009:

    Secondary Indexing in One Dimension: Beyond B-trees and Bitmap Indexes

    It is an extension of previous work by Sinha and Winslett, and suggests a data structure that combines the best properties of bitmap indexes and B-trees, with particular focus on range queries. I would be very interested in seeing an engineering of these ideas, with experiments – currently we are missing the resources to do this kind of things ourselves…

  3. @Callaghan

    You are totally right. WAH is an update to BBC. Conceptually, it is the same idea. (But you could object that BBC is just an update on the classic run-length encoding techniques.) My own work is also merely an update, in this sense, except that I considered very carefully table sorting as a factor (http://arxiv.org/abs/0901.3751 ).

  4. @Pagh Oh! And if you are thinking about Zukowski et al. paper with Peter Boncz on Super-Scalar RAM-CPU Cache Compression, then no, their technique does not involve any form of run-length encoding. It is mostly dictionary coding. So it is probably not applicable to bitmaps.

  5. @Daniel I agree completely that the compression/decompression of bitmaps is likely to be the practical bottleneck of what we propose. We crucially rely on compression that is close to the best information-theoretical bounds. I know that Peter Boncz and co-authors have some extremely fast compression methods, but I never got around to seeing if they compress 0-1 sequences optimally. Do you know if people tried to use multi-cores to be able to do better compression?

  6. The folklore even exists in Oracle docs, although the Data Warehousing guide gets it right — http://download.oracle.com/docs/cd/E11882_01/server.112/e10810/indexes.htm#DWHSG006. By ‘right’ I mean that they state the important factor is the ratio of distinct keys to the number of rows.

    I worked on the bitmap code at Oracle and did my best to fix the folklore internally. But change is hard.

    Gennady Antoshenkov, the person who provided BBC, did a remarkable job. I read a few of the WAH papers and my impression is that they were more of an update of BBC for modern computer architecture than something new. But maybe I am biased given my appreciation for the work of Antoshenkov.

Leave a Reply

Your email address will not be published. The comment form expects plain text. If you need to format your text, you can use HTML elements such strong, blockquote, cite, code and em. For formatting code as HTML automatically, I recommend tohtml.com.

You may subscribe to this blog by email.