Understanding what makes database indexes work

Why do database indexes work?

In a previous post, I explained that only two factors make indexing possible:

  • your index expects specific queries
  • or you make specific assumptions about the data sets.

In other cases, you are better off just scanning the entire data set.

What makes database indexes work?

As far as I know, there are only 6 strategies that make indexes work. By combining them in different ways, we get all of the various existing schemes. (I would love to hear your feedback on this claim!)

1. You expect specific queries: restructure your data!

Suppose you know ahead of time that you will only need to select some of the elements in your data set. Then you can taylor an index for such queries and thus avoid scanning much of the content. For example, an inverted index in full-text search will select which documents contain the various keywords. Instead of working with all documents, you will only worry about the ones matching at least one keyword. Indexing a column with a B-tree or a hash table is another scenario where you try to immediately go to the relevant rows in a table.

Of course, if I look for all documents containing the words “the” and “will”, and want to know how many there are and what is their average length, such a form of indexing will not help.

2. You expect specific queries: materialize them!

Another commonly used strategy is view materialization. If 10% of all visitors on Google type in the word “sex”, they might as well precompute the result of the query. In Business Intelligence, if you can expect your users to mostly care about results aggregated over weeks, months or years, it makes sense to precompute these values instead of always working from the raw data. Alternatively, you can materialize intermediate elements that are needed to compute your results. For example, even if people do not need data aggregated per day, precomputing it might be useful for computing weekly numbers faster.

This form of indexing tends to work well to address the most popular queries, but it fails when people have more specific needs.

3. You expect specific queries: redundancy is (sometimes) your friend

When you do not know exactly which queries to expect, you can try to index the data in different ways, for different queries. For example, you could both use a B-tree and a hash table, and determine at query time which is the best evaluation strategy. You might even determine that the best way is to forgo the indexes and scan the raw data!

4. Use multiresolution!

Suppose that you look for specific images, but you may still need to scan 50% of them. An index that would point you to only the relevant images might not be effective. Instead, you should try to quickly discard the irrelevant candidates. What you could do is create thumbnails (low resolution images). Then you can dismiss quickly the images that are obviously not a good match. Naturally, you can have progressively finer resolutions.

Database indexes often bin values together. For example, if you could bin all workers earning between $10,000 and $30,000, then all workers earning between $30,000 and $50,000, and so on. If you are looking for workers earning between $40,000 and $45,000, you can first find all works that are in the $30,000-$50,000 bin, and then look up their actual salaries, one by one. You can adapt the bins either to the data distribution or to the types of queries you expect.

For more examples, see my post How to speed up retrieval without any index?.

5. Your data is not random: compress it!

Most real-world data is highly compressible. By compressing the data, you can make it so that your CPU and IO subsystem process less data. However, you have to worry about bottlenecks. Too much compression may overload your CPU. Too little compression and most time will be spent in loading the data from disk. Two techniques are often combined to get good results out of compression: sorting and run-length encoding.

6. In any case: optimize your code

You should be using cache-aware and CPU-aware indexes. Be aware that comparing two bits together may take nearly as long as comparing two integers. Be aware that jumping all over the place (as in a B-tree) takes longer than processing the data by tiny chunks.

JOLAP is dead, OLAP4J lives?

There is an equivalent to SQL for multidimensional databases (OLAP) called MDX. Alas, just like XML has DOM, OLAP would need a standard open API. The closest thing we ever got was JOLAP. However, I reported in 2005 that the Java OLAP Interface (JOLAP) was dead.

Julian Hyde then proposed olap4j. However, it seems that adoption of olap4j is difficult even among open source Business Intelligence projects like Eclipse BIRT.

Note: Yes, I am aware of the irony that my previous post was DUAT: Do not use acronyms in titles.

A no free lunch theorem for database indexes?

As a trained mathematician, I like to pull back and ask what are the fundamental limitations I face.

A common misconception in our Google-era is that database performance is a technical matter easily fixed by throwing enough hardware at the problem. We apparently face no fundamental limitation. To a large extend, this statement is correct. Thanks to the B-tree and related data structures, we can search for most things very quickly. Roughly, the database problems are solved, as long as you consider only a specific type of queries: queries targeting only a small subset of your data.

What if you want to consider more general queries? Can you hope to find a magical database index that solves all your problems?

I spent part of my morning looking for a No Free Lunch (NFL) theorem for database indexes. I would like to propose one:

Any two database indexes are equivalent when their performance is average across all possible content and queries. (Naturally, it is ill-posed.)

I draw your attention to how limited your interaction with tools that produce aggregates, such as Google Analytics, are. Basically, you navigate through precomputed data. You may not realize it, but the tool does not let you craft your own queries. Jim Gray taught us to push these techniques further with structures such as the data cube (which wikipedia insists on calling an OLAP cube). However, precomputation is just a particular form of indexing. It helps tremendously when the queries take a particular form, but when you average over all possible queries, it is unhelpful.

What if you want to consider general queries, and still get fast results? Then you have to assume something about your data. I suggest you assume that your data is highly compressible. Run-length encoding has been shown to help database queries tremendously.

Short of these two types of assumptions (specific queries or specific data sets), the only way you can improve the current indexes is by constant factors—you can double the performance of existing B-tree indexes, maybe. Or else, you can throw in more CPUs, more disks, and more memory.

For now, I will stick with my puny computers, and I will assume that the data is highly compressible. It seems to work well in real life.

The problem with unidimensional research

Yesterday, I listened to some of the BDA’08 talks. One common issue I noticed is that most work is unidimensional. That is, researchers tell us “my technique is 4 times faster”. The trade-off, if any, is not presented. Here is what I would like to hear: “I use twice the RAM, but I am 4 times faster”.

This is a side-effect of our quest for positive results. People hide the trade-offs because the negative points make their papers less likely to be accepted. I believe this is the main reason why practitioners ignore researchers. Researchers paint a severely distorted view of the world.

Reviewers need to wise up and give extra points to researchers who describe the negative and positive points of a method. We need to stop believing that the one ideal technique is just around the corner.

In my experience as a researcher, there is no silver bullet. Impressive techniques often fall flat when viewed from a different angle. What appears deep and thoughtful may sound trivial a few years later once it is well understood. What appears childish may be the best option in most cases.

Paint a rosy picture of your work and I will distrust you. Paint the shadows and I start believing in you.

The mythical bitmap index

A bitmap index is a popular data structure to speed up the retrieval of matching rows in a table. (It is also used in Information Retrieval, to retrieve matching words.)

Building a bitmap index is not hard. You match each possible value with a vector of bits. When the value occur, you insert the value “true” in the vector, otherwise you insert the value “false”. Hence, you get a set of vectors containing binary values. To find out when a particular value occur, you just load up the corresponding vector of bits. (You still need a B-tree or a hash table to map values of the corresponding vector.) In practice, programming with (compressed) vector of bits leads to much faster software than working of sets of row IDs.

The size of such a matrix is the number of distinct values times the number of rows. Hence, some people conclude that bitmap indexes are mostly good to index data were we have few distinct values. In fact, the first sentence of the bitmap index article on wikipedia tells us that bitmap indexes are meant to index data such as “gender” were only two values are possible.

Of course, this reasoning is wrong. Why?

Because bitmap indexes (the big matrix of zeroes and ones) are never stored uncompressed. You always manipulate them in compressed form.

How big is the compressed matrix? Let us see. By run-length encoding (RLE), the simplest compression algorithm I know, each vector of bits has a compressed size at most proportional to the number of occurrences of the corresponding value. If you sum it up, you have that the compressed size of a bitmap index is at most proportional to the size of your table! Irrespective of the number of distinct values!

For a wide range of decision-support applications, bitmap indexes can match or surpass the performance of most indexes, irrespective of the number of distinct values.

You don’t believe me? Read these benchmarks using an Oracle database: “In summary, bitmap indexes are best suited for DSS regardless of cardinality (…)”.

With my very own bitmap index library, I have scaled up to hundreds of millions of attribute values without a problem and tables having several gigabytes of data. And my budget is not comparable to Oracle (I have one developer, me).

I am not saying that bitmap indexes are always the best solution. Of course not! But there is no published evidence that the number of distinct values is a practical criterion.

Then why does the bitmap index article on wikipedia suggests otherwise? Because it is all over the blogosphere and posting boards… because when I tried to fix the wikipedia article, my changes got reverted. So, I post my rebuttal here. If you have practical evidence that bitmap indexes mostly work well when you have 2-3 attribute values, let us see it. Otherwise, help me kill this myth!

Further reading:

Are you descriptive or predictive?

As Peter points out nobody really knows what science is. Generally speaking, however, I like to distinguish two forms of science.

  • Predictive science aims to predict future events based on past observations. It relies on induction. Machine Learning is the embodiment of predictive science.
  • Descriptive science aims to describe concisely the universe. Astronomy and biology are descriptive sciences.

The difference between the two is probably a matter of philosophical debate. For example, I can say “the Earth is round” (a description) or “sailing across the sea, you will eventually come back to your starting point” (a prediction). However, the intent is quite different. Gardening and having kids has taught me that the real-world is treacherous. I find it very interesting to describe my kids or my plants, but I am usually quite pessimistic when making predictions about them.

I believe this difference in intent is a fundamental issue in Computer Science. Descriptive people factor in the limitations of their own brain when doing science. They are not after the best system, but rather the best system that they can understand.

Let us play a game. A wizard comes to you and gives you a choice. You can either be handed out the laws of the universe as an algorithm, but in such a form that your brain will be prevented from ever understanding them. Or else, you can be given imperfect laws that you can hope to assimilate within your life time. Which do you pick? If you are a predictive person, you will prefer the perfect laws, at the cost of not understanding them; if you are a descriptive person, you will prefer the laws you can understand, even if they are imperfect.

Who should be buying expensive commercial database systems?

According to Curt Monash, few people should be buying high-end Database management systems:

There are relatively few applications that wouldn’t run perfectly well on PostgreSQL or EnterpriseDB today. (…)

What’s more, these mid-range database management systems can have significant advantages over their high-end brethren. The biggest is often price, for licenses and maintenance alike. Beyond that, they can be much easier to administer then their more complex counterparts. (…)

And what these mid-range DBMS don’t do today, they likely will do soon. (…)

EnterpriseDB is equal or superior in every way I can think of to Oracle7, a few security certifications perhaps excepted.

If you work for an organization that has expensive contracts with Oracle or Microsoft for their DBMS, it is most certainly in vain.

Meanwhile, the world of open source Business Intelligence is getting more interesting every day. We now have Pentaho Mondrian, Jedox, Birt, Enhydra Octopus, and so on. In 2005, I asked whether open source was ready for Business Intelligence. The question seems less controversial in 2008, doesn’t it?

Most of the database industry has been commoditized. If you stick around with these old schemas, you lose.

UC Berkeley holding tribute for Jim Gray

I wish I could realistically attend this. They are holding a tribute to Jim Gray, the famous database researcher. Jim has been lost at sea. We cannot conclude he is dead, though it becomes increasingly difficult to find an explanation for his disappearance. Mike Stonebraker, of Postgresql fame, will give a talk on “Why Jim Got the Turing Award.” Should be interesting.

I have written about Jim quite a bit here: Jim Gray missing at sea, What is infinite storage? , Science in an exponential world, That’s why I tinker, A “Measure of Transaction Processing” 20 Years Later, and so on.

Of all database researchers, Jim is the one who has had the biggest impact on my research and my teaching. Indeed, the cool thing about Jim is that he did not work on abstract nonsense. You can actually take his papers, and give the gist of them to your students, and you will have helped your students a lot.

IBM is buying Cognos

IBM buying Cognos? (See press release.) I did not see this one coming. A few weeks ago, I heard rumors of a pan-canadian research initiative supported by Cognos. It did seem to me that Cognos was here to stay, for a very long time. Being bought by IBM, which has a lot of database technology, will probably change drastically Cognos at the R&D level.

It seems that Business Intelligence (BI) is now quite mature and very integrated in the technology of large vendors (IBM, Microsoft, Oracle).

Will there be a second wave of BI technology carried by new start-ups? I think so. And I think it will have something to do with infinite storage.

See my other posts Oracle buys Hyperion and Mondrian to partner up with Pentaho for Open Source Business Intelligence.

OLAP experts sought in Montreal

Some recruiters are looking for OLAP experts in Montreal. Knowledge of MDX and other Microsoft OLAP technologies required. Knowledge of French probably a must. Get in touch with me.

(No, I do not get a commission or any benefit whatsoever. I am just glad when there are jobs in an area where I do research.)

(Yes, my email address is somewhere on this site. No, it is not hard to find it.)

One More Step Toward Infinite Storage

Acccording to a new IDC study reported in Wired, the world had 185 exabytes of storage available last year and will have 601 exabytes in 2010. Meanwhile, the amount of “digital information” generated will grow from 161 exabytes last year to 988 exabytes in 2010.

Their point is that we lack the storage capacity to store everything. This seems to go against the theory that we nearly have infinite storage. But I do not think so. How can they tell how much storage is available? Do they include the NSA? Do they include Echelon? Do they include all of the secret agencies in the world storing massive quantities of data in general?

What might be a more interesting observation is that few people store everything as of now. And I do not expect that people will start storing everything soon. Storage costs must still come down a bit, and software must adapt. But in a few short years, everyone will store copies of everything. And managing all this data, whatever managing means, will become a big deal. And it will not be a nice database problem either because this data will not follow nice database schemas.

Oracle buys Hyperion

This might be a very big deal: Oracle just bought Hyperion for US$3.3 billion. The most obvious effect is that the Business Intelligence market has now fewer players then ever and Oracle is now a very serious player indeed.

I wonder what this will mean for the defunct JOLAP? Could Oracle bring it back? It would be really nice to have one standard API for OLAP, even if it is limited to the Java programming language.

(Source: Owen)

Jim Gray missing at sea

According to MercuryNews.com, Jim Gray is missing at sea.

Well-known in Silicon Valley database software circles, Gray was awarded the prestigious A.M. Turing Award — the so-called “Nobel Prize of computer science” — in 1998 for his body of work, which helped paved the way for automatic-teller machines, computerized airline reservations and e-commerce.

Over the years, he moved from jobs at IBM, Tandem Computers and DEC, and, eventually, to Microsoft in 1995. And he became an avid sailor, often heading out alone.

(Source: Steven Keith)

IBM’s Many Eyes

IBM has an impressive Web 2.0 project called Many Eyes where you come in, upload your data and choose a visualization. The visualizations are based on Java applets and are pretty dynamic: you can drill-down and search the data visually. What is impressive is that it is totally open. You can upload your data, people can then work on your data and discover new ways to do it.

Many Eyes is a bet on the power of human visual intelligence to find patterns. Our goal is to “democratize” visualization and to enable a new social kind of data analysis.

This is human-driven collaborative data mining. I like it. I think this type of research has a bright future.

Imagine researchers making available their data and asking people to play with it?

Looking for an all-recording plug-in for Firefox

Dear reader,

I’m looking to record everything I ever browse on the Web using Firefox. That is right. I want a copy of every single document, Web page, query, and so on, I ever encounter. I also want to record the content of every single form I submit. The result should be adequately protected so that it is not possible to have access to the data without access to my machine. It would be akin to the wayback machine except that it would cover only the stuff I have seen. I have not yet decided whether it should record all the youtube videos I watch, but I guess it should.

Ideally, it should be able write the data on a portable disk.

Why do I want to do this? Why not. Actually, I would then want to use this data to build a fancy database that would support things like drill-down or roll-up queries (à la OLAP).

If you know how to build this, want to help, or know of such a plug-in, please drop me a line.

I’m also running a competition to decide how to call such a thing. I initially thought about calling it a webex but it turns out that’s a trademark. Then I decided that Hammerspace might be a better name, or maybe Magic Satchel? What do you think?

What is infinite storage?

A colleague of mine, a Ph.D. in Physics, objected to my use of the term “infinite storage” in some lecture notes I posted on the Web.

I think that infinite storage is something that might be possible in my lifetime. What does “infinite storage” means? Let us consider how much is required to achieve the mythical (digital) memex.

  • To record everything you read in a year requires 25MB.
  • To record everything you hear in a year requires 100GB.
  • To record everything you see in a year requires 10TB.

Hence, I argue that whenever I will be able to buy a cheap 10TB disk at my local electronics store, I will have infinite storage. Currently a portable 1TB drive can be had for $569.

What about recording everything I hear? Right now, I can buy a slick 100GB portable drive for $160.

Interesting question: how would I ever search through all these sounds?

And what about all these people who will get upset that I am recording them?

(Reference: Jim Gray, What Next? A Dozen Information-Technology Research Goals, Journal of the ACM, Vol. 50, No. 1, January 2003.)

DOLAP 2006 Preliminary Technical Program

The DOLAP 2006 preliminary technical program is out. Rokia Missaoui and Omar Boussaid have a paper in called “Enhanced Mining of Association Rules from Data Cubes”. There is one paper on wavelets and range sum queries. At this point, we do not even have the abstracts so I will wait before commenting further on the papers.

VLDB 2007 (14 March 2007 / 25-28 September 2007)

VLDB 2007 will be held in Vienna.

VLDB 2007 is a premier international forum for database researchers, vendors, practitioners, application developers, and users. We invite submissions reporting original results on all aspects of data management as well as proposals for panels, tutorials, and demonstrations that will present the most critical issues and views on practical leading-edge database technology, applications, and techniques. We also invite proposals for events and workshops that may take place at the conference site between September 23th and 25th before the VLDB 2007 conference.

Technique without theory or theory from technique? An examination of practical, philosophical, and foundational issues in data mining

Korukonda wrote a paper in the Journal of Human-Centred Systems (AI & Society) putting into question the (philosophical) foundation of Data Mining as a science. He puts it quite bluntly:

the current status of Data Mining as an intellectual discipline is tenuous.

This is of particular interest to me since I consider myself a Data Mining researcher. Unlike Korukonda, however, I consider that Data Mining can be equally user-driven (as in OLAP) or data-driven. I do not think that there is a well established definition of Data Mining, except that we all agree it has to do with analyzing lots of data. The lack of a theoretical foundation for Data Mining is a well known problem which was explicitely identified during IEEE Data Mining 2005 as one of the top 10 challenges facing the community.

Korunda makes some good points. First of all, he reminds us that “Knowledge from Data” can be misleading. He gives the example of “idiot correlation” where the wrong hypothesis is tested. He takes an example from the New York Times where a reporter wrote that they noticed a 4.5% increase in sales at Red Lobster in March despite the war in Irak. Why would this be interesting? It is well known that when working over large data sets, you will invariably find surprising relations between different variables, but these relations are not necessarily meaningful. The fact that they are well supported by the historical data is not sufficient to make them useful. Statistically, because the number of possible relations is exponentially large, some of them are always supported by the historical data. A review of the Data Mining literature would prove him right: many researchers design systems to blindly find relationships. This is one reason why I prefer user-driven Data Mining as in OLAP where meaningless relations are automically discarded by the user.

Nevertheless, he correctly points out that even though the process is flawed, it could still be that it is a useful paradigm. When facing large data sets, you can either give up or try something. Data Mining is the best paradigm we have for these types of problems.

Being a nice guy, Korukonda gives us the way out:

This focus needs to shift to the “why” questions, if DM is to establish itself as a scientific investigative tool or as a long-term solution to business problems. In other words, the outcome of DM should extend beyond discovery of patterns to finding causal explanations for the observed patterns and relationships.

Meanwhile, he cites Taschek who tells us that Data Mining is dead:

One reason data mining as a term failed was because data mining products did not work. Sure, the technology theoretically allowed companies to dig through historical data but it never lived up to its promise. (Taschek, eWeek, 2001)

I think that the Taschek quote must refer to data-driven Data Mining, because user-driven Data Mining is doing quite well with a worldwide market of 6 billions$ in software products alone.

Maybe the real question is whether it is sensible to take the human out of the loop. I always think that as long as strong AI escapes us, we have to keep the human in there because that’s our only chance for intelligence. Yes, paying an analyst to look at your data is expensive, but you can lower the costs by giving him the best tools money can buy.