General versus domain intelligence

Our brains come with hard-wired algorithms. Cats can catch birds or mice without thinking about it. I can grab and eat a strawberry without thinking. The Savanna-IQ Interaction Hypothesis says that general intelligence may originally have evolved as a domain-specific adaptation to deal with evolutionarily novel, nonrecurrent problems. We can derive from this hypothesis that people with better general intelligence won’t be better at routine tasks. In fact, they may fare worse at it! They may only have an edge for novel tasks. Thus, general and domain intelligence may be somewhat separate entities.

How do you recognize people with better general intelligence? They are better at adapting to new settings. They are the first to adopt new strategies. But they may not be very good at baseball or boxing, and they may be socially inept.

Modern Artificial Intelligence (and Machine Learning) is typically domain-specific. My spam filter can detect spam, but it won’t ever do anything else. Our software has evolved to cope with specific problems. Yet, we still lack software with general intelligence. Trying to build better spam filters may be orthogonal to achieving general intelligence in software. In fact, software with good general intelligence may not do so well at spam filtering.

Reference: Satoshi Kanazawa, Kaja Perina, Why night owls are more intelligent, Personality and Individual Differences 47 (2009) 685–690

Further reading: Language, Cognition, and Evolution: Modularity versus Unity by Peter Turney

Summer reading: my recommendations (2010)

Containment by Christian Cantrell is an excellent sci-fi novel. And you can grab it nearly for free from the author’s page. The premise of the book is that humanity built a colony on Venus. Children  are told that Earth cannot be reached. Massive research into economical oxygen production is required for long term survival. Indeed,  plants cannot survive on the surface of Venus. Or can they? Couldn’t we design special plants that could survive? One of the young researchers sets out to answer the question. Unfortunately, he won’t like the answer. The plot may not be extraordinary, but there are many things to like for computer nerds. For example, the book is set in a future where we appear to have cheap quantum computing. Or, at least, some very fast computers. One of the consequence is that any sufficiently smart kid can break any encryption. Moreover, it is cheaper to simulate most physical experiments than to actual execute them.

atrocity archiveThe Atrocity Archives by Charles Stross is the first in an ongoing series of books. Stross was a software engineer, and it shows. His book reveals many secrets all Computer Scientists should know. For example, do you know why Knuth will never finish the Art of Computer programming, no matter what he tells us? Here’s a quote:

The [Turing] theorem is a hack on discrete number theory that simultaneously disproves the Church-Turing hypothesis (wave if you understood that) and worse, permits NP-complete problems to be converted into P-complete ones. This has several consequences, starting with screwing over most cryptography algorithms—translation: all your bank account are belong to us—and ending with the ability to computationally generate a Dho-Nha geometry curve in real time.

This latter item is just slightly less dangerous than allowing nerds with laptops to wave a magic wand and turn them into hydrogen bombs at will. Because, you see, everything you know about the way this universe works is correct—except for the little problem that this isn’t the only universe we have to worry about. Information can leak between one universe and another. And in a vanishingly small number of other universes there are things that listen, and talk back—see Al-Hazred, Nietzsche, Lovecraft, Poe, et cetera. The many-angled ones, as they say, live at the bottom of the Mandelbrot set, except when a suitable incantation in the platonic realm of mathematics—computerised or otherwise—draws them forth. (And you thought running that fractal screensaver was good for your computer?)

The five most important algorithms?

Bernhard Koutschan posted a compilation of the most important algorithms. The goal is to determine the 5 most important algorithms. Out of his list, I would select the following five algorithms:

  • Binary search is the first non-trivial algorithm I remember learning.
  • The Fast Fourier transform (FFT) is an amazing algorithm. Combined with the Convolution theorem, it lets you do magic.
  • While hashing is not an algorithm, it is one of the most powerful and useful idea in Computer Science. It takes minutes to explain it, but years to master.
  • Merge sort is the most elegant sorting algorithm. You can explain it in three sentences to anyone.
  • While not an algorithm per se, the Singular Value Decomposition (SVD) is the most important Linear Algebra concept I don’t remember learning as an undergraduate. (And yes, I went to a good school. And yes, I was an A student.) It can help you invert singular matrices and do other similar magic.

NoSQL or NoJoin?

Several major players built alternatives to conventional database systems: Google created BigTable, Amazon built Dynamo and Facebook initiated Cassandra. There are many other comparable open source initiatives such as CouchDB and MongoDB. These systems are part of a trend called NoSQL because it is not centered around the SQL language. While there has always been non SQL-based database systems, the rising popularity of these alternatives in industry is drawing attention.

In The “NoSQL” Discussion has Nothing to Do With SQL, Stonebraker opposes the NoSQL trend in those terms:

(…) blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management.

In effect, Stonebraker says that all of the benefits of the NoSQL systems have nothing to do with ditching the SQL language. Of course, because the current breed of SQL is Turing complete, it is difficult to argue against SQL at the formal level. In theory, all Turing complete languages are interchangeable. You can do everything (bad and good) in SQL.

However, in practice, SQL is based on joins and related low-level issues like foreign keys. SQL entices people to normalize their data. Normalization fragments databases into smaller tables which is great for data integrity and beneficial for some transactional systems. However, joins are expensive. Moreover, joins require strong consistency and fixed schemas.

In turn, avoiding join operations makes it possible to maintain flexible or informal schemas, and to scale horizontally. Thus, the NoSQL solutions should really be called NoJoin because they are mostly defined by avoidance of the join operation.

How do we compute joins? There are two main techniques :

  • When dealing with large tables, you may prefer the sort merge algorithm. Because it requires sorting tables, it runs in O(n log n). (If your tables are already sorted in the correct order, sort merge is automatically the best choice.)
  • For in-memory tables, hash joins are preferable because they run in linear time O(n). However, the characteristics of modern hardware are increasing detrimental to the hash join alternative (see C. Kim, et al. Sort vs. Hash revisited. 2009).

(It is also possible to use bitmap indexes to precompute joins.) In any case, short of precomputing the joins, joining large tables is expensive and requires source tables to be consistent.

Conclusion: SQL is a fine language, but it has some biases that may trap developers. What works well in a business transaction system, may fail you in other instances.

Indexing XML

It is often important to index XPath queries. Not only is XPath useful on its own, but it is also the basis for the FLWOR expressions in XQuery.

A typical XPath expression will select only a small fraction of any XML document (such as the value of a particular attribute). Thus, a sensible strategy is to represent the XML documents as tables. There are several possible maps from XML documents to tables. One of the most common is ORDPATH.

In the ORDPATH model, the root node receives the identifier 1, the first node contained in the root node receives the identifier 1.1, the second one receives the identifier 1.2, and so on. Given the ORDPATH identifiers, we can easily determine whether two nodes are neighbors, or whether they have a child-parent relationship.

As an example, here’s an XML document and its (simplified) ORDPATH representation:


<liste temps="janvier" >
<bateau />
<bateau >
<canard />
</bateau>
</liste>

ORDPATH name type value
1 liste element
1.1 temps attribute janvier
1.2 bateau element
1.3 bateau element
1.3.1 canard element

Given a table, we can easily index it using standard indexes such as B trees or hash tables. For example, if we index the value column, we can quickly process the XPath expression @temps=”janvier”.

Effectively, we can map XPath and XQuery queries into SQL. This leaves relatively little room for XML-specific indexes. I am certain that XML database designers have even smarter strategies, but do they work significantly better?

Reference: P. O’Neil, et al.. ORDPATHs: insert-friendly XML node labels. 2004.

Further reading: Native XML databases: have they taken the world over yet?

Sorting is fast and useful

I like to sort things. If you should learn one thing about Computer Science is that sorting is fast and useful.

Here’s a little example. You want to check quickly whether an integer belongs to a set. Maybe you want to determine whether a userID is valid. The solutions:

I wrote a Java benchmark to compare the three solutions:

Binary search over a sorted array is a only 10% slower than the HashSet. Yet, the sorted array uses half the memory. Hence, using a sorted array is the clear winner for this problem.

If you think that’s a little bit silly, consider that column-oriented DBMSes like Vertica use binary search over sorted columns as an indexing technique.

Code: Source code posted on my blog is available from a github repository.

External-Memory Sorting in Java : the First Release

In my previous post, you were invited to help with a reference implementation of external sorting in Java. Several people tested and improved the code. I like the result.

  • I posted the code on Google code. All contributors are  owners of the project. The source code is under subversion.
  • I have added a link to it from the wikipedia page.

What is left to do?

  • The code remains untested. Please run your benchmarks! Find bugs!
  • Please contribute unit tests.
  • Can you write a tutorial on how to use the code?
  • Can you simplify the code further while making it faster and more robust?

Caveat: My intent was for the code to be in the public domain—nobody should own reference implementations—but Google code would not allow it. I selected the lesser GPL license instead, for now.

Reference: There is a fast external sorting implementation in Java by the Yahoo! people. (Thanks to Thierry Faure for pointing it out.) I have not looked at it.

External-Memory Sorting in Java

Update: this code is now obsolete. Please see the corresponding Github project.

Sometimes, you want to sort large file without first loading them into memory. The solution is to use External Sorting. Typically, you divide the files into small blocks, sort each block in RAM, and then merge the result.

Many database engines and the Unix sort command support external sorting. But what if you want to avoid a database? Or what if you want to sort in a non-lexicographic order? Or maybe you just want a simple external sorting example?

When I could not find such a simple program, I wrote one.

Please help me improve it. It needs:

  • To be easy to modify: the code must remain simple.
  • It must scale to very large files.
  • While scalability and simplicity are most important, it must also be sensibly efficient.

Once we have a good solution, I plan to post it on Google code and add a link in the External Sorting wikipedia entry. If you contribute to the code, please add your name so you can get credit.

Reference: ExternalSort.java, http://pastebin.com/H57TZF7e

Which is faster: integer addition or XOR?

The bitwise exclusive or (e.g., 1110 XOR 1001 = 0111) looks simpler to compute than integer addition (e.g., 2 + 9 = 11). Some research articles claim that XOR is faster. It appears to be Computer Science folklore. But is it true?

Which line runs faster? (The symbol “^” is the XOR.)

for(int k = 0; k < N; ++k) sum+= k;

for(int k = 0; k < N; ++k) sum^= k;

My result: In C++ and Java, both run at the same speed (within 1%).

Code: My source code is on github.

Is programming “technical”?

According to student evaluations, most of my students appreciate short programming assignments. Yet, every year, some students think that programming is below them or unimportant.

Maybe I should start my courses with this theorem:

Theorem. If you understand an idea, you can implement it in software.

There is no denying that programming requires a lot of technical knowledge. Most programmers do technical jobs, involving testing, building or refactoring code. But programming is ultimately a communication form. And it is as noble as Mathematics or English. Let us compare:

  • Writers are considered sexy and non-technical people. Yet, grammar and spelling are technical. Moreover, most writers earn a living by writing ads for boring products. Some of them make a living with grand novels, but fewer than you think.
  • Physicists are great thinkers. Yet, their mathematical derivations are often mind-numbing and technical. Many physicists spend years running extremely technical experiments. And when they don’t, they program extremely complex (and technical) simulations.

For some reason, being a writer is somehow considered more prestigious than being a programmer. If you ask me, Linus Torvalds is every bit as cool J. K. Rowling. And I’d rather have a lunch date with Linus.

Most common questions about recommender systems…

I get ten to fifteen questions a week on recommender systems from entrepreneurs and engineers. Sometimes, I help people find their way in the literature. On occasion—for a consulting fee—I get my hands dirty and evaluate, design or code specific algorithms.  But mostly, I answer the same questions again and again:

1. How much data do I need?

Given your data, you can use cross-validation or A/B testing to measure objectively the effectiveness of a recommender system.

2. We have this system in place, how do we know whether it is sane?

See previous question.

3. My online recommender system is slow!

Laziness is your friend: don’t recompute the recommendations each time you have new data.

4. My customers don’t like the recommendations!

  • Keep expectations in check: recommending products is difficult and even human beings have trouble doing it,
  • Explain the recommendations: nobody trusts a black box,
  • Allow your users to freely explore your data and products in convenient and exciting ways.

5. Which algorithm is best?

You should start with simple algorithms: it worked well enough for Amazon. To do better, a mix of different algorithms is probably best. You can combine them using ensemble methods.

Where to get your ebooks?

If you read my blog, you probably like to read in general. Thus, if you don’t own an ebook device, you will soon. The choice is growing: the Amazon Kindle, the Sony Reader, the Apple iPad,… I bought a kindle because my wife won’t let me fill the house with books. And I hate to throw away perfectly good paper books.

Amazon has most of the market for now. Yet, using the kindle store—on the kindle—is painful. Moreover, Amazon ebooks are protected by Digital Right Management (DRM). Amazon sells you crippled ebooks that can stop working if you copy them too often. There are often better alternatives elsewhere.

And, in Canada, there is a two-dollar surcharge for every wireless download using the Kindle. Since most ebooks are 0.5MB or less, the wireless costs 4$ per megabyte! This is insulting! Moreover, if you buy a book by mistake—which is annoying common—Amazon will reimburse the cost of the book itself, but not the fee for the wireless download.

Thankfully, you can grab books compatible with the kindle (in Mobipocket format) elsewhere. Then you can drop the file on the kindle using the USB port.

  • You can get nearly 2000 of the great French classic for free on ebookgratuits. This include a large fraction of the work of Honoré de Balzac.
  • Project Gutenberg offers 30,000 free e-books in various languages (mostly English).
  • WebScription sells DRM-free ebooks in various format. Most books fall into the scifi, young adults and fantasy genres.

I am currently reading You’re Not Fooling Anyone When You Take Your Laptop to a Coffee Shop by Scalzi. I bought it at WebScription for six dollars. It is a compilation of Scalzi’s blog posts on his life as a writer. I am fascinated by how much it ressembles my own life. Well… Except for the fact that I don’t get paid when I publish a paper. Maybe I should put together a compilation of posts about my silly work life. Would anyone buy it for six dollars?

I am also reading Halting State by Stross which I bought on Amazon for ten dollars. I haven’t yet gotten into the mood of the novel.

Further reading:

  • According to a Computer Scientist, the iPad could make Computer Science obsolete.
  • While I don’t think academic journals will be available on the Kindle any time soon, I think that has mostly to do with how insane academic publishing is.

Actual programming with HTML and CSS (without javascript)

I usually stick with academic or research issues, but today, I wanted to have some fun. Geek fun.

While W3C describes Cascading Style Sheets (CSS) as a mechanism for adding style (e.g. fonts, colors, spacing) to Web documents, it is also a bona fide programming language. In fact, it is one of the most widespread declarative languages.

But how powerful is CSS? It is not Turing complete, so there is no hope of programming full applications.  However, CSS may be surprisingly powerful. (Update: It turns out that HTML+CSS is Turing complete!)

Suppose you are given a plain HTML table and you want to add counters and colors to it. Starting with this table:

<table>
<tr><th>City</th><th>Color</th></tr>
<tr><td>Montreal</td><td>Red</td></tr>
<tr><td>Toronto</td><td>Blue</td></tr>
<tr><td>Vancouver</td><td>Yellow</td></tr>
</table>

We want to color rows in alternative colors. We need counters and a way to color the second line differently.

Solution: Add the following lines to your CSS style sheet:

tr{counter-increment: mycounter}
table {counter-reset: mycounter -1}
td:first-child:before {content: counter(mycounter)". " }
tr:nth-child(2n+2) td {background-color: #ccc;}
tr:nth-child(2n+3) td {background-color: #ddd;}

My best blog posts (2009)

As year 2009 comes to an end, I selected a few of my best blog posts.

Database, compression and column stores:

Hashing (contrarian blog posts):

Academia and Research:

Entropy-efficient Computing

Microprocessors and storage devices are subject to the second law of thermodynamics: using them turn usable energy (oil, hydrogen) into unusable energy (heat). Data centers are already limited by their power usage and heat production. Moreover, many new devices need to operate for a long time with little power: (1) smart phones (2) powerful computing devices inserted into our bodies (3) robots shipped in space.

Our approach to entropic efficiency remains crude. We improve power supply. We shut down disks and CPUs when they are idle. Deeper questions arise however:

  • Except maybe for embarrassingly parallel problems (such as serving web pages), parallel computing trades entropic efficiency for short running times. If entropic efficiency is your goal, will you stay away from non-trivial parallelism?
  • We analyze algorithms by considering their running time. For example, shuffling an array takes time O(n) whereas sorting it takes time O(n log n). Yet, unlike sorting, I expect that shuffling an array should be possible without any entropy cost (heat generation)!

Suppose I give you a problem, and I ask you to solve it using as little entropy as possible. How would you go about it?

Further reading:

Run-length encoding (part 3)

In Run-length encoding (part 1), I presented the various run-length encoding formats. In part 2, I discussed the coding of the counters. In this third part, I want to discuss the ordering of the elements.

Indeed, the compression efficiency of run-length encoding depends on the ordering. For example, the sequence aaabbb is far more compressible than the sequence ababab.

Reordering sequences

Text, images, and sound are stored on disk as strings. Intuitively, we may think that strings cannot be reordered without losing information. But this intuition is wrong!

You can reorder strings without losing any information, as long as the reordering is invertible. The Burrows–Wheeler transform—invented in 1994—is an invertible transform which tends to generate streams of repeated characters. It is used by the open source bzip2 software. This invertible reordering trick works so well that bzip2 is “within 10% to 15% of the best available techniques, whilst being around twice as fast at compression and six times faster at decompression.”

Ordering sets

Sometimes we want to compress sets of elements. For example, the ordering of the rows in a relational database is semantically irrelevant. Similarly, in a phone directory, we are not concerned about the order of the entries. Of course, we are trained to expect entries to be sorted lexicographically starting from the last names:

Last name First name
Abeck John
Becket Patricia
Smith John
Tucker Patricia

Such sequences of tuples can be compressed column-wise: compress each column with run-length encoding. Reordering the rows so as to maximize compression is NP-hard by reduction from the Hamiltonian path problem. Moreover, it can be reduced to an instance of the traveling salesman problem.

Fortunately, a simple and efficient heuristic is available. Reorder the columns in increasing cardinality: put the columns having the fewest number of distinct values first. Next, sort the table lexicographically.

Applying this heuristic, our previous table becomes:

First name Last name
John Abeck
John Smith
Patricia Becket
Patricia Tucker

Further reading:

Daniel Lemire and Owen Kaser, Reordering Columns for Smaller Indexes, Information Sciences 181 (12), 2011.

Daniel Lemire, Owen Kaser, Eduardo Gutarra, Reordering Rows for Better Compression: Beyond the Lexicographic Order, ACM Transactions on Database Systems 37 (3), 2012.

Run-length encoding (part 2)

(This is a follow-up to my previous blog post, there is also a follow-up: part 3.)

Any run-length encoding requires you to store the number of repetitions. In my example, AAABBBBBZWWK becomes 3A-5B-1Z-2W-1K, we must store 5 counters (3,5,1,2,1) and 5 characters.

Storing counters using a fixed number of bits

Most programming languages represent integers using a fixed number of bits in binary format. For example, Java represents integers (int) using 32 bits. However, in my example (3A-5B-1Z-2W-1K) storing the counters using 32 bits and the characters using 8 bits means that I use 25 bytes which is more than twice the length of the original string (AAABBBBBZWWK).

Thus, we have a simple optimization problem: determine the best number of bits.  In practice, it might be better to store the data in a byte-aligned way. That is, you should be using 8, 16, 32 or 64 bits. Indeed, reading numbers represented using an arbitrary number of bits may involve a CPU processing overhead.

If you use too few bits, some long runs will have to count as several small runs. If you use too many bits, you are wasting storage. Unfortunately, determining on a case-by-case basis the best number of bits requires multiple scans of the data. It also implies added software complexity.

But you don’t have to use the binary format!

You can still use  a fixed number of bits for your counters, but with quantized codes instead of the binary format. For example, using 3 bits, you could only allow the counter values 1,2,16, 24, 32,128,256,1024. In this example, the sequence of bits 000 is interpreted as the value 1, the sequence of bits 001 as the value 2, the sequence 010 as 16, and so on. Determining the best codes implies that you must scan the data, compute the histogram of the counters, and then apply some optimization algorithm (such as dynamic programming). The decoding speed might be slight slower as you need to look-up the codes from a table.

Using variable-length counters for optimal compression

If you are willing to sacrifice coding and decoding speed, then you can represent the counters using universal codes. Thus, instead of using a fixed number of bits and optimizing the representation (as in the quantized coding idea), you seek an optimal variable-length representation of the counters. With this added freedom, you can be much more efficient.

The illustrate the idea behind variable-length codes, we consider Gamma codes: the code 1 is 1, the code 01 is 2, the code 001 is 3, the code 0001 is 4, and so on. Thus, we use x bits to represent the number x.

Fortunately, we can do much better than Gamma codes and represent the number x using roughly 2 log x bits with delta codes. Firstly, write x as x=2N +(x modulo 2N) where N is the logarithm. Then we can represent the number N-1 as a Gamma code using N-1 bits, and then store (x modulo 2N) in binary format (using N-1 bits). Thus, we can represent all numbers up to 2N-1 using 2N-2 bits.

Unfortunately, the current breed of microprocessors are not kind to variable-length representations so the added compression is at the expense decoding speed.

Continue with part 3.

References and further reading: Holloway et al., How to Barter Bits for Chronons, 2007. See also the slides of my recent talk Compressing column-oriented indexes.

Run-length encoding (part I)

(This is part 1, there is also a part 2 and a part 3.)

Run-length encoding (RLE) is probably the most important and fundamental string compression technique. Countless multimedia formats and protocols use one form or RLE compression or another.

RLE is also deceptively simple. It represents repeated values as a counter and a character. Thus, the string AAABBBBBZWWK becomes 3A-5B-1Z-2W-1K.

If that is all there was to RLE, then the wikipedia page on run-length encoding would be just fine. Yet, I think it needs help.

Why do we use RLE?

  • You can read and write RLE data in one pass, using almost no memory.
  • Given a vector V compressed with RLE, you can apply any scalar function f to its component in time O(|V |) where |V | is the compressed size of the vector.
  • Given two vectors V and V‘ compressed with RLE, you can do arithmetic (e.g. V+V‘) in time O(|V |+|V’|).

(Some RLE formats have worse complexity bounds.)

Any downsides to RLE?

  • Random access is slower. Sometimes, only sequential read (from the beginning) is possible. Updating an RLE-compressed array can be difficult.
  • You need long runs of identical values.
  • Some RLE formats negatively affect CPU vectorization. Thus, if the compression rates are modest, it could actually take longer to process an RLE-compressed array.

What is the RLE format?

There is no unique RLE format. How you use the RLE idea depends on your goals such as (1) maximize the compression rate (2) maximize the processing speed.

Here are some common variations:

  • Instead of using a counter for each run of characters, you only add a counter after a value has been repeated twice. For example, the string AAABBBBBZWWK becomes AA1-BB3-Z-WW-K. Thus, if many characters are not repeated, you will rarely use an unnecessary counter.
  • You can use a single bit to decide whether a counter is used. For example, the string AAABBBBZWWK becomes A-True-3, B-True-5, Z-False, W-True-2, K-False. Again, this may avoid many unnecessary counters if values are often not repeated.
  • Instead of a counter, you may store the location of the run in the array. For example, the string AAABBBBBZWWK becomes 1A-4B-9Z-10W-11K. The benefit of this approach is to allow random access in logarithmic time using binary search. However, it is also incompatible with some techniques to avoid unnecessary counters. So, it is a compression-speed trade-off. For even more speed, you can store both the location of the run and its length (thus avoiding a subtraction).
  • To help vectorization, you can group the characters into blocks of k characters. For example, using blocks of two characters, the string AAABBBBBZWWK becomes 1AA-1AB-2BB-1ZW-1WK. Again, this is a compression-speed trade-off because there will be fewer runs to compress after grouping the characters into long blocks.

Continue reading to part 2.

Some References (to my own work):

More database compression means more speed? Right?

Current practical database compression techniques stress speed over compression:

  • Vectorwise is using Super-scalar RAM-CPU cache compression which includes a carefully implemented dictionary coder.
  • C-store—and presumably Vertica—is using similar compression techniques as well as simple run-length encoding of projection indexes.
  • Oracle is compressing its bitmaps using Byte-aligned bitmap compression—a variation on run-length encoding.
  • Wu et al.’s Fastbit as well as my very own Lemur Bitmap Index C++ Library use less aggressive compression techniques, for faster results. In fact, one my recent empirical results is that on a two-CPU dualcore machine, using 64-bit words instead of 32-bit words in word-aligned compression—which nearly halves the compression—can make the processing faster.
  • LucidDB similarly compresses its bitmap indexes with a simple variation on run-length encoding.

In a comment to my previous blog post, Rasmus Pagh asks more or less this question:

Given that we have more and more CPU cores—and generally more powerful CPUs—shouldn’t we compress the data more aggressively?

As the CPUs get more powerful, shouldn’t all database become I/O bound? That is, the main difficulty is to bring enough data into the CPUs?

Apparently not.

  • As we have more CPU cores, we also have more bandwidth to bring data to the the cores. Otherwise, CPU cores would be constantly data-starved in most multimedia and business applications.
  • Multicore CPUs are not the only game in town: RAM and storage have also been revolutionized. In 2009, it is not unpractical to run entirely database applications from RAM. After all, many business databases fit in 16 GB for RAM. And even when we do not have enough RAM, we have SSDs.
  • Lightweight compression techniques often favor vectorization which is also getting more and more important and powerful.

Hence, for most database applications fast decompression remains preferable to aggressive compression.

Which should you pick: a bitmap index or a B-tree?

Morteza Zaker sent me pointer to their work comparing bitmap indexes and B-trees in the Oracle database. They examine the folklore surrounding bitmap indexes—which are often thought to be mostly useful over low cardinality columns (columns having few distinct values, such as gender). Their conclusion is that the Bitmap index is the conclusive choice for data warehouse design for columns with high or low cardinality. Their claim is backed by experiments using columns with millions of distinct values. This confirms the observation made in my earlier post: The mythical bitmap index.

They make an interesting distinction between full cardinality columns, columns where each value appears only once, and high cardinality columns where at least a few values are often repeated (think about storing last names). A bitmap index is inadequate for a full cardinality column because there is no compression possible, and this is probably how the folklore around bitmap indexes came about. Yet, unlike transaction processing, data warehousing is usually not concerned with full cardinality columns.

Reference: Zaker et al., An Adequate Design for Large Data Warehouse Systems: Bitmap Index Versus B-Tree Index, IJCC 2 (2), 2008.

Further reading: Trading compression for speed with vectorization, To improve your indexes: sort your tables! and my research paper Sorting improves word-aligned bitmap indexes.