More database compression means more speed? Right?

Current practical database compression techniques stress speed over compression:

  • Vectorwise is using Super-scalar RAM-CPU cache compression which includes a carefully implemented dictionary coder.
  • C-store—and presumably Vertica—is using similar compression techniques as well as simple run-length encoding of projection indexes.
  • Oracle is compressing its bitmaps using Byte-aligned bitmap compression—a variation on run-length encoding.
  • Wu et al.’s Fastbit as well as my very own Lemur Bitmap Index C++ Library use less aggressive compression techniques, for faster results. In fact, one my recent empirical results is that on a two-CPU dualcore machine, using 64-bit words instead of 32-bit words in word-aligned compression—which nearly halves the compression—can make the processing faster.
  • LucidDB similarly compresses its bitmap indexes with a simple variation on run-length encoding.

In a comment to my previous blog post, Rasmus Pagh asks more or less this question:

Given that we have more and more CPU cores—and generally more powerful CPUs—shouldn’t we compress the data more aggressively?

As the CPUs get more powerful, shouldn’t all database become I/O bound? That is, the main difficulty is to bring enough data into the CPUs?

Apparently not.

  • As we have more CPU cores, we also have more bandwidth to bring data to the the cores. Otherwise, CPU cores would be constantly data-starved in most multimedia and business applications.
  • Multicore CPUs are not the only game in town: RAM and storage have also been revolutionized. In 2009, it is not unpractical to run entirely database applications from RAM. After all, many business databases fit in 16 GB for RAM. And even when we do not have enough RAM, we have SSDs.
  • Lightweight compression techniques often favor vectorization which is also getting more and more important and powerful.

Hence, for most database applications fast decompression remains preferable to aggressive compression.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

9 thoughts on “More database compression means more speed? Right?”

  1. Google’s compression is probably worth a mention here too. They tend to use very fast, lightweight compression as well (e.g. Zippy).

    Some of it is described in their papers (a small section in the Bigtable paper) and talks (small mentions of it in Jeff Dean’s talks, such as his Bigtable talk or his recent LADIS 2009 talk).

  2. @Greg

    Thanks Greg. This is the kind of comment that makes blogging so profitable.

    Shame on me: I did not even think of BigTable, and I don’t know anything about their compression techniques… more reading for me… I hope to blog about it in the future.

  3. As we have more CPU cores, we also have more bandwidth to bring data to the the cores.

    There is no reason for that to be true: advances in processor technology often follow a different curve than advances in memory architectures. It may well be the case that many-core architectures in the future are increasingly bandwidth-constrained: once data reaches a core, computation cycles are cheap, but data movement into / out of cores might be relatively expensive.

  4. @Conway

    There reason for this to be true is stated in my post: Otherwise, CPU cores would be constantly data-starved in most multimedia and business applications. Intel will not mass-produce CPUs unless the technology to keep them busy with mainstream applications is out there.

    Disclaimer: you can always find special cases, and nobody can predict the future.

  5. Well, it may well be the case that “CPU cores will be constantly data-starved” in the future, certainly for many run-of-the-mill business applications.

    In the recent past, most superscalar chips had only a very limited ability to extract instruction-level parallelism from most programs — that didn’t stop Intel from mass-producing those chips, even though they were utilized relatively inefficiently. Those chips (e.g. Pentium IV) were still useful, despite the inefficiency. Similarly, manycore chips may still be useful, even if they are relatively bandwidth-starved — which would make communication-intelligent designs increasingly important.

  6. @Conway

    Sure. It could happen that in the future the cores will be constantly data-starved. Nobody can predict the future. But it is not the case right now. Lightweight compression outperforms and has outperformed for years if not decades heavy compression within databases.

    (Some papers have claimed that databases are I/O bound. They have just not convinced me, nor the database industry.)

  7. Ready for a radical idea? Too bad, here it is anyway.

    Having to categorize one’s compression method as being lightweight or heavyweight suggests to me that the method just isn’t very good or appropriate for the data. Good methods do a good job of data modeling, and with really good data modeling the otherwise-competing performance goals of speed and size can go hand in hand. To me, in the case of structured databases, good data modeling means multidimensional data modeling; unfortunately, nearly all the methods now being used are inherently one-dimensional and, predictably, wind up requiring compromise.

    I say that after having developed compression software that achieves almost 8-to-1 compression of the TPC-H lineitem table, a common benchmark in the DBMS world. That is far beyond published results from Oracle and IBM, and it demonstrates how much better, and more appropriate, multidimensional data modeling is when one is dealing with multidimensional data.

  8. @Glenn

    The type of compression ratio you are referring to is already possible using publicly available algorithms:

    V. Raman and G. Swart. Entropy compression of relations and querying of compressed relations. In VLDB, 2006.

  9. That’s a super paper and a good start in the right direction! Those results, although measured on entropy-reduced data, illustrate the power of multidimensional approaches and the goal alignment (speed with size) one can get from good models.

    Their paper contains a footnote that I found not only intriguing but suggestive of a direction to go to reach the next level of performance:

    “By modeling the tuple sources as i.i.d., we lose the ability to exploit inter-tuple correlations. To our knowledge, no one has studied such correlations in databases – all the work on correlations has been among fields within a tuple. If inter-tuple correlations are significant, the information theory literature on compression of non zero-order sources might be applicable.”

    Funny they should say that! The answer is both yes and no. Yes, inter-tuple correlations can and should be exploited to compress structured data; I led a team that did that with great success some 20 years ago. And no, we found the information theory literature to be irrelevant. Information theory concerns encoding modeled data, not the design of the data models themselves. That is where the challenges and benefits lie.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.