Column stores and row stores: should you care?

Most database users know row-oriented databases such as Oracle or MySQL. In such engines, the data is organized by rows. Database researcher and guru Michael Stonebraker has been advocating column-oriented databases. The idea is quite simple: by organizing the data into columns, we can compress it more efficiently (using simple ideas like run-length encoding). He even founded a company, Vertica, to sell this idea.

Daniel Tunkelang is back from SIGMOD: he reports that column-oriented databases have grabbed much mindshare. While I did not attend SIGMOD, I am not surprised. Daniel Abadi was awarded the 2008 SIGMOD Jim Gray Doctoral Dissertation Award for his excellent thesis on Column-Oriented Database Systems. Such great work supported by influential people such as Stonebraker is likely to get people talking.

But are column-oriented databases the next big thing? No.

  • Column stores have been around for a long time in the form of bitmap and projection indexes. Conceptually, there is little difference. (See my own work on bitmap indexes.)
  • While it is trivial to change or delete a row in a row-oriented database, it is harder in column-oriented databases. Hence, applications are limited to data warehousing.
  • Column-oriented databases are faster for some applications. Sometimes faster by two orders of magnitude, especially on low selectivity queries. Yet, part of these gains are due to the recent evolution in our hardware. Hardware configurations where reading data sequentially is very cheap favor sequential organization of the data such as column stores. What might happen in the world of storage and microprocessors in the next ten years?

I believe Nicolas Bruno said it best in Teaching an Old Elephant New Tricks:

(…) some C-store proponents argue that C-stores are fundamentally different from traditional engines, and therefore their benefits cannot be incorporated into a relational engine short of a complete rewrite (…) we (…) show that many of the benefits of C-stores can indeed be simulated in traditional engines with no changes whatsoever.  Finally, we predict that traditional relational engines will eventually leverage most of the benefits of C-stores natively, as is currently happening in other domains such as XML data.

That is not to say that you should avoid Vertica’s products or do research on column-oriented databases. However, do not bet your career on them. The hype will not last.

(For a contrarian point of view, read Adabi and Madden’s blog post on why column stores are fundamentally superior.)

3 thoughts on “Column stores and row stores: should you care?”

  1. One claim that caught my attention was compressibility in column oriented databases: the column stores tend to compress very well, significantly increasing IO bandwidth (x number of bytes from disk translates to >> x number of bytes of actual data). Since most DBs are IO bound, this turns out to provide a big real-world performance advantage. What do you think?

  2. @Parand

    Yes, but “most DBs are IO bound” is not the entire explanation. Here are two finer points:

    A) It is not all that true.

    On this blog (search for it), I have run experiments showing that parsing CSV files was easily CPU bound. Of course, you have to define properly what “parsing” means… I mean here to find the strings, then copy them into some data structure.

    That is why databases use relatively “cheap” compression techniques. Going out of your way to squeeze the data down might be counterproductive. Compression is not everything.

    Thankfully, column-oriented designs allow “cheap” compression techniques to work well. Basically, you sort the data (a relatively cheap operation) and then you then you apply run-length encoding.

    B) Compression is not only about reducing IO costs.

    As an example, is it faster to compute the sum of:

    111122222
    or
    4×1, 5×2
    ?

    Clearly, it is faster to compute the sum of the “compressed” array. So compression can also save CPU cycles when *you are operating directly over the compressed data stream*. Whenever you need to load the data in RAM, then uncompress it, and then work over the uncompressed data, you have to worry that you will overload your memory bandwidth.

Leave a Reply

Your email address will not be published. Required fields are marked *