Most database users know row-oriented databases such as Oracle or MySQL. In such engines, the data is organized by rows.Â Database researcher and guru Michael Stonebraker has been advocating column-oriented databases. The idea is quite simple: by organizing the data into columns, we can compress it more efficiently (using simple ideas like run-length encoding). He even founded a company, Vertica, to sell this idea.
Daniel Tunkelang is back from SIGMOD:Â he reports that column-oriented databases have grabbed much mindshare. While I did not attend SIGMOD, I am not surprised. Daniel Abadi was awarded theÂ 2008 SIGMOD Jim Gray Doctoral Dissertation Award for his excellent thesis on Column-Oriented Database Systems. Such great work supported by influential people such as Stonebraker is likely to get people talking.
But are column-oriented databases the next big thing? No.
- Column stores have been around for a long time in the form of bitmap and projection indexes. Conceptually, there is little difference. (See my own work on bitmap indexes.)
- While it is trivial to change or delete a row in a row-oriented database, it is harder in column-oriented databases. Hence, applications are limited to data warehousing.
- Column-oriented databases are faster for some applications. Sometimes faster by two orders of magnitude, especially on low selectivity queries. Yet, part of these gains are due to the recent evolution in our hardware. Hardware configurations where reading data sequentially is very cheap favor sequential organization of the data such as column stores. What might happen in the world of storage and microprocessors in the next ten years?
I believeÂ Nicolas Bruno said it best inÂ Teaching an Old Elephant New Tricks:
(…) some C-store proponents argue that C-stores are fundamentally different from traditional engines, and therefore their beneï¬ts cannot be incorporated into a relational engine short of a complete rewrite (…) we (…) show that many of the beneï¬ts of C-stores can indeed be simulated in traditional engines with no changes whatsoever.Â Â Finally, we predict that traditional relational engines will eventually leverage most of the beneï¬ts of C-stores natively, as is currently happening in other domains such as XML data.
That is not to say that you should avoid Vertica’s products or do research on column-oriented databases. However, do not bet your career on them. The hype will not last.
(For a contrarian point of view, read Adabi and Madden’s blog post on why column stores are fundamentally superior.)
3 thoughts on “Column stores and row stores: should you care?”
One claim that caught my attention was compressibility in column oriented databases: the column stores tend to compress very well, significantly increasing IO bandwidth (x number of bytes from disk translates to >> x number of bytes of actual data). Since most DBs are IO bound, this turns out to provide a big real-world performance advantage. What do you think?
Yes, but “most DBs are IO bound” is not the entire explanation. Here are two finer points:
A) It is not all that true.
On this blog (search for it), I have run experiments showing that parsing CSV files was easily CPU bound. Of course, you have to define properly what “parsing” means… I mean here to find the strings, then copy them into some data structure.
That is why databases use relatively “cheap” compression techniques. Going out of your way to squeeze the data down might be counterproductive. Compression is not everything.
Thankfully, column-oriented designs allow “cheap” compression techniques to work well. Basically, you sort the data (a relatively cheap operation) and then you then you apply run-length encoding.
B) Compression is not only about reducing IO costs.
As an example, is it faster to compute the sum of:
Clearly, it is faster to compute the sum of the “compressed” array. So compression can also save CPU cycles when *you are operating directly over the compressed data stream*. Whenever you need to load the data in RAM, then uncompress it, and then work over the uncompressed data, you have to worry that you will overload your memory bandwidth.
Indeed, compressibility seemed to be a major advantage they observed in using a column store vs. in Hadoop. Wasn’t at all clear to me what that was / should be the case.
You may subscribe to this blog by email.