How should you build a high-performance column store for the 2020s?

Though most relational databases (like MySQL) are “row oriented”, in that they keep rows stored together… experience has taught us that pivoting the data arrow so that columns, and not rows, are stored together can be beneficial. This is an old observation that most experienced programmers know about as the array-of-struct versus struct-of-array choice. There is even a wikipedia page on the topic.

For example, in Java you might have a Point object containing three attributes (X, Y, and Z). You can store an array of Point objects. Or you could have three arrays (one for X, one for Y and one for Z). Which one is best depends on how you are using the data, you should consider both.

It is fairly common for databases to have tons of columns, and for analytical queries to bear only on a few columns. So it makes sense to keep the columns together.

There is actually a full spectrum of possibilities. For example, if you have a timestamp column, you might consider it as several separate columns that can be stored together (years, day, seconds). It is impossible in the abstract to tell how to best organize the data, you really need some domain knowledge about how you access the data.

In the data science world, I have been very interested in a new initiative called Apache Arrow. One way to view Apache Arrow is a common column-oriented database format that can be shared by many database engines.

Column-orientation is at its best when used with compression. Indeed, it is arguably easier and more efficient to compress columns stored together, than one they are interleaved with other columns.

Arrow relies on dictionary compression, a tried-and-true technique. It is a great technique that can enable really fast algorithm. The key idea is quite simple. Suppose that there are 256 distinct values in your column. Then you can represent them as numbers in [0,255], with a secondary lookup table to recover the actual values. That is, if you have N distinct values, you can store them using ceil(log(N)/log(2)) bits using binary packing (C code, Java code, Go code, C# code). Thus you might use just one byte per value instead of possibly much more. It is a great format that enables superb optimizations. For example you can accelerate dictionary decoding using SIMD instructions on most Intel and AMD processors released in the last 5 years.

In a widely read blog post, professor Abadi criticized Apache Arrow in these terms:

I assume that the Arrow developers will eventually read my 2006 paper on compression in column-stores and expand their compression options to include other schemes which can be operated on directly (such as run-length-encoding and bit-vector compression).

I thought I would comment a bit on this objection because I spent a lot of time hacking these problems.

Let me first establish the terminology:

  1. Run-length encoding is the idea where you represent repeated values as the value being repeated followed by some count that tells you how often the value is repeated. So if you have 11111, you might store is as “value 1 repeated 5 times”.
  2. A bit-vector (or bitmap, or bit array, or bitset) is simply an array of bits. It is useful top represent sets of integer values. So to repeat the set {1,2,100}, you would create an array containing at least 100 bits (e.g., an array of two 64-bit words) and set only the bits at index 1, 2 and 100 to ‘true’, all other bits would be set to false. I gave a talk at the Spark Summit earlier this year on the topic.

    To make things complicated, many bit-vectors used in practice within indexes rely on some form of compression.

    You can build an entire data engine on top of bit-vectors: Pilosa is a great example.

Here are some considerations:

  1. To sort, or not to sort?

    Run-length encoding is very powerful when the data is sorted. But you cannot have all columns in sorted order. If you have 20 columns, you might lexicographically sort the data on column 2 and 3, and then column 2 and 3 will be very highly compressible by run-length encoding, but other columns might not benefit so much from your sorting efforts. That’s great if your queries focus on columns 2 and 3, but much less interesting if you have queries hitting all sorts of columns. You can duplicate your data, extracting small sets of overlapping columns, sorting the result, but that brings engineering overhead. You could try using sorting based on space-filling curves, but if you have lots of columns, that might be worse than useless. There are better alternatives such as Vortex or Multiple-List sort.

    Should you rely on sorting to improve compression? Certainly, many engines, like Druid, derive a lot of benefits from sorting. But it is more complicated than it seems. There are different ways to sort, and sorting is not always possible. Storing your data in distinct projection indexes like Vertica may or may not be a good option engineering-wise.

    The great thing about a simple dictionary encoding is that you do not need to ask these questions…

  2. There are more ways to compress than you know.

    I have been assuming (and I guess Adabi did the same) that dictionary coding was reliant on binary packing. But there are other techniques found in Oracle and SAP platforms. All of them assume that you divide your column data into blocks.

    • By using dictionary coding again within each small block, you can achieve great compression because most blocks will see very few distinct column values. That’s called indirect coding.
    • Another technique is called sparse coding, where you use a bit-vector to mark the occurrences of the most frequent value, followed by a list of the other values.
    • You also have prefix coding where you record the first value in the block and how often it is repeated consecutively, before storing the rest of the block the usual manner.

    (We review these techniques in Section 6.1.1 of one of our papers.)

    But that is not all! You can also use patched coding techniques, such as FastPFor (C++ code, Java code). You could even get a good mileage out of a super fast Stream VByte.

    There are many, many techniques to choose from, with different trade-offs between engineering complexity, compression rates, and decoding speeds.

  3. Know your hardware!

    And every single processor you might care about would support SIMD instructions that can process several values at once. And yes, this includes Java through the Panama project (there is a great talk by top Oracle engineers on this topic). So you absolutely want to make it easy to benefit from these instructions since they are only getting better over time. Intel and AMD will popularize AVX-512, ARM is coming with SVE.

    I think that if you are designing for the future, you want to take into account SIMD instructions in the same way you would take into account multi-core or cloud-based processing. But we know less about doing this well than we could hope for.

  4. Be open, or else…

    Almost every single major data-processing platform that has emerged in the last decade has been either open source, sometimes with proprietary layers (e.g., Apache Hive), or built substantially on open source software (e.g., Amazon Redshift). There are exceptions, of course… Google has its own things, Vertica has done well. On the whole, however, the business trade-offs involved when building a data processing platform make open source increasingly important. You are either part of the ecosystem, or your are not.

So how do we build a column store for the future? I don’t yet know, but my current bet is on Apache Arrow being our best foundation.

Advertisement. If you are a recent Ph.D. graduate and you are interested in coming to Montreal to work with me during post-doc to build a next-generation column store, please get in touch. The pay won’t be good, but the work will be fun.

7 thoughts on “How should you build a high-performance column store for the 2020s?”

  1. I agree with Andy that generating code at runtime and hotspot JIT compilation to machine code are making a big difference. We’re hitting the limits of what’s possible with interpretation.

    Another aspect is logical query optimization that is often overlooked. Improvements in the execution layer and storage can give you a 10x improvement, picking the wrong plan due to missing or bad logical optimization can impact your performance by 100x-1000x.

    One final aspect I want to mention is that there is a tendency to over-optimize for read/scan-only use cases. An increasingly important requirement is the ability to create the storage format quickly and to be able to maintain it efficiently (e.g., in-place updates).

Leave a Reply

Your email address will not be published. Required fields are marked *