Daniel Lemire's blog

Fast integer compression: decoding billions of integers per second

Databases and search engines often store arrays of integers. In search engines, we have inverted indexes that map a query term to a list of document identifiers. This list of document identifiers can be seen as a sorted array of integers. In databases, indexes often work similarly: they map a column value to row identifiers. You also get arrays of integers in databases through dictionary coding: you map all column values to an integer in a one-to-one manner.

Our modern processors are good at processing integers. However, you also want to keep much of the data close to the CPU for better speed. Hence, computer scientists have worked on fast integer compression techniques for the last 4 decades. One of the earliest clever techniques is Elias coding. Over the years, many new techniques have been developed: Golomb and Rice coding, Frame-of-Reference and PFOR-Delta, the Simple family, and so on.

The general story is that while people initially used bit-level codes (e.g., gamma codes), simpler byte-level codes like Google’s group varint are more practical. Byte-level codes like what Google uses do not compress as well, and there is less opportunity for fancy information theoretical mathematics. However, they can be much faster.

Yet we noticed that there was no trace in the literature of a sensible integer compression scheme running on desktop processor able to decompress data at a rate of billions of integers per second. The best schemes, such as Stepanov et al.’s varint-G8IU report top speeds of 1.5 billion integers per second.

As your may expect, we eventually found out that it was entirely feasible to decoding billions of integers per second. We designed a new scheme that typically compress better than Stepanov et al.’s varint-G8IU or Zukowski et al.’s PFOR-Delta, sometimes quite a bit better, while being twice as fast over real data residing in RAM (we call it SIMD-BP128). That is, we cleanly exceed a speed of 2 billions integers per second on a regular desktop processor.

We posted our paper online together with our software. Note that our scheme is not patented whereas many other schemes are.

So, how did we do it? Some insights:

After all this work, it is my belief that, to be competitive, integer compression schemes need to fully exploit the vectorized instructions available in modern processors. That is, it is not enough to just write clean C or C++, you need to design vectorized algorithms from the ground up. However, it is not a matter of blindly coming up with a vectorized algorithm: getting a good speed and a good compression rate is a challenge.

Credit: This work was done with Leonid Boytsov. We have also benefited from the help of several people. I am grateful to O. Kaser for our early work on this problem.

Software: A fully tested open source implementation is available from github. As caveat, we used C++11 so that a C++11 compiler is required (e.g., GNU GCC 4.7).

Limitations: To be able to compare various alternatives, we have uncoupled some decoding steps so that at least two passes are necessary over the data. In some cases, better speed could be possible if the processing was merged into one pass. We are working on further optimizations.

Further reading: More database compression means more speed? Right?, Effective compression using frame-of-reference and delta coding

Update: The article has been accepted for publication in Software Practice & Experience.

Quick links: The paper, the software.