# Fast integer compression: decoding billions of integers per second

Databases and search engines often store arrays of integers. In search engines, we have inverted indexes that map a query term to a list of document identifiers. This list of document identifiers can be seen as a sorted array of integers. In databases, indexes often work similarly: they map a column value to row identifiers. You also get arrays of integers in databases through dictionary coding: you map all column values to an integer in a one-to-one manner.

Our modern processors are good at processing integers. However, you also want to keep much of the data close to the CPU for better speed. Hence, computer scientists have worked on fast integer compression techniques for the last 4 decades. One of the earliest clever techniques is Elias coding. Over the years, many new techniques have been developed: Golomb and Rice coding, Frame-of-Reference and PFOR-Delta, the Simple family, and so on.

The general story is that while people initially used bit-level codes (e.g., gamma codes), simpler byte-level codes like Google’s group varint are more practical. Byte-level codes like what Google uses do not compress as well, and there is less opportunity for fancy information theoretical mathematics. However, they can be much faster.

Yet we noticed that there was no trace in the literature of a sensible integer compression scheme running on desktop processor able to decompress data at a rate of billions of integers per second. The best schemes, such as Stepanov et al.’s varint-G8IU report top speeds of 1.5 billion integers per second.

As your may expect, we eventually found out that it was entirely feasible to decoding billions of integers per second. We designed a new scheme that typically compress better than Stepanov et al.’s varint-G8IU or Zukowski et al.’s PFOR-Delta, sometimes quite a bit better, while being twice as fast over real data residing in RAM (we call it SIMD-BP128). That is, we cleanly exceed a speed of 2 billions integers per second on a regular desktop processor.

We posted our paper online together with our software. Note that our scheme is **not** patented whereas many other schemes are.

So, how did we do it? Some insights:

- We find that it is generally faster if we compress and decompress the integers in relatively large blocks (more than 4 integers). A common strategy is bit packing. We found that bit packing could be much faster (about twice as fast) if we used vectorization (e.g., SSE2 instructions). Vectorization is the simple idea that your processor can process several values (say 4 integers) in one operation: for example, you can add 4 integers to 4 other integers with one instruction. Because bit unpacking is the key step in several algorithms, we can effectively double the decompression speed if we double the bit unpacking speed.
- Most of the integer compression schemes rely on delta coding. That is, instead of storing the integers themselves, we store the difference between the integers. During decompression, we effectively need to compute the prefix sum. If you have taken calculus, think about it this way: we store the derivative and must compute the integral to recover the original function. We found out that this can use up quite a bit of time. And if you have doubled the speed of the rest of the processing, then it becomes a serious bottleneck. So we found that it is essential to also vectorize the computation of the prefix sum.

After all this work, it is my belief that, to be competitive, integer compression schemes need to fully exploit the vectorized instructions available in modern processors. That is, it is not enough to just write clean C or C++, you need to design vectorized algorithms from the ground up. However, it is not a matter of blindly coming up with a vectorized algorithm: getting a good speed and a good compression rate is a challenge.

**Credit**: This work was done with Leonid Boytsov. We have also benefited from the help of several people. I am grateful to O. Kaser for our early work on this problem.

**Software**: A fully tested open source implementation is available from github. As caveat, we used C++11 so that a C++11 compiler is required (e.g., GNU GCC 4.7).

**Limitations**: To be able to compare various alternatives, we have uncoupled some decoding steps so that at least two passes are necessary over the data. In some cases, better speed could be possible if the processing was merged into one pass. We are working on further optimizations.

**Further reading**: More database compression means more speed? Right?, Effective compression using frame-of-reference and delta coding

**Update**: The article has been accepted for publication in Software Practice & Experience.

**Quick links**: The paper, the software.

Hi,

pretty interesting.. altough I don’t know much about this topic I was thinking if this scheme is compatible with wider vector sets and perform well with new Intel AVX2 (256 bit wide integer instructions) and BMI2 instructions which would bring 2x wider vectorization.. and also what about Intel Xeon Phi with 512 bit wide vector instructions.. would new algorithms be thought?

Only desiring for you to share your thoughts on the effect of upcoming vector instructions for integer compression..

thanks..

Comment by oscarbg — 13/9/2012 @ 9:59

@oscarbg

Thanks for your excellent comment. We do discuss the issue of wider vectors in the paper, see section 7 (page 22).

Reference: http://arxiv.org/abs/1209.2137

Comment by Daniel Lemire — 13/9/2012 @ 10:34

@oscarbg

Unlike varint it uses only simple SSE2 instructions: 4-int additions, ORs, and ANDs. I believe these instructions should be efficiently supported by all CPUs that have SIMD operations.

Comment by Itman — 13/9/2012 @ 21:35

thanks for responses..

Comment by oscarbg — 13/9/2012 @ 23:09

Hi Daniel, very interesting work here from what I can understand. Question for you, does this finding have an effect on how data could be transmitted through cell towers to smart phones? Essentially I want to know if this finding implies it would be twice as easy to transmit let’s say a youtube video to a mobile device? Where do you see the upper bounds of compression going in the near future compared to where it is today? Thanks.

Comment by omar — 17/9/2012 @ 1:00

@omar

1) Our work is mostly applicable to in-memory databases and search engines. If you are sending data over a network, then pure CPU decoding speed is less important. The goal is our work is to decode the data very, very fast even if it means having a lesser compression. That’s because we assume that the data to be decoding is already in memory.

2) As for…

Where do you see the upper bounds of compression going in the near future compared to where it is today?We can still improve matters quite a bit. Despite 40 years of research, there is still plenty of room to compress better and faster.

Comment by Daniel Lemire — 17/9/2012 @ 8:18

Hi Daniel,

I have a question with regard to what you mention in your paper for the binary packing algorithm by Willhalm et al. You say, that they need at least 11 assembly instructions and then you don’t consider this at all anymore. However, my assumption would be that it’s not only the number of instructions but as well their latency etc.

If I look at the results in their paper their decoding speed is at least the same as yours.

For 7 bit/int you get 2800 mis and Willhalm et al get ~300ms for 1000 mis => ~3000 mis.

Is there a specific reason you left it out in your observations? Or did I misunderstand something completely and their approach is simply not viable in your benchmark.

Thanks,

Martin

Comment by Martin — 30/9/2012 @ 23:34

@Martin

If you want to deduce speed from their Figure 11, then you need to look at our Figure 8 for comparison. We reach 6000 mis on bit packing alone.

The numbers we report (outside our Fig. 8) *include* delta decoding which, as we stress in our paper, is a significant cost.

Comment by Daniel Lemire — 1/10/2012 @ 8:21

@Martin

Thank you for pointing it out. Yes, comparing just the number of assembly instructions is a bit naive. We might even need to implement this method in the future. Note, however, that

1) The latencies in Willhalm et al algorithms are same or higher. They are doing very similar things except that they have the shuffle operation and an additional memory access for the shuffle table. Essentially, their method is varint + a few extra masks and shifts. And we know that varint is slower.

2) As Daniel pointed out, the speeds are not directly comparable. I would elaborate on this. First, deltas can entail a significant cost. It is possible to incorporate them better, but this wasn’t done in the current work and is planned for the follow up article. Second, their bit case is not equivalent to ours. They only decode integers that fit into B bits exactly. In our case, it is B bits ON AVERAGE. They did not implement switching about different bitcases in a single integer array.

Comment by Itman — 1/10/2012 @ 8:35

@Martin

For bit widths less than 8, our unpacking speed is always higher than 6000 mis (see Fig. 8b) whereas they have half this speed (3300 mis) when the bit width is 6: indeed, they require 300 ms to unpack 1000 billion integers (see their Fig. 11). For a larger bit width (10), their speed falls to less than 2500 mis whereas our worse overall unpacking speed is 4000 mis.

Comment by Daniel Lemire — 1/10/2012 @ 8:47

@Daniel, thanks for the clarification, this explains a lot since you are focussing on vertical processing instead of horizontal.

Comment by Martin — 1/10/2012 @ 8:52

@Itman, what I was missing from this picture was that I mis-read the part about vertical and horizontal bit packing. I’m intrigued to try using vertical bit packing in our column store prototype and compare it to horizontal bit packing to see what happens.

Since I’m a traditional In-Memory RDBMS guy, code length never change until we recompress the whole data partition.

It looks like there is some interesting meat left that one could look at

Thanks for the great read.

Comment by Martin — 1/10/2012 @ 9:01

@Martin, no problem. Daniel made a good point: I suspect that Willhalm et al does not include storing and/or preprocessing costs into timings. Essentially, they should be decompressing into L1 and discard data at some point. We, in contrast, reported mostly very conservative estimates that included storing uncompressed integers.

Since memory is slow and you have to write 4-byte integers, this limits decoding speed. Data in Fig 8b represents a scenario that should be closer to that of Willhalm et al. We do read compressed integers from memory, but this is a small cost, because integers occupy one byte on average (you can read 20 billions per second on our machine). In this scenario, we are getting 5000-7000 mis.

Again, to remove all doubts we would need to implement this approach some day. Because, we are not doing it right away, we are curious to learn your results (should you decide and try vertical layout).

Comment by Itman — 1/10/2012 @ 11:43

Hi Daniel. Quick question: the README for the github source says the license is “APL 2.0″. Does this mean the “Apache License 2.0″?

Thanks

-Todd

Comment by Todd Lipcon — 4/10/2012 @ 19:31

@Todd

Yes, that’s what it means.

Comment by Daniel Lemire — 4/10/2012 @ 20:06

“Update: D. Lemire believes that this scheme was patented by Rose, Stepanov et al. (patent 20120221539).

We wrote this code before the patent was published (August 2012)”

does it make their patent useless?

Comment by JanPierres — 6/10/2012 @ 3:24

Hi JanPierres,

We found several teams that discovered this or a similar algorithm about the same time. This all happened before the patent obtained. You can check details in the paper.

Unfortunately, it is not clear how a patent would work in this case. Hopefully, it was defensive and they will not sue anybody for using this code. Otherwise, a judge will decide whether the patent is useless or not ))

Comment by Itman — 6/10/2012 @ 22:50

@JanPierres

Varint-G8IU is patented but not (as far as I can tell) the other schemes we use. Their paper appeared before the patent was granted. We implemented it without knowing about the patent for the purpose of comparing performance.

Regarding prior art, what matters is the filling date of the patent application. So anyone using varint-G8IU in a commercial application could be in for a lawsuit. That is why I included this warning.

With patents, you can play the following game: you apply for a patent, then entice people to use your work without telling them about the patent, and then once the patent is granted, you are free to require licensing and sue them.

Of course, I would not want anyone to get sued because I posted software online. But my warning should keep people safe.

I think that patents are evil. See

Do we need patents?

http://lemire.me/blog/archives/2012/01/06/do-we-need-patents/

Comment by Daniel Lemire — 6/10/2012 @ 22:54

How does this PFOR implementation compare with LZ4 in terms of compression ratio and performance?

http://code.google.com/p/lz4/

Comment by dg — 8/10/2012 @ 21:40

@dg

I believe you can’t really compare something generic like LZ4 with the integer compression schemes we review in this paper. However, because we expected this sort of question, we did benchmark Google Snappy and got that it was not usable for these purposes (see the paper for details). I expect the same conclusion to hold with LZ4.

Comment by Daniel Lemire — 8/10/2012 @ 22:45

This is amazing! I wish more papers were like this.

Comment by Alecco — 7/1/2013 @ 11:12