Daniel Lemire's blog

Bit hacking versus memoization: a Stream VByte example

In compression techniques like Stream VByte or Google’s varint-GB, we use control bytes to indicate how blocks of data are compressed. Without getting into the details (see the paper), it is important to map these control bytes to the corresponding number of compressed bytes very quickly. The control bytes are made of four 2-bit numbers and we must add these four 2-bit numbers as quickly as possible.

There is a related Stack Overflow question from which I am going to steal an example: given the four 2-bits 11 10 01 00 we want to compute 3 + 2 + 1 + 0 = 6.

I could imagine quite a few more alternatives, including some that use SIMD instructions, but I have to stop somewhere.

So how fast are these techniques? I threw together a quick benchmark to measure the throughput. I am using a recent (Intel Skylake) processor.

memoization 1.7 cycles/byte
naive 2.6 cycles/byte
aqrit-Willets 3.1 cycles/byte
batch (32-bit) 1.4 cycles/byte

Sadly, the aqrit-Willets approach, despite its elegance, is not always faster than the naive approach. The batch approach is fastest.

Because the batch approach could be made even faster by using 64-bit words, it would be my best choice right now to replace memoization if speed were my concern. It illustrates how there are potential benefits in a data layout that allows batch processing.

This microbenchmark reinforces the view that memoization is fast, as it does well despite its simplicity. Unfortunately.

Update: On Twitter, Geoff Langdale described a fast vectorized approach using SIMD instructions. An approach similar to what he advocates is described in the paper Faster Population Counts Using AVX2 Instructions.