Bit hacking versus memoization: a Stream VByte example

In compression techniques like Stream VByte or Google’s varint-GB, we use control bytes to indicate how blocks of data are compressed. Without getting into the details (see the paper), it is important to map these control bytes to the corresponding number of compressed bytes very quickly. The control bytes are made of four 2-bit numbers and we must add these four 2-bit numbers as quickly as possible.

There is a related Stack Overflow question from which I am going to steal an example: given the four 2-bits 11 10 01 00 we want to compute 3 + 2 + 1 + 0 = 6.

  • How do we solve this problem in our implementation? Using table look-ups. Basically, we precompute each of the 256 possible values and just look them in a table. This is often called memoization. It works fine and a lot of fast code relies on memoization but I don’t find it elegant. It makes me sad that so much of the very fastest code ends up relying on memoization.
  • What is the simplest piece of code that would do it without table lookup? I think it might be
     (x & 0b11) + ((x>>2) & 0b11) + ((x>>4) & 0b11) + (x>>6). 
  • Can we get slightly more clever? Yes, aqrit and Kendall Willets came up with a fancier involving two multiplications:
    ((0x11011000 * ((x * 0x0401) & 0x00033033)) >> 28).

    The compiler might implement a product like x * 0x0401 into a shift and an addition. Nevertheless, it is not obvious that two multiplications (even with optimizations) are faster than the naive approach but it is really a nice piece of programming. I expect that most readers will struggle to find out why this expression work, and that’s not necessarily a good thing. (John Regher points out that this code has undefined behavior as I have written it. One needs to ensure that all computations are done using unsigned values.)

  • In Stream VByte, the control bytes are organized sequentially which means that you can use another fancy approach that processes four bytes at once:
    v = ((v >> 2) & 0x33333333) + (v & 0x33333333);
    v = ((v >> 4) & 0x0F0F0F0F) + (v & 0x0F0F0F0F);
    

    where the variable v represents a 32-bit integer. You could generalize to 64-bit integers for even better speed. It might be slightly puzzling at first, but it is not very difficult to work out what the expression is doing.

    It has the benefit of being likely to be faster than memoization, but at the expense of some added code complexity since we need to process control bytes in batches. There is also some concern that I could suffer from uneven latency, with the first length in a batch of four being delayed if we are not careful.

    We could modify this approach slightly to compute directly the sums of the lengths which could be put to good use in the actual code… but it is fancy enough as it stands..

I could imagine quite a few more alternatives, including some that use SIMD instructions, but I have to stop somewhere.

So how fast are these techniques? I threw together a quick benchmark to measure the throughput. I am using a recent (Intel Skylake) processor.

memoization1.7 cycles/byte
naive2.6 cycles/byte
aqrit-Willets3.1 cycles/byte
batch (32-bit)1.4 cycles/byte

Sadly, the aqrit-Willets approach, despite its elegance, is not always faster than the naive approach. The batch approach is fastest.

Because the batch approach could be made even faster by using 64-bit words, it would be my best choice right now to replace memoization if speed were my concern. It illustrates how there are potential benefits in a data layout that allows batch processing.

This microbenchmark reinforces the view that memoization is fast, as it does well despite its simplicity. Unfortunately.

Update: On Twitter, Geoff Langdale described a fast vectorized approach using SIMD instructions. An approach similar to what he advocates is described in the paper Faster Population Counts Using AVX2 Instructions.

6 thoughts on “Bit hacking versus memoization: a Stream VByte example”

  1. Mask and add is probably going to win in this situation, but aqrit’s comment (on github) made me realize that the codes (and their sums) are small enough to fit into nybbles so that a single shift, or, and mask (&) would get them into place for a multiply to add them all up. He uses an initial multiply to do that, but (x << 10 | x ) & 0x33033 would do the same.

    Multiplies are good when the number of desired shifts and bitwise or/add's is high, but individual shifts should be considered otherwise.

    The other fun thing about multiply is that it can give prefix sum for regularly-spaced small operands, so eg if we have numbers in byte spacing, like 0x04030201, multiplying by 0x01010101 gives 0x0A060301, which is the (little-endian) prefix sum of the bytes. Again, this all relies on zero-padding to keep one byte from carrying into the next, but for small values it's very handy.

  2. I’m missing the comparison with the various popcnt options, for example the 64-bit variant:

    v = Long.bitCount(v & 0x5555555555555555L) +
    (Long.bitCount(v & 0xaaaaaaaaaaaaaaaaL) << 1);

    or the 8-bit variant with one popcnt:

    v = Integer.bitCount(((v << 8) | v) & 0xaaff);

  3. I’m more positive to the “lookup” approach than you seem to be. You’re right that the “batch” approach is going to scale better in situations where there is a simple algorithmic transformation from key to value, but a table gives you the flexibility of arbitrary (or changing) mappings with surprisingly good performance.

    I think the limitations of the compiled code conceals just how fast the lookups can be. On Skylake, you can sustain 2 lookups per cycle, so if we are loading one byte of input and then doing one table lookup per byte, the “actual” speed for a table lookup should be closer to 1.0 cycles per byte.

    But since we can “batch” the reading of the input by reading more than one byte at a time and then extracting the individual keys, we can actually get well under 1 cycle per byte (2 table lookups per cycle, plus some smaller overhead for reading and extracting input).

    Here’s what I see on Skylake for a 4096 byte input:

    bitsum_c_loop(input, count): 1.342 (best) 1.346 (avg)
    bitsum_asm_loop(input, count): 1.007 (best) 1.011 (avg)
    bitsum_asm_rotate(input, count): 0.827 (best) 0.839 (avg)
    bitsum_asm_gather(input, count): 0.635 (best) 0.637 (avg)

    To be fair, for this case where easy transformations from bit pattern key to lookup value exist, an “in-register” approach is going to be even faster. If you add “-march=native” to your compilation options, you can see the start of the effect of vectorization. The 64-bit popcnt() variant should be able to get down to .25 cycles per byte, and the AVX2 popcnt() analogue can go even lower (although I haven’t tested these two).

    So yes, for this particular problem (and any problem where the transformation is easily vectorized), lookups are probably a dead-end (unless you can make the key substantially wider than 8-bits?) But I’m generally more excited by the universality of the gather-based lookup. It’s quite fast, easy to understand, and far more flexible.

    Here’s a though-exercise: In practice, is the L1 cache significantly different from an enormous vector register split into cacheline sized lanes? Some of the slower AVX operations are already slower than he 4 cycle latency to L1, so I don’t think it’s just a matter of speed.

  4. minor improvements:

    // batch
    v = ((v >> 2) & 0x33333333) + (v & 0x33333333);
    v = (v + (v >> 4)) & 0x0F0F0F0F;

    // Thomas Mueller 8-bit variant
    x = Integer.bitCount(((x << 8) | (x & 0xAA)));

Leave a Reply

Your email address will not be published. Required fields are marked *