# Fast unary decoding

Computers store numbers in binary form using a fixed number of bits. For example, Java will store integers using 32 bits (when using the int type). This can be wasteful when you expect most integers to be small. For example, maybe most of your integers are smaller than 8. In such a case, a unary encoding can be preferable. In one form of unary encoding, to store the integer x, we first write x zeroes followed by the integer 1. The following table illustrates the idea:

Number   8-bit binary   unary
0 00000000 1
1 00000001 01
2 00000010 001

Thus, to code the sequence 0, 0, 1, 2, 0, we might use 1-1-01-001-1 stored as the byte value 11010011. To recover the sequence from a unary coded stream, it suffices to seek the bits with value 1. A naive decoder will simply examine each bit value, in sequence. (See code sample.) We should not expect this approach to be fast.

A common optimization is to use a table look-up. Indeed, we can construct a table with 256 entries where, for each byte value, we store the number of 1s and their position. (See code sample.) This can be several times faster.

A possibly faster alternative is to use the fact that all modern processors have Hamming weight instructions: fast instructions telling you how many bits with value 1 are present in a 64-bit word (popcnt on Intel processors supporting SSE4.2). Similarly, one can use an instruction that counts the number of trailing zeroes (lzcnt or bsf on Intel processors). We can put such instructions to good use for unary decoding. (See popcnt code sample and See lzcnt/bsf code sample.)

According to my tests, on recent Intel processors, the latter approach is much faster, decoding integers at speeds of over 800 millions integers per second, being roughly twice as fast as a table-based approach. See the next figure.

As usual, my source code is freely available.

## 7 thoughts on “Fast unary decoding”

1. JS says:

2. KWillets says:

This looks like some fun bit-twiddling.

For the table-based method, would it work if for each byte you just store the lsb position and the byte without the lsb?

Then you just look up the lsb position, replace the looked-up operand with its lsb-less version, and iterate until 0.

3. @KWillets

Part of my motivation for posting this here is to encourage others to try their own designs. There are many other strategies.

I think that your approach would use less memory, but I think it would be slower.

4. @JS

I think that the instruction you want is popcnt with the corresponding _mm_popcnt_u64 intrinsic. I believe that other instructions are likely to give you slower code.

Update: This conjecture was false.

5. Anonymous says:

@Daniel,

actually, you may be wrong. I tried the suggested function and it was always faster. On my laptop the difference is small, but it is more dramatic on fastpfor for some reason.

Try this implementation:

int bitscanunary_ctzl(long *bitmap, int bitmapsize, int *out) {
int pos = 0;
int val = 0, newval = 0;
for(int k = 0; k < bitmapsize; ++k) {
unsigned long bitset = bitmap[k];
while (bitset != 0) {
long t = bitset & -bitset;
int r = __builtin_ctzl(bitset);
newval = k * 64 + r;
out[pos++] = newval – val;
val = newval;
bitset ^= t;
}
}
return pos;
}

6. @Anonymous

Indeed, the approach __builtin_ctzl can be faster. I was wrong. I have updated my blog post.

7. Bulat Ziganshin says:

I think that any possible implementation (дясте.ефиду) would stall a lot due to instruction dependencies. by running multiple decoders in parallel (in the same OS thread) you can make it much faster. the same may apply to your turbopfor library