(This is a follow-up to my previous blog post, there is also a follow-up: part 3.)
Any run-length encoding requires you to store the number of repetitions. In my example, AAABBBBBZWWK becomes 3A-5B-1Z-2W-1K, we must store 5 counters (3,5,1,2,1) and 5 characters.
Storing counters using a fixed number of bits
Most programming languages represent integers using a fixed number of bits in binary format. For example, Java represents integers (int) using 32 bits. However, in my example (3A-5B-1Z-2W-1K) storing the counters using 32 bits and the characters using 8 bits means that I use 25 bytes which is more than twice the length of the original string (AAABBBBBZWWK).
Thus, we have a simple optimization problem: determine the best number of bits. Â In practice, it might be better to store the data in a byte-aligned way. That is, you should be using 8, 16, 32 or 64 bits. Indeed, reading numbers represented using an arbitrary number of bits may involve a CPU processing overhead.
If you use too few bits, some long runs will have to count as several small runs. If you use too many bits, you are wasting storage. Unfortunately, determining on a case-by-case basis the best number of bits requires multiple scans of the data. It also implies added software complexity.
But you don’t have to use the binary format!
You can still use Â a fixed number of bits for your counters, but with quantized codes instead of the binary format. For example, using 3 bits, you could only allow the counter values 1,2,16, 24, 32,128,256,1024. In this example, the sequence of bits 000 is interpreted as the value 1, the sequence of bits 001 as the value 2, the sequence 010 as 16, and so on. Determining the best codes implies that you must scan the data, compute the histogram of the counters, and then apply some optimization algorithm (such as dynamic programming). The decoding speed might be slight slower as you need to look-up the codes from a table.
Using variable-length counters for optimal compression
If you are willing to sacrifice coding and decoding speed, then you can represent the counters using universal codes. Thus, instead of using a fixed number of bits and optimizing the representation (as in the quantized coding idea), you seek an optimal variable-length representation of the counters. With this added freedom, you can be much more efficient.
The illustrate the idea behind variable-length codes, we consider Gamma codes: the code 1 is 1, the code 01 is 2, the code 001 is 3, the code 0001 is 4, and so on. Thus, we use x bits to represent the number x.
Fortunately, we can do much better than Gamma codes and represent the number x using roughly 2 log x bits with delta codes. Firstly, write x as x=2N +(x modulo 2N) where N is the logarithm. Then we can represent the number N-1 as a Gamma code using N-1 bits, and then store (x modulo 2N) in binary format (using N-1 bits). Thus, we can represent all numbers up to 2N-1 using 2N-2 bits.
Unfortunately, the current breed of microprocessors are not kind to variable-length representations so the added compression is at the expense decoding speed.
Continue with part 3.
References and further reading: Holloway et al., How to Barter Bits for Chronons, 2007. See also the slides of my recent talk Compressing column-oriented indexes.