Run-length encoding (part 2)

(This is a follow-up to my previous blog post, there is also a follow-up: part 3.)

Any run-length encoding requires you to store the number of repetitions. In my example, AAABBBBBZWWK becomes 3A-5B-1Z-2W-1K, we must store 5 counters (3,5,1,2,1) and 5 characters.

Storing counters using a fixed number of bits

Most programming languages represent integers using a fixed number of bits in binary format. For example, Java represents integers (int) using 32 bits. However, in my example (3A-5B-1Z-2W-1K) storing the counters using 32 bits and the characters using 8 bits means that I use 25 bytes which is more than twice the length of the original string (AAABBBBBZWWK).

Thus, we have a simple optimization problem: determine the best number of bits.  In practice, it might be better to store the data in a byte-aligned way. That is, you should be using 8, 16, 32 or 64 bits. Indeed, reading numbers represented using an arbitrary number of bits may involve a CPU processing overhead.

If you use too few bits, some long runs will have to count as several small runs. If you use too many bits, you are wasting storage. Unfortunately, determining on a case-by-case basis the best number of bits requires multiple scans of the data. It also implies added software complexity.

But you don’t have to use the binary format!

You can still use  a fixed number of bits for your counters, but with quantized codes instead of the binary format. For example, using 3 bits, you could only allow the counter values 1,2,16, 24, 32,128,256,1024. In this example, the sequence of bits 000 is interpreted as the value 1, the sequence of bits 001 as the value 2, the sequence 010 as 16, and so on. Determining the best codes implies that you must scan the data, compute the histogram of the counters, and then apply some optimization algorithm (such as dynamic programming). The decoding speed might be slight slower as you need to look-up the codes from a table.

Using variable-length counters for optimal compression

If you are willing to sacrifice coding and decoding speed, then you can represent the counters using universal codes. Thus, instead of using a fixed number of bits and optimizing the representation (as in the quantized coding idea), you seek an optimal variable-length representation of the counters. With this added freedom, you can be much more efficient.

The illustrate the idea behind variable-length codes, we consider Gamma codes: the code 1 is 1, the code 01 is 2, the code 001 is 3, the code 0001 is 4, and so on. Thus, we use x bits to represent the number x.

Fortunately, we can do much better than Gamma codes and represent the number x using roughly 2 log x bits with delta codes. Firstly, write x as x=2N +(x modulo 2N) where N is the logarithm. Then we can represent the number N-1 as a Gamma code using N-1 bits, and then store (x modulo 2N) in binary format (using N-1 bits). Thus, we can represent all numbers up to 2N-1 using 2N-2 bits.

Unfortunately, the current breed of microprocessors are not kind to variable-length representations so the added compression is at the expense decoding speed.

Continue with part 3.

References and further reading: Holloway et al., How to Barter Bits for Chronons, 2007. See also the slides of my recent talk Compressing column-oriented indexes.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

7 thoughts on “Run-length encoding (part 2)”

  1. Thank you for the interesting post.
    I would note, however, that not all universal codes are born equal. Specifically, byte-aligned codes and word-aligned codes are at least twice as fast as bit-oriented codes such as Elias delta/gamma or Golomb codes.
    They are used in search engines and allow very fast decompression. In many cases it takes the same time as to read the data from disk sequentially.
    See, e.g. “Index Compression using Fixed Binary Codewords. Vo Ngoc Anh, Alistair Moffat”
    PS: I also gonna read “How to Barter Bits for Chronons”. Seems to be very interesting. Thanks!

  2. I’m compelled to take issue with your last sentence…

    “Unfortunately, the current breed of microprocessors are not kind to variable-length representations so the added compression is at the expense decoding speed.”

    My recent experience has tought me that compression and speed are no longer related that way, and largely for that reason. Today’s hierarchies of caches work in concert with out-of-order execution and other features to provide new avenues for the designer to exploit. These modern architectural features can be made to reward, with execution speed, high density in data. It is up to the designer to get the density, as well as the logic, that lets the hardware deliver the speed. To me, that’s the new reality.

    I say that with some conviction having just finished optimizing a fast decompressor for structured data. It uses canonical Huffman codes for the data, and a compressed variation of J. Brian Connell’s classic structures for decoding. During software optimization, time and time again, I was able to get further speed improvements by increasing the compression not only of the data, but also of the decoding data structures and their pointers. It was the variable-length coding, as much as any other design factor, that got me the information density I needed from the data to get the speed I needed from the system. In the end, that happened, I believe, primarily because the use of variable-length codes reduced the demand on a relatively slow path component, the system bus.

    Software optimization is not what it was years ago, and for me at least, neither are the relationships between compression and speed. But that won’t be everyone’s experience, so I would like to hear others’ opinions.

  3. @Daniel
    I have read the references on Chronos and Bits. It looks like one should use terms variable-bit and variable-byte methods very cautiously. It is also interesting that Huffman coding can be sped up considerably by using special lookup tables.
    @Glenn,
    My experience and a variety of experimental evaluations (just check the reference given by the author), say that in many cases more sophisticated compression methods introduce speed penalty. In particular, variable-bit methods are usually slower (but not always), that variable-byte methods.
    The difference, however, is subtle. In many cases, obviously, better compression rates allows to avoid expensive cache misses and even more expensive disk reads. In those case, better compression is obviously a priority.

  4. I agree. It could go either way; so much depends on the specifics. But the stangest thing is that so often I find myself increasing compression in order to increase speed, and winning!

    BTW, there is an old paper (and a good one) by Debra Lelewer and Dan Hirschberg (CACM 4/90) that explores a lot of the Huffman code decoding issues.

  5. @Glenn Davis

    It was the variable-length coding, as much as any other design factor, that got me the information density I needed

    When confronted with a “problem” sometimes the best approach isn’t to solve it but to avoid it.
    Like, why using counters to spot peculiar points within an address range when you can use flags (bits), interleaved in data or not (sparse bit maps). 🙂

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.