(This is part 1, there is also a part 2 and a part 3.)
Run-length encoding (RLE) is probably the most important and fundamental string compression technique. Countless multimedia formats and protocols use one form or RLE compression or another.
RLE is also deceptively simple. It represents repeated values as a counter and a character. Thus, the string AAABBBBBZWWK becomes 3A-5B-1Z-2W-1K.
If that is all there was to RLE, then the wikipedia page on run-length encoding would be just fine. Yet, I think it needs help.
Why do we use RLE?
- You can read and write RLE data in one pass, using almost no memory.
- Given a vector V compressed with RLE, you can apply any scalar function f to its component in time O(|V |) where |V | is the compressed size of the vector.
- Given two vectors V and V‘ compressed with RLE, you can do arithmetic (e.g. V+V‘) in time O(|V |+|V’|).
(Some RLE formats have worse complexity bounds.)
Any downsides to RLE?
- Random access is slower. Sometimes, only sequential read (from the beginning) is possible. Updating an RLE-compressed array can be difficult.
- You need long runs of identical values.
- Some RLE formats negatively affect CPU vectorization. Thus, if the compression rates are modest, it could actually take longer to process an RLE-compressed array.
What is the RLE format?
There is no unique RLE format. How you use the RLE idea depends on your goals such as (1) maximize the compression rate (2) maximize the processing speed.
Here are some common variations:
- Instead of using a counter for each run of characters, you only add a counter after a value has been repeated twice. For example, the string AAABBBBBZWWK becomes AA1-BB3-Z-WW-K. Thus, if many characters are not repeated, you will rarely use an unnecessary counter.
- You can use a single bit to decide whether a counter is used. For example, the string AAABBBBZWWK becomes A-True-3, B-True-5, Z-False, W-True-2, K-False. Again, this may avoid many unnecessary counters if values are often not repeated.
- Instead of a counter, you may store the location of the run in the array. For example, the string AAABBBBBZWWK becomes 1A-4B-9Z-10W-11K. The benefit of this approach is to allow random access in logarithmic time using binary search. However, it is also incompatible with some techniques to avoid unnecessary counters. So, it is a compression-speed trade-off. For even more speed, you can store both the location of the run and its length (thus avoiding a subtraction).
- To help vectorization, you can group the characters into blocks of k characters. For example, using blocks of two characters, the string AAABBBBBZWWK becomes 1AA-1AB-2BB-1ZW-1WK. Again, this is a compression-speed trade-off because there will be fewer runs to compress after grouping the characters into long blocks.
Continue reading to part 2.
Some References (to my own work):
- Daniel Lemire, Compressing column-oriented indexes (slides)
- Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering 69 (1), pages 3-28, 2010.
- Daniel Lemire, Owen Kaser, Reordering Columns for Smaller Indexes, working paper.
With this “obsessive” interest in RLE may be you should think again about the compression scheme we talked about in emails on 19 Sep 2009 and which you didn’t seem to grok:
It’s actually iterated hierarchical RLE with a fancy encoding of (what serves as) “counters”.
Thanks Daniel for the beautifully concise explanation of RLE. Makes me wonder how it can seem complex in other explanations.
@Kevembuangga I’m interested in RLE because it is a fundamental idea. I’ll gladly study any form of RLE if I have proper documentation.
I could try to guess what iterated hierarchical RLE with a fancy encoding means, but that is not a very interesting game.
I’ll gladly study any form of RLE if I have proper documentation.
I am afraid there isn’t any documentation.
As I told you this is something I stumbled upon when peeking at reverse engineering of a compression utility.
This wasn’t academic, only engineering, sorry 🙂
P.S. To explain my (lack of) motivation, since I do not personally enjoy the hairsplitting about nitty gritty details and umphteenth decimals which show up in “academic” research I don’t feel it’s worth the effort to elaborate on this.
I am not even sure it is worth your efforts, it may lack “nice properties” to write about, yet it was effective in a commercial product.
@Kevembuangga Let me put it differently. Could you implement the a scheme like the one you described?