Run-length encoding (part 3)

In Run-length encoding (part 1), I presented the various run-length encoding formats. In part 2, I discussed the coding of the counters. In this third part, I want to discuss the ordering of the elements.

Indeed, the compression efficiency of run-length encoding depends on the ordering. For example, the sequence aaabbb is far more compressible than the sequence ababab.

Reordering sequences

Text, images, and sound are stored on disk as strings. Intuitively, we may think that strings cannot be reordered without losing information. But this intuition is wrong!

You can reorder strings without losing any information, as long as the reordering is invertible. The Burrows–Wheeler transform—invented in 1994—is an invertible transform which tends to generate streams of repeated characters. It is used by the open source bzip2 software. This invertible reordering trick works so well that bzip2 is “within 10% to 15% of the best available techniques, whilst being around twice as fast at compression and six times faster at decompression.”

Ordering sets

Sometimes we want to compress sets of elements. For example, the ordering of the rows in a relational database is semantically irrelevant. Similarly, in a phone directory, we are not concerned about the order of the entries. Of course, we are trained to expect entries to be sorted lexicographically starting from the last names:

Last name First name
Abeck John
Becket Patricia
Smith John
Tucker Patricia

Such sequences of tuples can be compressed column-wise: compress each column with run-length encoding. Reordering the rows so as to maximize compression is NP-hard by reduction from the Hamiltonian path problem. Moreover, it can be reduced to an instance of the traveling salesman problem.

Fortunately, a simple and efficient heuristic is available. Reorder the columns in increasing cardinality: put the columns having the fewest number of distinct values first. Next, sort the table lexicographically.

Applying this heuristic, our previous table becomes:

First name Last name
John Abeck
John Smith
Patricia Becket
Patricia Tucker

Further reading:

Daniel Lemire and Owen Kaser, Reordering Columns for Smaller Indexes, Information Sciences 181 (12), 2011.

Daniel Lemire, Owen Kaser, Eduardo Gutarra, Reordering Rows for Better Compression: Beyond the Lexicographic Order, ACM Transactions on Database Systems 37 (3), 2012.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

13 thoughts on “Run-length encoding (part 3)”

  1. makes reconstructing single rows a messy adventure

    Beside having to “guess” the value of the given row items from BW reordered columns how is that more messy than having to unfold the RLE compressed columns up to the row rank you are interested in?

  2. @Kevembuangga Suppose you want to get all row IDs where the first column has value X and the second column has value Y. How do you extract these row IDs quickly, if the columns were reordered independently?

  3. How do you extract these row IDs quickly

    You would have to run the inverse BWT on each whole column, this is just another stage of decompression, unless there is a very clever hack (on a level of this kind!) to peek the ID values from the encoded BWT.
    Do you mean that with simple RLE you can pick the row IDs corresponding to the Xs an Ys somehow by “direct access” from the compressed columns without actually running thru the whole columns?
    And don’t you use boolean operations on bit maps to compute such sets of IDs?

    Anyway, may be you should forget about Burrows-Wheeler which is overrated due to its good performance on text but bad for Markov sources, which column data probably are most often.
    See Most Burrows-Wheeler based compressors are not optimal
    H. Kaplan and E. Verbin, CPM’07

  4. @Kevembuangga

    Thank you for the pointers.

    You would have to run the inverse BWT on each whole column, this is just another stage of decompression (…)

    Who said anything about decompressing the data?

    I can scan RLE-compressed data looking for the row IDs I need without ever decompressing them. So, the running time is a linear function of the compressed size of my input. (Times some factor which depends on the number of columns.)

    If I first decompress the data, then my processing time will be a function of the real (uncompressed) data size. I save on storage but not on processing time. In fact, the processing time will be longer than if I had worked with uncompressed data (not accounting for possible IO savings).

    And I am not even discussing the processing overhead of computing several BWT.

  5. I can scan RLE-compressed data looking for the row IDs I need without ever decompressing them. So, the running time is a linear function of the compressed size of my input.

    Mmmmmm… I see, but you still do run thru the whole column, if you were really clever you could have the running time a linear function of ONLY the size of the compressed output (number of rows matching the sought for value(s)):
    Instead of storing some kind of map of the row values, store for each possible row value (cardinality) a compressed bit map of the corresponding row IDs.
    So whenever you know which values you are looking for you only pick and process the corresponding bit maps.
    Compressed sparse bit maps can be cheap…

  6. @Kevembuangga

    if you were really clever you could have the running time a linear function of ONLY the size of the compressed output

    The curse of dimensionality makes this cleverness extremely difficult unless you expect specific queries.

    Consider these queries for example:


    select X*Y*Z where (X*Y-Z<W);
    select * where (X*X+Y*Y<Z*Z+W*W+V*V);

    Certainly, you can find ways to index them. But if I am free to come up with others, you will eventually get in serious trouble.

    The lesson here is that there is not one indexing strategy that will always work.

    You may enjoy this earlier blog post:

    Understanding what makes database indexes work
    http://www.daniel-lemire.com/blog/archives/2008/11/07/understanding-what-makes-database-indexes-really-work/

  7. @Kevembuangga Yes, I think we are talking about the same thing. If you only focus on one column at a time, you can indeed index things up very nicely using a (compressed) B-tree or hash table. But we are dealing with a multidimensional list… a table with several columns… that’s much harder to index.

  8. I think we are talking about the same thing.

    Then you missed my point!
    I am not proposing a scheme that will resolve the multicolumns sorting conundrum any better than your (or anyone else’s) RLE compression.
    What I say is that when working within anyone column during a complex search it is possible to reduce the computation load to be only proportional to the compressed size of the intermediate result (set of row IDs) for this column NOT to the compressed size of the whole column.
    Though that is of practical import only if there isn’t a constant multiplicative factor hidden somewhere in the algorithm I envision which will spoil the better asymptotic performance, many a “theoretically best” algorithm is crippled by this.
    All this of course assuming that we are still speaking about columns for which a bitmap is to be preferred.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.