As year 2009 comes to an end, I selected a few of my best blog posts.

Database, compression and column stores:

Hashing (contrarian blog posts):

Academia and Research:

Microprocessors and storage devices are subject to the second law of thermodynamics: using them turn usable energy (oil, hydrogen) into unusable energy (heat). Data centers are already limited by their power usage and heat production. Moreover, many new devices need to operate for a long time with little power: (1) smart phones (2) powerful computing devices inserted into our bodies (3) robots shipped in space.

Our approach to entropic efficiency remains crude. We improve power supply. We shut down disks and CPUs when they are idle. Deeper questions arise however:

  • Except maybe for embarrassingly parallel problems (such as serving web pages), parallel computing trades entropic efficiency for short running times. If entropic efficiency is your goal, will you stay away from non-trivial parallelism?
  • We analyze algorithms by considering their running time. For example, shuffling an array takes time O(n)  whereas sorting it takes time O(n log n). Yet, unlike sorting, I expect that shuffling an array should be possible without any entropy cost (heat generation)!

Suppose I give you a problem, and I ask you to solve it using as little entropy as possible. How would you go about it?

Further reading:

In Run-length encoding (part 1), I presented the various run-length encoding formats. In part 2, I discussed the coding of the counters. In this third part, I want to discuss the ordering of the elements.

Indeed, the compression efficiency of run-length encoding depends on the ordering. For example, the sequence aaabbb is far more compressible than the sequence ababab.

Reordering sequences

Text, images, and sound are stored on disk as strings. Intuitively, we may think that strings cannot be reordered without losing information. But this intuition is wrong!

You can reorder strings without losing any information, as long as the reordering is invertible. The Burrows–Wheeler transform—invented in 1994—is an invertible transform which tends to generate streams of repeated characters. It is used by the open source bzip2 software. This invertible reordering trick works so well that bzip2 is “within 10% to 15% of the best available techniques, whilst being around twice as fast at compression and six times faster at decompression.”

Ordering sets

Sometimes we want to compress sets of elements. For example, the ordering of the rows in a relational database is semantically irrelevant. Similarly, in a phone directory, we are not concerned about the order of the entries. Of course, we are trained to expect entries to be sorted lexicographically starting from the last names:

Last name First name
Abeck John
Becket Patricia
Smith John
Tucker Patricia

Such sequences of tuples can be compressed column-wise: compress each column with run-length encoding. Reordering the rows so as to maximize compression is NP-hard by reduction from the Hamiltonian path problem. Moreover, it can be reduced to an instance of the traveling salesman problem.

Fortunately, a simple and efficient heuristic is available. Reorder the columns in increasing cardinality: put the columns having the fewest number of distinct values first. Next, sort the table lexicographically.

Applying this heuristic, our previous table becomes:

First name Last name
John Abeck
John Smith
Patricia Becket
Patricia Tucker

Further reading: Daniel Lemire, Owen Kaser, Reordering Columns for Smaller Indexes (working paper).

The debacle of the leaked emails, data and code from the University of East Anglia showed that reputed global warming scientists were petty and cheaters. As always, the pursuit of excellence is often at the expense of rigor.

To put a stop to growing skepticism, Scientific American published Seven Answers to Climate Contrarian Nonsense. They rehash what most scientists—me included—accept :

  • Climate warming is real;
  • Climate warming is caused by CO2;
  • Human beings are responsible.

Boring!

But what are they hiding? What is the conspiracy?

Most likely, there is no conspiracy. Large communities of scientists do not lie. They may be wrong, but never on purpose. However, they are often misguided.

The telling tale is often in the question that are not being asked. That is the most important lesson any new scientist must learn about research papers and science in general. What points are omitted from the discussion?

Here are some questions I have asked myself about Global Warming. Feel free to help me find answers:

  • Can global warming be stopped or significantly reduced if we change by a small amount our CO2 production? With India and China ramping up their standard of living, surely nobody expects more than a stabilization of our CO2 production on the short term? We won’t reduce our oil production until we run out. That is for certain.
  • Is global warming necessarily harmful for everyone? Intuitively, Montreal could benefit from a little warming. (Update: If farming conditions improve in Canada and Siberia, we may produce more food globally.)
  • If global warming is unavoidable, shouldn’t we try to adapt our farming technology? What about fruits and vegetables able to grow with little water and under intense heat? How do we preserve the productivity of the soil under a difficult climate?
  • What is the best way to protect vulnerable countries, such as most of Africa? Will a small reduction of our production of CO2 save them from chaos?

Disclaimer: I am a Computer Scientist. My research has nothing to do with climate change.

For the record: We own a single car, which we only use during the week-end or to drive the kids to the doctor’s office. I use public transportation whenever I go on campus. We recycle. We make our own compost. We own a small house (on purpose). I avoid air travel (unlike almost all other scientists).

The University of Quebec at Montreal (UQAM) is looking for candidates to fill a level 2 (junior) Canada Research Chair in Software and Knowledge Engineering. You must be fluent in French.

Powered by WordPress