Given these CSS instructions,


z[x] > a[i] {color: blue;}
z z[x] a {text-decoration: underline;}
z > z a , z z z + a { color: red ;}

what will be the color of the text in the following XML file?

<?xml version="1.0" encoding="ISO-8859-1" ?>
<?xml-stylesheet type="text/css"
href="test.css"?>
<z><z x="x">
<z />
<a i="x">my text</a>
</z>
</z>

It would take too long to expose all of the flaws of peer review, here are some:

  • some work is just flat wrong because the reviewers cannot analyze all of the mathematical results, and because they cannot redo the experiments;
  • numerous researchers cheat, sometimes in small ways (“2 out of 3 experiments agree with by theory, let us drop the third one”), sometimes in big ways (“I don’t have time to run these experiments, so let me make up some data”);
  • peer review may perpetuate some biases and prevent researchers from putting into question some fundamental questions (“we decided that this is the right way, if you question it, you are a loony”).

However, for all its faults, peer review remains essential in science. I want other researchers to read and criticize my work. I enjoy it very much when people try to find flaws in my work. I think that my work is serious enough that when people point out flaws, I am usually aware of them at some level and I can respond easily (and enjoy the process).

The type of peer review I do not enjoy is the country-club approach: 1) does the paper agrees with the goals and views of the reviewers 2) is the paper written by someone we can respect? Fortunately, you can navigate the system and stay away (mostly) from country-club peer review.

But why do I still like peer review despite its obvious flaws? Because I see it as an honor-based system. In such a system, you have to accept that there will be cheaters. A lot of them. And there will lots of mistakes. All we have to do is be open about it. That is, you cannot say “but my work was peer reviewed so you cannot question it!” or “I am very good, look at these prestigious publications!”. The peer review is there to help the authors. It is not, however, an insurance against fraud or mistakes. I like peer review because it helps me become better, but I do not use the system to determine how good someone else is.

So, what do we do if we want to know how good someone is? You read his work. You reproduce his experiments. You redo his math. Of course, this scales poorly. If you have to hire someone, you cannot read the work of 50 or 500 candidates. So? I think we have to be realistic. It is hard to know how good someone is. You can get to know 10 or 20 researchers in your life. That is about all.

Hiring processes are flawed. You will hire cheaters. Get over it.

A bitmap index is a popular data structure to speed up the retrieval of matching rows in a table. (It is also used in Information Retrieval, to retrieve matching words.)

Building a bitmap index is not hard. You match each possible value with a vector of bits. When the value occur, you insert the value “true” in the vector, otherwise you insert the value “false”. Hence, you get a set of vectors containing binary values. To find out when a particular value occur, you just load up the corresponding vector of bits. (You still need a B-tree or a hash table to map values of the corresponding vector.) In practice, programming with (compressed) vector of bits leads to much faster software than working of sets of row IDs.

The size of such a matrix is the number of distinct values times the number of rows. Hence, some people conclude that bitmap indexes are mostly good to index data were we have few distinct values. In fact, the first sentence of the bitmap index article on wikipedia tells us that bitmap indexes are meant to index data such as “gender” were only two values are possible.

Of course, this reasoning is wrong. Why?

Because bitmap indexes (the big matrix of zeroes and ones) are never stored uncompressed. You always manipulate them in compressed form.

How big is the compressed matrix? Let us see. By run-length encoding (RLE), the simplest compression algorithm I know, each vector of bits has a compressed size at most proportional to the number of occurrences of the corresponding value. If you sum it up, you have that the compressed size of a bitmap index is at most proportional to the size of your table! Irrespective of the number of distinct values!

For a wide range of decision-support applications, bitmap indexes can match or surpass the performance of most indexes, irrespective of the number of distinct values.

You don’t believe me? Read these benchmarks using an Oracle database: “In summary, bitmap indexes are best suited for DSS regardless of cardinality (…)”.

With my very own bitmap index library, I have scaled up to hundreds of millions of attribute values without a problem and tables having several gigabytes of data. And my budget is not comparable to Oracle (I have one developer, me).

I am not saying that bitmap indexes are always the best solution. Of course not! But there is no published evidence that the number of distinct values is a practical criterion.

Then why does the bitmap index article on wikipedia suggests otherwise? Because it is all over the blogosphere and posting boards… because when I tried to fix the wikipedia article, my changes got reverted. So, I post my rebuttal here. If you have practical evidence that bitmap indexes mostly work well when you have 2-3 attribute values, let us see it. Otherwise, help me kill this myth!

Further reading: Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering 69 (1), pages 3-28, 2010.

I have written a lot about productivity in research and academia on this blog. As recent examples, see my posts Scientific productivity tips from Hartley and Branthwaite Rigor or relevance: choose one, Improving your intellectual productivity by accepting chaos, and Research productivity: what matters?.

However, there is a simple secret that anyone can apply right now. It can help you get better grades if you are students, it can help you finish that book or report you must write if you are a professional.

I should really charge you for this secret, but I am not good at capitalism. The secret is this: break down your task into small and easy chunks of work. And just do them!

Of course, there is more to it:

  • Sometimes you do not know how to break down the task. Do not worry so much about it. Break it down in some way and get working. Let me take an example… suppose that you want to prove some very difficult mathematical conjecture. Suppose that thousands of scientists have failed in making any progress at proving this conjecture. If you must work on it, then stop worrying and just pick a strategy, any strategy, and work.
  • Tasks you assign yourself can range from drawing pictures to going to sending an email to someone who has worked on the problem. As long as you work seriously and systematically, you cannot go wrong.
  • Periodically reevaluate your strategy. Always assume that your plans are quite possibly wrong. Just constantly refresh them.

Why does it work?

  • Feeling overwhelmed makes you stupid. You are better off working using a silly approach than to stand there and wait for divine inspiration. The gods help those who help themselves! The gods even help the idiots.
  • Enumerating all possible solutions to a problem, might actually be the best possible strategy! Sometimes, the best approach sounds silly a priori. All intellectuals work by trials and errors.
  • Getting your hands dirty on a problem trains your brain. After a week working hard (but stupidly) on a problem, you will have a different view of the problem because your brain will be different.
  • If you picked a very difficult problem, you may never succeed in solving it, but by working hard and regularly, you are very likely to find some results you can be proud of.

I took some time off this year. No, we did not go anywhere specific. I just took two weeks off with my two sons. We had fun.

It has been years since I took time totally off. For irrational reasons, I would always keep my research going at a reduced rate during my time off. Not this year! I was 100% with my family… most of the time.

There still were a few exceptions. For example, DOLAP 2008 accepted our bitmap-index paper and we had to revise the paper within a few days. I really like this paper.

The problem we work on is simple: improve the compression and speed of bitmap indexes by row reordering. A bitmap index is a like a binary matrix with millions or billions of rows. Each column is compressed using a variant of run-length encoding. At query-time, the you load a few columns and apply logical operations (AND/OR) between them. Better compression translates into a faster index because the time required to perform logical operations over bitmaps is proportional to their compressed size. Finding the optimal row order is NP-hard.

A simple heuristic is to lexicographically sort the rows. It works surprisingly well. Fancier heuristics work slightly better. There are limits to how sophisticated your heuristic can get, however, since you need to scale to billions of rows. Fancy graph algorithms are out (as far as I can tell). (Processing the data in small blocks without merging does not work well.) The gist of our paper is that it is very difficult to predict which heuristic will work best, even with simple synthetic data. It was a difficult paper to write, and I am happy that the reviewers did not dismiss it since it breaks the traditional “A is better than B” research model. What our paper establishes is that it is a deeper problem than you might think at first. Fortunately, there are still a few elegant conclusions we came to.

Ok. Enough about research. Here is my youngest son on a poney:

« Previous PageNext Page »

Powered by WordPress