What comes first, theory or experiments?

In “my research process“, I explain how I proceed to produce research papers. As a comment to my most recent post, Peter Turney wrote:

I don’t usually start writing until all the research is done. It sounds like you write and research in parallel.

From what I understand, Peter proceeds like this:

  • Find a problem;
  • Find some way to evaluate your solution (data+metric);
  • Test several solutions experimentally;
  • Do some theory if needed;
  • If the results are interesting, write the paper knowing who you write it for.

His approach is efficient if you can implement and test algorithms very fast. (He is probably using high level languages.) However, he can waste time on implementation and testing. He wastes no time writing uninteresting papers.

My own process is slightly different:

  • Find a problem;
  • Consider a priori what we can tell about the problem and its solutions;
  • Work out a theoretical framework, derive results, start crafting a paper;
  • Think of experiments to run to validate/invalidate the theory or implement some solutions;
  • Spend a long time revising the paper. Iterate several times between theory and experiments.
  • Figure out where to submit it. Or throw away the paper.

Of course, my approach is almost totally inadequate for fields like Machine Learning where a priori work is often barren or irrelevant. Also, I end up writing uninteresting papers—and throw them away (hopefully). I also cannot tell, when I start writing it, what the paper will say and to whom exactly it might be interesting. Hence, I often change the abstract and the title several times. Many sections that I write are thrown away. I have entire, very long papers, that nobody will ever read.

Hopefully, my process produces papers with a sane balance between theory and practice. At least, that is what I tell myself.

Here are a few of my papers that are typical of my approach:

  • Faster Retrieval with a Two-Pass Dynamic-Time-Warping Lower Bound : I worked on this paper for over two years, crafting about one hundred pages of theoretical results. Most of my results were too weak for publication and were thrown away. At some point, I set the paper aside for about 6 months because I could not get a decisive theoretical results. I ran serious experiments only a few months before submission and I got surprisingly good results (C++ code). At least, I was surprised by the experiment! There are several small theoretical improvements hidden in the paper. Plus, I collected a few nice observations.
  • Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes: It took me about a year to get at this paper. The full extend of the work has been submitted as a journal article a few weeks ago. With Owen Kaser, we wrote about four times as many pages as we published. Initially, we spent a lot of time on theoretical concerns, mapping the problem to graph theory. We implemented some prototypes using fancy theoretical computer science, but they were outgunned by approaches based on mere sorting. I assume we got slightly depressed at some point. I ended writing a bitmap index library from scratch in C++. We then spent a lot of time both on further theory and increasingly serious experiments. All year long, both the theory and the experiments improved. Both experiments and theory suggested several new problems to me, and I will work on them in 2009.

Hence, I only write down the code after I have some interesting theoretical results. But I do not require the theoretical results to be worth a paper. Hence, I always start experiments after I have started a paper. Then I iterate, working on the paper (the theory) and on the experiments. The idea is that one is supposed to help the other. Experiments are meant to suggest new theory, and theory is supposed to suggest new experiments.

How do you work?

Are you really running out of time?

A common feeling among creative workers is the lack of time. Yet, most people will run out of energy before they run out of time. A single task that takes you 5 minutes (asking a Business Development Officer for Intellectual Property rights) can drain you out for a week. Another task, like lecturing for 3 hours, can energize you for the rest of the week. Highly productive people do not have more time, but they may have more energy, more method and better feedback on their progress.

I believe that three problems lead us to conclude we lack time:

  • You are spending too much time on boring tasks. To be productive, you need to work on projects you love. For this reason, creative people should pick their projects.
  • You fail to manage your projects. Without help, you can only keep track of our 7 projects or tasks at any one time. If you want to do more, a method is needed. Myself, I use GTD. But some method is needed to scale up to a large number of projects. Without method, you will drift to unessential tasks and then blame the lack of time to explain why important tasks went unattended.
  • You do not measure your progress. You need to get feedback about the quality and quantity of your work. Myself, I put my work under subversion and get daily emails of what files changed. It is a crude by effective measure of my work. Also, tracking your project carefully, at the task level, helps. Finally, having coworkers who react to your work is a blessing. Without measure of your progress, you may realize too late that your projects did not progress and then blame the lack of time.

Recommender systems: where are we headed?

Daniel Tunkelang comments on the recent progress in collaborative filtering:

(…) the machine learning community, much like the information retrieval community, generally prefers black box approaches, (…) If the goal is to optimize one-shot recommendations, they are probably right. But I maintain that the process of picking a movie, like most information seeking tasks, is inherently interactive, (…)

I disagree with him. Even for non-interactive recommendations, the Machine Learning community is off-track for two reasons:

  • They fail to take into account diversity. In Information Retrieval, we know that if precision is high (all documents are relevant) but recall is low (few of the relevant documents are presented), then the system is poor. There is no such balance in collaborative filtering. Precision above all else is the goal. This is wrong. Diversity metrics must be used.
  • They work over static data sets. A system like Netflix is not static and so, accuracy on a static data set might be a good predictor for real-world performance. The problem is intrinsically nonlinear. People will rate different items, and they will rate differently, if you change the recommender system. The feedback loop may work against you or in your favour. The effect might be large or small. As far as I can tell, I am the only one who keep pointing out this fundamental, but never addressed limitation of working over static data sets. Update: This has absolutely nothing to do with online versus batch algorithms.

See also my post Netflix: an interesting Machine Learning game, but is it good science?

Note: I organized the ACM KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition along with people like Yehuda Koren. Yahuda is among the candidates to win the Netflix prize. I do not oppose the Netflix competition. I just do not think that it will solve our big problems.

Understanding what makes database indexes work

Why do database indexes work?

In a previous post, I explained that only two factors make indexing possible:

  • your index expects specific queries
  • or you make specific assumptions about the data sets.

In other cases, you are better off just scanning the entire data set.

What makes database indexes work?

As far as I know, there are only 6 strategies that make indexes work. By combining them in different ways, we get all of the various existing schemes. (I would love to hear your feedback on this claim!)

1. You expect specific queries: restructure your data!

Suppose you know ahead of time that you will only need to select some of the elements in your data set. Then you can taylor an index for such queries and thus avoid scanning much of the content. For example, an inverted index in full-text search will select which documents contain the various keywords. Instead of working with all documents, you will only worry about the ones matching at least one keyword. Indexing a column with a B-tree or a hash table is another scenario where you try to immediately go to the relevant rows in a table.

Of course, if I look for all documents containing the words “the” and “will”, and want to know how many there are and what is their average length, such a form of indexing will not help.

2. You expect specific queries: materialize them!

Another commonly used strategy is view materialization. If 10% of all visitors on Google type in the word “sex”, they might as well precompute the result of the query. In Business Intelligence, if you can expect your users to mostly care about results aggregated over weeks, months or years, it makes sense to precompute these values instead of always working from the raw data. Alternatively, you can materialize intermediate elements that are needed to compute your results. For example, even if people do not need data aggregated per day, precomputing it might be useful for computing weekly numbers faster.

This form of indexing tends to work well to address the most popular queries, but it fails when people have more specific needs.

3. You expect specific queries: redundancy is (sometimes) your friend

When you do not know exactly which queries to expect, you can try to index the data in different ways, for different queries. For example, you could both use a B-tree and a hash table, and determine at query time which is the best evaluation strategy. You might even determine that the best way is to forgo the indexes and scan the raw data!

4. Use multiresolution!

Suppose that you look for specific images, but you may still need to scan 50% of them. An index that would point you to only the relevant images might not be effective. Instead, you should try to quickly discard the irrelevant candidates. What you could do is create thumbnails (low resolution images). Then you can dismiss quickly the images that are obviously not a good match. Naturally, you can have progressively finer resolutions.

Database indexes often bin values together. For example, if you could bin all workers earning between $10,000 and $30,000, then all workers earning between $30,000 and $50,000, and so on. If you are looking for workers earning between $40,000 and $45,000, you can first find all works that are in the $30,000-$50,000 bin, and then look up their actual salaries, one by one. You can adapt the bins either to the data distribution or to the types of queries you expect.

For more examples, see my post How to speed up retrieval without any index?.

5. Your data is not random: compress it!

Most real-world data is highly compressible. By compressing the data, you can make it so that your CPU and IO subsystem process less data. However, you have to worry about bottlenecks. Too much compression may overload your CPU. Too little compression and most time will be spent in loading the data from disk. Two techniques are often combined to get good results out of compression: sorting and run-length encoding.

6. In any case: optimize your code

You should be using cache-aware and CPU-aware indexes. Be aware that comparing two bits together may take nearly as long as comparing two integers. Be aware that jumping all over the place (as in a B-tree) takes longer than processing the data by tiny chunks.

How to become smarter

picture by tatianes
  • Work on projects you love doing, even if only part of the time. You can only be as smart as you are motivated. I will never be a smart electrician.
  • Reading and learning are important, but people learn by doing, by tinkering.
  • Carry a notebook or an iPad, and use it to record ideas as they come to you. Periodically sort through your ideas.
  • Get in the habit of doubting everything. Especially things that appear obvious. Is an umbrella the best way to stay dry under the rain? Do you really die if you stop eating altogether?
  • Teach others what you want to learn. For example, if you want to understand advanced physics, start a blog on quantum mechanics.
  • This is probably the most important point: hang around with smart people. If you live among monkeys, you might have a good life, but you won’t be smarter than a monkey. Happily, you can easily hang around with smart people wherever you live thanks to the Internet. This is important because if you hang around with people who do great work, you will be motivated by emulation: nobody likes to feel like a loser among his peers.
  • Push yourself: try daring projects and learn to fail. Be ambitious! Do not waste your time with things you know how to do well. Go beyond. Aim as high as you can, while trying to stay on track.
  • Context is important when solving problems. I found that offices are nearly the worst place to work for me. I have done some of my best work at home. Sometimes, a coffee place can be a decent alternative office (presumably because of the white noise effect). Sometimes, using a pen is better than a keyboard. Sometimes, working with a laptop in your bed is better than working on a desk. Change, try new contexts!
  • Set time aside to think, write, read in a quiet place.
  • Come back to important projects regularly. Do not get lost in the small stuff.
  • Urgency is an important factor. Somehow, being too happy about what you achieved can slow you down. This suggests that you should be critical of your own work and that you should not underestimate your competitors. Of course, you need to stay motivated, so do not overestimate your competitors or underestimate your own work either!
  • You will not cure cancer in one day. You will not become a pro golfer in a week. You can only solve big problems by dividing them up into small chunks. Always stay focus on the next small step. Do not stare mindlessly at the big picture.

Be physically smart:

  • Omega-3 is good for you and might make you smarter. Eating fish seems like a good idea.
  • When you are tensed, eat carbs (bread, cookies). Do not make things worse by drinking coffee.
  • Too much coffee tends to get your mind to speed up and you lose focus easily. You end up getting many things done, but you no longer have time for thinking about the hard problems.
  • When you need energy, eat proteins (cheese, meat, beans). Coffee alone will only help you temporarily, it does not get you through a lot of hard work.
  • Drink a lot of water: after all, your brain is mostly water.
  • Sleep a decent amount. Some people claim sleep-deprivation allows them to get more done, and it might be true, and I do not know of any evidence that sleep deprivation hurts your brain, but being sleepy does slow you down and tends to get you to work on routine problems.
  • Taking long walks (at least 20 minutes) out in a quiet park, thinking about some deep issues, tend to set me up for good work for the rest of the day.

For further reading and scientific evidence, read my posts Physical factors making your smarter: white noise, carbohydrates, music, alcohol, and coffee? and Thinking intelligence is innate makes you stupid.


My research process

One thing you never read about is how people do research in their mind. People do describe how to write papers, how to get an academic job, but somehow, I cannot recall anyone describing their thought process.

Mine is simple enough. It includes both theoretical and experimental work. So here it is…

  • I usually start with a specific problem. This problem must be about something significant: a few people worldwide might want to know about the solution. It must be sufficiently narrow that I can address it in a few months. I try to apply the Turney’s principle: be ambitious. In other words, it should not be obvious when I begin that I will succeed. Yes, this means that I do not know I will be able to write a paper at the end! And yes, this means that I sometimes fail. Ideally, I pick a problem so original that I am the only one working on it, worldwide. Almost invariably, the nicest problems take one of the following forms: 1) I want to explain theoretically something I observe experimentally 2) I want to improve on an existing method by at least an order of magnitude (in accuracy, simplicity, speed). Merely aiming to improve an existing approach by a small amount is something I avoid, if only because I know that given enough time, I can always hope to improve any technique by a tiny amount. There is no challenge, no surprise, no risk of failure!
  • A good problem is such that I can then it process down to at least one simple conjecture. A simple conjecture is one that I can realistically hope to make progress on within a few days or a few hours. Sometimes I verify the conjecture experimentally, sometimes theoretically, it does not matter. I avoid working on several small conjectures at the same time: I try to handle them one at a time. Sometimes, the result of my work on a conjecture will be another conjecture. Sometimes these conjectures turn out to be silly, in retrospect.
  • Once I have processed the first simple conjecture, I try to come up with other ones that will bring me closer to a solution to my problem. Always picking the next most promising one.
  • Very often, I will give up on a problem or the problem will change drastically over time. Or the problem will generate worthwhile subproblems. At any given time, I have about a dozen different problems on my radar, but only about 2 or 3 active ones, and only about 2 or 3 conjectures I am working on.

Collaboration messes up this process because I no longer control the overall problem. But I will still decompose the problem into conjectures that I take one at a time. One benefit of working with someone else is that you have someone who will read and check your conjectures. You can also check someone’s else conjecture which is refreshing. You are also much less likely to make crucial mistakes in the process if you work with others (especially if your collaborators are any good).

To a large extend, my process does not rely on brilliant insights nor luck. I merely grind the problem slowly, each time approaching closer and closer to the solution (hopefully). I do not care about making mistakes. I am very, very often wrong. In the past, I have wasted months working on useless problems, generating useless conjectures: this tends to happen more frequently if I work alone.

What makes me more productive, mostly, are nice problems. Often, picking the small conjectures is rather simple: after all, I do not need to be right, I just need to grind at the problem. If there is any talent involved at all in my process, it has to do on how I pick the overall problem. But even then, I think that passion matters more than talent. The more I care about the problem, to faster I make progress. And more importantly, the happier I am as I work.

Funding opportunities, networking, fame and fortune play no role in the above process. At no point do I worry about what others will think except maybe when I pick the overall problem. And even then, I only check, in my mind, that a few people will care, enough that some journal will publish it, eventually. This egocentric process is probably suboptimal. However, my overarching goal is not to be famous, but rather to enjoy myself and get paid in the process. This is not to say I do care about my peers: I want to earn their respect.

I can sometimes offload some of the conjectures to people working on my projects. However, my process does not scale up very well. I can work in small teams (2 or 3 people), but I could not run a large laboratory (10 people or more) with the above process. I am more of a craftsman than a tycoon.