In Are your research papers telling original stories?, I claimed that the main benefits of the typical research paper were that:

  • the contribution to the state-of-the-art is clear (what did you invent?);
  • we can quickly quantify the value of the contribution (how well does it work?).

Basically, research papers are fitted to the needs of the current peer review system.

The current breed of research papers are also convenient. There are millions of ways of improving any given process. Each improvement can become a research paper. You can even proceed systematically. Pick any given solution to a problem and add a twist to it. Can you solve the problem faster? Can you solve the problem by using less memory? Can you solve the problem incrementally? And so on. You can manufacture countless research papers without ever learning anything new. And because you measured and categorized all of your contributions, you are even likely to get much recognition! Moreover, because you invented many new things, you may even get your name on a framework, algorithm or problem! If any of what you did is useful for industry, you may even get rich!

But I may not find your work interesting. I would like to propose an alternative recipe that should produce more interesting research papers:

  • Pick any process followed by practitioners or by nature. How do human beings or ants solve a given problem? What heuristics do successful engineers follow?
  • Explain, model or reproduce the process in question.

There are endless puzzles out there. For example, I have no satisfactory explanation of why wikipedia worked. Had I been asked about a project like wikipedia in the nineties, I would have predicted failure. Admit it: you would have done the same. Yet, it worked. Why?

Look at how the best programmers work. They have many clever tricks (algorithms, processes, strategies) that you will never find in any textbook. Sometimes these tricks work unreasonably well. But we have no explanation.

Remember: Nature is the best coauthor.

Further reading: Write good papers.

Science is not a matter of pure logic. Some of the best scientists lived through intense emotions—it shaped their lifes. Here are a few quotes:

  • Ludwig Boltzman (invented entropy): When he could not reach the standards he set for himself, he would be overcome by feelings of fear, suffering and depression. He killed himself on 5 September 1906.
  • Wallace Carothers: Despite his success with Nylon, he felt that he had not accomplished much, he checked into a Philadelphia hotel room and died after drinking a cocktail of lemon juice laced with potassium cyanide.
  • Rudolf Diesel: His family says that Diesel committed suicide because his invention was stolen. A cross in his journal on the date he died was an indicator of suicide.

Related posts: Research productivity: some paths less travelled and Emotions killing your intellectual productivity.

Update: Don’t worry about my own state of mind. I’m not depressed right now, but I am interested in how our emotions shape our research and our faculty to innovate.

Theory in Computer Science—as in any other field—is based on models. These models make many hidden assumptions. This is one of the fundamental reason why pure theory is wasteful. We must constantly revisit our old assumptions and run experiments to determine whether our models are useful.


Hash tables are a fundamental data structure. They are part of every elementary introduction to Computer Science. From a programmer’s point of view, they associate keys with values. Hash tables go back to the beginnings of Computer Science even though new variations keep being invented (e.g., Cuckoo hashing). Incredibly, hash tables were just recently integrated in the C++ language, but less conservative languages like Java, Python, Ruby, Perl, PHP, all have hash tables built-in.

The secret sauce is that the key is first transformed (or hashed) to a number between 1 and m corresponding to a cell in an array. As long as this number of generated with sufficient randomness, hash tables are almost equivalent to looking up values in an array—except for the overhead of hashing the key.

Thus, everyone knows that hash table queries run in amortized constant time. That is, as the number of keys increases, the average time necessary to recover a key-value pair does not increase.

But wait! How do we compute hash values? Consider the standard universal hashing technique described in Corman et al. Introduction to Algorithms. We pick any prime number m. Pick randomly a number a in [0,m). Then h(x)= x a modulo m is a universal hash function. Multiplications of numbers in [0,m) can almost be done in time O(log m).  For the purposes of hash tables, the number m must be close to the number of different keys n. Thus—formally—hash tables often run in time O(log n). (Thanks to Colin Percival for fixing the complexity analysis for me.)

Am I being pedantic? Does the time required to multiply integers on modern machine depend on the size of the integers? It certainly does if you are using vectorization. And vectorization is used in commercial databases!

Thus, it is possible for hash tables to run in time O(log n) in practice, despite what the textbooks say.

What other hidden assumptions are you making right now?

Disclaimer 1: Fortunately, universal hashing can be computed in linear time with respect to the number of bits, by avoiding multiplications.

Disclaimer 2: The problems that result from picking the wrong model of computation are well known and addressed by most textbooks. I have not discovered anything new.

Update: See my more recent post Sensible hashing of variable-length strings is impossible.

To do statistics over the Web, we need samples. Thus, I want to know how to pick a Web page at random, without making much effort. If you are Google or Microsoft, it is easy. But what about the rest of us? And what if I want to pick users at random on Facebook?

In effect, I want to sample a virtually infinite data set: I consider the Web to be infinite because I cannot enumerate all elements in it. Given finite resources against an infinite data set, can I sample fairly?

Thankfully, the objects in this set are linked and indexed. Hence, I came up with two sampling strategies:

  • Using the hyperlinks: Start with any Web page. Recursively follow hyperlinks until you have many Web pages. Pick a page at random in this set. (Mathematically, I am sampling nodes in an infinite graph.)
  • Using the index: Sampling items Pick a set of common words, enter them into a search engine. Pick a page at random in the result set.

Are there any other strategies? What can I say (if anything) about the biases of my sampling?

Further reading: Benoit, D. and Slauenwhite, D. and Schofield, N. and Trudel, A., World’s First Class C Web Census: The First Step in a Complete Census of the Web, Journal of Networks 2 (2), 2007.

Update: An anonymous reader points me to Ziv Bar-Yossef and Maxim Gurevich, Random sampling from a search engine’s index, JACM 55 (5), 2008.

Update 2: Chris Brew points me to Henzinger et al., On Near-Uniform URL Sampling, WWW 2000.

Update 3: Regarding Facebook, I found Gjoka et al., Unbiased Sampling of Facebook, 2009.

Hello World (book)

I learned programming on my own when I was twelve years old with a TRS-80 and Microsoft Basic. The documentation that came with the TRS-80 was fantastic. Alas, today, no vendor would ever think of including an introduction to programming with a computer. If your are a dad (or a mom) and you regret this, then Hello World: Computer Programming for Kids and Other Beginners by Warren and Carter Sande is for you. The book is cheap: while the list price is $34, Amazon sells it for a little over $23.

As a kid, my initial goal was to create my own video games. And I did! I learned (entirely on my own) that:

  • Even though computers are fast, it is hard to create fast software. Being clever is hard work!
  • While it is easy to program a small game, programming a slightly more complex game can be orders of magnitude more difficult.

This book should allow your kids to learn as much and more.

Instead of using Microsoft Basic, the authors use Python. An excellent choice. They cover all the same material as in my venerable TRS-80 documentation: random numbers, variables, loops, graphics, functions and sound. However, there are a few more advanced concepts: dynamic arrays, arrays of arrays, objects, modules, file input and output, and event-based programming.

Note that many of the examples of the book would not run on the latest Python release (3.1). That is a minor concern since I would recommend you stay away from Python 3 in any case.

Disclaimer: I was given a free e-book in exchange for my review.

« Previous PageNext Page »

Powered by WordPress