Today I spent money from a research grant. Here is the process:

  1. I grab the form in Excel format.
  2. I fill it out.
  3. I print the form.
  4. I sign it.
  5. I give it to a secretary.
  6. The secretary gets the chair of my research center to sign it.
  7. The form is then sent to accounting, by internal mail (on paper).
  8. They review the form.
  9. They enter the data in a computerized database.

Suggesting that it could be improved using technology from this century is crazy talk.

I agreed to give a talk to graduate students on how to write good research papers. I have posted the slides of my talk online. They are mostly taken out of my web page on this topic.

What annoys you about research papers? How do you recognize a good research paper? Do you have any advice to share?

Most researchers are convinced that their current work is important. Otherwise, they wouldn’t do it. Yet, few of them work on obviously important things like curing cancer or solving world hunger. Rather, they do silly things like prove the Poincaré conjecture. A century to figure out some theoretical nonsense? Please!

So, why won’t researchers work on the important problems of our era?

The conventional explanation is that working directly on the major problems is like staring at the Sun. Instead, researchers must do routine work until an opening toward greatness opens up. So real researchers…

  • survey existing work,
  • comment on special cases,
  • provide theoretical justifications for empirical observations,
  • validate new theory experimentally, and so on.

That is, researchers are not architects. They use greedy algorithms:

  • Look at problems and results you can grasp with your current expertise,
  • Select an interesting problem which is a good fit for your expertise,
  • rinse and repeat.
  • Wait! You are close to solve a major problem? Jump on it!

Should scientists feel guilty that they can’t prove the importance of each increment? I think not. I think scientists are inefficient, but there is no better way known to man. Indeed, consider how real innovation is typically unpredictable:

  • The greatest difference between my Honda Civic and the car I drove as a teenager is that I can lock and unlock all doors with a remote. This single function made all the difference in the world for me. I drive my wife nuts as I keep playing with the remote: lock, unlock, lock, unlock… And people thought we would have flying cars!
  • I am sure that Google offered better search results that altavista. Yet, the real reason I switched to Google and never looked back is that they did away with the big annoying ads. Understanding that you didn’t need to annoy your users to make a profit was Google’s greatest innovation. (Don’t let them fool you into thinking PageRank had something to do with it.)
  • Amazon.com is by far the best e-commerce site on the planet. But what is different? In fact, a lot of small little things. On the surface, Amazon.com is just an HTML view of a database. But they have collected many small innovations that when put together make a huge difference.

My point is that innovation in the little things adds up to important and practical results. That is why academic researchers spend so much time writing surveys or studying to death a detail. They don’t think their own work will change the world, but they count on others doing the same thing. They hope that when they put it back together, the result is great. For the last two hundred years or so, they have fared extremely well.

To put it another way: greedy algorithms can be pretty good. They can certainly beat 5-year plans.

Further reading: Innovative ideas are indistinguishable from crackpot ones

You can sort large files while using little memory. The Unix sort tool is a widely available implementation of this idea. Files are written to disk sequentially, without random access. Thus, you can also sort variable-length records, such as lines of text.

What about shuffling? Using the Fisher-Yates algorithm also known as Knuth algorithm, you can shuffle large files while using almost no memory. But you need random access to your files. Thus it is not applicable to variable-length records. And indeed, the Unix sort command cannot shuffle. (It has a random-sort option, but it is not a shuffle. Meanwhile, the shuf command runs in RAM.)

A solution: Tag each record with a random number. Pick random numbers from a very large set so that the probability that any two lines have the same random number is small. Then use external-memory sorting. You can implement something similar as a single line in Unix.

A better solution? Shuffling is possible in linear time O(n). Sorting is a harder problem (in O(n log n)). Thus, using a sort algorithm for shuffling—as we just did—is inelegant. Can we shuffle in linear time without random access with variable-length records?

Maybe we could try something concrete? Consider this algorithm:

  • Read the original file in small blocks. Shuffle each block in RAM. Write them to temporary files. View each shuffled block as a stack of records.
  • Select a non-empty block at random. Pick and remove the record on top of the stack. Append it to the result set. Repeat. (The correct probability assignment for each block is the number of records left in the block divided by the total number of records left.)

(As a variation on this algorithm, you can merge the blocks two-by-two.)

Unfortunately, I doubt this algorithm can run in linear time.

Your challenge: Consider variable-length records. Prove or disprove that we can implement an external-memory shuffle in linear time. Alternatively, come up with an algorithm faster than the sorting-based one.

Update: Preston L. Bannister proposed an algorithm which solves the problem to my satisfaction. The same algorithm was described by P. Sanders in Random Permutations on Distributed, External and Hierarchical Memory (Information Processing Letters, 1998).

Reference: This is a follow-up to my blog post External-Memory Shuffles?

The bitwise exclusive or (e.g., 1110 XOR 1001 = 0111) looks simpler to compute than integer addition (e.g., 2 + 9 = 11). Some research articles claim that XOR is faster. It appears to be Computer Science folklore. But is it true?

Which line runs faster? (The symbol “^” is the XOR.)

for(int k = 0; k < N; ++k) sum+= k;

for(int k = 0; k < N; ++k) sum^= k;

My result: In C++ and Java, both run at the same speed (within 1%).

Disclaimer: I’d be delighted if you could prove me wrong. Please provide Java or C++ source code.

Next Page »

Powered by WordPress