Would you accept work designing mass destruction weapons? Back when a was in college, one my most memorable philosophy assignment was a rebuttal to the claim that scientists working on weapons of mass destruction were responsible for the creation of the weapons. As intellectuals, and scientists, do we have to bear the full responsibility of our actions?

Somehow, we like to imagine researchers working on new weapons in lavish conditions. They are individuals with prestigious degrees who could have any job. Yet, they are enticed by greed to work on evil projects. Or are they?

I spent weeks in the library—the Web didn’t even exist!—documenting how difficult the job market for scientists was. As far back as twenty years ago, statistics showed that there were far more qualified scientists than corresponding jobs. (Ironically, years later, I became a scientist only to be surprised by how competitive the job market really is.)

Scientists may not be physically starving, but they have invested 10 or 20 years working toward a single goal: become a bona fide researcher. And jobs are scarce. Faced with reality, people compromise.

Yet, remember, this scarcity of Ph.D.-related jobs combined with a glut of Ph.D.s is not an accident:

By the fall of 1972, there are likely to be more Ph.D.’s looking for positions than there are (adequately salaried) positions with duties commensurate with Ph.D. training level in mathematics. (Anderson,  Are there too many Ph.D.’s, American Mathematical Monthly, 1970)

Source: Sébastien Paquet.

In my previous post, you were invited to help with a reference implementation of external sorting in Java. Several people tested and improved the code. I like the result.

  • I posted the code on Google code. All contributors are  owners of the project. The source code is under subversion.
  • I have added a link to it from the wikipedia page.

What is left to do?

  • The code remains untested. Please run your benchmarks! Find bugs!
  • Please contribute unit tests.
  • Can you write a tutorial on how to use the code?
  • Can you simplify the code further while making it faster and more robust?

Caveat: My intent was for the code to be in the public domain—nobody should own reference implementations—but Google code would not allow it. I selected the lesser GPL license instead, for now.

Reference: There is a fast external sorting implementation in Java by the Yahoo! people. (Thanks to Thierry Faure for pointing it out.) I have not looked at it.

Update: this code is now obsolete. Please see the corresponding Google code project.

Sometimes, you want to sort large file without first loading them into memory. The solution is to use External Sorting. Typically, you divide the files into small blocks, sort each block in RAM, and then merge the result.

Many database engines and the Unix sort command support external sorting. But what if you want to avoid a database? Or what if you want to sort in a non-lexicographic order? Or maybe you just want a simple external sorting example?

When I could not find such a simple program, I wrote one.

Please help me improve it. It needs:

  • To be easy to modify: the code must remain simple.
  • It must scale to very large files.
  • While scalability and simplicity are most important, it must also be sensibly efficient.

Once we have a good solution, I plan to post it on Google code and add a link in the External Sorting wikipedia entry. If you contribute to the code, please add your name so you can get credit.

Reference: ExternalSort.java, http://pastebin.com/H57TZF7e

« Previous Page

Powered by WordPress