Sorting 1 terabyte in 209 seconds

Yahoo! managed to sort 10 billion 100-byte elements in 209 seconds. This was done in Java using Hadoop.

As a basis for comparison, on a fast and recent Mac Pro, it takes 6000 seconds to sort a 2 GB text file using Unix file utilities. Yahoo!’s problem is 500 times larger, and they solve it 30 times faster : they are 4 orders of magnitude faster! Of course, they have fixed-length records which helps tremendously.

However, I wonder how much energy (power usage) was spent on the sort operation?

2 thoughts on “Sorting 1 terabyte in 209 seconds”

  1. The Terabyte sort seem pretty silly, of course throwing a shitload of ressources at a problem is bound to give “impressive results” but where is the benefit for the average user?
    i.e your 6000 seconds sort.
    This looks like the Formula 1 racing which is supposed to further technological progress and which does once in a while, but at which cost?
    The Penny sort on the same page seem more sensible.
    BTW, from experience with my linear sort the 6000 seconds you report for 2Gb fall within plausible range of elapsed time due to disk access latency when sorted records are shuffled around, not a compute bound limit, you might check it.

  2. The 6000 seconds is definitively not “internal memory” since the whole machine has 2 GiB of RAM and it tries to sort 2 GiB of data. So there is quite a bit of IO overhead. Sure.

Leave a Reply

Your email address will not be published. Required fields are marked *