External-Memory Sorting in Java

Update: this code is now obsolete. Please see the corresponding Github project.

Sometimes, you want to sort large file without first loading them into memory. The solution is to use External Sorting. Typically, you divide the files into small blocks, sort each block in RAM, and then merge the result.

Many database engines and the Unix sort command support external sorting. But what if you want to avoid a database? Or what if you want to sort in a non-lexicographic order? Or maybe you just want a simple external sorting example?

When I could not find such a simple program, I wrote one.

Please help me improve it. It needs:

  • To be easy to modify: the code must remain simple.
  • It must scale to very large files.
  • While scalability and simplicity are most important, it must also be sensibly efficient.

Once we have a good solution, I plan to post it on Google code and add a link in the External Sorting wikipedia entry. If you contribute to the code, please add your name so you can get credit.

Reference: ExternalSort.java, http://pastebin.com/H57TZF7e

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

11 thoughts on “External-Memory Sorting in Java”

  1. Why not put it on Google Code right away? They have a nice code review tool that works better than pastebin to comment code or ask a question on a specific line.

  2. I like your approach! Great use of PriorityQueue and the Comparable interface on your BinaryFileBuffer. Short and simple.

    I made a couple of changes on pastebin to get rid of a potential infinite loop. But beside that, it looks good to me. (Although I think 8 spaces is way too much for tabs. I go with 2. 😉 )

  3. My command of Java is quite poor, but that’s beside the point: I did not originally understand the algorithm. If I have a list of, say, 100 numbers, split that list into 10 lists of 10 numbers each, sort each of them individually and then merge the result, I will still need to sort the new 100-number list. However, there are advantages to doing things that way – which are explained here: http://en.wikipedia.org/wiki/Mergesort

    “1- A small list will take fewer steps to sort than a large list.
    2- Fewer steps are required to construct a sorted list from two sorted lists than two unsorted lists. For example, you only have to traverse each list once if they’re already sorted […].”

    This old pony learnt a new trick today…

  4. I think the key is that you don’t have to load the smaller files entirely: just read the first number in each of the 10 files and you’ll know the smallest (remaining) entry. Pop the smallest entry out out of the file where you found it and send it directly to the output file. As a result, you never had to keep the 100 numbers in memory.

  5. Reading through the code I think found one unneeded conditional:
    while(line != null) {

    When I created a big file to test, I ran out of memory. Increasing the
    heap size, java -Xmx128m or more, it only ever created one temp file.
    I believe that’s because you check free memory rather than what has
    been used so far:
    while((Runtime.getRuntime().freeMemory() > 2097152)
    &&( (line = fbr.readLine()) != null) ){ // as long as you have 2MB
    tmplist.add(line);

    If more than one file was created, I do not know the behaviour if this
    would overwrite:
    File newtmpfile = File.createTempFile(“sortInBatch”, “flatfile”);

    As it is, for a large file, the Java version is faster, however
    sometimes the result is not what I get on the command-line with sort
    (I get a smaller file!). I haven’t been able to properly isolate the bug. It
    only happens with fairly large files though.

    $ time java -Xmx512m ExternalSort to_sort10.txt to_sort10.txt.java.out

    real 0m24.436s
    user 0m24.937s
    sys 0m1.669s

    $ time sort to_sort10.txt > to_sort10.txt.sort.out

    real 2m41.615s
    user 2m35.640s
    sys 0m1.876s

    -rw-r–r– 1 danielharan danielharan 54288300 4 Apr 00:55 to_sort10.txt
    -rw-r–r– 1 danielharan danielharan 39720054 4 Apr 01:21
    to_sort10.txt.java.out #EEEK
    -rw-r–r– 1 danielharan danielharan 54288300 4 Apr 01:26
    to_sort10.txt.sort.out

  6. I suppose there are some issues with your code:
    – The check of remaining memory (Runtime.getRuntime().freeMemory() > 2097152) is not valid. The Vector will douple its size if it ran out of space. This may allocate more than 2Mb.
    – All files to merge are kept open. The OS could run out of file handles if there are to many files to merge.
    – The files to merge are not closed savely. If an exception is thrown they may be still opened.

    Alternative approaches merge files in multiple runs until there are no files to merge left. This is probably only needed here if too many files have to be merged at once.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.