External-Memory Shuffles?

We need to shuffle the lines in very large variable-length-record flat files.

We can load the files in MySQL and do “select * from mytable order by rand().”

However, loading the data in a DBMS and dumping it out is cumbersome. So, we do an in-memory shuffle block by block. It comes close to a full random shuffle, but I am worried it might not be good enough.

Anyone knows of a fast way to do a full external-memory shuffle using only Ruby, Perl, Python, and other Unix utilities?

A crazy idea: Let n be the number of lines. Shuffle the numbers between 1 and n, in-memory. Then preprend these numbers at the beginning of each line, do external-memory sorting with the Unix command sort, and remove the random numbers from the final file. This will scale up to about 100 million lines on a PC.

It might be possible to generate random numbers from a large set, thus making collisions very unlikely, thereby avoiding a shuffle of numbers altogether.

Definitive solution? There is a GNU command called shuf, but it works only for small files. However, the new sort command has a –random-sort flag. To install a recent sort on on MacOS, do the following:

wget http://ftp.gnu.org/gnu/coreutils/coreutils-6.9.tar.bz2
tar xvjf coreutils-6.9.tar.bz2
cd coreutils-6.9
mkdir osxbuild && cd osxbuild
../configure --prefix=/usr/local/stow/coreutils-6.9
jm_cv_func_svid_putenv=yes && make
sudo cp src/sort /usr/local/bin/
sudo cp ../man/sort .1 /usr/local/share/man/man1/

However, sort –random-sort will only shuffle properly if no line is repeated. You need to first pass your file through cat -n to ensure that all lines are unique, then conclude with cut to remove the counter added by cat. Here is a command that does the trick:

cat -n myfile.csv | sort --random-sort | cut -f 2-

It may not generate a perfect shuffle, but it should be close.

Update:  I have written a follow-up blog post.

5 thoughts on “External-Memory Shuffles?”

  1. Thanks Suresh. To implement a “Knuth shuffle”, don’t you need random access to the items to swap them? For my problem, I specify a “variable-length-record flat file”. If I am allowed to turn it into a fixed-length-record flat file with random access to the lines, I might as well throw the data in a DBMS. Hence, there is the idea of shuffling the numbers from 1 to n, which can be done in O(n) time using the Knuth shuffle [which languages like Java or Python conveniently provide], prepend them to the lines, and then sorting the lines…

    Or are you thinking about some other algorithm? Here is my reference:

    Or maybe you mean that I can scale up the shuffling of the numbers from 1 to n to values of n much greater than 100,000,000? Well, maybe I was being a pessimist…

  2. What’s the problem with a shuffle ? using the standard Knuth trick, you can generate a shuffle as easily as choosing random numbers.

  3. True, but you’re only shuffling the range [1..n], not the actual numbers, no ? I’m guessing that it won’t scale beyond what you can fit in memory though.

  4. I like the CRC64 solution, although, why not go even crazier and use a SHA algorithm? The disadvantage of these approaches is that you need one pass to generate the random number (prepending it to the line) and another pass to sort the lines.
    Unf. my sha1sum command doesn’t seem to accept input from the command line, making it difficult to answer with a nice one liner.

Leave a Reply

Your email address will not be published. Required fields are marked *