We need to shuffle the lines in very large variable-length-record flat files.
We can load the files in MySQL and do “select * from mytable order by rand().”
However, loading the data in a DBMS and dumping it out is cumbersome. So, we do an in-memory shuffle block by block. It comes close to a full random shuffle, but I am worried it might not be good enough.
Anyone knows of a fast way to do a full external-memory shuffle using only Ruby, Perl, Python, and other Unix utilities?
A crazy idea: Let n be the number of lines. Shuffle the numbers between 1 and n, in-memory. Then preprend these numbers at the beginning of each line, do external-memory sorting with the Unix command sort, and remove the random numbers from the final file. This will scale up to about 100 million lines on a PC.
It might be possible to generate random numbers from a large set, thus making collisions very unlikely, thereby avoiding a shuffle of numbers altogether.
Definitive solution? There is a GNU command called shuf, but it works only for small files. However, the new sort command has a –random-sort flag. To install a recent sort on on MacOS, do the following:
tar xvjf coreutils-6.9.tar.bz2
mkdir osxbuild && cd osxbuild
jm_cv_func_svid_putenv=yes && make
sudo cp src/sort /usr/local/bin/
sudo cp ../man/sort .1 /usr/local/share/man/man1/
However, sort –random-sort will only shuffle properly if no line is repeated. You need to first pass your file through cat -n to ensure that all lines are unique, then conclude with cut to remove the counter added by cat. Here is a command that does the trick:
cat -n myfile.csv | sort --random-sort | cut -f 2-
It may not generate a perfect shuffle, but it should be close.
Update: I have written a follow-up blog post.