To test my algorithms, I like to generate synthetic data. To do so, I often need to generate distinct randomly chosen numbers from a range of values. For example, maybe I want to pick 2 distinct integers in the interval [0,10]. For my purposes, I need these numbers to appear in order, but we can just generate them in any order and sort them later.

Picking the first number at random is easy: most programming languages come with fast pseudo-random number generators. However, when you try to pick the second number, there is a small probability that you pick the first one again. If this happens, you need to start again. To check quickly whether a number has already been picked, we might use a hash table. This suggests the first algorithm one might try:

HashSet<Integer> s = new HashSet<Integer>(); while (s.size() < N) s.add(rand.nextInt(Max));

(This code generates N distinct integers in the interval [0,Max).)

Intuitively, this algorithm is hard to beat when you need to pick few integers from a large range. In this case, the probability that you will pick an already picked number is small. But, in fact, even if you need to pick one out of every two values from a range (say pick 10 integers in the interval [0, 20)), this algorithm is still reasonably efficient. Indeed, the probability that a given number is already picked is no larger than 50%. How many times (on average) do you need to generate new random numbers if you have a 50% probability of rejecting them? You can check that the answer 2. This means that as long as you don’t need to pick more than half the values (N is no more than Max/2), you can expect to need to generate no more than Max random numbers.

What if you need to pick more than Max/2 integers in [0,Max)? This can become a problem if you are not careful. Thankfully, there is a nice fix: picking N integers in [0,Max) for N large to Max is equivalent to picking Max-N integers in [0,Max) and then selecting the numbers you did not pick. Computing this complement can be done efficiently if you first sort the numbers you picked. This means that you can always assume that N is no larger than Max/2.

Still, it is reasonable to think that the performance of the hash-based algorithm degrade as N becomes closer to Max/2.

One possibly better alternative in this case… one that your typical Computer Science professor might propose… is Reservoir Sampling. Though it sounds fancy, Reservoir Sampling is actually easily implemented:

int[] ans = new int[N]; for (int k = 0; k < N; ++k) ans[k]=k; for(int k = N ; k < Max; ++k) { int v = rand.nextInt(k+1); if(v < N) { ans[v] = k; } }

It is not immediately obviously why this algorithm would work. However, it is correct. The nice thing about Reservoir Sampling is that we know exactly how many random numbers we need to generate: we need Max of them, no matter what. This means that Reservoir Sampling has a running time that depends on Max, but not a lot of N.

However, it turns out that an even better alternative might be to replace the hash table by a bitmap. A bitmap is just an array of bits. We need Max bits. If the value has already been picked, we set the bit to 1, otherwise the bit is set to 0. The algorithm is otherwise identical to the first hash-based algorithm:

BitSet bs = new BitSet(Max); int cardinality = 0; while(cardinality < N) { int v = rand.nextInt(Max); if(!bs.get(v)) { bs.set(v); cardinality++; } }

It turns out that a good heuristic is to use the bitmap algorithm when N is smaller than Max / 1024. Otherwise, the hash-based algorithm appears better. Reservoir Sampling is not a good choice for this problem.

The following table shows the speed (in millions of integers picked per second) of the various techniques on a recent i7 processor using C++. Note how much faster the bitmap approach is.

Max/N | Hash | Bitmap | Reservoir Sampling |
---|---|---|---|

16384 | 2.0 | 1.0 | 0.0 |

1024 | 7.5 | 28 | 0.1 |

2 | 1.3 | 64 | 14 |

For good measure, I coded up these algorithms in both Java and C++. The results are consistent. My code is available for review.

**Credit**: I thank Nathan Kurz for challenging me on this problem.

I wonder how this compares?

(1) fill an array with 0 to Max random floating point numbers

(2) apply an index sort to the array (sort the array and return an index of elements in ascending order)

(3) output the first N values in the index

In Perl Data Language, this algorithm takes two lines:

$result = qsorti(random($max));

p $result(0:($n-1));

@Peter Turney

I don’t think your numbers are going to be distinct. In theory, it is possible that your approach would pick just one distinct value (repeated many times).

Update: I misread Peter’s description.

You’re wrong. I’ve been using exactly this code for years.

@Peter Turney

I misread your algorithm. Sorry.

Still, there is a probability (very small indeed) that all your floating point numbers are going to identical. In this sense, your algorithm is probabilistic… with good probability, it will solve the problem.

I’ll add it to my benchmark later. Thanks.

An example might help:

pdl> $ran=random(4)

pdl> p $ran

[0.9474 0.8675 0.7389 0.4402]

pdl> $sort=qsorti($ran)

pdl> p $sort

[3 2 1 0]

(p = print; floats are truncated for display purposes; sort from smallest to largest; qsorti = quick sort and return index)

When you’re using a high-level language (Perl Data Language, Matlab, etc.), you learn to avoid explicit loops by calling built-in functions that implicitly loop over vectors and matrices. In a high-level language, an algorithm without explicit loops is almost always much faster than an algorithm with explicit loops.

“Still, there is a probability (very small indeed) that all your floating point numbers are going to identical. In this sense, your algorithm is probabilistic… with good probability, it will solve the problem.”

If all the floating point numbers are identical and the sort preserves the original order when there is a tie, then the output will be [0 1 2 3 …]. It is not a problem if this output happens from time to time. It is a valid output, as long as it doesn’t happen too often.

@Peter Turney

That’s right, but even a tie between two floating point numbers, if ties are always resolved in the same deterministic manner, will introduce a bias.

I assume that Perl will use 64-bit floating point numbers… in such a case, your algorithm to generate distinct 32-bit integers has a negligible bias.

(I should note that even my algorithms have biases in practice. They are only free of biases if I assume that I have a perfect random number generators.)

Of course, the interesting question is speed. The way you describe your algorithm, it runs in time O(Max), so that we might expect that when N is much smaller than Max, then your algorithm is slow compared to the hash set approach. My instinct is that even when N is close to Max, your algorithm is slower than the bitmap approach. Of course, I’ll need to verify this more seriously.

“The way you describe your algorithm, it runs in time O(Max)”

Actually O(Max log(Max)) due to sorting. But if you implement all the algorithms in a high-level language, I’m guessing mine will run the fastest, due to the lack of explicit loops. For me, the time and effort I save by writing programs in a high-level language is more important than the speed I can get from the computer by working in a lower-level language (in most cases, with some rare exceptions).

Also, what if you don’t want the final list in sorted order? What if the task is to shuffle the numbers in the list [0,Max]? (This is a typical task for my work.) Then your N/2 select/drop inversion trick is not applicable.

@Peter Turney

As for focusing on the performance one gets using Perl… see my post “The language interpreters are the new machines”: http://lemire.me/blog/archives/2011/06/14/the-language-interpreters-are-the-new-machines/

Much of the traditional computer science is focused on designing algorithms from basic operations (and this is what I do here with C++ and Java), but this is increasingly less relevant.

But not everything is black and white. For some problems, it is definitively worth it to get a 10x speed-up. Google’s backend could not be written in pure Perl. 😉

Back to the problem at hand…

If you want to shuffle a list, then a Fisher–Yates shuffle is probably best. Such shuffling is part of Java, C++(STL) and Python. I don’t know about Perl but I have read online that you can find a shuffle function in List::Util. So I would argue that in many instances, you shouldn’t code a list shuffling algorithm by hand.

You should check out format preserving encryption.

The basic idea is this: hashing has collisions, but encryption does not, so you can encrypt an int as you increment it to get distinct random numbers. A (solvable) challenge is finding an encryption algorithm that tightly fits the range of numbers you are generating.

http://blog.demofox.org/2013/07/06/fast-lightweight-random-shuffle-functionality-fixed/