In Picking N distinct numbers at random: how to do it fast?, I describe how to quickly pick N distinct integer values are random from a range of integer values. It comes down to using either bitset/bitmap or a hash set.

The bitset approach is very fast when you need to pick many integer values out of a small range. The hash set approach is very fast if you need to pick very few values, irrespective of the range of values.

What about the middle ground? What if you need to pick lots of integer values from an even greater range?

Because N is large, you may not care to get exactly N values. That is, if you need to pick 100 million integer values at random, it might be fine to pick 99,999 integer values.

What can you do?

- Fill an array with N randomly generated integer values (using a uniform distribution).
- Sort the array.
- Remove duplicates.

That is pretty good, but the sort function could be expensive if N is large: it is O(N log N), after all.

Assuming that there are no duplicates, can we model this using probabilities? What is the distribution corresponding to the first value? We have N values picked out of a large range. So the probability that any value has been picked is N over the range. We recognize the geometric distribution. Once you have found the first value, you can repeat this same reasoning except that we now have N-1 values over a somewhat restricted range (because we generate them in sorted order).

- Generate a value over the range R using a geometric distribution with probability N/R.
- Append the value to my output array.
- Reduce the range R with the constraint that all future values must be larger than the last value appended to the output array.
- Decrement N by one.
- Repeat until N is 0.

You can use the fact that we can cheaply generate numbers according to a geometric distribution:

floor(log(random_number_in_0_1()) /log(1-p));

All you need is a way to generate random numbers in the unit interval [0,1] but that is easy. In C++ and many other programming languages, you have builtin support for geometric distributions.

The net result is an O(N) algorithm to pick N values at random over a range.

There is a catch, however. My model is not quite correct. For one thing, we do not quite have a geometric distribution: it is only valid if the range is very, very large. This manifests itself by the fact that the values I generate may exceed the range (a geometric distribution is unbounded). We can patch things up by stopping the algorithm once a value exceeds the range or some other anomaly occurs.

So I ran a benchmark where I have to pick 100,000,000 values among all integers smaller than 40,000,000,000. I get that the time per value generated is about half using the geometric-distribution approach:

sort-based | 170 ns |

geometric | 80 ns |

For larger arrays, I can achieve 3x to 4x gains but then my software runs out of memory.

What else could you do? Another practical approach would be to divide up the range into many small subranges and to use the fact that the number of values within each subrange follows a binomial distribution (which can be approximated by a normal distribution), to do a divide-and-conquer approach: instead of having a pick many values in a large range problem, we would have several small “pick few values into a small range” problems. For each small problem, you can afford to do a sort-based approach since sorting small arrays is fast.

I think it’s close, but it’s not the geometric distribution exactly. Consider the case of picking 1 value out of 2 for example.

You are right that each value is chosen with probability N/R (with R the size of the range), but the probabilities when considering what value is be picked next are not independent.

I think we agree. From my blog post: “My model is not quite correct. For one thing, we do not quite have a geometric distribution”.

Also there is a concern with the failure mode: if you end up picking 999,999 values you stop early with less values but these values are at least somehow “well distributed”. What about the other case, where the fates direct you to pick 1,000,001 values? The algorithm cuts off early and values near the end of the range won’t be chosen. So I think there is a significant bias against values near the end of the range.

Yes. There is going to be a bias. Of course, we could stop well short of the end and finish off the generation using some other mean.

Note that the reason there is a bias at the end follows directly from the fact that the geometric distribution is an imperfect approximation.

Yes and no. Drawing from the wrong distribution in this leads to a subtle bias throughout the range, and it also leads to not getting exactly the “expected” size most of the time.

However, just clipping off the array when you get to N is another thing and leads to a different, quite strong type of bias for the end elements.

I think you should choose the same strategy you do for fewer elements: you can return a 999,999 array element, why not a 1,000,001 one? There is a symmetry there and it eliminates one big source of bias. Instead of stopping when

`i <= N || range_size <= (N - i)`

just stop when out[i] > range_max (of course, don’t actually assign that last element).Yes, from a C perspective, returning more than N elements is more problematic :).

Finally once you get your result, you can

stretchorshrinkit to exactly N by choosing random locations to insert elements and choosing them randomly from the range implied by the insertion position (yes another source of bias, I think, but small and evenly distributed in the sense that every number should have equal likelihood to appear).I agree with your concerns regarding the tail.

Part of the problem is that the C qsort starts at a significant disadvantage: it takes its comparator by

function pointerand has to make an indirect function call to compare each pair of elements. For fast comparisons like between`uint64_t`

that’s a lot of overhead for what boils doing to one assembly instruction.I sent a PR to add C++’s sort, everything else unchanged, and I find it to be almost 2x as fast as C’s sort (it’s the second reading here):

`Generating 1000000 values in [0, 40000000000), density: 0.002500 %`

timings: 0.171996 s 0.094119 s 0.063858 s

timings per value: 171.996000 ns 94.119000 ns 63.858000 ns

actual counts: 999993 999985 999998

Generating 10000000 values in [0, 40000000000), density: 0.025000 %

timings: 2.039629 s 1.103942 s 0.666784 s

timings per value: 203.962900 ns 110.394200 ns 66.678400 ns

actual counts: 9998779 9998723 9999999

`Generating 100000000 values in [0, 40000000000), density: 0.250000 %`

timings: 22.494848 s 12.375101 s 6.343137 s

timings per value: 224.948480 ns 123.751010 ns 63.431370 ns

actual counts: 99875477 99875035 100000000

On this machine geometric is 3x to 4x as far as C sort, which is quite a bit faster (relatively) than your results. Maybe different compiler…

Thanks Travis.

(As you may suspect, I was aware that qsort was slower the C++ sort, but I was writing a pure C benchmark.)

You can write fast sorts in C too, it’s just that qsort isn’t one of them (unless the comparator time is large). Otherwise you just show “why writing a generic function-pointer-taking sort in C is relatively slow”.

My belief is that I used the “textbook” or standard way to sort an array in C. If my belief is wrong, I’d love to be corrected.

Yes, as far as I know qsort is the “textbook” way to sort in C and a good example of why higher level languages can be faster than “textbook C”.

I am not sure I understand why we can’t easily fix the C qsort. I realize that qsort is compiled, but, surely, it is not impossibly hard engineering to fix this? Can’t the compiler recognize the qsort call and do some inlining? Of course, it would only solve qsort, but so what?

The compiler already handles something like memcpy to optimize away the function call when needed. How hard could it be to do the same with qsort?

I don’t think it’s that hard, but it just doens’t matter a lot.

If libc was shipped with some LTO payload (i.e., intermediate representation code), then you could do it with LTO: the comparison function could be inlined in the final compile step.

That’s probably never going to happen because compiler vendors want to keep their LTO info version specific and be allowed to change it at any point.

The real thing is that it just doesn’t matter: you can easily write your own search method, or copy/paste the source of qsort or any other good sort you find on the internet and have your comparator inlined.

So in that sense it is not the same as memcpy: that’s about compiling memcpy into the calling code with information only the compiler knows about alignment and copy length: you have no real practical alternative to the compiler doing a good job here in many cases.

You have a practical alternative to qsort: just don’t use the copy of the sort that has been compiled ahead of time and stuck in libc. C++ doens’t have this problem because sort is in a header, and JIT’d languages don’t have this problem because inlining happens at runtime.

Another solution for the primitive-valued-array case, implementable in the library only, would be if the standard library writers provided standard “compare int” type methods, and then qsort could check at runtime if a standard compare method was being used and then dispatch to the specialized version for that primitive.

Or it could inspect the machine code of the provided compare method and decide whether it was equivalent to the usual comparison – but now we’re just getting crazy.

I had a similar problem and implemented the following in my minperf project, RandomSetGenerator.java. It only need about 4-5 ns/entry.

Instead of a random number generator, I used a seeded hashing / mixing algorithm, similar to mix64, for a counter. It is guaranteed to return unique numbers without having to check for uniqueness.

If no sorting at all is needed, then you can just use the hashing algorithm. This is also used in fastfilter_cpp, random.h, GenerateRandom64Fast, for 64 bit numbers. But mix functions can be written for any other number of bits (I wrote hash48, hash44, and hash32). If the range is not a power of 2, one might want to discard numbers outside of the range, and/or recursively split the range.

In my case, “some” sorting is needed (blocks with ascending ranges). For this, I recursively split the range. That will guarantee you get the right number of entries in total. Within each range, the “no sorting” approach from above is used.

Splitting a range: I have used the normal distribution to calculate the number of expected entries below the middle (RandomSetGenerator.randomHalf). It is somewhat slow, but if the ranges are large this isn’t a problem.

This is a great comment Thomas.