Almost picking N distinct numbers at random

In Picking N distinct numbers at random: how to do it fast?, I describe how to quickly pick N distinct integer values are random from a range of integer values. It comes down to using either bitset/bitmap or a hash set.

The bitset approach is very fast when you need to pick many integer values out of a small range. The hash set approach is very fast if you need to pick very few values, irrespective of the range of values.

What about the middle ground? What if you need to pick lots of integer values from an even greater range?

Because N is large, you may not care to get exactly N values. That is, if you need to pick 100 million integer values at random, it might be fine to pick 99,999,999 integer values.

What can you do?

  1. Fill an array with N randomly generated integer values (using a uniform distribution).
  2. Sort the array.
  3. Remove duplicates.

That is pretty good, but the sort function could be expensive if N is large: it is O(N log N), after all.

Assuming that there are no duplicates, can we model this using probabilities? What is the distribution corresponding to the first value? We have N values picked out of a large range. So the probability that any value has been picked is N over the range. We recognize the geometric distribution. Once you have found the first value, you can repeat this same reasoning except that we now have N-1 values over a somewhat restricted range (because we generate them in sorted order).

  1. Generate a value over the range R using a geometric distribution with probability N/R.
  2. Append the value to my output array.
  3. Reduce the range R with the constraint that all future values must be larger than the last value appended to the output array.
  4. Decrement N by one.
  5. Repeat until N is 0.

You can use the fact that we can cheaply generate numbers according to a geometric distribution:

floor(log(random_number_in_0_1()) /log(1-p));

All you need is a way to generate random numbers in the unit interval [0,1] but that is easy. In C++ and many other programming languages, you have builtin support for geometric distributions.

The net result is an O(N) algorithm to pick N values at random over a range.

There is a catch, however. My model is not quite correct. For one thing, we do not quite have a geometric distribution: it is only valid if the range is very, very large. This manifests itself by the fact that the values I generate may exceed the range (a geometric distribution is unbounded). We can patch things up by stopping the algorithm once a value exceeds the range or some other anomaly occurs.

So I ran a benchmark where I have to pick 100,000,000 values among all integers smaller than 40,000,000,000. I get that the time per value generated is about half using the geometric-distribution approach:

sort-based 170 ns
geometric 80 ns

For larger arrays, I can achieve 3x to 4x gains but then my software runs out of memory.

My code is available.

What else could you do? Another practical approach would be to divide up the range into many small subranges and to use the fact that the number of values within each subrange follows a binomial distribution (which can be approximated by a normal distribution), to do a divide-and-conquer approach: instead of having to pick many values in a large range problem, we would have several small “pick few values into a small range” problems. For each small problem, you can afford to do a sort-based approach since sorting small arrays is fast.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

24 thoughts on “Almost picking N distinct numbers at random”

  1. I think it’s close, but it’s not the geometric distribution exactly. Consider the case of picking 1 value out of 2 for example.

    You are right that each value is chosen with probability N/R (with R the size of the range), but the probabilities when considering what value is be picked next are not independent.

    1. I think it’s close, but it’s not the geometric distribution exactly. Consider the case of picking 1 value out of 2 for example.

      I think we agree. From my blog post: “My model is not quite correct. For one thing, we do not quite have a geometric distribution”.

  2. Also there is a concern with the failure mode: if you end up picking 999,999 values you stop early with less values but these values are at least somehow “well distributed”. What about the other case, where the fates direct you to pick 1,000,001 values? The algorithm cuts off early and values near the end of the range won’t be chosen. So I think there is a significant bias against values near the end of the range.

    1. So I think there is a significant bias against values near the end of the range.

      Yes. There is going to be a bias. Of course, we could stop well short of the end and finish off the generation using some other mean.

        1. Yes and no. Drawing from the wrong distribution in this leads to a subtle bias throughout the range, and it also leads to not getting exactly the “expected” size most of the time.

          However, just clipping off the array when you get to N is another thing and leads to a different, quite strong type of bias for the end elements.

          I think you should choose the same strategy you do for fewer elements: you can return a 999,999 array element, why not a 1,000,001 one? There is a symmetry there and it eliminates one big source of bias. Instead of stopping when i <= N || range_size <= (N - i) just stop when out[i] > range_max (of course, don’t actually assign that last element).

          Yes, from a C perspective, returning more than N elements is more problematic :).

          Finally once you get your result, you can stretch or shrink it to exactly N by choosing random locations to insert elements and choosing them randomly from the range implied by the insertion position (yes another source of bias, I think, but small and evenly distributed in the sense that every number should have equal likelihood to appear).

  3. Part of the problem is that the C qsort starts at a significant disadvantage: it takes its comparator by function pointer and has to make an indirect function call to compare each pair of elements. For fast comparisons like between uint64_t that’s a lot of overhead for what boils doing to one assembly instruction.

    I sent a PR to add C++’s sort, everything else unchanged, and I find it to be almost 2x as fast as C’s sort (it’s the second reading here):

    Generating 1000000 values in [0, 40000000000), density: 0.002500 %
    timings: 0.171996 s 0.094119 s 0.063858 s
    timings per value: 171.996000 ns 94.119000 ns 63.858000 ns
    actual counts: 999993 999985 999998

    Generating 10000000 values in [0, 40000000000), density: 0.025000 %
    timings: 2.039629 s 1.103942 s 0.666784 s
    timings per value: 203.962900 ns 110.394200 ns 66.678400 ns
    actual counts: 9998779 9998723 9999999

    Generating 100000000 values in [0, 40000000000), density: 0.250000 %
    timings: 22.494848 s 12.375101 s 6.343137 s
    timings per value: 224.948480 ns 123.751010 ns 63.431370 ns
    actual counts: 99875477 99875035 100000000

    On this machine geometric is 3x to 4x as far as C sort, which is quite a bit faster (relatively) than your results. Maybe different compiler…

      1. You can write fast sorts in C too, it’s just that qsort isn’t one of them (unless the comparator time is large). Otherwise you just show “why writing a generic function-pointer-taking sort in C is relatively slow”.

          1. Yes, as far as I know qsort is the “textbook” way to sort in C and a good example of why higher level languages can be faster than “textbook C”.

            1. I am not sure I understand why we can’t easily fix the C qsort. I realize that qsort is compiled, but, surely, it is not impossibly hard engineering to fix this? Can’t the compiler recognize the qsort call and do some inlining? Of course, it would only solve qsort, but so what?

              The compiler already handles something like memcpy to optimize away the function call when needed. How hard could it be to do the same with qsort?

              1. I don’t think it’s that hard, but it just doens’t matter a lot.

                If libc was shipped with some LTO payload (i.e., intermediate representation code), then you could do it with LTO: the comparison function could be inlined in the final compile step.

                That’s probably never going to happen because compiler vendors want to keep their LTO info version specific and be allowed to change it at any point.

                The real thing is that it just doesn’t matter: you can easily write your own search method, or copy/paste the source of qsort or any other good sort you find on the internet and have your comparator inlined.

                So in that sense it is not the same as memcpy: that’s about compiling memcpy into the calling code with information only the compiler knows about alignment and copy length: you have no real practical alternative to the compiler doing a good job here in many cases.

                You have a practical alternative to qsort: just don’t use the copy of the sort that has been compiled ahead of time and stuck in libc. C++ doens’t have this problem because sort is in a header, and JIT’d languages don’t have this problem because inlining happens at runtime.

                1. Another solution for the primitive-valued-array case, implementable in the library only, would be if the standard library writers provided standard “compare int” type methods, and then qsort could check at runtime if a standard compare method was being used and then dispatch to the specialized version for that primitive.

                  Or it could inspect the machine code of the provided compare method and decide whether it was equivalent to the usual comparison – but now we’re just getting crazy.

    1. As it turns out, you can do much better than std::sort also. I used a radix sort which was several times faster than std::sort and led to the sort-based method for generating unique numbers being significantly faster than the geometric distribution method (although of course there may be a lot of optimization to be done on that one).

      I wrote more about radix sorts specifically over here.

  4. I had a similar problem and implemented the following in my minperf project, RandomSetGenerator.java. It only need about 4-5 ns/entry.

    Instead of a random number generator, I used a seeded hashing / mixing algorithm, similar to mix64, for a counter. It is guaranteed to return unique numbers without having to check for uniqueness.

    If no sorting at all is needed, then you can just use the hashing algorithm. This is also used in fastfilter_cpp, random.h, GenerateRandom64Fast, for 64 bit numbers. But mix functions can be written for any other number of bits (I wrote hash48, hash44, and hash32). If the range is not a power of 2, one might want to discard numbers outside of the range, and/or recursively split the range.

    In my case, “some” sorting is needed (blocks with ascending ranges). For this, I recursively split the range. That will guarantee you get the right number of entries in total. Within each range, the “no sorting” approach from above is used.

    Splitting a range: I have used the normal distribution to calculate the number of expected entries below the middle (RandomSetGenerator.randomHalf). It is somewhat slow, but if the ranges are large this isn’t a problem.

  5. I wanted to do a thing similar to this at one point, stumbled across algorithms for generating permutations without having to generate the whole set, and realized that the first N values from a random permutation of M values are very much like a set of N distinct values from 0 to M… And if M is a power of 2, there’s established methods for using a block cipher like AES-128 to generate a block cipher for fewer than 128 bits. Then you just encode values 0, 1, 2, … and get out distinct values in that range. There’s some overhead, but the advantage is that you don’t have to store them or sort them or anything to avoid duplicates.

  6. Others have already noted the format-preserving encryption method. I built a table of mix function parameters for bits 8..128 so that it’s possible to create a lightweight mix function with a range less than double the required range so the overhead isn’t so bad. However, the real challenge is getting a function with decent coverage of the possible set of output permutations.

    I recently used the sort method in a component that did need monotonic output; but then I added an optional post-shuffle for completeness. In that case I supported an arbitrary minimum distance, and to overcome the overflow problem I reduced the random range to R - (N - 1) * d, then sorted, then addedi * d to each result[i]. I believe this is correct because any acceptable result cannot have more than one entry of value R - 1, etc., so the nth-to-last entry cannot be higher than R - n.

    I would have to think harder to be confident, but you could probably use a rejection loop when your geometric generator went out of bounds on a per-element basis (rather than rejecting a complete set) for correct results, and you could apply the same arbitrary-minimum-distance tweak at that point.

    Also, “99,999” doesn’t really approximate “100 million” as implied in the original post.

    1. Sorry, I should clarify. That table I linked lists parameters that would plug in to @Thomas’ hash44() function, above, to generalise it to the fewest number of bits for whatever range you need so the rejection loop doesn’t spend too much time rejecting out-of-range values.

      That hash construction relates back to Murmur3 and appears all over the place, these days. Somebody described the process of searching for parameters here: http://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html

  7. Hi,
    Very interesting and insightful. Thanks for your blog entries!

    There is a related blog post here:
    http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/
    I have added his method to your code and it is even faster:

    size_t veryfast_pick_N(uint64_t range_min, uint64_t range_max, size_t N, uint64_t *out) {
    uint64_t range_size = range_max – range_min;
    if(N == 0) return 0;
    if(range_size = range_max) return 0;
    out[0] = previous + range_min;
    size_t i = 1;
    for(; i < N; i++) {
    uint64_t newoffset = random_geometric(p);
    previous += newoffset + 1;
    if(previous >= range_max) {
    break;
    }
    out[i] = previous + range_min;
    }
    return i;
    }

    With that, I have an extra question/request, it would be great to be able to get a random sample of bits from a compressed bitmap. It seems to me that this code would be somewhat easy to use in Roaring Bitmap, am I right?
    Something like sample = bm.random_sample(N) that returns a bitmap with all the bits in bm bm & sample == sample and with a cardinality of N.
    Would you consider adding it?

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax