It is common in software to need random integers within a range of values. For example, you may need to pick an object at random in an array. Random shuffling algorithms require such random integers.

Typically, you generate integers uniformly at random over the full range (say a 64-bit integer in [0, 2^{64}) or a 32-bit integer in [0, 2^{32})). From these integers, you find a way to get integers without a range. O’Neill showed that this apparently unimportant operation can be more expensive than generating the original random integers.

Naively, you could take the random integer and compute the remainder of the division by the size of the interval. It works because the remainder of the division by D is always smaller than D. Yet it introduces a statistical bias. You can do better without being slower. The conventional techniques supported in Java and Go require at least one or two integer division per integer generated. Unfortunately, division instructions are relatively expensive.

If you are willing to suffer a slight statistical bias, you can generate a floating-point values in [0,1) and multiply the result by the size of the interval. It avoids the division but introduces other overhead.

There is a better approach that requires no division most of the time:

uint64_t nearlydivisionless ( uint64_t s ) { uint64_t x = random64 () ; __uint128_t m = ( __uint128_t ) x * ( __uint128_t ) s; uint64_t l = ( uint64_t ) m; if (l < s) { uint64_t t = -s % s; while (l < t) { x = random64 () ; m = ( __uint128_t ) x * ( __uint128_t ) s; l = ( uint64_t ) m; } } return m >> 64; }

Let us suppose that I shuffle an array of size 1000. I generate 64-bit integers (or floating-point numbers) which I convert to indexes. I use C++ but I reimplement the algorithm used in Java. Let me look at the number of nanoseconds per input key being shuffled on a Skylake processor (Intel):

Java’s approach | 7.30 ns |

Floating point approach (biased) | 6.23 ns |

Nearly division-free | 1.91 ns |

So the division-free approach is much better.

Is this result robust to the hardware? Let us try on Cannon Lake processor where the division instruction is faster…

Java’s approach | 3.75 ns |

Floating point approach (biased) | 8.24 ns |

Nearly division-free | 2.53 ns |

The division-free approach is less beneficial because of the faster division, but the gains are still there.

What about an ARM processor? Let us try on an Ampere Skylark.

Java’s approach | 20.67 ns |

Floating point approach (biased) | 14.73 ns |

Nearly division-free | 8.24 ns |

Again, the division-free approach wins.

Practically speaking, avoiding integer divisions is a good way to generate faster code.

I make my code available. To ease reproducibility, I have used docker containers: you should be able to reproduce my results.

**Further reading**: Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation 29 (1), 2019

Daniel Lemire, "Nearly Divisionless Random Integer Generation On Various Systems," in *Daniel Lemire's blog*, June 6, 2019.

It looks like you may have incorrectly transcribed a value for the ARM table “Floating point approach (biased)”.

Correct. Thanks for pointing it out.

Hi Daniel,

I don’t understand,

uint64_t t = -s % s;

I interpret that as the remainder when -s is divided by s.

Tnx.

Think: (2^64-s) mod s

For -s % s to compile in VC one has to write (0-s) % s.

The code sample I have provided in my blog post is unlikely to compile under Visual Studio for other reasons as well.

This compiles with MSVC:

`uint64_t nearlydivisionless ( uint64_t s ) {`

uint64_t x = random64 () ;

uint64_t m;

uint64_t l = _umul128(x, s, &m);

if (l < s) {

for (uint64_t t = (0ULL - s) % s; l < t; l = _umul128(x, s, &m))

x = random64 ();

}

return m;

}

The _umul128() intrinsic is available for AMD64 only; it can but trivially implemented for I386: see https://skanthak,homepage.t-online.de/nomsvcrt.html

I have a question on the array-shuffling algorithm. Your approach seems to be

for (int i = len – 1; i > 0; i–)

swap(a+i, a+random_range(i));

A more naïve approach would be

for (int i = len – 1; i >= 0; i–)

swap(a+i, a+random_range(len));

Is there a reason why the second algorithm shouldn’t be used? Or is it simply slower because (for large arrays) it spills from L1 to L2 or L3 or RAM more often than the first?

@Jörn

Do you have a proof that your proposed algorithm is correct? That is, assuming that random_range is a perfect generator, you generate all permutations with equal probability?

The algorithm I use is from Knuth’s. It is correct. There might be other, faster algorithms, but to be considered, they must be known to be correct.

The method is biased, although it is not entirely obvious.

If you work though the math by hand for the case of 3 elements, you’ll see that element 0 is less more likely to end up in position 1 than it is to end up in position 2. You can work out the path by hand, maybe by drawing a tree of all possible 27 combinations of swaps: you’ll find that 8 outcomes lead to element 0 in position 2, while 10 outcomes place it position 3.

Or just simulate it.

Thank you Travis.

It’s actually interesting to look at the pattern for arrays of say 10 elements:

`naive:`

0 1 2 3 4 5 6 7 8 9

a[0] 10.0 10.1 9.9 10.0 9.9 10.0 10.0 10.0 10.0 10.1

a[1] 12.8 9.4 9.3 9.5 9.6 9.7 9.8 9.8 9.9 10.1

a[2] 12.0 12.4 8.9 9.2 9.3 9.3 9.5 9.8 9.8 9.9

a[3] 11.2 11.5 12.1 8.8 8.9 9.2 9.4 9.5 9.7 9.8

a[4] 10.5 10.9 11.4 11.9 8.6 8.8 9.1 9.3 9.7 9.9

a[5] 9.8 10.3 10.8 11.3 11.7 8.6 8.9 9.1 9.6 10.0

a[6] 9.0 9.7 10.1 10.5 11.3 12.0 8.6 9.2 9.6 10.0

a[7] 8.6 9.2 9.7 10.1 10.8 11.3 12.0 9.0 9.4 9.9

a[8] 8.2 8.5 9.0 9.7 10.1 10.8 11.7 12.3 9.6 10.1

a[9] 7.8 8.1 8.8 9.1 9.7 10.4 11.2 11.9 12.7 10.2

Notice the diagonal stripe of higher probabilities running from the top left the bottom right. It can be discribed as “position N is are much more likely to end up with the value that started in a position < N than >= N”. So for example, a[1] is much more likely to get the value that started in a[0] than any other value.

The intuition is this. Consider the second to last step in the loop, where are swapping a[1] with a randomly selected element. Since a[1] swaps with any element with equal probability, after this step a[1] has 0.1 probability of having any specific value, and in particular it has 0 with with p=0.1.

Now, we do the last swap for a[0]. At this point a[0] has not necessarily been swapped, unless it was selected randomly in an earlier step. You can show that this means it will still have zero ~38% of the time (limit of (1-1/n)^n as n->inf). Now you swap that value with another position. Position a[1] will be selected with p=0.1. So the final probability that a[1] ends up 0 is the 0.1 it was zero after the second to last step (and didn’t get swapped in the last step), plus the chance it was selected in the last step: 0.1×0.9+.38×.1 = 0.128 (12.8 has shown in % in the table).

Once that is clear the rest of the table makes sense: each position has a higher probability than uniform of ending up with values in earlier positions, and then the opposite must be true for later values for everything to add up.

This bias doens’t go away as you increase the size of the array (I had originally thought it would). In fact, it gets more extreme if you compare the most biased elements, i.e., for larger arrays some outcomes have a 2x higher probability than others.

Code blocks don’t preserve spacing, and don’t look the same as the preview 🙁

Another try with leading zeros to keep format:

`naive:`

0 1 2 3 4 5 6 7 8 9

a[0] 10.0 10.0 10.0 10.0 10.1 10.0 10.0 09.9 10.0 10.0

a[1] 12.8 09.4 09.4 09.5 09.6 09.6 09.8 09.9 09.9 10.0

a[2] 12.0 12.4 09.0 09.1 09.2 09.4 09.5 09.7 09.8 10.0

a[3] 11.2 11.6 12.1 08.7 08.9 09.1 09.2 09.5 09.7 10.0

a[4] 10.4 10.9 11.3 11.9 08.6 08.8 09.1 09.3 09.7 10.0

a[5] 09.8 10.2 10.7 11.2 11.8 08.6 08.9 09.2 09.5 10.0

a[6] 09.2 09.6 10.1 10.6 11.2 11.8 08.7 09.1 09.5 10.0

a[7] 08.7 09.1 09.6 10.1 10.7 11.4 12.1 09.0 09.5 10.0

a[8] 08.1 08.6 09.1 09.6 10.2 10.9 11.6 12.4 09.5 10.0

a[9] 07.7 08.2 08.7 09.2 09.7 10.5 11.1 12.0 12.8 10.0

Very nice. Thank you!

Wouldn’t it be sufficient, to drop the first 4 lines (incl. if) and setting l = 0 as start condition with a foot-controlled do-while? (first calculation block looks the same like the while-body).

Can you point to actual code?

So in

`nearlydivisionless`

, isl >= sthe necessary and sufficient condition to produce an unbiased value? Why do you even switch to another algorithm then, which requires division, if you can simply try again without division? I.e. would the following work?`uint64_t ratherdivisionless(uint64_t s)`

{

while (1) {

uint64_t x = random64();

__uint128_t m = (__uint128_t) x * (__uint128_t) s;

uint64_t l = (uint64_t) m;

if (l >= s)

return m >> 64;

}

}

There are many possible algorithms. If you follow the link to Melissa’s post, you’ll find several. You may find an unbiased algorithm that is faster. That’s entirely possible. But I submit to you that you need to provide a proof of correctness and benchmarks to demonstrate that it is faster.

I think you can prove that svpv’s point is a good and general one, without getting your hands dirty (i.e., testing it).

If you have a “rejection sampling” method (because that’s what this is), which uses a fast method taking F time that succeeds with probability p, and on failure falls back to a slow method that takes S time, isn’t the fastest fall back method just to do the same fast sampling method? If it isn’t, it contradicts the idea that the rejection sampling method was faster in the first place (you should just use the “slow” fallback in that case, since it’s actually faster). Probably you can show it formally just by expanding the infinite sequence F

p + S(1-p) substituing the whole thing again for S.The difference is microscopic in this case for most interesting

`s`

since the first attempt succeeds with overwhelming probability except for very large`s`

, however it has the advantage being simpler in implementation, smaller in code size and of not needing to justify the bias-free nature of two different methods.As a practical matter, however, you need the special fallback case because the l < s approach fast path will blow up as s approaches 2^64. Few people are going to pass that, but you don’t want to take a billion years or whatever when they do. It’s kind of too bad though, if you could just use 65 bits instead of 64 for that check, you wouldn’t need the fallback case.

This comment is wrong: it should be ignored.

I can delete this comment

https://lemire.me/blog/2019/06/06/nearly-divisionless-random-integer-generation-on-various-systems/#comment-411232

if you’d like.

Nah, maybe there is a useful tidbit in there.

Actually I stand by the overall point, but I misread the code so it definitely doesn’t apply here.

The problem of large s is quite valid. At a certain point a basic rejection method becomes cheaper, because you’re doing the % on a large proportion of values.

The divisionlessness is unfortunately a function of how many excess random bits we use.

You could do a check for s greater than some threshold and use a simple technique.

A check right at the start would work but would be in the fast path. Is it possible to defer the check until after the first (l < s) check? That would move it out of the fast path and make it promising.

I think for practical reasons small s dominate, especially with 64-bit values, but yeah it would be nice to elegantly handle the large s case too.

One way might be to add bits to the whole calculation when we hit l < s. Basically divide the bottom bucket into 2^b buckets (using b more random bits), reevaluate the product and l, and hope for l >= s in the wider calculation.

I don’t think this biases it; it just reevaluates at a higher resolution.

Yeah I mean for say 32-bit s on a 64-bit machine that approach is even easy to do up front, use say a 40-bit l instead of 32-bit and you don’t have to worry about big s any more.

With 64-bit muls and the result split across two registers it’s not easy though.

No, it needs to check t.

The range for any given output j contains either floor(2^64/s) or ceil(2^64/s) multiples of s. The check on l < s merely finds if l is the first multiple in the range; the check on t figures out if there’s room for one more value at the end.

Why do you need to check t? You can simply throw out any values that values fall into the range [s, t] without checking if it was possible to extract an unbiased value from it.

In fact, that’s exactly what Daniel’s code does on the first iteration: it only checks against s. Passing check is sufficient, although not necessary.

I don’t understand your wording, so I may misunderstand, but I see a bias remaining if the l < s rejection is the only one applied.

The goal is to get an equal number of accepted values within each interval, where we only use every s-th value, and the offset of the first value can vary.

Given an interval [0, L) we get the most values if the offset is 0:

0, s, 2

s, … ks < L == k*s + tWe get one fewer value by starting at t:

t, s+t, … (k-1)

s+t < L == ks+tSo in the former case (or at any offset < t) we have to reject one value, but not if t <= l < s.

In other words, if we chop off the first s positions we still have a varying number of values to the right.

You are right, I hadn’t actually considered the math and misunderstood what what was going on.

I thought the rejection condition was

`l < s`

and then everything after that was just a different algorithm for generating another value (which is why I was confused why a different algorithm would be used), but as you point out the true rejection criteria for both the initial and retry loops is`l < t`

.The

`l < s`

is just a fast path to see if we even need to bother computing`t`

. After we compute`t`

the second`l < t`

check is applied and the value may not be rejected after all.Once you’ve computed

`t`

you might as well stay in the inner loop with the`l < t`

condition, since there is no advantage anymore of the two tier check (indeed, it would be slower). So that answers the original query of svpv:No, it is not sufficient. The check

`l >= t`

is the sufficient condition (strictly speaking it is notnecessaryfor most s, since you could also choose some reduced interval and reject more values, but that would be dumb).Thanks for setting me straight.

Someone (not me) should submit a fix to std::uniform_int_distribution. At least in libstd++ it looks like a total dog in this respect, doing at least

twodivisions per random value generated:`if (__urngrange > __urange)`

{

// downscaling

const __uctype __uerange = __urange + 1; // __urange can be zero

const __uctype __scaling = __urngrange / __uerange;

const __uctype __past = __uerange * __scaling;

do

__ret = __uctype(__urng()) - __urngmin;

while (__ret >= __past);

__ret /= __scaling;

}

So something like 100+ cycles for generating a random int one many modern machines?

… and by “fix” I mean use this multiplicative technique.

Following some of my other comments on this thread, I put together a version of this that falls through to a multiply instead of a division if the initial check fails. It took me a bit of fumbling with basic arithmetic, but I eventually realized that extending to a 64-bit fraction only requires starting with the 32-bit version and a similar cascade through a couple of cheap-to-check conditions.

See the RangeGeneratorExtended here:

https://gist.github.com/KWillets/b6f76894c115e41339bcb8d34bf9ea49

Since Java does not have an 128 bit integer type:

Is it safe to “downgrade” this to output 32 bit random integers by using 64 bit where you used 128 bit, and 32 bit where you used 64 bit?

Is there any difference to https://lemire.me/blog/2016/06/30/fast-random-shuffling/ besides now being published?

The blog post you comment upon is entitled “Nearly Divisionless Random Integer Generation On Various Systems”. The post you refer to only benchmarked on one system (processor).

It is safe for 32-bit integers, yes.