Fast random shuffling

In a random shuffle, you want to take the elements of a list and reorder them randomly. In a “fair” random shuffle, all possible permutations must be equally likely. It is surprisingly hard to come up with a fair algorithm. Thankfully, there is a fast and easy-to-implement algorithm: the Fisher-Yates shuffle. It is a rather intuitive algorithm and there are YouTube videos about it… so, I will just point you at a piece of C code:

for (i=size; i>1; i--) {
   int p = random_bounded(i); // number in [0,i)
   swap(array+i-1, array+p); // swap the values at i-1 and p

What can we expect to limit the speed of this algorithm? Let me assume that we do not use fancy SIMD instructions or parallelization.

If the input array is not in the cache, and we cannot fetch it in time or it is just too large, then cache faults will dominate the running time. So let us assume that the array is in the CPU’s cache.

If we have N input words, we go through the loop N – 1 times. At each iteration of the loop, you need to read two values and write two other values. A recent x64 processor can only store one value to memory per cycle, so we cannot do better than two cycles per input word. In the very next iteration, you may need to read one of the recently written values. So, two cycles per input word is probably optimistic.

What else could be the problem? The generation of the random numbers could hurt us. Let us assume that we are given a random number generation routine that we cannot change. For this blog post, I will stick with PCG.

What remains? Notice how the Fisher-Yates shuffle requires numbers in a range. The typical techniques to generate random numbers in a range involve frequent divisions.

For example, you might want to look at how the Go language handles it:

func (r *Rand) Int31n(n int32) int32 {
	max := int32((1 << 31) - 1 - (1<<31)%uint32(n))
	v := r.Int31()
	for v > max {
		v = r.Int31()
	return v % n

This function always involves two divisions. Java, the PCG library… all involve at least one division per function call, often many more than one. Sadly, divisions are many times more expensive than any other operation, even on recent processors.

In an earlier blog post, I showed how to (mostly) get around divisions.

In general, no map from all 32-bit integers to a range can be perfectly fair. In practice, the effect is quite small unless your range is close to the maximal value of an integer. Thus you can simply use the following function:

uint32_t random_bounded(uint32_t range) {
  uint64_t random32bit =  random32(); //32-bit random number 
  multiresult = random32bit * range;
  return multiresult >> 32;

Maybe you feel bad about introducing a slight bias. You probably should not since the random-number generation itself is unlikely to be perfect.

Still, we can correct the bias. Recall that some of the values are mapped ceil(4294967296/range) times whereas others are mapped floor(4294967296/range) times. By sometimes redrawing a new random value, we can avoid entirely the bias (this technique is called rejection sampling):

uint32_t random_bounded(uint32_t range) {
  uint64_t random32bit =  random32(); //32-bit random number 
  multiresult = random32bit * range;
  leftover = (uint32_t) multiresult;
  if(leftover < range ) {
      threshold = -range % range ;
      while (leftover < threshold) {
            random32bit =  random32();
            multiresult = random32bit * range;
            leftover = (uint32_t) multiresult;
  return multiresult >> 32;

This looks quite a bit worse, but the “if” clause containing divisions is very rarely taken. Your processor is likely to mostly ignore it, so the overhead of this new function is smaller than it appears.

So how do we fare? I have implemented these functions in C, using them to compute a random shuffle. Before each shuffle, I ensure that the array is in the cache. I report the number of clock cycle used per input words, on a recent Intel processor (Skylake). As usual, my code is available.

Random shuffle timings, varying the range function
range functioncycles per input word
PCG library18.0
no division, no bias7
no division (with slight bias)6

Avoiding divisions makes the random shuffle runs twice as fast.

Could we go faster? Yes. If we use a cheaper/faster random number generator. However, keep in mind that without SIMD instructions or multi-core processing, we cannot realistically hope to reach the lower bound of 2 cycles per input words. That is, I claim that no function can be 3 times faster than the fastest function we considered.

You can save a little bit (half a cycle per input word) if you replace the 32-bit PCG calls by 64-bit calls, processing input words in pairs. Using SIMD instructions, we could go even faster, but I do not have access to a SIMD-accelerated PCG implementation… We could, of course, revisit the problem with different random-number generators.

26 thoughts on “Fast random shuffling”

    1. It is important to distinguish a random shuffle from a “fair” random shuffle as discussed. For a fair random shuffle, I do not know a good way to benefit from GPU execution.

  1. Appologies for double posting this, but it applies here too in a way that might not be obvious! On the “Picking N distinct numbers at random” post i mention you can use “format preserving encryption” to get those N distinct numbers. You can actually also use it to get distinct numbers [0,N) which means that the numbers you get out are the index for a shuffle. Since encryption requires a key, it’s seedable to generate multiple shuffles.

    In other words, you can use format preserving encryption to make a shuffle iterator that visits each item once and only once, and tells you when you have visited all items. You can do this in O(1), without actually shuffling the list or even storing the whole list in memory at once!

    If you care about cache friendliness, it (by itself?) isn’t the way to go, but it can be very good for a very quick, virtually storageless shuffle.

    1. Yes, that’s an interesting approach.

      As noted at the beginning of my blog post, it is remarkably difficult to come up with algorithms that produce a fair shuffle. You have to demonstrate that every possible permutation is equally likely. If you drop this fairness requirement, then various faster alternatives become possible.

  2. I don’t understand the:
    “no map from all 32-bit integers to a range can be perfectly fair”

    If the source is assumed to be uniform and the mapping to a smaller range is unbiased then what is not “fair” about the result?

    1. Ah! Nevermind you’re saying in general..mapping a single value to the smaller range. I find the wording off…I was assuming you were talking about a rejection method.

      1. Of course, one problem that one can encounter in such cases is that standard libraries may depend on really slow random number generators (for good or bad reasons). That can make the point irrelevant since the bottleneck is not in the shuffle function itself.

  3. Is there a similar approach for generating doubles in [0;1[?

    Java generates a 53 bit long, then does a division by (double)(1L << 53).

      1. Not that I know of – as far as I know, it can improve Java performance to write *0.5 instead of /2 if it is a highly critical path (it may not often make a difference because of cache wait times). In general 1/N is usually no longer exact representable by a double; so this transformation only works for powers of two without loss IIRC.

        I plan on doing some benchmarks with JMH; but if you have some other ideas, I’d be happy to know.

        1. What would prevent the compiler from turning “/ (double)(1L << 53)" into " * (1 /(double)(1L << 53))" where " (1 /(double)(1L << 53))" can be computed at compile time?

          1. The Java compiler is not meant to do any such optimization – I believe by design they want the compiler to only translate into bytecode, and have HotSpot to do such optimizations. But I don’t think it ever does division to mulitplication optimizations. I believe ProGuard may be able to do such though at the bytecode level.

            In general, it will not be possible: if I’m not mistaken /3 and *(1/3) may not produce the exact same results due to numerical issues. (powers of 2 are; so for above case it is possible – but I could not find such transformations in hotspot).

            1. I think you are correct that the Java compiler won’t do this optimization (though, admittedly, I did not check). But I do not think it is difficult as clang appears to do it automagically with C code when dividing by a power of two.

              This being said, Java does benefit from this optimization in the end because it is hand coded in ThreadLocalRandom (in the Java 8 version I checked).

              I suspect that you got “/ (double)(1L < < 53)" by looking at the Random class, but that's an hopelessly slow class, almost by design. Please see my other blog post on this topic:

              So I think it does answer your original query, does it not?

              1. Yes, in particular at compile time this should be rather easy to do. I can imagine that some of these easy things just add up too much when you do this at runtime.

                But from initial benchmarks, I may be wrong and HotSpot does optimize this to no measureable difference; or CPU pipelineing makes up for these differences.

                1. CPU pipelineing makes up for these differences

                  Your benchmark might have other bottlenecks… but a division is massively more expensive than a multiplication so it has to show in the end result if you don’t have much else going on. Pipelining does not help much with the division because it has both a low throughput and a low latency.

                  My own test seems to indicate that Java is smart enough to turn a division by a power of two into a multiplication:

                  Benchmark Mode Samples Score Error Units
                  m.l.m.r.Division.division thrpt 5 115439397.412 ± 65683.557 ops/s
                  m.l.m.r.Division.divisionBy3 thrpt 5 60568277.782 ± 13316.776 ops/s
                  m.l.m.r.Division.divisionBy3ThroughMultiplication thrpt 5 115443682.953 ± 91673.580 ops/s
                  m.l.m.r.Division.precompDivision thrpt 5 115446809.914 ± 75424.232 ops/s


                  In this particular instance, we see that the throughput is diminished by a factor of two. I would expect a larger difference, but my microbenchmark has limitations.

  4. “What can we expect to limit the speed of this algorithm? Let me assume that we do not use fancy SIMD instructions or parallelization.”

    The Fisher-Yates algorithm appears to me to be inherently sequential. How might parallelisation or SIMD be applicable here (apart from the random number generation)?

      1. Which makes me wonder: how many other shuffling algorithms which might be parallelisable (is that a word?) are as thorough, and anywhere near as simple, as Fisher-Yates?

        1. You can easily parallelize shuffling in the following manner. During a first pass, send each of the values to one of N buckets, by picking a random integer in [0,N). Then shuffle each bucket using Fisher-Yates. Finally, aggregate the shuffled buckets: take all values from the first bucket, then all values from the second bucket and so forth.

          One major defect of such an alternative is that it is not in-place.

  5. Parallel shuffle can be done inplace using partitions (equal size balanced “tree”, unbalanced remainder requires separate shuffling). Partitions are shuffled with Fisher-Yates in parallel. After that one can use random element swap merge to combine partitions with reducing number of threads. Merge can be optimized with simd with store mask or blend.
    Partitions improve cache utilization for data sets that are larger than cache. While merge with linear memory access pattern should use memory bandwidth more efficiently than random access shuffle. This will be slower for small data sets when cache provides enough bandwidth to make Fisher-Yates very fast.

    But after reading your numbers I’m pretty happy with my scalar shuffle because it can shuffle 52 elements (64 bit ints, aka a pack of cards) and reduce to 4 mask elements in 220-240 cycles. I’m currently using lcg generator and multiplicative bound reduction including bitwise-and optimization for power of two ranges. Removing division was the largest optimization (2200 cycles to 450 cycles).

    Second largest improvement was taking stack a temporary stack copy of random number generator (450->280). Without generator copy c++ compiler generated loads and stores for each random number. Loads and stores removed fairly few instructions but those were bottlenecks allowing average instructions per cycle raise from 2.8 to 3.4.

    I end up here looking for simd replacement for Fisher-Yates after my attempt to use vpermq to shuffle 4 element partitions resulted to slower code than scalar shuffle. sse/avx permutations have problem that permutation selecting must be immediate value. Immediate value force slow jump table conversion from random number to permutation operation. Result was about 25% reduction to instruction count but 60% reduction to IPC.

    1. @Pauli

      Thanks for sharing your numbers. Very interesting.

      Parallel shuffle can be done inplace using partitions (equal size balanced “tree”, unbalanced remainder requires separate shuffling).

      My blog post is concerned with a “fair” random shuffles where all possible permutations (N!) must be equally likely. If you divide up the inputs into fixed-sized blocks, shuffle them, and remerge, you will not get a fair shuffle. That might be fine, depending on your needs, but if you allow biased shuffles, it opens up many opportunities, in general, depending on the needed quality.

Leave a Reply

Your email address will not be published. Required fields are marked *