Benchmarking algorithms to visit all values in an array in random order

In an earlier post, I described how to visit all values in an array in a pseudo-random order quite fast. The idea is simple, given an array of size n, pick a value a that is coprime with n. Then for any value of b, you have that (a x + b ) modulo n will visit all values in [0,n) exactly once as x goes from 0 to n. The word “coprime” is a fancy way of saying that the two numbers do not have a non-trivial (different from 1) common divisor.

It is nice because you can visit all values while using very little memory and without changing the order of the values in your array.

After some trivial optimization, the resulting code is as simple as this application of the ternary operator

```value = (value + prime < n) ?
value + prime :
value + prime - n;
```

where we initialize value to a random number in [0,n), and we set prime to be an adequately chosen random number that is coprime with n and not too small.

(The ternary operator x ? y : z just means: if x is true, return y otherwise return z.)

Though the ternary operator looks like a branch, recent compilers are likely to compile this code, when targeting recent processors, as a conditional move. Though conditional moves used to be slow and generally a bad idea, they are now quite cheap. Very few CPU cycles are needed.

Many readers insisted that I was silly to design a method that fits values in [0,n). All one needs to do is to find the largest power of two larger or equal to n (let us call it 2L), and then visit all of these values, skipping over up to n/2 values in the process.

We have many nice algorithms to visit all values in an interval [0,n) exactly once. One approach is to use a linear congruential generator. (The term linear congruential generator or LCG, is a fancy way to say that you have recursive linear function involving a modulo reduction. There is a bit of non-trivial mathematics from the 1960s involved. Many widely used random number generators rely on an LCG.).

In the end, the code then looks something like this…

```lcg = (a * lcg + 1) & (size2 - 1);
while (lcg >= size) {
lcg = (a * lcg + 1) & (size2 - 1);
}
```

That is, we move to the next value within [0,2L), and must keep hoping to another value, as long as we are outside of [0,n). In the worst case, we may need to hop n/2 times, so you cannot guarantee that this code has a constant-time complexity.

More critically, this creates mispredicted branches. Mispredicted branches can cost tens of CPU cycles!

So while it looks like the clever and natural approach, padding your sizes to the nearest power of two has potential downsides.

So what is the performance?

I rewrote my code in C because I hate benchmarking in Java. My benchmark takes values from one source array and rewrite them, shuffled, into another array. I record the number of CPU cycles used per cycle. I think that the absolute minimum on an x64 processor would be one cycle, because you cannot store more than one word per cycle. It is an array of integers.

 n fast ([0,n)) LCG ([0,2L), [0,n)) 3500 4.4 6.9 24,500 4.3 12.9 171,500 4.1 17.6 1,200,500 4.3 23.1 8,403,500 4.1 32.1

Clearly, my weakly random approach is very fast compared to the power-of-two LCG. Interestingly, my approach is consistently fast as the arrays get larger.

What could be causing such a large difference in performance? Thankfully, Linux makes it very easy to measure the number of mispredicted branches. So we can check what happens in this respect with our fast code when the array size grows…

```\$ ./run-fast.sh
100
5,756      branch-misses
1000
5,550      branch-misses
10000
6,163      branch-misses
100000
6,417      branch-misses
1000000
7,448      branch-misses
10000000
8,154      branch-misses
100000000
13,444      branch-misses
```

So the number of mispredicted branches is tiny and grows very slowly. What about the approach that consists in iterating over values in a power-of-two array, using branches to fall back within [0,n)?

```\$ ./run.sh
100
6,584      branch-misses
1000
6,481      branch-misses
10000
315,945      branch-misses
100000
1,724,468      branch-misses
1000000
2,519,027      branch-misses
10000000
374,239,031      branch-misses
100000000
1,873,408,379      branch-misses
```

It is an entirely different matter. For some large values of n, the processor is unable to accurately predict branches. The performance collapses.

I should add that neither of these techniques is likely to generate unbiased random shuffles. If you work for an online casino, please don’t use these techniques to shuffle cards. They might be good enough, however, if you need to select randomly where some computing jobs should go.

I should add, also, that we could adapt the LCG so that it directly work in the interval [0,n). This could potentially generate better random numbers. Sadly, while there is well established mathematics to help you find the right parameters so that your recurrence formula visits all values exactly once, I am not aware of equally convenient mathematics that you can apply to get “nice randomness”.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

5 thoughts on “Benchmarking algorithms to visit all values in an array in random order”

1. Sean O'Connor says:
2. Max says:

“and we set prime to be an adequately chosen random number that is coprime with n and not too small.”
Fun fact: So how to count down?
Choose start value to be 0 and the coprime to be n-1.

3. CamelCoder says:

Isn’t it possible to create an LCG with the modulo of n?
That would only cost one more multiplication and have the same time complexity that the additive method, whiles having way higher quality randomness. The initialization times are worse, though that should be benchmarked separately, as one can reuse the a constant and only compute new values for c to visit if the array length stays the same.

I wanted to try this my self, but your code seems to be not quite right:

```static uint32_t productOfAllPrimeDivisors(uint32_t val) { uint32_t pot = 2; uint32_t answer = 1; while ((uint64_t)pot <= val / 2) { if (IsPrime(pot)) { // but pot is never modified answer *= pot; while ((val % pot) == 0) { val /= pot; } } } return answer; } ```

1. Isn’t it possible to create an LCG with the modulo of n?

Yes but I expect you will find that it is slower.

4. Simon Labrecque says:

That’s awesome. Thanks!

You may subscribe to this blog by email.