In software, we sometimes want to generate (pseudo-)random numbers. The general strategy is to have a state (e.g., a 64-bit integer) and modify it each time we want a new random number. From this state, we can derive a “random number”.

How do you that you have generated something that can pass as a random number? A gold standard in this game is L’Ecuyer’s Big Crush benchmark. It is a set of statistical tests. It is not sufficient to “pass” Big Crush to be a good random number generator, but if you can’t even pass Big Crush, you are in trouble.

When I need a super fast and super simple random number that qualifies, I go for Lehmer’s generator:

__uint128_t g_lehmer64_state; uint64_t lehmer64() { g_lehmer64_state *= 0xda942042e4dd58b5; return g_lehmer64_state >> 64; }

Once compiled for an x64 processor, the generator boils down to two 64-bit multiplication instructions and one addition. It is hard to beat! The catch is that there is non-trivial data dependency between the calls when using the same state with each call: you may need to complete the two multiplications before you can start work on the next function call. Because our processors are superscalar (meaning that they can do several instructions per cycle), it is genuine concern. You can break this data dependency by having effectively two generators, using one and then the other.

Lehmer’s generator passes Big Crush. There are many other fast generators that pass basic statistical tests, like PCG64, xorshift128+, but if you want raw speed, Lehmer’s generator is great.

Recently, a new fast contender has been brought to my attention: wyhash. It is closely related to a family of random number generators and hash functions called MUM and designed by Vladimir Makarov (there is a nearly identical generator by Makarov called mum-prng). The new contender works as follow:

uint64_t wyhash64_x; uint64_t wyhash64() { wyhash64_x += 0x60bee2bee120fc15; __uint128_t tmp; tmp = (__uint128_t) wyhash64_x * 0xa3b195354a39b70d; uint64_t m1 = (tmp >> 64) ^ tmp; tmp = (__uint128_t)m1 * 0x1b03738712fad5c9; uint64_t m2 = (tmp >> 64) ^ tmp; return m2; }

Wyhash (and presumably mum-prng) passes rigorous statistical tests.

On an x64 processor, the function generators two multiplications, one addition and two XOR. If you are counting, that’s only two instructions more than Lehmer’s generator. Like generators from the PCG family, wyhash updates its seed very simply, and so you can pipeline the generation of two or more random numbers with minimal data dependency between them: as soon as one addition is completed, you can start work on the second number.

Both of these generators might be relatively less performant on ARM processors due to the high cost of generating the full 128-bit product on ARM architectures. They are also both relatively harder to implement in a portable way.

This being said, which is faster on my x64 processor?

Let us run the experiments. I am going to work over sets of 524288 random numbers. I am using a skylake processor and GNU GCC 8. I make my source code available.

First, I just sum up the random numbers being generated.

wyhash | 0.51 ms |

Lehmer’s | 0.63 ms |

Lehmer’s (two gen.) | 0.48 ms |

Lehmer’s (three gen.) | 0.37 ms |

From run to run, my margin of error is about 0.02.

Next I am going to store the random numbers in an array.

wyhash | 0.6 ms |

Lehmer’s | 0.6 ms |

Lehmer’s (two gen.) | 0.6 ms |

Lehmer’s (three gen.) | 0.4 ms |

So using three Lehmer’s generators is best.

Of course, using parallel generators in this manner could be statistically unsafe. One would want to run further tests.

**Credit**: Monakov suggested going to three generators. The post was updated accordingly.

**Further Reading**: There were apparently some Hacker News comments on both a new hash function (XXH3) and wyhash.

**Credit**: Wyhash was invented by Wang Yi.

**Update**: I have implemented wyhash in Swift.

Armv8 has UMULH instruction for getting bits [127:64] of the 64×64 multiplication, so it should be the same amount of the instructions, as on x64.

On x64, you need a single instruction to generate the 128 bits of the product of two 64-bit values. As far as I know, on ARM, you will need to multiplication instructions (mul and umulh) to get the same result, and the umulh instruction is sometimes much slower than the mul instruction.

Total amount of instructions is same, but on the arm it is 3 multiplications: umulh, madd, mul. On x64 it is: mulx, imul, add.

Indeed umulh is slower than mul, but not dramatically. Execution throughputs are 1/4 vs 1/3 on Cortex A57.

From Faster Remainder by Direct Computation:

Quoting directly from the ARM documentation:

That’s correct, but from the same doc: latency for 64×64 MUL is 5 cycles and 3 cycles M pipeline stall. 5 cycles latency does not look much faster than 6.

Cyril… We go from two multiplications (on x64) to three multiplications, including one that is particularly expensive. The instruction count might be the same, but we know not to determine the speed of a piece of code to its instruction count when some instructions are much more expensive than others.

See my follow-up blog post: ARM and Intel have different performance characteristics: a case study in random number generation. I show that wyhash, which is so good on x64, is no longer too great on ARM.

I don’t report the Lehmer’s results in this following blog post… but maybe you will trust me, after reading that other blog post, if I tell you that neither Lehmer’s nor wyhash are competitive on my ARM server.

I’ll be glad to give you access to my ARM server if you want to prove me wrong…

BTW Cortex A75 is better example where umulh is slow. While on A57 it is more or less the same, as regular mul in terms of latency and cpu cycles, on A75 it is exactly 2 times slower, then mul. https://static.docs.arm.com/101398/0200/arm_cortex_a75_software_optimization_guide_v2.pdf (page 16)

Unfortunately I don’t have access to the real A75 hardware and can’t run your tests.

Yes: in Lehmer’s generator, dependency chain is 5 cycles (4 cycles for high-part result of low-parts multiplication, plus 1 cycle for adding the result of high×low multiplication), there’s only one multiply unit, and so with only two independent generators the multiplier is not fully utilized: we’re supplying 4 multiply ops each 5 cycles.

Thus, using 3 independent generators improves the situation (without running into undesirable effects like register spilling).

You are correct. I have updated my blog post and credited you.

For people reading this. The processor can issue a multiplication per cycle but it takes three cycles for it to complete. Hence the number three. We need three streams to keep it busy all the time.

And, of course, if the rest of your application needs the multiply unit, you are in trouble.

(This is specific to current x64 processors.)

Uhm. Your answers to Cyril and me are written as if Lehmer’s generator uses a 64×64 bit multiplication, but the code you show actually needs a 128×64 bit multiplication, and there’s no way to express that as one multiplication instruction, not on AMD64, not on AArch64. Check the assembly generated by GCC: you’ll see one ‘mul’ instruction and one ‘imul’ instruction.

My blog post states:

Once compiled for an x64 processor, the generator boils down to two 64-bit multiplication instructions and one addition.That’s for x64. Cyril stated that on ARM, you’d have the same number of multiplications. I don’t think that’s correct. For ARM, you are going to need something like three multiplications, one of which will be UMULH, which is slow on the ARM machines I know.

As for my answer to your comment, as long as you have enough registers, it should be obvious that you can issue one multiplication per cycle if you have three generators because of the three cycle latency (assuming you have a superscalar processor with sufficient width to fit in the loads and the adds). It is also obvious from your analysis that two generators are not enough.

My bad — misinterpreted what you said the first time around. Sorry. On AArch64 there should be one more mul (so three in total rather than two as on AMD64).

I still think your follow-up to my comment was somewhat confusing: with two generators, you have

fourindependent multiplications, which is enough to cover the basic three-cycle latency. The problem is that dependency cycle is not 3 ticks (and I suspect I miscalculated earlier and just from nominal latencies the critical cycle is 4 ticks, not 5).According to Agner Fog, we have the following on recent Intel processors… for register inputs, you have 3 cycles of latency for 64-bit multiplications (one extra cycle for 32-bit but it is irrelevant here).

Of course, if one of the inputs is in memory, then the latency is expected to be longer… but let us ignore this for the time being.

You have a dependency chain of at least 5 cycles as per your analysis. Maybe you want to argue that the dependency chain is longer, fine… but let us agree that it is

at least5 cycles. In 5 cycles, you do two multiplications. So that’s one multiplication every 5/2 = 2.5 cycles. Intel processors can do one multiplicationevery cycle.I think it follows that we need to have more than two generators to keep the multiplication unit fully occupied.

Now this is a bit rough because we are making the implicit assumption that we are strictly limited by the multiplication. We probably need to do a better job analyzing the flow of instructions to really understand what is going on.

https://www.agner.org/optimize/instruction_tables.pdf

I don’t see any reason to believe that 3 independently seeded generators would be unsafe compared to using a single generator. If anything, it could be slightly better since there is 3x the state and 3 uncorrelated streams rather than 1.

Of course, I guess you could just bundle up your 3x generator and put it through the available tests like big crush.

That aside, once you accept multiple generators and so effectively measure throughput rather than value-to-value latency, won’t the fastest approach be SIMD? That changes the playing field dramatically (e.g., there aren’t even any 64-bit input multiplications available up to and including AVX2).

I don’t see any reason to believe that 3 independently seeded generators would be unsafe (…)In this instance, I also cannot imagine what the problem might be, but I thought I’d be prudent and urge people to check before they adopt this!

won’t the fastest approach be SIMD?Yeah but I excluded SIMD deliberately here.

I also did not discuss latency.

Maybe it got lost in an edit, because another commenter quotes “Can you do better without any specialized instructions” but I don’t see that now.

One could argue that the rules of the game are a bit unclear: for example, when you use 2 or 3 Lehmer generators, you unroll the loop by 2 or 3 and call each in turn in the “user” code – so it’s not really implementing the RNG API, but rather an end-user optimization using separate generators which may be easy or hard to use in practice depending on the situation.

To keep the same API you could wrap the 2 or 3 generators up into one that gives the usual interface and alternates among the generators using an internal flag. In

principlea compiler could even undo this packaging as an optimization, if it unrolled the loop body by the right amount and compiled away the control flow – but I think it is unlikely with today’s compilers in most scenarios.You are correct that I wasn’t clear as to the rules of the game. But we both know that SIMD could kill these scalar functions, as long as the compiler can’t autovectorize well.

Auto-vectorization

isSIMD, no?Well I would say it isn’t totally obvious that SIMD will do

muchbetter here since both presented approaches use wide integer multiplications which is a weak spot for x86 SIMD, which doesn’t offer any 64-bit input multiplications at all.… but yeah I guess SIMD would win perhaps with something like xorshift variant.

When I refer to SIMD, I include auto-vectorization, always. In my comment above, I should qualify… with “programmer’s deliberate use of SIMD”.

Don’t guess, just test it!

On x86-32, 64-bit PRNGs which don’t need a 64*64-bit multiplication or division run a

BITfaster when implemented using MMX (or SSE).If you can run PE32/PE32+ executables: fetch MMX, 32-bit and 64-bit implementation; these run the PRNGs from NOMSVCRT in a tight loop. For a more realistic test, fetch MMX, 32-bit and 64-bit implementation: these run the same PRNGs with the MonteCarlo approximation I already mentioned here.

Right, but which can pass Big Crush from TestU01?

Testing a 64-bit PRNG with TestU01 is

NO GOOD IDEA, better use Practically Random instead.A properly implemented 64-bit PRNG produces the same sequence on x32 and x64 (or any other processor architecture), with or without the use of MMX/SSE/Neon/AVX.

Bob Jenkin’s Small Fast prng64(), the enhanced and

truemiddle-square msqr64(), the hashed arithmetic progression drbg64() (same as Daniel’s adaption of wyprng()), the scrambled MCG smcg64() pass full runs of PractRand up to 32TB.The square-with-carry I mentioned here passes at least up to 4TB.

Testing a 64-bit PRNG with TestU01 is NO GOOD IDEA, better use Practically Random instead.How can it possibly be a bad idea? I would agree that you should use Practically Random, but what harm may come from running Big Crush… other than climate change due to CPU heat?

Though I agree that PractRand seems both more modern and possibly all around better… it is certainly easier from a software engineering point of view… TestU01 is the one benchmark we have that is backed by years of peer-reviewed research… and its Big Crush is a “de facto” standard. It is also a static and determined target as long as you describe how you ran the tests (you need to be specific because Big Crush is essentially a 31-bit test).

It’s no good idea (I didn’t say bad idea) because TestU01 was designed for 32-bit numbers, so you need to take quite some precautions or apply spoon-feeding when testing 64-bit PRNGs.

While Lehmer’s multiplicative congruential generator is fast, it does

NOTpass PractRand 0.94 (but you know this already). In does it beat the minimal standard and too big to fail Melissa O’Neill tested these generators with PractRand 0.93, where they passed.Consider an idea from George Marsaglia instead: save and add the “carry” of the multiplication, giving a multiply-with-carry generator:

`uint64_t seed = 0, carry = 0x...;`

uint64_t square_with_carry() {

__uint128_t mwc = (__uint128_t) seed * 0x... + carry;

seed = mwc;

carry = square >> 64;

return seed;

}

Or a square-with-carry generator:

`uint64_t seed = 0, carry = 0xb5ad4eceda1ce2a9;`

uint64_t square_with_carry() {

__uint128_t square = (__uint128_t) seed * seed + carry;

seed = square;

carry = square >> 64;

return seed;

}

This 64-bit generator passes the PractRand test suite (version 0.94) at least up to 4 TB!

See source and source for the code generated by GCC 8.2 and clang 7.0 respectively, plus VC source for an implementation using Microsofts Visual C compiler (which does not support a 128 bit integer type).

I used the latter for the PractRand test.

I turned your comment into an issue on GitHub: https://github.com/lemire/testingRNG/issues/14

I was trying to understand why Lehmer’s with 3 generators is more performant on A53, while single generator perfectly loads pipeline with something about 9 cycles consumed and 12 cycles latency (this is calculated by A57 reference, but should be relevant for A53 too). 3 instructions are required:

`umulh x9, x22, x23`

mul x22, x22, x23

madd x19, x19, x23, x9

Then I realised that test uses different iterations count, so basically it tests branch prediction efficiency. After running the fixed version results became more reasonable:

`wyrng 0.009253 s`

splitmix64 0.007591 s

lehmer64 0.007093 s

lehmer64 (2) 0.00733 s

lehmer64 (3) 0.007269 s

`Next we do random number computations only, doing no work.`

wyrng 0.011513 s

splitmix64 0.008451 s

lehmer64 0.00664 s

lehmer64 (2) 0.006763 s

lehmer64 (3) 0.006762 s

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/pull/30

Excellent, thanks.

Running the PRNGs in a tight loop just summing their results is typically not seen in practice: real-world applications perform some (more) computation with the random numbers.

Consider to change your timing setup a little bit and perform some (silly, but yet sufficient) computation, like the MonteCarlo determination of pi:

`unsigned hits = 4000000;`

for (unsigned i = hits; i > 0; i--)

{ const unsigned x = prng(), y = prng();

const unsigned long long z = ~0ULL * ~0ULL;

hits -= z - (unsigned long long) x * x < (unsigned long long) y * y;

}

printf("Pi = %lu.%06lu\n", hits / 1000000, hits % 1000000);

It is not. Indeed. You are correct.

What is up with ‘z’ in your code? Why not just write ‘1’?

The (f)ull and true square of ~0UL alias UINT_MAX is not ‘1’, but ~0ULL * ~0ULL or (~1ULL << 32) | 1ULL: (x-1)

(x-1) = xx – 2*x + 1. While precision surely does not matter for this MonteCarlo approximation, I don’t want to give a bad or questionable example.wyhash64 is half-awesome, half-questionable.

It is similar to PCG in that it combines a weak base generator and an output function to create a strong generator. One problem with PCG is that the base generator’s critical path is using MUL+ADD, which takes 4 cycles on Intel. No matter how fast the output function, performance cannot exceed one output per 4 cycles.

wyhash64 uses ADD for the base generator, latency 4 cycles. That’s awesome. Base generator is weaker than an LCG, but we can compensate with a stronger output function. The carry bits still result in some amount of randomness that we can “amplify” by multiplications.

Questionable is the output function. XOR of the low and high multiplication result is not an invertible function, so some outputs are impossible to generate. See for yourself:

long total = 0;

for (int m = 1; m < 65536; m += 2) {

uint8_t hit[65536] = { 0, };

for (int i = 0; i < 65536; i++) {

uint32_t state = i;

state *= m;

state ^= state >> 16;

hit[state & 0xffff]++;

}

int hits = 0;

for (int i = 0; i < 65536; i++)

hits += !!hit[i];

printf(“%5d: %5d\n”, m, hits);

total += hits;

}

printf(“average: %5ld\n”, total / 32768);

Not sure how bad the problem is for 64bit multiplication, but I would expect about half of all 64bit numbers to be absent from the output. Arguably that is bad. Or good, depending on one’s view.

The bad side is pretty obvious. The good side is that collisions in the output functions means random numbers get generated multiple times before the 64bit counter loops. So this PRNG doesn’t fail a birthday paradox test, in spite of having 64bit state and 64bit output.

Assuming you consider the non-invertible output function to be bad, that should be relatively easy to fix. Use regular 64bit multiplication and xorshift.

state ^= state >> 32;

state *= M1;

state ^= state >> 32;

state *= M2;

state ^= state >> 32;

Something like the above should work. It should also be faster on Arm, 32bit x86 and some other architectures. We might get away with a single multiplication if we use a random xorshift like PCG does. I should try some variant and see if they survive practrand.

This is mathematically intuitive.

Based on fairly quick tests, the two fastest output functions that might work (up to several Gigabytes at least) are:

`state ^= state >> 32;`

state *= M1;

state ^= state >> 32;

state *= M2;

state ^= state >> 32;

and:

`old ^= old >> ((old >> 59) + 5);`

old *= M1;

old ^= old >> ((old >> 59) + 5);

I don’t think it makes a difference whether you use the same or two different multipliers. You can also make the multiplier and the increment of the base generator the same. In a way, the increment is just another multiplier anyway.

My usual rules for the multiplier are that it has to be odd (obviously) have a popcount of roughly 32 and should avoid long stretches of 0 or 1. Pick your favorite commit hash, skip any hex digits that are 0 or f and ensure the last bit is 1 and you have a good multiplier.

And because the precise multiplier doesn’t matter very much, it is easy to have multiple independent PRNG using different multipliers. Multipliers still have to be checked, but the check is much easier than a multi-dimensional spectral test for an LCG.

Better compare it with Fortuna-like designs, which apply a block cipher or hash function to a simple counter: the first part of wyhash64 is an arithmetic progression alias Weyl sequence.

It differs from other generators which use a simple multiplicative (non-cryptographic) hash function by ‘folding’ the high part of the product back into the output.

16 months since this article was published, so here is a small update on this subject. I posted a proposal for a PRNG (

tylo64) over at thenumpygithub in June:https://github.com/numpy/numpy/issues/16313#issuecomment-641897028

It is heavily inspired by

sfc64, but with improvements that makes it evenquite a bit faster than wyHash(see below), and at the same time robust for massive parallel usage, due to the added user specified Weyl increment. I.e. each k (odd number) guaratees a unique period of 2^64 (but average ~ 2^94). This feature practically removes the need for a jump function (which sfc64 and tylo64 is missing).Statistically, I have ran PractRand to 4 TB without issues – also the 32bit variant of it as recommened by Melissa O’Neill. Note also, this implementation does not rely on hardware support for fast multiplication or 128-bit arithmetic, like wyhash.

In numpy, they have now added my proposed Weyl sequence to the original sfc64(requires 320 bits state): https://bashtage.github.io/randomgen/bit_generators/sfc.htmlMy proposed PRNG has 256-bit state, and is faster probably because it updates only 196-bit state per number, vs 256-bit in sfc64, as they do essentially the same number and types of operations. Real world Monte Carlo test follows, as suggested by Stefan Kanthak:

`// Output on AMD Ryzen 7 2700X cpu:`

// tylo64: Pi = 3.14153238: 0.719000 secs

// wyhash64: Pi = 3.14167794: 1.003000 secs

#include <stdint.h>

#include <time.h>

#include <stdio.h>

static uint64_t wyhash64_x;

static uint64_t tylo64_x[4];

uint64_t wyhash64() {

wyhash64_x += 0x60bee2bee120fc15;

__uint128_t tmp;

tmp = (__uint128_t) wyhash64_x * 0xa3b195354a39b70d;

uint64_t m1 = (tmp >> 64) ^ tmp;

tmp = (__uint128_t)m1 * 0x1b03738712fad5c9;

uint64_t m2 = (tmp >> 64) ^ tmp;

return m2;

}

uint64_t tylo64() {

enum {LROT = 24, RSHIFT = 11, LSHIFT = 3};

const uint64_t b = tylo64_x[1], result = tylo64_x[0] ^ (tylo64_x[2] += tylo64_x[3]|1);

tylo64_x[0] = (b + (b << LSHIFT)) ^ (b >> RSHIFT);

tylo64_x[1] = ((b << LROT) | (b >> (64 - LROT))) + result;

return result;

}

#define COMPUTE_PI(prng) { \

hits = N; \

clock_t diff, before = clock(); \

for (unsigned i = hits; i > 0; i--) { \

const uint32_t x = prng(), y = prng(); \

hits -= z - (uint64_t) x * x < (uint64_t) y * y; \

} \

diff = clock() - before; \

printf("Pi = %lu.%06lu: %f secs\n", hits / (N/4), hits % (N/4), (float) diff / CLOCKS_PER_SEC); \

}

`int main()`

{

size_t N = 400000000, hits;

const uint64_t z = ~0ULL * ~0ULL;

wyhash64_x = tylo64_x[0] = tylo64_x[1] = tylo64_x[2] = tylo64_x[3] = time(NULL); // seed

COMPUTE_PI(tylo64)

COMPUTE_PI(wyhash64)

}