In software, we generate randomness with algorithms called “pseudo-random number generator”, sometimes simply called “random number generator” or RNG. A popular random number generator is xorshift128+: it is used by many JavaScript engines. It was designed by an influential computer science professor, Sebastiano Vigna, who has done a lot of great work. I suspect that the JavaScript engineers made a fine choice by selecting xorshift128+.

How can you tell that your random number generator has a good quality? A standard test is TestU01 designed by professor L’Ecuyer from the University of Montreal. TestU01 comes with different sets of tests, but BigCrush is the gold standard. A good random number generator should pass BigCrush when you pass the values as-is, as well as when you reverse the bits produced by the random number generator. Indeed, Vigna writes about another popular random number generator:

A recent example shows the importance of testing the reverse generator. Saito and Matsumoto [2014] propose a different way to eliminate linear artifacts (…) However, while their generator passes BigCrush, its reverse fails systematically (…) which highlights a significant weakness in its lower bits.

Passing BigCrush, even after reversing the bits, is not an impossible task. The SplittableRandom class in Java appears to pass (Steele et al., 2014), and so does the PCG family. And, of course, all cryptographic random number generators pass BigCrush, irrespective of the order of the bits.

On Wikipedia, we can read about xorshift128+ passing BigCrush robustly:

the following xorshift128+ generator uses 128 bits of state (…) It passes BigCrush, even when reversed.

Is Wikipedia correct? It offers a reference by Vigna who invented xorshift128+ (Vigna, 2017). Vigna’s journal article states:

(…) we propose a tightly coded xorshift128+ generator that does not fail any test from the BigCrush suite of TestU01 (even reversed) (…) xorshift128+ generator (…) is currently the fastest full-period generator we are aware of that does not fail systematically any BigCrush test (not even reversed)

That seems like a pretty definitive statement. It admits that there might be statistical failure, but no systematic one. So it would seem to support the Wikipedia entry.

The xorshift128+ code is not difficult:

#include <stdint.h> uint64_t s[2]; uint64_t next(void) { uint64_t s1 = s[0]; uint64_t s0 = s[1]; uint64_t result = s0 + s1; s[0] = s0; s1 ^= s1 << 23; // a s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5); // b, c return result; }

This the version with the paramater recommended by Vigna, but the V8 JavaScript engine went with a slight variation:

uint64_t s[2]; uint64_t next(void) { uint64_t s1 = s[0]; uint64_t s0 = s[1]; uint64_t result = s0 + s1; s[0] = s0; s1 ^= s1 << 23; // a s[1] = s1 ^ s0 ^ (s1 >> 17) ^ (s0 >> 26); // b, c return result; }

(The code I offer is taken directly from the author’s website and is equivalent to what we find on Wikipedia and in the research paper.)

Can we check the claim? The BigCrush benchmark is available as free software, but it is a pain to set it up and run it. So I published a package to test it out. Importantly, you can check my software, compile it, run it, review it… at your leisure. I encourage you to do it! I use Vigna’s own C implementation.

Statistical tests are never black and white, but you can use p-values to arrive at a reasonable decision as to whether the test passes or not. The BigCrush implementation does this analysis for us. To make things more complicated, random number generators rely on an initial “seed” that you input to initiate the generator. Provide a different seed and you will get different results.

So I did the following when running BigCrush:

- I use four different input seeds. I only care if a given test fails for all four seeds. I use 64-bit seeds from which I generate the needed 128-bit seed using another generator (splitmix64), as recommended by Vigna. I just chose my seeds arbitrarily, you can try with your own if you have some free time!
- I only care when BigCrush gives me a conclusive p-value that indicates a clear problem.
- I use the least significant 32 bits produced by xorshift128+, in reversed bit order. (BigCrush expects 32-bit inputs.)

Let us review the results.

The following tests gave p-values outside [0.001, 0.9990]: (eps means a value < 1.0e-300): (eps1 means a value < 1.0e-15): Test p-value ---------------------------------------------- 68 MatrixRank, L=1000, r=0 eps 71 MatrixRank, L=5000 eps 80 LinearComp, r = 0 1 - eps1 ----------------------------------------------

The following tests gave p-values outside [0.001, 0.9990]: (eps means a value < 1.0e-300): (eps1 means a value < 1.0e-15): Test p-value ---------------------------------------------- 68 MatrixRank, L=1000, r=0 eps 71 MatrixRank, L=5000 eps 80 LinearComp, r = 0 1 - eps1 -----------------------------------------------

The following tests gave p-values outside [0.001, 0.9990]: (eps means a value < 1.0e-300): (eps1 means a value < 1.0e-15): Test p-value ---------------------------------------------- 68 MatrixRank, L=1000, r=0 eps 71 MatrixRank, L=5000 eps 80 LinearComp, r = 0 1 - eps1 ----------------------------------------------

The following tests gave p-values outside [0.001, 0.9990]: (eps means a value < 1.0e-300): (eps1 means a value < 1.0e-15): Test p-value ---------------------------------------------- 24 ClosePairs mNP1, t = 9 9.2e-5 24 ClosePairs NJumps, t = 9 1.0e-4 68 MatrixRank, L=1000, r=0 eps 71 MatrixRank, L=5000 eps 80 LinearComp, r = 0 1 - eps1 ----------------------------------------------

The failure is equivalent if I use the alternate version of the code, as used by the V8 JavaScript engine.

**Analysis**

Xorshift128+ fails BigCrush when selecting the least significant 32 bits and reversing the bit order. The evidence is clear: I used four different seeds, and it failed MatrixRank and LinearComp in all four cases. Thus the Wikipedia entry is misleading in my opinion.

While I reversed the bit orders, you can also get systematic failures by simply reversing the byte orders. You select the least significant 32 bits, and reverse the resulting four bytes.

The recommended replacement for xorshift128+, xoroshiro128+, also fails BigCrush in a similar manner (as you can verify by yourself). Yet the xoroshiro128+ Wikipedia page could serve as an unequivocal recommendation:

Xoroshiro is the best general purpose algorithm currently available. (…) Mersenne Twister might still be a better choice for highly conservative projects unwilling to switch to such a new algorithm, but the current generation of statistically tested algorithms brings a baseline of assurance from the outset that previous generations lacked.

I feel that this would need to be qualified. In my tests, Xoroshiro is no faster than SplittableRandom (which I call splitmix64 following Vigna’s terminology), and SplittableRandom passes BigCrush while Xoroshiro does not. Recall that the PCG functions also pass BigCrush and they are quite fast. There are other choices as well.

To be clear, there is probably no harm whatsoever at using xorshift128+, but it does systematically fail reasonable statistical tests. If your core argument for choosing a generator is that it passes hard statistical test, and it fails, I think you have to change your argument somewhat.

Why is so much salience given to the lower bits? This would ignore very obvious patterns in the middle bits. Would patterns in the middle bits matter? Are there tests based on arbitrary byte-wise permutations, or even just turning a number “inside out” by reversing the higher 32 bits and lower 32 bits?

A number of the very early random number generators had obvious patterns in the low bits while appearing to be random in the upper bits – several of them would repeat the same last 3 or 4 bits over and over for ever.

More recent algorithms tend to remove that problem, so it’s perhaps not as significant now, but its wise to check everything just in case a problem has been reintroduced.

For an example of bad random numbers, see if you can find details of the PlayStation 2 hardware random number generator, which generated floating point random numbers in the vector units. This was found to suffer from a very strong pattern, so much so that if you were to take several thousand values, and assign them in groups of three to points in space, you would end up with a series of rotated grid-like layers in space, which could hardly be less random if it tried. Try the same with your current favourite random number generator and see what you get!

The code snippet that you tested for xorshift128+ appears to be taken from this paper here http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf .

While the figure annotation implies that this is the exact implementation used, the surrounding paper implies that it only passed BigCrush using different choices for a, b, and c than the ones in that code snippet.

Could you reproduce these results with a=23, b=17, c=26? These are the constants listed as the “best” in that paper, as well as the constants used by the V8 runtime (https://v8project.blogspot.se/2015/12/theres-mathrandom-and-then-theres.html) and the constants used in the wikipedia example (https://en.wikipedia.org/wiki/Xorshift#xorshift.2B).

The code snippet that you tested for xorshift128+ appears to be taken from this paper here (…)I copied and pasted it from the author’s web site: http://xoroshiro.di.unimi.it/xorshift128plus.c. But it is also identical to Figure 1 in the peer-reviewed paper with the label

The xorshift128+ generator used in the tests.While the figure annotation implies that this is the exact implementation used, the surrounding paper implies that it only passed BigCrush using different choices for a, b, and c than the ones in that code snippet.Vigna specifically recommends the triple (23, 18, 5) that I have tested, and he recommends against the triples you suggest (a=23, b=17, c=26) which appears first in Table 1:

(This is a direct quote from the paper.)

Could you reproduce these results with a=23, b=17, c=26?Tests are running. Big Crush takes many hours, but I already have the results from PractRand:

So we have a failure. You should expect a Big Crush failure (since statistical tests tend to be redundant).

So, yes, the failure is confirmed: