In software, we use random number generators to emulate “randomness” in games, simulations, probabilistic algorithms and so on. There are many definitions of what it means to be random, but in practice, what we do is run statistical tests on the output of the random number generators.
These tests are not perfect, because even a purely random sequence could fail particular tests just out of bad luck. Extraordinarily bad events do happen on occasion. However, we can repeat these tests and if they keep failing, we can gain confidence that something is wrong. By “wrong”, we mean that the output does not quite look random.
One concern is that running these tests can be difficult, and inconvenient. To alleviate this problem, I created a GitHub repository where you can find scripts that should allow you to test several different random number generators using a single command line.
I am focusing on fast non-cryptographic random number generators, the type that most programmers use for hash tables, simulations, games, and so forth. I stress that we are not interested in cryptographic security. Cryptography is a whole other world that we are going to leave to cryptographers.
I chose the following generators since they are widespread and fast:
- splitmix64 is a random number generator part of the standard Java API (SplittableRandom). It produces 64-bit numbers.
- pcg32 and pcg64 are instances of the PCG family designed by O’Neill. They produce either 32-bit or 64-bit outputs.
- xorshift32 is a classical xorshift random number generator you can read about in textbooks.
- Finally, xorshift128plus and xoroshiro128plus are recently proposed random number generator by Vigna et al. They produce 64-bit numbers.
No doubt I could have added many more…
First, I decided to look at raw speed on recent x64 processors:
|xoroshiro128plus||0.85 cycles per byte|
|splitmix64||0.89 cycles per byte||xorshift128plus||0.91 cycles per byte||pcg64||1.27 cycles per byte||pcg32||1.81 cycles per byte||xorshift32||2.33 cycles per byte|
Basically, splitmix64 and the generators proposed by Vigna et al. are roughly equally fast. O’Neill’s PCG schemes are slightly slower. I should point out that all of them are much faster than whatever your C library provides (rand).
Let us move on to statistical testing.
A convenient testing framework is Practically Random. It is a recently proposed piece of code that will eat as many random bytes as you want and check for randomness. You can let the program run for as long as you want. I went with a nice round number: I test 512GB of random bytes.
Only splitmix64 and the PCG schemes pass Practically Random (512GB).
You can, however, make xoroshiro128plus pass if you turn it into a 32-bit generator by dropping the least significant bits. Naturally, if you do so, you will diminish the speed of the generator by half. You might be able to do well by discarding fewer than 32 bits, but I did not investigate this approach because I prefer generators to produce either 32 bits or 64 bits.
Another well-established framework is L’Ecuyer’s TestU01. I run TestU01 in “big crush” mode using different seeds. Only when I see repeated failures with distinct seeds do I report a failure.
- xoroshiro128+ fails MatrixRank and LinearComp
- splitmix64 passes
- xorshift128+ fails MatrixRank and LinearComp
- pcg32 passes
- pcg64 passes
- xorshift32 fails decisively
Evidently, L’Ecuyer’s big crush benchmark is hard to pass. I should stress that it is likely possible to cause more failures by changing the conditions of the tests. That is, it is not because I do not report a failure that one does not exist, I may simply not have detected it.
I stress that these results are entirely reproducible. I provide all the software, all the scripts as well as my own outputs for public review.
- Blackman and Vigna report that xoroshiro128+ passes their big-crush tests. Sadly, it does not pass my big-crush tests. Note that you can verify my results for yourself and rerun the code!Of course, Blackman and Vigna did run big crush and did get passing scores, but their testing conditions no doubt differ.
It is already documented that xoroshiro128+ fails practically random.
- Though xorshift128+ has been widely adopted, it fails both big crush and practically random. The failure is rather decisive.
- Fortunately, Java’s splitmix64 appears to pass big crush and practically random.
- The PCG schemes pass all tests.
Part of my motivation when writing this blog post was Vigna’s remark: “Note that (smartly enough) the PCG author avoids carefully to compare with xorshift128+ or xorshift1024*.”
I am hoping that this blog post helps fill this gap. Evidently, my analysis is incomplete and we need to keep investigating. However, I hope that I have given you an interest for testing random number generators. If you grab my GitHub repository, it should be easy to get started.
Speaking for myself, as an independent party, I would rather have independent assessments. It is fine for O’Neill and Vigna to have Web sites where they compare their work to other techniques, but it is probably best for all of us if we collectively review these techniques independently. Please join me. To get you started: it is possible that I made mistakes. Please apply Linus’s Law and help dig out the bugs!
What would be interesting is to help document better these random number generators. For example, Xoroshiro128+ has a wikipedia entry (that looks messy to me), but the other schemes I have considered do not, as far as I can tell, have wikipedia entries, yet they are worthy of documentation in my opinion.
I am also in the process of adding more generators to the benchmark.
Disclaimer: I have no vested interest whatsoever in the success or failure of any of these generators. As a Java programmer, I am slightly biased in favor of splitmix64, however.
Further reading: See Xorshift1024*, Xorshift1024+, Xorshift128+ and Xoroshiro128+ fail statistical tests for linearity, Journal of Computational and Applied Mathematics, to appear (Available online 22 October 2018)
22 thoughts on “Testing non-cryptographic random number generators: my results”
I’m glad that working on PRNG testing. Not enough people are! But I think you need to distinguish between systemic failures (which have really low p-values very close to zero) and occasional blips were there is an â€œunlikelyâ€ p-value. Blips must happen from time to time. Three runs is ample to show a repeatable systemic failure, but not nearly enough to say that a blip occurs too often, you’ll most likely be seeing false patterns in random noise.
Another option is to use Adaptive Crush which if it sees a questionable p-value on a given test it increases the number of samples and reruns that test. It “rinse and repeats” up to a limit looking for pass/fail.
I emailed Daniel and pointed out to him that he was erroneously repeating the same test with the same seeds. I’m pleased to see he has updated his post.
I think it’s important, though, to be clear about how to interpret the p-value results from TestU01. Here’s a quote from the TestU01 manual
â€œWhen a p-value is extremely close to 0 or to 1 (for example, if it is less than 10^âˆ’10), one can obviously conclude that the generator fails the test. If the p-value is suspicious but failure is not clear enough, (p = 0.0005, for example), then the test can be replicated independently until either failure becomes obvious or suspicion disappears (i.e., one finds that the suspect p-value was obtained only by chance). This approach is possible because there is no limit (other than CPU time) on the amount of data that can be produced by a RNG to increase the sample size and the power of the test.â€
Thus the proper terminology for many of the p-values Daniel got is *suspicious*. As L’Ecuyer notes, any time you get a suspicious p-value, you are obligated to investigate. Arguably the current code in Daniel’s test suite doesn’t do this part well enough, because he just reruns the whole test three times. That’s not enough investigation!
As the author of PCG, so if someone thinks they see an issue, I have an obligation to investigate. I have looked into the specific suspicious p-values Daniel observed for specific tests, and here are the results I get repeating those same tests:
Running CollisionOver (BigCrush Test #9) on pcg32 twenty times, I see these p-values: 0.05811, 0.144, 0.9842, 0.797, 0.9561, 0.4523, 0.3093, 0.1909, 0.9385, 0.6832, 0.9418, 0.1008, 0.06458, 0.1211, 0.4523, 0.6231, 0.03508, 0.9561, 0.2548, 0.8652, which averages out to be 0.4964.
Running Run, r = 0 (BigCrush Test# 38) on the byte reverse (not the usual bit-reversed) of the least significant 32-bits of pcg64 lsb twenty times, I see these p-values: 0.992, 0.8454, 0.08798, 0.6355, 0.8053, 0.1355, 0.06824, 0.25, 0.1758, 0.4459, 0.5009, 0.4907, 0.8229, 0.6984, 0.3183, 0.4428, 0.6964, 0.8723, 0.0706, 0.7646, which averages out to be 0.506.
Thus, we can conclude that Daniel was just seeing some normal variation, not an actual problem.
Likewise, Daniel’s claims about TestU01 finding flaws in SplitMix are similarly false, merely what we normally see in thorough testing where many tests are applied, and some of the fails he reports for Vigna’s generators are also not indicators of serious problems.
I’ve only just started (re)checking SplitMix to make sure it gets a clean bill of health too, but once it’s done I’ll post again with a pastebin link for all the results.
Even though Daniel made a few mistakes (and gave me a bit more excitement than I was expecting this morning!), I applaud the idea of more widespread testing of PRNGs in general. We’ve already seen that mistakes are easy to make, and in that context, we really want multiple sources of independent validation, not just PRNG authors posting results. (You might hope that if the work has been formally published, that peer-reviewers would do this kind of work, but independently replicating someone’s results, and checking over their code for bugs, is above and beyond what typically happens in that context.)
Thanks for your comments. I will be updating my results this week.
Here are links to my retest results for PCG and SplitMix:
* pcg32 â€” https://pastebin.com/g9Fe2KdZ
* pcg64 â€” https://pastebin.com/MTsGm9uN
* splitmix â€” https://pastebin.com/ut80JMMz and https://pastebin.com/gPP4NA88
As expected, like the earlier retesting I reported for PCG, these retests show, as expected, that no real flaw had been discovered in SplitMix via simple TestU01 testing.
(But, regarding the statistical quality of SplitMix, it does, however, only produce each output *once* (when used as a 64-bit PRNG). Once you’ve seen a given 64-bit value, you’ll never see it again. That means it fails a really simple 64-bit birthday-problem test, due to lack of repeats. But actually observe this issue, you need either lots of time or lots of memory to observe 64-bit repeats (i.e., you must remember every 64-bit value you have seen so far). I have a custom tester to check for plausible behavior, and it uses 100 GB of RAM to perform the test. Obviously that’s not a test everyone can run (although you can use smaller sizes if you’re willing to wait longer), but for SplitMix, you don’t actually have to run the test to know that it’ll fail, since it’s an integral part of the design. FWIW, Vigna’s PRNGs, pcg64, pcg64_fast, a 128-bit MCG, and std::mt19937_64 all pass a simple 64-bit birthday test. As with any statistical issue, you might decide it’s not important for your application. And in some specialized situations, the once-only property is actually desirable, which is why PCG reluctantly provides some variants (explicitly labelled with the suffix _once_insecure) to provide it.)
(It’s also the case that Daniel tested Vigna’s cut-down version of SplitMix, which drops its multi-stream functionality and the split() function that inspires its name.)
“Once you’ve seen a given 64-bit value, you’ll never see it again.”
Unless I’m missing the point, I see this as unimportant since for a generator with period 2^64 (or 2^64-1), we can only expect good stat quality with sample sizes < sqrt(2^64) = 2^32 for not birthday spacing related and b-day spacing no more than (2^64)^(1/3) ~= 2642246
If it really were the case that you could only demand 2^32 values from SplitMix, that would itself be a problem as well because on a modern machine we can produce 2^32 values well under 10 seconds. A modern PRNG that is only good for fewer than 10 seconds of output would be questionable at best.
The old rule about only being able to expect about sqrt(period) values from any PRNG is not the case for many/most modern PRNGs, and especially not cryptographically secure ones based on hashing a counter (e.g., Random123). The period of those is based on the width their internal counter, and they should give randomness that passes statistical tests all the way until the counter loops back.
SplitMix passes PractRand’s statistical tests up to at least 64 TB of output (that’s as far as I’ve run it so far), 2048x more than the 32 GB fail we’d see if it failed at the square root of its period.
In general though, I think we can agree that using a 64-bit PRNG with 2^64 period to produce 64-bit outputs is clearly problematic. We can agree that there will be birthday issues far sooner than the period might have you think.
In contrast, if you use SplitMix as a 32-bit PRNG, everything will seem as it should.
And the cubic root of period is only an LCG thing..bad me.
Oh, and thanks for putting the effort into writing a paper accessible to a wide audience. Certainly practitioners appreciate the extra effort.
Thank you. I will rerun everything from scratch. I tried patching up my results with a partial rerun, but it is too easy to mislabel files when proceeding this way. I will just precisely capture the output of the script on a single machine. This will hopefully clear all confusions.
“Running CollisionOver (BigCrush Test #9) on pcg32 twenty times, I see these p-values: 0.05811, 0.144, 0.9842, 0.797, 0.9561, 0.4523, 0.3093, 0.1909, 0.9385, 0.6832, 0.9418, 0.1008, 0.06458, 0.1211, 0.4523, 0.6231, 0.03508, 0.9561, 0.2548, 0.8652, which averages out to be 0.4964.
Running Run, r = 0 (BigCrush Test# 38) on the byte reverse (not the usual bit-reversed) of the least significant 32-bits of pcg64 lsb twenty times, I see these p-values: 0.992, 0.8454, 0.08798, 0.6355, 0.8053, 0.1355, 0.06824, 0.25, 0.1758, 0.4459, 0.5009, 0.4907, 0.8229, 0.6984, 0.3183, 0.4428, 0.6964, 0.8723, 0.0706, 0.7646, which averages out to be 0.506.”
The mean is a fairly useless metric for repeated tests. What you want to do is a goodness of fit for a U(0, 1) distribution.
p for the first series (Test #9) of p-values [octave’s implementation of the one-sided KS-test, kolmogorov_smirnov_test(x, “unif”, 0, 1)] is ~0.476, nothing suspicious.
p for the second series (Test #38) is 0.992 which at least would lead me to run that particular test more times.
(Anderson-Darling gives a somewhat high but not very high p-value for the second series, 0.96.)
I don’t get it: JavaDoc says “Instances of SplittableRandom are not cryptographically secure. Consider instead using SecureRandom in security-sensitive applications.” – so what is your point?
Crack SecureRandom and you will get my attention.
And about “widespread”: Please show me any relevant code using SplittableRandom in Java….
Instances of SplittableRandom are not cryptographically secure. Consider instead using SecureRandom in security-sensitive applications.
That is correct.
so what is your point?
I’m not sure what you are asking.
Maybe you think it is a post about cryptography?
My blog post is specifically about “non-cryptographic RNGs”, please see the title. The tests I run are specifically designed for non-crytographic RNGs.
Crack SecureRandom and you will get my attention.
Again, this post has nothing to do whatsoever with cryptography. We test non-cryptographic random number generators.
Please show me any relevant code using SplittableRandom in Java
SplittableRandom is part of the core Java API. It is difficult to be more widespread.
Curious how Mersenne Twister compares to the above (later?) random number generator. Maybe not important.
Mersenne Twister passes practically random, but I don’t have the results yet from “big crush”. From what I read elsewhere, it fails. You can certainly run the tests yourself if you are curious: I encourage you to do so.
The Mersenne Twister did not fail in Daniel’s test because he told PractRand to stop after 64 GB, which is about half an hour of testing. By default, PractRand tests up to 32 TB, which obviously takes considerably longer.
The 32-bit Mersenne Twister will fail at 256 GB of output and its 64-bit version will fail at 512 GB of output (which takes three hours to reach). It fails PractRand’s BRank test. You can find sample output showing it fail in the â€œHow to Test with PractRandâ€ tutorial on my blog.
So, a more precise answer from Daniel would be to say that it didn’t fail when he tested it, but he only ran a limited test. Saying that it â€œpasses PractRandâ€ (in general) is false; it does not.
[FWIW, a 74 bit Lehmer PRNG fails PractRand at about the same point at the Mersenne Twister (yes, I meant 74, not 64).]
That’s correct. It only passes the test as one can find it in the repository. So it is one seed, and only 64 GB. This does not catch all possible failures.
Looks like testingRNG/testu01/results in your github repo still has files from when you were accidentally reusing the same seed for every run (perhaps because your test script dumps results files in testingRNG/testu01, and not testingRNG/testu01/results ?).
In particular, the results for testpcg64-S848432-b-r.log and testpcg64-S987654-b-r.log are just duplicates of testpcg64-b-r.log. When you correct this issue, the actual runs for these seeds are completely free of any blips and the â€œrun failuresâ€ you think you’re seeing for pcg64 disappear.
Also, I suspect that the reason you thought SplitMix failed was because your Xorshift RNG misidentified itself as SplitMix. If you remove those misattributions, I think SplitMix will come out okay.
My script deliberately does not overwrite files in the results subdirectory (that would be inconvenient as they are checked into the repository). However, it is correct that there were issues with these archived log files. So, as stated in a previous comment, and in the revised blog post, I am re-running everything from scratch to hopefully get a clean and fully reproducible result dump. Thanks for warning me about these problems.
You are probably right about SplitMix, and once I get the final results, I might do a blog post specifically about SplitMix.
Are you aware of any generators which pass TestU01 that use 32-bit arithmetic exclusively? That is, composed of integer type
Lin: You can read the original TestU01 article that introduced Crush, BigCrush, etc.
P. L’Ecuyer and R. Simard, “TestU01: A C Library for Empirical Testing of Random Number Generators”, ACM Transactions on Mathematical Software, 33, 4, Article 22, August 2007, 40 pages.
It contains a large table with test results for many 32-bit generators. You will find there that the Mersenne Twister fails linearity tests in BigCrush. But there are also generators that pass all the tests and (more importantly) are also supported by solid theory.
You may subscribe to this blog by email.