Quickly sampling from two arrays (C++ edition)

Suppose that you are given two arrays. Maybe you have a list of cities from the USA and a list of cities from Europe. You want to generate a new list which mixes the two lists, taking a sample from one array (say 50%), and a sample from the other array (say 50%). So if you have 50 cities from the USA and 50 cities from Europe, you want a new array that contains, in random order, 25 cities from the USA and 25 cities from Europe.

We need this kind of mixed sampling all the time in machine learning or data science. This summer, I was running simulations and the bulk of the time was spent mixing arrays. I need to pick, say, 25% of all elements from one array and combine them with, say, 75% of all elements from another array.

There are many bad ways to solve this problem. But here is a reasonable one. First you pick a sample from the first array using reservoir sampling; then you pick a sample from the other array (again using reservoir sampling), and you finally apply a random shuffle to the result.

Reservoir sampling is an efficient way to sample N values from an array:

  for (i = 0; i < N; i++) {
    output[i] = source[i];
  }
  for (; i < ...; i++) {
    r = random_bounded(i);// value in [0,i)
    if (r < howmany) {
      output[r] = source[i];
    }
  }

Knuth shuffling is an efficient way to randomly shuffle the elements in an array:

  for (i = size; i > 1; i--) {
    r = random_bounded(i);// value in [0,i)
    swap(array[i-1], array[r]);
  }

With these two algorithms in place, I can sample from two source arrays using three function calls:

reservoirsampling(output, N1, source1, length1);
reservoirsampling(output + N1, N2, source2, length2);
shuffle(output, N1 + N2);

So how efficient is it? Suppose that I have two arrays made of a million elements each and I want to sample half a million elements from each. On my iMac, I use a bit over 12 CPU cycles per input element (so about 24 million cycles in total). You probably can go even faster, but this approach has the benefit of being both simple and efficient.

My source code is available.

Vectorizing random number generators for greater speed: PCG and xorshift128+ (AVX-512 edition)

Most people designing random number generators program using regular code. If they are aiming for speed, they probably write functions in C. However, our processors have fast “vectorized” (or SIMD) instructions that can allow you to go faster. These instructions do several operations at once. For example, recent Skylake-X processors from Intel can do eight 64-bit multiplications with a single instruction.

There is a vectorized version of the Mersenne Twister generator used in some C++ standard libraries, but the Mersenne Twister is not particularly fast to begin with.

What happens if we vectorize really fast generators?

I had good luck vectorizing Vigna’s xorshift128+ random number generator. A generator like it is part of some JavaScript engines. The xorshift128+ generator produces 64-bit values, but you can consider them as two 32-bit values. On my Skylake-X processor, I can generate 32-bit random integers at a rate of 2 cycles per integer using xorshift128+. That’s almost twice as fast as when using the default, scalar implementation in C.

scalar xorshift128+3.6 cycles per 32-bit word
SIMD xorshift128+1.0 cycles per 32-bit word

PCG is a family of random number generators invented by O’Neill. Can we do the same with PCG? I took a first stab at it with my simdpcg library. My vectorized PCG ends up using about 1 cycle per integer, so it has the same speed as the vectorized xorshift128+.

scalar PCG4.3 cycles per 32-bit word
SIMD PCG1.0 cycles per 32-bit word

In these tests, I simply write out the random number to a small array in cache. I only measure raw throughput. To get these good results, I “cheat” a bit by interleaving several generators in the vectorized versions. Indeed, without this interleave trick, I find that the processor is almost running idle due to data dependencies.

My C code is available:

Sadly, I expect that most of my readers do not yet have processors with support for AVX-512 instructions. They are available as part of the Skylake-X and Cannonlake processors. Intel has not been doing a great job at selling these new processors in great quantities. You may be able to have access to such processors using the cloud.

Update: In my initial version, I reported worse performance on xorshift128+. Samuel Neves pointed out to me that this is due to the lack of inlining, because I compile the xorshift128+ functions in their own object files. We can solve this particular problem using link time optimization (LTO), effectively by passing the -flto flag as part of the compile command line. As usual, results will vary depending on your compiler and processor.

Further reading: See Xorshift1024*, Xorshift1024+, Xorshift128+ and Xoroshiro128+ fail statistical tests for linearity, Journal of Computational and Applied Mathematics, to appear (Available online 22 October 2018)

The Xorshift1024* random number generator fails BigCrush

In software, we use random number generators to simulate randomness. They take the form of functions which return numbers that appear random. To test their quality, we apply statistical tests. The goal standard is a statical test called BigCrush. It tests that quality of 32-bit random values.

In an earlier post, I reported that contrary to what you can read on the Xorshift Wikipedia page, the Xorshift128+ random number generator fails BigCrush. This is somewhat significant because Xorshift128+ has been widely adopted.

The Xorshift Wikipedia page also states that a more expensive generator, xorshift1024*, passes BigCrush. So I wondered, does it, indeed, pass BigCrush?

So I used my testing framework to run BigCrush. I use four different “seeds” and I only worry about a failure if it occurs with all four seeds, and with an excessively improbable p-value. Because xorshift1024* generates 64-bit outputs, and BigCrush requires 32-bit inputs, I only keep the 32 least significant bits of each word produced by xorshift1024*.

Here are my results:

 ./summarize.pl testxorshift1024star*.log
reviewing xorshift1024star lsb 32-bits
Summary for xorshift1024star lsb 32-bits (4 crushes):
- #81: LinearComp, r = 29: FAIL!! -- p-values too unlikely (1 - eps1, 1 - eps1, 1 - eps1, 1 - eps1) -- ALL CRUSHES FAIL!!

reviewing xorshift1024star lsb 32-bits (bit reverse)
Summary for xorshift1024star lsb 32-bits (bit reverse) (4 crushes):
- #80: LinearComp, r = 0: FAIL!! -- p-values too unlikely (1 - eps1, 1 - eps1, 1 - eps1, 1 - eps1) -- ALL CRUSHES FAIL!!

So xorshift1024* fails BigCrush systematically when providing the values as is (log file 1, 2, 3, 4), and it fails again with reversing the bit order (log file 1, 2, 3, 4).

So the Wikipedia entry is misleading. Both xorshift128+ and xorshift1024* fail BigCrush.

My code is available for review, and you can rerun the tests for yourself.

Further reading: See Xorshift1024*, Xorshift1024+, Xorshift128+ and Xoroshiro128+ fail statistical tests for linearity, Journal of Computational and Applied Mathematics, to appear (Available online 22 October 2018)

The Xorshift128+ random number generator fails BigCrush

In software, we generate randomness with algorithms called “pseudo-random number generator”, sometimes simply called “random number generator” or RNG. A popular random number generator is xorshift128+: it is used by many JavaScript engines. It was designed by an influential computer science professor, Sebastiano Vigna, who has done a lot of great work. I suspect that the JavaScript engineers made a fine choice by selecting xorshift128+.

How can you tell that your random number generator has a good quality? A standard test is TestU01 designed by professor L’Ecuyer from the University of Montreal. TestU01 comes with different sets of tests, but BigCrush is the gold standard. A good random number generator should pass BigCrush when you pass the values as-is, as well as when you reverse the bits produced by the random number generator. Indeed, Vigna writes about another popular random number generator:

A recent example shows the importance of testing the reverse generator. Saito and Matsumoto [2014] propose a different way to eliminate linear artifacts (…) However, while their generator passes BigCrush, its reverse fails systematically (…) which highlights a significant weakness in its lower bits.

Passing BigCrush, even after reversing the bits, is not an impossible task. The SplittableRandom class in Java appears to pass (Steele et al., 2014), and so does the PCG family. And, of course, all cryptographic random number generators pass BigCrush, irrespective of the order of the bits.

On Wikipedia, we can read about xorshift128+ passing BigCrush robustly:

the following xorshift128+ generator uses 128 bits of state (…) It passes BigCrush, even when reversed.

Is Wikipedia correct? It offers a reference by Vigna who invented xorshift128+ (Vigna, 2017). Vigna’s journal article states:

(…) we propose a tightly coded xorshift128+ generator that does not fail any test from the BigCrush suite of TestU01 (even reversed) (…) xorshift128+ generator (…) is currently the fastest full-period generator we are aware of that does not fail systematically any BigCrush test (not even reversed)

That seems like a pretty definitive statement. It admits that there might be statistical failure, but no systematic one. So it would seem to support the Wikipedia entry.

The xorshift128+ code is not difficult:

#include <stdint.h>
uint64_t s[2];
uint64_t next(void) {
  uint64_t s1 = s[0];
  uint64_t s0 = s[1];
  uint64_t result = s0 + s1;
  s[0] = s0;
  s1 ^= s1 << 23; // a
  s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5); // b, c
  return result;
}

This the version with the paramater recommended by Vigna, but the V8 JavaScript engine went with a slight variation:

uint64_t s[2];
uint64_t next(void) {
  uint64_t s1 = s[0];
  uint64_t s0 = s[1];
  uint64_t result = s0 + s1;
  s[0] = s0;
  s1 ^= s1 << 23; // a
  s[1] = s1 ^ s0 ^ (s1 >> 17) ^ (s0 >> 26); // b, c
  return result;
}

(The code I offer is taken directly from the author’s website and is equivalent to what we find on Wikipedia and in the research paper.)

Can we check the claim? The BigCrush benchmark is available as free software, but it is a pain to set it up and run it. So I published a package to test it out. Importantly, you can check my software, compile it, run it, review it… at your leisure. I encourage you to do it! I use Vigna’s own C implementation.

Statistical tests are never black and white, but you can use p-values to arrive at a reasonable decision as to whether the test passes or not. The BigCrush implementation does this analysis for us. To make things more complicated, random number generators rely on an initial “seed” that you input to initiate the generator. Provide a different seed and you will get different results.

So I did the following when running BigCrush:

  • I use four different input seeds. I only care if a given test fails for all four seeds. I use 64-bit seeds from which I generate the needed 128-bit seed using another generator (splitmix64), as recommended by Vigna. I just chose my seeds arbitrarily, you can try with your own if you have some free time!
  • I only care when BigCrush gives me a conclusive p-value that indicates a clear problem.
  • I use the least significant 32 bits produced by xorshift128+, in reversed bit order. (BigCrush expects 32-bit inputs.)

Let us review the results.

seed: 12345678

 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):
       Test                          p-value
 ----------------------------------------------
 68  MatrixRank, L=1000, r=0          eps  
 71  MatrixRank, L=5000               eps  
 80  LinearComp, r = 0              1 - eps1
 ----------------------------------------------

seed: 412451

The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):
       Test                          p-value
 ----------------------------------------------
 68  MatrixRank, L=1000, r=0          eps  
 71  MatrixRank, L=5000               eps  
 80  LinearComp, r = 0              1 - eps1
 -----------------------------------------------

seed: 987654

The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):
       Test                          p-value
 ----------------------------------------------
 68  MatrixRank, L=1000, r=0          eps  
 71  MatrixRank, L=5000               eps  
 80  LinearComp, r = 0              1 - eps1
 ----------------------------------------------

seed: 848432

 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):
       Test                          p-value
 ----------------------------------------------
 24  ClosePairs mNP1, t = 9          9.2e-5
 24  ClosePairs NJumps, t = 9        1.0e-4
 68  MatrixRank, L=1000, r=0          eps  
 71  MatrixRank, L=5000               eps  
 80  LinearComp, r = 0              1 - eps1
 ----------------------------------------------

The failure is equivalent if I use the alternate version of the code, as used by the V8 JavaScript engine.

Analysis

Xorshift128+ fails BigCrush when selecting the least significant 32 bits and reversing the bit order. The evidence is clear: I used four different seeds, and it failed MatrixRank and LinearComp in all four cases. Thus the Wikipedia entry is misleading in my opinion.

While I reversed the bit orders, you can also get systematic failures by simply reversing the byte orders. You select the least significant 32 bits, and reverse the resulting four bytes.

The recommended replacement for xorshift128+, xoroshiro128+, also fails BigCrush in a similar manner (as you can verify by yourself). Yet the xoroshiro128+ Wikipedia page could serve as an unequivocal recommendation:

Xoroshiro is the best general purpose algorithm currently available. (…) Mersenne Twister might still be a better choice for highly conservative projects unwilling to switch to such a new algorithm, but the current generation of statistically tested algorithms brings a baseline of assurance from the outset that previous generations lacked.

I feel that this would need to be qualified. In my tests, Xoroshiro is no faster than SplittableRandom (which I call splitmix64 following Vigna’s terminology), and SplittableRandom passes BigCrush while Xoroshiro does not. Recall that the PCG functions also pass BigCrush and they are quite fast. There are other choices as well.

To be clear, there is probably no harm whatsoever at using xorshift128+, but it does systematically fail reasonable statistical tests. If your core argument for choosing a generator is that it passes hard statistical test, and it fails, I think you have to change your argument somewhat.

Further reading: See Xorshift1024*, Xorshift1024+, Xorshift128+ and Xoroshiro128+ fail statistical tests for linearity, Journal of Computational and Applied Mathematics, to appear (Available online 22 October 2018)

Testing non-cryptographic random number generators: my results

In software, we use random number generators to emulate “randomness” in games, simulations, probabilistic algorithms and so on. There are many definitions of what it means to be random, but in practice, what we do is run statistical tests on the output of the random number generators.

These tests are not perfect, because even a purely random sequence could fail particular tests just out of bad luck. Extraordinarily bad events do happen on occasion. However, we can repeat these tests and if they keep failing, we can gain confidence that something is wrong. By “wrong”, we mean that the output does not quite look random.

One concern is that running these tests can be difficult, and inconvenient. To alleviate this problem, I created a GitHub repository where you can find scripts that should allow you to test several different random number generators using a single command line.

I am focusing on fast non-cryptographic random number generators, the type that most programmers use for hash tables, simulations, games, and so forth. I stress that we are not interested in cryptographic security. Cryptography is a whole other world that we are going to leave to cryptographers.

The contenders

I chose the following generators since they are widespread and fast:

  • splitmix64 is a random number generator part of the standard Java API (SplittableRandom). It produces 64-bit numbers.
  • pcg32 and pcg64 are instances of the PCG family designed by O’Neill. They produce either 32-bit or 64-bit outputs.
  • xorshift32 is a classical xorshift random number generator you can read about in textbooks.
  • Finally, xorshift128plus and xoroshiro128plus are recently proposed random number generator by Vigna et al. They produce 64-bit numbers.

No doubt I could have added many more…

Speed

First, I decided to look at raw speed on recent x64 processors:

xoroshiro128plus0.85 cycles per byte
splitmix640.89 cycles per bytexorshift128plus0.91 cycles per bytepcg641.27 cycles per bytepcg321.81 cycles per bytexorshift322.33 cycles per byte

Basically, splitmix64 and the generators proposed by Vigna et al. are roughly equally fast. O’Neill’s PCG schemes are slightly slower. I should point out that all of them are much faster than whatever your C library provides (rand).

Let us move on to statistical testing.

Practically Random

A convenient testing framework is Practically Random. It is a recently proposed piece of code that will eat as many random bytes as you want and check for randomness. You can let the program run for as long as you want. I went with a nice round number: I test 512GB of random bytes.

Only splitmix64 and the PCG schemes pass Practically Random (512GB).

You can, however, make xoroshiro128plus pass if you turn it into a 32-bit generator by dropping the least significant bits. Naturally, if you do so, you will diminish the speed of the generator by half. You might be able to do well by discarding fewer than 32 bits, but I did not investigate this approach because I prefer generators to produce either 32 bits or 64 bits.

TestU01

Another well-established framework is L’Ecuyer’s TestU01. I run TestU01 in “big crush” mode using different seeds. Only when I see repeated failures with distinct seeds do I report a failure.

  • xoroshiro128+ fails MatrixRank and LinearComp
  • splitmix64 passes
  • xorshift128+ fails MatrixRank and LinearComp
  • pcg32 passes
  • pcg64 passes
  • xorshift32 fails decisively

Evidently, L’Ecuyer’s big crush benchmark is hard to pass. I should stress that it is likely possible to cause more failures by changing the conditions of the tests. That is, it is not because I do not report a failure that one does not exist, I may simply not have detected it.

I stress that these results are entirely reproducible. I provide all the software, all the scripts as well as my own outputs for public review.

Analysis

  • Blackman and Vigna report that xoroshiro128+ passes their big-crush tests. Sadly, it does not pass my big-crush tests. Note that you can verify my results for yourself and rerun the code!Of course, Blackman and Vigna did run big crush and did get passing scores, but their testing conditions no doubt differ.

    It is already documented that xoroshiro128+ fails practically random.

  • Though xorshift128+ has been widely adopted, it fails both big crush and practically random. The failure is rather decisive.
  • Fortunately, Java’s splitmix64 appears to pass big crush and practically random.
  • The PCG schemes pass all tests.

Going forward!

Part of my motivation when writing this blog post was Vigna’s remark: “Note that (smartly enough) the PCG author avoids carefully to compare with xorshift128+ or xorshift1024*.”

I am hoping that this blog post helps fill this gap. Evidently, my analysis is incomplete and we need to keep investigating. However, I hope that I have given you an interest for testing random number generators. If you grab my GitHub repository, it should be easy to get started.

Speaking for myself, as an independent party, I would rather have independent assessments. It is fine for O’Neill and Vigna to have Web sites where they compare their work to other techniques, but it is probably best for all of us if we collectively review these techniques independently. Please join me. To get you started: it is possible that I made mistakes. Please apply Linus’s Law and help dig out the bugs!

What would be interesting is to help document better these random number generators. For example, Xoroshiro128+ has a wikipedia entry (that looks messy to me), but the other schemes I have considered do not, as far as I can tell, have wikipedia entries, yet they are worthy of documentation in my opinion.

I am also in the process of adding more generators to the benchmark.

Disclaimer: I have no vested interest whatsoever in the success or failure of any of these generators. As a Java programmer, I am slightly biased in favor of splitmix64, however.

Further reading: See Xorshift1024*, Xorshift1024+, Xorshift128+ and Xoroshiro128+ fail statistical tests for linearity, Journal of Computational and Applied Mathematics, to appear (Available online 22 October 2018)