Vectorizing random number generators for greater speed: PCG and xorshift128+ (AVX-512 edition)

Most people designing random number generators program using regular code. If they are aiming for speed, they probably write functions in C. However, our processors have fast “vectorized” (or SIMD) instructions that can allow you to go faster. These instructions do several operations at once. For example, recent Skylake-X processors from Intel can do eight 64-bit multiplications with a single instruction.

There is a vectorized version of the Mersenne Twister generator used in some C++ standard libraries, but the Mersenne Twister is not particularly fast to begin with.

What happens if we vectorize really fast generators?

I had good luck vectorizing Vigna’s xorshift128+ random number generator. A generator like it is part of some JavaScript engines. The xorshift128+ generator produces 64-bit values, but you can consider them as two 32-bit values. On my Skylake-X processor, I can generate 32-bit random integers at a rate of 2 cycles per integer using xorshift128+. That’s almost twice as fast as when using the default, scalar implementation in C.

scalar xorshift128+3.6 cycles per 32-bit word
SIMD xorshift128+1.0 cycles per 32-bit word

PCG is a family of random number generators invented by O’Neill. Can we do the same with PCG? I took a first stab at it with my simdpcg library. My vectorized PCG ends up using about 1 cycle per integer, so it has the same speed as the vectorized xorshift128+.

scalar PCG4.3 cycles per 32-bit word
SIMD PCG1.0 cycles per 32-bit word

In these tests, I simply write out the random number to a small array in cache. I only measure raw throughput. To get these good results, I “cheat” a bit by interleaving several generators in the vectorized versions. Indeed, without this interleave trick, I find that the processor is almost running idle due to data dependencies.

My C code is available:

Sadly, I expect that most of my readers do not yet have processors with support for AVX-512 instructions. They are available as part of the Skylake-X and Cannonlake processors. Intel has not been doing a great job at selling these new processors in great quantities. You may be able to have access to such processors using the cloud.

Update: In my initial version, I reported worse performance on xorshift128+. Samuel Neves pointed out to me that this is due to the lack of inlining, because I compile the xorshift128+ functions in their own object files. We can solve this particular problem using link time optimization (LTO), effectively by passing the -flto flag as part of the compile command line. As usual, results will vary depending on your compiler and processor.

8 thoughts on “Vectorizing random number generators for greater speed: PCG and xorshift128+ (AVX-512 edition)”

  1. If you do not have an AVX-512 cpu, you can still experiment with these on some of the cloud providers, which offer AVX-512 vms.

    1. In these tests, I simply write out the random number to a small array in cache. I only measure raw throughput. To get these good results, I “cheat” a bit by interleaving several generators in the vectorized versions. Indeed, without this interleave trick, I find that the processor is almost running idle due to data dependencies.

      Isn’t the cheating in the “I simply write out the random number to a small array in cache”, chances of that happening consistently in the real world are small, unless you’re randomizing your hard-disk.

      1. Isn’t the cheating in the “I simply write out the random number to a small array in cache”, chances of that happening consistently in the real world are small, unless you’re randomizing your hard-disk.

        Cheating as opposed to what? How else do you propose to measure how quickly one can generate random numbers?

        1. As opposed to comparing the performance of prng’s in a situation that is more likely to reflect real-world-usage of any prng.

          Modern processors will make the operation you’re measuring really fast due to deep pipe-lines and out-of-order execution. In normal code branch mis-prediction will trash the instruction cache often.

          So it really comes down to what one wants to measure, only then one might be able to answer the question of how to do that.

          1. In normal code branch mis-prediction will trash the instruction cache often.

            The point you seem to be making is that random number generation might not be the bottleneck. Something else, like the branchy nature of my code, might be the limiting factor. That’s absolutely correct.

            1. Something else, like the branchy nature of my code, might be the limiting factor.

              I’m saying that in a “normal” program (not specifically testing through-put of a prng), other (surrounding) code will trash your instruction cache (and probably your data cache as well), and therefor in a real world situation the tested prng will not do so well as advertised. How less well it will do depends on the prng and the actual implementation of it.

              What I’m saying is that you are not testing that, or in other words, it’s not a very useful measure.

            2. Sorry for the split response, but there’s more I think.

              The efficiency of the intrinsics depends fully on that the values remain in registers throughout. Iff they don’t, intrinsics are very expensive as the whole lot (512 bytes, way bigger than your data cache) need to be written to memory and back. Again, surrounding code might force those values out of registers, which won’t happen in your test method..

              I am not criticizing your implementation(s). I’m just putting question marks along-side your measuring method.

              To be honest I don’t have a good response either. M. E. O’Neill is going to publish a blog-post on this particular issue (she told me it’s in the making), so I’m looking forward to that.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax