Default random-number generators are slow

Most languages like Java and Go, come with standard pseudo-random-number generators. Java uses a simple linear congruential generator. Starting with a seed value, it generates a new value with the reccurence formula:

seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);

The seed variable thus modified time and time again can be considered “random”. For many practical purposes, it is good enough.

It should also be quite fast. A multiplication has a latency of only 3 cycles on recent Intel processors, and that’s the most expensive operation. Java should be able to generate a new 32-bit random number integer every 10 cycles or so on a recent PC.

And that is the kind of speed you get with a straight-forwad implementation:

int next(int bits) {
    seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
    return (int) (seed >>> (48 - bits));

Sadly, you should not trust this analysis. Java’s Random.nextInt is several times slower at generating random numbers than you would expect as the next table shows.

Function Timing (ns) on Skylake processor
Random.nextInt 10.4
my next function above 2.7

That’s a four-fold difference in performance between my implementation and the standard Java one!

Languages like Go do not fare any better. Even the venerable rand function from the C standard library is several times slower than you would expect.

Why? Because the standard Java API provides you with a concurrent random number generator. In effect, if you use the Java random number generator in a multithreaded context, it is safe: you will get “good” random number generator. Go and other languages do the same thing.

It is unclear to me why it is needed. You can easily get concurrency in a multithreaded context by using one seed per thread.

Evidently, language designers feel that random-number generators should be particularly idiot-proof. Why have the random-number generators received this particular type of attention?

For users who want less overhead, the Java API provides a class in the concurrent package called ThreadLocalRandom that is nearly as far as my naive function, as the next table shows.

Function Timing (ns) on Skylake processor
ThreadLocalRandom 3.2

It turns out that the ThreadLocalRandom uses many optimization tricks, covering all of its functions, that the Random class does not have.

In any case, if you need to write fast software that depends on random numbers (such as a simulation), you probably want to pick your own random-number generator.

Reference: As usual, my benchmarking software is available online.

Credit: I am grateful to Viktor Szathmary for pointing out the ThreadLocalRandom class.

Daniel Lemire, "Default random-number generators are slow," in Daniel Lemire's blog, February 1, 2016.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

13 thoughts on “Default random-number generators are slow”

  1. Choosing the correct pseudo-random number generator for a Monte Carlo simulation requires a careful analysis, but as a rule of thumb linear congruential generators (LCG) are inadequate. (At least if you are doing computational science, and not video games. This is not because computational science is a more noble endeavour than video game developing: the simulation of physical systems has different requirements than the simulation of a synthetic world. )

    Moreover LCGs are unsuitable for cryptographic applications.

    In my opinion the real question is not whether the default pseudo-random generator is fast enough, but whether it is good enough for your application.

      1. Sound interesting, but I do not have hands on experience. (Disclaimer: my background is in computational science and engineering, but I’ve never run large scale Monte Carlo simulations. Speed never was my main concern… but this depends of the applications in my area of interest.)

  2. This kind of thing is exactly why I think threads should be used approximately never by applications. Non-parallel abstractions (events, coroutines, coop threads, etc) for multitasking, and pools of worker processes for parallelism! That way be default you never have to worry about the memory nightmares that lead to by-default thread safe RNGs.

      1. Goroutines are just dressed-up threads. The Go community promotes a CSP style of programming with channels, but you can share memory between goroutines and get data races just like you can in Java or C (or almost any other popular language these days).

  3. In mono-threaded env. i’ve used XorShift1024* and i’ve tested it with a bloom filter of 800Mil elem Precision 0.000001f and it’s seems real fast and good enough to not create clash: or not?

    1. Excellent. I have updated my blog post and credited you with this important observation.

      It does not change the message of my blog post, but it does point the programmers in a useful direction.

  4. Does the “concurrent random number generator” mean the one which *locks* the seed or the internal state every time during the computation? That might be *very* slow. I would appreciate if you could clarify this. Giving independent seeds for each threads would not cause this locking overhead.

    FYI, Erlang/OTP PRNGs in rand module has three algorithms: Xorshift116+ as default, Xorshift64* and Xorshift1024* are the other choices. Seeds must be given explicitly, or chosen to use the one stored in the process dictionary (a per-process storage area). Note in Erlang processes are roughly equivalent to C or Java threads, but each process has its own dictionary, not sharable with other processes.

  5. Another point, which can be important for a PRNG is the ability to split it. That is, support a function `split : rngstate -> rngstate * rngstate` which produces a fork in the RNG seed at that state.

    This allows you to carry out concurrent work but recreate the original state from a seed, since you can use this to handle non-determinism. And it allows you to use the same RNG for a binary tree, but cut off a subtree without having to roll forward the RNG state. You just split at each node.

    Some libraries, notably those who provide tooling for property-based-testing are extremely reliant on such a splitting function in the PRNG they use.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.