We often use random numbers in software in applications such as simulations or machine learning. Fast random number generators tend to produce integers in [0,232) or [0,264). Yet many applications require integers in a given interval, say the interval {0,1,2,3,4,5,6,7}. Thus standard libraries frequently provide a way to request an integer within a range.
How does it work underneath? Typically, a 64-bit or 32-bit integer is produced, and then it is converted into an integer within the desired range. Unfortunately, as pointed out in details by Melissa O’Neill, doing so without introducing undue biases is slow. That is, it can take more time to convert the integer to the range than to produce the random integer in the first place!
In Fast Random Integer Generation in an Interval, we show how to drastically accelerate this computation compared to the standard libraries. The main trick is to avoid as much as possible the use of division instructions (since they are slower).
In the paper, there is a precautionary note about the applicability of this technique to GPUs. Indeed, there are substantial differences between general purposes 64-bit processors and common GPUs (32-bit). A reader, Norbert Juffa, reached out to me to point out that the note might be unwarranted. Juffa wrote a benchmark using the NVIDIA API (CUDA) to support his claim.
The fast function that avoids divisions as much as possible can be expressed using a few lines of C.
// returns value in [0,s) uint64_t nearlydivisionless (uint64_t s) { uint64_t x = random64 (); // compute ( x * s ) >> 64 uint64_t h = __umul64hi (x, s); uint64_t l = x * s; if (l < s) { uint64_t t = -s % s; while (l < t) { x = random64 (); h =__umul64hi (x, s); l = x * s; } } return h; }
What Juffa does is to generate 10,000,000 integers in the interval [0,500,000] from Marsaglia’s KISS64 random number generator. On a Quadro P2000 Nvidia card, he shows that using an approach that minimizes the use of divisions is much faster.
OpenBSD-like | 5 ms |
Java-like | 2.9 ms |
Our approach | 1.4 ms |
Very interesting stuff, Daniel. Do I get it correctly that you use your previously posted trick to avoid modulo operation?
I am not a CUDA expert, but it seems totally crazy that loops and conditionals are faster than a modulo operation on GPU. GPU must stop the cores that don’t follow a branch (and execute them later). Conditionals should have a much higher penalty compared to poor modulo operation.Is the modulo operation so slow on GPU or I miss something?
The same argument also partially true for CPUs: a single branch misprediction is like 10-15 instructions, the same number as a modulo operation.
Do I get it correctly that you use your previously posted trick to avoid modulo operation?
Yes.
I am not a CUDA expert, but it seems totally crazy that loops and conditionals are faster than a modulo operation on GPU. GPU must stop the cores that don’t follow a branch (and execute them later). Conditionals should have a much higher penalty compared to poor modulo operation.Is the modulo operation so slow on GPU or I miss something?
It is not possible to have unbiased numbers, as far as I can tell, without branching, so the code here compares three strategies involving branching. All should involve just about the same amount of branching.
You can skip the branching at the expense of some (small) bias, but it was not tested here.
I wonder if a different API could speed things up, for example bulk generation.
If the order of numbers doesn’t matter (is it always important?), you could split the job, for example instead of generating 1 million numbers 0..99, generate numbers in blocks: 0..63, 64..95, and 96..99. How many of each block are needed would need to be somewhat random (Poisson distribution I think); calculating that would require O(log n).
Once I had to generate random numbers that need to be unique, and somewhat sorted. Sure, you can make them unique by sorting, but that’s slow. I found a faster algorithm, see this discussion.