Doubling the speed of std::uniform_int_distribution in the GNU C++ library

The standard way in C++ to generate a random integer in a range is to call the std::uniform_int_distribution function. The current implementation of std::uniform_int_distribution in the GNU C++ library (libstdc++) to generate an integer in the interval [0,range] looks as follows.

scaling = random_range / range;
limit = range * scaling;
do
   answer = random;
while ( answer >= limit);
return  answer / scaling;

Typically, it requires two divisions by invocation. Even on recent processors, integer division is relatively expensive.

We can achieve the same algorithmic result with at most one division, and typically no division at all without requiring more calls to the random number generator. It speeds things up on various systems. It was recently added to Swift language.

Travis Downs suggested that someone should try to fix the GNU C++ implementation. So I prepared and submitted a patch. It is currently under review.

The main challenge is that we need to be able to compute the “full” product. E.g., given two 64-bit integers, we want the 128-bit result; given two 32-bit integers we want the 64-bit result. This is fast on common processors.

The 128-bit product is not natively supported in C/C++ but can be achieved with the __int128 128-bit integer extension when it is available. Thankfully, we can check for __int128 support, and when the support is lacking, we can fall back on another approach. The 128-bit integer extension appears in many compilers (LLMV clang, GNU GCC, Intel icc), but even within a given compiler (say GNU GCC), you cannot rely on the extension being always present, since the compiler implementation might be system specific.

The code I ended up with is somewhat ugly, but it works:

      __product = (unsigned __int128)(__urng() - __urngmin) * __uerange;
      uint64_t __lsb = uint64_t(__product);
      if(__lsb < __uerange)
      {
        uint64_t __threshold = -uint64_t(__uerange) % uint64_t(__uerange);
        while (__lsb < __threshold)
        {
          __product = (unsigned __int128)(__urng() - __urngmin) * __uerange;
          __lsb = uint64_t(__product);
        }
      }
      __ret = __product >> 64;

It mostly avoids divisions. I designed a simple random shuffling benchmark that mostly just calls std::shuffle which, in turn, relies on std::uniform_int_distribution. Given an array of one million elements, I get the following timings on a skylake processor with GNU GCC 8.3:

before patch 15 ns/value
after patch 8 ns/value

For fun, let us compare with the implementation that is present in the Swift standard library:

     var random: T = next()
     var m = random.multipliedFullWidth(by: upperBound)
     if m.low < upperBound {
       let t = (0 &- upperBound) % upperBound
       while m.low < t {
         random = next()
         m = random.multipliedFullWidth(by: upperBound)
       }
     }
     return m.high

I think it is more readable in part because the Swift language has built in support for the full multiplication. It is somewhat puzzling that the C++ language does not see fit to make it easy to compute the full product between two integers.

Reference: Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation 29 (1), 2019

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

37 thoughts on “Doubling the speed of std::uniform_int_distribution in the GNU C++ library”

  1. Following my comments on earlier posts, I put together a repo with a few variations on this method which might be interesting.

    One observation after plugging in std::random_device is that random bits are expensive, and another direction might be to minimize how many we use. Shuffle, for instance, can keep track of the bit width of its index as it works its way down, and rejection within just that many bits may prove faster due to lower entropy consumption.

    With pseudorandom generators of course it’s different, so we almost need different algorithms for the different sources.

    1. Wouldn’t bit-optimal use of entropy effectively correspond with arithmetic decoding of a compressed bitstream? How hard that would be? I guess there would be a spectrum from fast and bit-expensive to slow and (almost?) bit-optimal solutions to such a problem.

      1. AC did seem relevant, and it might be good to look through the different solutions. I have a feeling that left-to-right infinite-precision multiplication is already well known somewhere, but I haven’t found a reference.

        1. If someone wants to work through this, I am available.

          I think that there are papers on using as few random bits as possible, I read a few but I did not pay attention because that’s not what interested me at the time.

          1. I have a preliminary implementation which consumes “optimal” amount of whole bits per call. (It’s not based on AC, though.) There are two variants: one that is easy to understand, and another that’s a little hairier but branch predictor friendly and a bit faster, although it conceptually does lot of unnecessary work.

            The branch predictor friendly version would seem to have overhead of about 9 ns per call for small non-power-of-two values on my couple year old 2.9 GHz laptop. I still have to clean up my code…

          2. I have experimented with some code that uses bits a bit more efficiently: a variant of “bitmask with rejection” algorithm (to which the “mostly avoiding divisions” algorithm transforms when variable number of bits are used, I believe), and a “long multiplication” algorithm. Of both there are naive and loop-unrolled versions (latter to please branch predictors).

            I think the bitmask with rejection algorithm consumes log(4)log2(n) bits on average for large n, and the long multiplication algorithm does alway use less than log2(n)+2 on average for a specific value of n, the number of possible result values. Thus, the long multiplication algorithm uses the bit material more efficiently, although for almost all values of n it imposes a two-bit overhead over optimal fractional bit algorithms, and actually loses to the bitmask with rejection algorithm when n is little less than an integer power of two.

            My oversimplified benchmark (including xorshift64s and bit management) spends around 4.5 ns per call on naive bitmask w/ rejection function, and 4.3 ns on unrolled variant (for n=24; these can vary significantly depending on value of n). In the case of long multiplication algorithm corresponding numbers on my 2.9 GHz ’16 laptop are 11.3 and 9.2 ns.

            My code is hopefully readable enough: https://pastebin.com/5c8zMk6B

            1. I fixed my long multiplication routine, and interestingly enough the non-unrolled variant is now faster, although it’s slower than than before. It now takes about 12.4 ns per call.

              I think this doesn’t really reflect performance of the long multiplication algorithm itself (which should be have run time independent of n on average, and shows with BOGUS_RANDOMNESS on my code to drop typical call time to <4 ns), but rather management overhead of the random bits pool. It might be that the numbers would be much better if one would acquire randomness in larger batches than 64 bits at a time and shift it from the pool in a branchless manner.

        2. This problem is awfully painful to think about without rational bignums if one wants to handle fractional bits (in the style of arithmetic coding), but if we choose to consume an whole (but possibly variable) number of input bits for every output value it becomes probably more tractable. (I leave the implementation as an exercise to the reader as it’s quite late evening on my time zone.:)

          1. “Efficient” AC is widespread in compression algorithms, and doesn’t need any bignum stuff. You just don’t get “exact” AC but rather just precision down to the number of bits you have (there are some tradeoffs e.g., with how often you flush).

            So I don’t see a problem to adapt those algorithms to the shuffle case: the analogy is more or less exact. However they all need one division per value, so I have my doubts.

            That said, I think Daniel’s technique works with any number of bits, right? So you don’t need a different algorithm to be efficient down to the bit: AC would only get you sub-bit precision, and your generator would have to be really slow for that to pay off, I think.

            We’ll that’s not strictly correct, since in the retry case Daniel’s algorithm will use another set of bits: Daniel is there a formula for the number of bits used in expectation for a given range?

            1. Okay, bad terminology (and I was also pretty tired).

              One thing I fear on using efficient arithmetic coding methods is introducing some of the skew that these routines are intending to avoid. Practical AC doesn’t need to care about that so much.

              Reasoning about expectation of number of bits on the case of code above is an interesting question. I think it should be easier to reason at least about a case of and “infinite-precision” binary fixed point number presenting range between 0 and the number of alternatives (non-inclusive). What is the expected amount of bits of such a random number that determines the random integer without uncertainty? The best case would be log2(maxval), the worst would be infinity for almost all values of maxval, I think.

              I wonder if such a routine could be turned into something akin to Bresenham’s algorithm in practice. Relying on single bit input steps would probably make that ugly on the branch predictor, though…

              1. Really good point about the bias. Of course in compression you don’t really care since it’s just a microscopic difference in the compression ratio but for this type of thing it would be a big deal.

                1. I suspect working around bias turns into nasty bignum math very quickly. On the other hand, if you want to generate uniformly distributed numbers on the same, relatively small (< square root of native register) range s, you can simply search uniform integers over range s^n (smaller than register range) using the routine I present on godbolt link below, and later chop them relatively efficiently to individual results using precomputed division/residual magic constants.

                  This drops expected number of entropy bits used for range 3 from 3 bits to 1.63 on 64-bit register implementation (optimal would be 1.58), for instance.

            2. You approximately have a geometric distribution with p = 1 – s/2^L where s is your range. I think that the expected number of trials before you have a success is 1/p, so 2^L/(2^L-s)… You consume L bits per trial, so it is about 2^L/(2^L-s) * L bits consumed to generate a random number in [0,s).

              You can solve for the optimal value of L. It is going to be higher than log2(s), evidently. My expectation is that the optimal L is not going to be very far from log2(s). But I am sure someone can derive a solid bound.

              Sorry, I have not done the math yet.

              1. Some empirical averages (of million random runs each) on used bits I observed on my code for 1 to 64 values:

                {0., 1., 2.99913, 2., 3.68596, 3.75048, 4.09385, 3., 4.56261, 4.56261, 4.7658, 4.62472, 5.00652, 5.03252, 5.21029, 4., 5.53107, 5.53158, 5.62542, 5.56342, 5.71718, 5.71846, 5.82045, 5.62477, 5.9298, 5.93728, 6.04736, 5.9685, 6.16978, 6.17932, 6.27255, 5., 6.51585, 6.51485, 6.56376, 6.53185, 6.60912, 6.60959, 6.65559, 6.56258, 6.70191, 6.70274, 6.75153, 6.71835, 6.79677, 6.79703, 6.84769, 6.62494, 6.90202, 6.90547, 6.95634, 6.92183, 7.0106, 7.01561, 7.07071, 6.96757, 7.12957, 7.13322, 7.18957, 7.14959, 7.25198, 7.2558, 7.30239, 6.}

                The plot doesn’t suggest it would be too obvious, but yes, it’s like 2-based logarithm and some overhead…

              2. No wonder my plots seemed odd, I had a bug on my code. But when corrected, it’d seem that average number of bits stays under log2(n + 1) + 2 bits for outputs of n alternatives.

              3. Average number of bits required for s alternatives (with a bit of help from OEIS in the form of mystery sequence A104594) is, at least for my code which consumes whole bits:

                x = ceil(log2(s)) + bitand(s, s-1)/2^(ceil(log2(s))-1)

                And for s >= 1: ceil(log2(s)) <= x < ceil(log2(s))+2

    2. The “bits are expensive” scenario is interesting. Note that the existing two-division approach does not use fewer bits… you know this, but a casual reader may not.

      In the case where bits are infinitely expensive, I guess the solution is known.

      1. The existing thing is suboptimal in just about every way.

        With expensive bits, the rightward extension idea may be adaptable to consuming only a few at a time, eg an 8-bit random times a 32-bit range. It would trade more adds and multiplies (every 8 bits instead of every 32) for fewer random bits consumed.

      2. Here’s a rough idea: precompute magic constants which convert divisions into multiplications on initialization of std::uniform_int_distribution. Maintain reasonable amount of usable bits of entropy (like 64 bits or something), and rescale a full native type of entropy bits to output range (using magic multiplicative constants). This is the preliminary result. Now, multiply this again by another magic constant to get a fixed-point fraction corresponding to “exact” value of bits which would represent this value on the native type, and compute xor of this with original entropy bits. If this result is not zero, find the most significant bit set, and return that bit and less significant remaining bits to the entropy pool (these bits were not necessary in deciding the output value) and return the preliminary result as the random value.

        If the result of xor is zero, continue by requesting more random bits and by performing long division algorithm (with above bit-checking) until the arbitrary-precision fraction and entropy bits diverge.

        If my reasoning is correct, this routine shouldn’t demand divisions beyond initialization, no excessively wide multiplications, and should result less than one bit of wasted entropy per generated random value.

          1. Naively so, or at least that they don’t change often. It is hard to reason how one could perform bit-efficient operations without that, it would eventually turn into implementing divisions in a way or another.

            (Extracting one bit of division result shouldn’t take more than something like 2-3 clock cycles, though, and every such bit would halve the likelihood of need to continue, but combining this with the approach of the blog post is a bit beyond me.)

              1. That definitely sabotages my approach. 🙂

                Maybe there’s a way to extend the multiplication method somehow, considering that need of extra bits should become increasingly unlikely, bit after bit.

                Bit-by-bit division wouldn’t be awfully hard to implement and it could run until the result would be known to be unbiased, but that would have a cost per produced bit. This might still be faster for many inputs than two hardware divisions (or similar overhead)!

              2. Here’s an alternative version. This is really pseudocode; I haven’t even tried compiling it. I tested the concept with ranges of 3 and 5 with corresponding Wolfram Language code and 2 and 3 bit “word size”, and it seemed to give unbiased results. So, likelihood of the code having significant bugs is large. (Writing the supporting code was just bigger hassle than translating the idea.)

                uint32_t get_uniform_rand(uint32_t range)
                {
                uint32_t shift = 32 - __builtin_ffs(range);
                uint32_t add = range << shift;
                uint64_t mul;
                uint32_t res;
                uint32_t frac;

                if (range == 0)
                {
                return 0;
                }

                frac = mul = (uint64_t)(get_random_bits(bits) << shift) * range;
                res = mul >> 32;

                /* Can the fractional part still cause res to increment? */
                while (frac > -add)
                {
                uint32_t oldfrac = frac *= 2;

                if (get_random_bit())
                {
                frac += add;
                }

                if (oldfrac > frac)
                {
                /* Wraparound, that is, carry. */
                return res + 1;
                }
                }

                return res;
                }

                The idea here is that first only limited amount of bits contributes to the multiplication, and then multiplication is performed one bit at a time until it’s known that further bits don’t contribute to the output value.

                Biggest problem with modern processors on this is probably branch prediction of the while loop. This code might be faster if the loop would be somehow unrolled couple of times and the point where all necessary bits were seen would be reasoned afterwards.

                    1. I think the bit-by-bit multiplication might be omitted (and full multiplication in the style of the blog post performed) for tracking the number of necessary bits of entropy, and this information could be extracted from the lower multiplication result world, but I have no proof for this, or clear hunch if this could be accomplished efficiently (without a division).

                      This still doesn’t remove the need that sometimes more than one word of entropy is truly needed to get an unbiased result.

      3. The “bits are expensive” scenario: arithmetic coding could be used, or ANS coding (“asymmetric numeral systems”, see wikipedia). ANS might be faster. The problem is probably correcting bias (I think for both arithmetic and ANS coding). For ANS coding, the generation loop looks something like this (Java), so it only uses multiplication / shift / add:

        private static final long TOP = 1L << 24;
        private static final int SHIFT = 12;
        private static final int MASK = (1 << SHIFT) - 1;

        // read 64 bit from the compressed stream
        long state = r.nextLong();
        // for all output bytes
        for (int i = 0; i < size; i++) {
        // read data from the state
        int x = (int) state & MASK;
        // lookup the code
        int c = freqToCode[x] & 0xff;
        // output
        out[i] = (byte) c;
        // update state depending on output and frequencies
        state = (freq[c] * (state >> SHIFT)) + x - cumulativeFreq[c];
        // if necessary, read from the compressed stream
        while (state < TOP) {
        state = (state << 32) | (r.nextInt() & 0xffffffffL);
        }
        }

        I don’t think tables are needed as frequencies should be equal for each entry in the alphabet. I implemented some ANS code in h2/src/test/org/h2/test/unit/TestAnsCompression.java (https://github.com/h2database/h2database).

  2. Great article. Since these are eventually arithmetic tricks, I hope that one day this will be the level of optimization that the compiler will give us.

  3. I was interested enough in this that I tried to understand it and make it work. Unfortunately, it didn’t. I found the problem.

    This code is very inefficient with memory, but it shows the problem and the fix.

    The idea is NOT to feed it random numbers, but to feed every number in the input integer size and see how they get binned out. Each output integer should have the an equal number of inputs that resolve to it.

    #include <stdio.h>
    #include <stdint.h>
    #include <string.h>

    int16_t nearlyDivisionlessY(uint16_t x, uint16_t divisor, uint16_t fixed) {
    uint32_t m;
    uint16_t l, v;
    m=(uint32_t)x*divisor;
    l=(uint16_t)m;
    if (l<divisor) {
    if (!fixed) {
    v=-divisor % divisor; //PROBLEM: referenced paper AND proposed kernel patch say use this
    } else {
    v=(uint16_t)(-divisor) % divisor; //FIX
    }
    if (l< v) {
    return -1;
    }
    }
    return m>>16;
    }

    #define MYMAX (UINT16_MAX+1)
    uint16_t bins[MYMAX];
    uint16_t bbins[MYMAX];

    int main(int argc, char *argv[]) {
    uint16_t i,divisor,fix;
    uint16_t b,base;
    for (fix=0; fix<=1; ++fix)
    for (divisor=6; divisor<=6; ++divisor) {
    if (!(divisor&(divisor-1))) continue; //skip powers of 2
    printf(" %u %s 0x10000 %% %u = %u\n",divisor, (fix?"fixed":"original"), divisor, (uint16_t)(0x10000UL % divisor));
    memset(bins,0,sizeof(bins));
    memset(bbins,0,sizeof(bbins));
    i=0;
    do {
    b=nearlyDivisionlessY(i,divisor,fix);
    if (0<=b)
    bins[b]++;
    } while (++i);
    i=0;
    do {
    bbins[bins[i]]++;
    if (bins[i]) {
    printf("%5u %5u\n", i, bins[i]);
    }
    } while (++i<divisor);
    i=0;
    do {
    if (bbins[i]) {
    printf("%5u %5u\n", i, bbins[i]);
    }
    } while (++i);
    }
    return 0;
    }
    /* output:
    6 original 0x10000 % 6 = 4
    0 10923
    1 10923
    2 10922
    3 10923
    4 10923
    5 10922
    10922 2
    10923 4
    6 fixed 0x10000 % 6 = 4
    0 10922
    1 10922
    2 10922
    3 10922
    4 10922
    5 10922
    10922 6
    */

  4. Another question on generating pseudorandom numbers on non-power-of-two interval: would it possible to implement PRNGs that emit uniformly distributed pseudorandom integers while evolving their state efficiently on conventional binary logic and performing fixed amount of work between outputs? Would there be a trivial method to construct such a PRNG for a specific interval size?

    Listed unbiased transforms from power-of-two PRNGs fail on the “fixed amount of work” part, and evolving the state as trits instead of bits, for instance, is unlikely to be particularly efficient to implement on conventional CPUs.

    1. Well, I should have thought it through…

      A simple uniform ternary toy PRNG (cycle length 243 == 3^5) might be, for instance:

      #include <stdint.h>

      uint8_t state;

      uint8_t random3(void)
      {
      state = (7 * state + 154) % 243;
      return state / 81;
      }

      1. If it’s any consolation if you hadn’t given me a minute to answer your question, I also would have though “no you can’t do it in guaranteed fixed time”, by analogy with the superficially similar problem “given a source of random bits, generate a random value between 1 and N, where N is not an integer power of 2” (or several equivalent formulations).

        At least for that problem (the one the post was about, I guess), I don’t think you can do it with a guaranteed fixed number of bits? Of course the expected number of bits is finite, due to the magic of exponential series, but you still need an arbitrary number of bits at least for some cases, IIRC.

        So I would have thought something similar here, but as you point out, obviously not.

        It makes sense though: the operators +, %, * aren’t inherently “binary” so it’s not too weird I guess that they work fine for non-power-of-two: it’s only when your underlying random source provides bits that the pow-2 becomes inherent. Or something handwavy like that.

        1. I guess I agree (in equally handwavy ways) on pretty much all of this.

          I think it’s intuitive that for non-integer powers of two you can’t guarantee an upper bound for number of bits that would produce a result when the input source is base-2, but the mean number of bits required, for any finite range is bounded. In an earlier comment I concluded that the long multiplication method should work with less than ceil(log2(n)) + 2 bits per call on average for any value (and at best log2(n) for integer powers of two, of course).

          1. I think the bit efficiency overhead (in practice, the rounding up of log2(n) and the + 2 part) could be largely eliminated for values of n much smaller than the word size: one could simply keep the seen random bits from the previous round, re-feed them to new calculation where n = n_new * n_previous, subtracting preceding value multiplied by n_previous from the result, and repeating this as long as n fits in the word size. This could provide bit efficiency approaching log2(n), about 3% (or 1/32) worse than optimal in the case of 64-bit words. The computational overhead in comparison with the naive implementation would be something like two multiplications and two additions per call.

            (This would be effectively a computation which would incrementally compute the result for range n_1 * n_2 * … * n_x and peel off values for individual values of n at the same time.)

            Thus one can get quite close to optimal bit efficiency wile maintaining unbiased results, without a need to resort to arbitrary precision arithmetic or divisions, at least when typical n is significantly smaller than the native word size – which should really be the common case for such a function.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax