Fastest way to compute the greatest common divisor

Given two positive integers x and y, the greatest common divisor (GCD) z is the largest number that divides both x and y. For example, given 64 and 32, the greatest common divisor is 32.

There is a fast technique to compute the GCD called the binary GCD algorithm or Stein’s algorithm. According to Wikipedia, it is 60% faster than more common ways to compute the GCD.

I have honestly never written a program where computing the GCD was the bottleneck. However, Pigeon wrote a blog post where the binary GCD fared very poorly compared to a simple implementation of Euler’s algorithm with remainders:

unsigned gcd_recursive(unsigned a, unsigned b)
    if (b)
        return gcd_recursive(b, a % b);
        return a;

Though Pigeon is a great hacker, I wanted to verify for myself. It seems important to know whether an algorithm that has its own wikipedia page is worth it. Unfortunately, the code on Wikipedia’s page implementing the binary GCD algorithm is either inefficient or slightly wrong. Here is a version using a GCC intrinsic function (__builtin_ctz) to find the number of trailing zeros:

unsigned int gcd(unsigned int u, unsigned int v) {
  int shift;
  if (u == 0)
    return v;
  if (v == 0)
    return u;
  shift = __builtin_ctz(u | v);
  u >>= __builtin_ctz(u);
  do {
    unsigned m;
    v >>= __builtin_ctz(v);
    v -= u;
    m = (int)v >> 31;
    u += v & m;
    v = (v + m) ^ m;
  } while (v != 0);
  return u << shift;

My result? Using integers in [0,2000), the simple version Pigeon proposed does 25 millions GCDs per second, whereas my binary GCD does 39 millions GCDs per second, a difference of 55% on an Intel core i7 desktop. Why do my results disagree with Pigeon? His version of the binary GCD did not make use of the intrinsic __builtin_ctz and used an equivalent loop instead. When I implemented something similarly inefficient, I also got a slower result (17 millions GCDs per second) which corroborates Pigeon’s finding.

My benchmarking code is available.

On a 64-bit machine, you probably can adapt this technique using the __builtin_ctzll intrinsic.

Update: You can read more about sophisticated GCD algorithms in the gmplib manual.

Conclusion: The Binary GCD is indeed a faster way to compute the GCD for 32-bit integers, but only if you use the right instructions (e.g., __builtin_ctz). And someone ought to update the corresponding Wikipedia page.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

33 thoughts on “Fastest way to compute the greatest common divisor”

  1. What I don’t get is that you have a speedup only on numbers of the form m—2^k and n—2^j, and the speed-up is proportional to min(j,k). How do you explain doubling the speed if asymptotically few pairs of numbers are of that form?

  2. By the way, the numbers you used for testing are relatively small. More complicated algorithms are often slower for small numbers and don’t show their efficiency until the numbers are bigger. Without using anything bigger than uint32 you could test numbers of size ~1’000’000’000.

  3. If you care about asymptotics, then both of these are quadratic. For a subquadratic algorithm, you need something like a half-gcd based algorithm.

  4. @Pigeon

    It is not necessary for the numbers to be divisible by two for them to benefit from the binary GCD.

    Take 3 and 5. After the first pass in the loop you get 3 and 2. The 2 gets back to 1 due to the ctz shift.

    The nice thing with the binary GCD is that it does not use any expensive operation (ctz is quite cheap on recent Intel processors) whereas the basic GCD relies on integer division.

  5. I wonder if you could save a cmpl by reusing the u > v comparison for the loop break as well. That is:

    if( u == v)
    else if (u > v)

    This will shorten the last iteration and probably speed up the speculative execution.

  6. @KWillets

    With clang, your version is faster. With GCC, the version in the blog post is faster. The difference is within 10%.

    If I played with compiler flags, there might be other differences as well.

    In any case, your version is on github if you want to benchmark it.

  7. Hi again Daniel, I can save a further 7.5% on my earlier suggestion by altering the loop to

    do {
    unsigned m;
    v >>= __builtin_ctz(v);
    m = (v ^ u) & -(v < u);
    u ^= m;
    v ^= m;
    v -= u;
    } while (v);

  8. For my tweak the assembler output from gcc has the comparison, then a branch to the top of the loop, then the same comparison :(. The second comparison isn’t reachable by any other path either.

    Maybe some syntactic shuffling would trigger the optimization; I may give it a few tries later.

  9. This is faster on my version of gcc:

    int shift, uz, vz;
    uz = __builtin_ctz(u);
    if ( u == 0) return v;

    vz = __builtin_ctz(v);
    if ( v == 0) return u;

    shift = uz > vz ? vz : uz;

    u >>= uz;

    do {
    v >>= vz;

    if (u > v) {

    unsigned int t = v;
    v = u;
    u = t;

    v = v – u;
    vz = __builtin_ctz(v);
    } while( v != 0 );

    return u << shift;


    gcd between numbers in [1 and 2000]
    26.4901 17.6991 32.7869 25.974 24.3902 31.746 36.6972

    I was actually trying to get it to utilize the fact that ctz sets the == 0 flag when its argument is 0, so a following test against 0 should not need an extra instruction. However the compiler didn't notice. Instead it set up some interesting instruction interleaving so that the v != 0 test is actually u == v before the subtraction; I believe this is to enable ILP.

    Also, using an inline xchg instruction for the swap doubles the speed:

    gcd between numbers in [1 and 2000]
    26.1438 16.3934 33.6134 25.974 25.4777 30.5344 72.7273
    gcd between numbers in [1000000001 and 1000002000]
    26.1438 16 33.8983 25.974 25.3165 29.6296 72.7273

  10. Here’s the asm for the swap; I just replaced the part inside the brackets with xswap(u,v):

    #define xswap(a,b) __asm__ (\
    “xchg %0, %1\n”\
    : : “r”(a), “r” (b));

    Unfortunately I don’t understand if this is correctly defined (I copied it from some poorly-documented examples), but the assembler output looks good.

  11. Looking at Steven’s asm listings, I realized that my compiler was significantly behind, so I downloaded 3G of Apple “updates” last night. These results are now from clang-500.2.79.

    I started playing around with various ways of getting abs(v-u) (especially when unsigned) and also realized that bsfl(x) == bsfl(-x), so this works for the inner loop on gcdwikipedia5fast:

    do {
    v >>= vz;
    unsigned int diff =v;
    diff -= u;
    vz = __builtin_ctz(diff);
    if( diff == 0 ) break;
    if ( v < u ) {
    u = v;
    v = 0 – diff;
    } else
    v = diff;

    } while( 1 );

    If diff is signed 32-bit it's slightly faster, abs(diff) can be used, and the v < u test can be switched to diff < 0 for a slight gain. But it becomes a 31-bit algorithm. I haven't tried signed 64-bit yet.

    Using bsfl(diff) instead of v seems to speed it up significantly; it's probably ILP again since it doesn't have to wait for v to finalize.

  12. Hold on, I just tried signed 64-bit and got a huge boost:

    do {
    v >>= vz;
    long long int diff = v ;
    diff -= u;
    vz = __builtin_ctz(diff);
    if( diff == 0 ) break;
    if ( diff < 0 )
    u = v;
    v = abs(diff);

    } while( 1 );

  13. @KWillets

    I added these two alternatives to the benchmark.

    I find that results vary a lot depending on the compiler and processor. It is hard to identify a clear winner… except that they are all faster than the Euclidean algorithm with remainder.

  14. I checked the new revision and the 64-bit version (7) should use abs() and a few other edits.

    Should I be submitting edits to github?

  15. It’s possible to slightly improve Ralph Corderoy’s branch-free code above (Dec. 28 comment) by using a difference delta rather than an xor delta.

    If you don’t mind limiting the input range to INT_MAX, the sign of (int)(v-u) can be used to control the swap:

    v -= u;
    mask = (int)v >> 31;
    u += v & mask; /* u + (v - u) = v */
    v = (v + mask) ^ mask; /* Conditional negate ~(v - 1) = -v */

    If you want to accept inputs up to UINT_MAX, it’s still possible to combine the subtract and mask formation with a bit of asm magic (x86 AT&T syntax) to get access to the carry flag:

    asm("sub %2,%1; sbb %0,%0" : "=r" (mask), "+r" (v) : "g" (u));

    Depending on the CPU, it may be worth spending an instruction to clear the mask to avoid a false dependency on its previous value. Add
    , “0” (0)
    to the end of the list of input parameters. (For those not familiar with GCC asm syntax, the 0 in quotes means that this input operand should be in the same register as output operand 0, the mask. The 0 in parens is the operand value. GCC will generate an xorl instruction to zero the mask.)

    1. Your proposal was added to the benchmark. Here are the numbers on my laptop:

      ❯ ./gcd
      gcd between numbers in [1 and 2000]
      Running tests... Ok!
      We proceed to report timings (smaller values are better).
      basicgcd                    54.0541
      gcdwikipedia2               23.5294
      gcdwikipedia2fast           66.6667
      gcd_recursive               54.0541
      gcd_iterative_mod           53.3333
      gcdFranke                   68.9655
      gcdwikipedia3fast           66.6667
      gcdwikipedia4fast           54.0541
      gcdwikipedia5fast           66.6667
      gcdwikipedia2fastswap       64.5161
      gcdwikipedia7fast           86.9565
      gcdwikipedia7fast32         85.1064
      gcdwikipedia8Spelvin        58.8235
      gcd between numbers in [1000000001 and 1000002000]
      Running tests... Ok!
      We proceed to report timings (smaller values are better).
      basicgcd                    54.0541
      gcdwikipedia2               23.6686
      gcdwikipedia2fast           66.6667
      gcd_recursive               53.3333
      gcd_iterative_mod           54.0541
      gcdFranke                   68.9655
      gcdwikipedia3fast           65.5738
      gcdwikipedia4fast           54.7945
      gcdwikipedia5fast           66.6667
      gcdwikipedia2fastswap       64.5161
      gcdwikipedia7fast           86.9565
      gcdwikipedia7fast32         83.3333
      gcdwikipedia8Spelvin        58.8235
      1. I’ve just noticed your benchmarks don’t change the number range between runs — the offset is used only in the tests. With the benchmarks corrected, the mod approach shows faster on the larger value range.

        Also noting the label on the results is misleading. The benchmark appears to be calculating ops per ms, not the timings directly, so larger is better.

        Results on my TGL-H laptop after corrections:

        gcd between numbers in [1 and 2000]
        Running tests... Ok!
        Kops/ms (larger values are better).
        basicgcd 50
        gcdwikipedia2 20.3046
        gcdwikipedia2fast 44.4444
        gcd_recursive 50
        gcd_iterative_mod 48.1928
        gcdFranke 46.5116
        gcdwikipedia3fast 45.4545
        gcdwikipedia4fast 61.5385
        gcdwikipedia5fast 44.9438
        gcdwikipedia2fastswap 43.0108
        gcdwikipedia7fast 48.1928
        gcdwikipedia7fast32 76.9231
        gcdwikipedia8Spelvin 64.5161

        gcd between numbers in [1000000001 and 1000002000]
        Running tests... Ok!
        Kops/ms (larger values are better).
        basicgcd 37.7358
        gcdwikipedia2 7.15564
        gcdwikipedia2fast 16.4609
        gcd_recursive 37.037
        gcd_iterative_mod 37.7358
        gcdFranke 16.5289
        gcdwikipedia3fast 16.3934
        gcdwikipedia4fast 22.3464
        gcdwikipedia5fast 16.4609
        gcdwikipedia2fastswap 16.3265
        gcdwikipedia7fast 18.3486
        gcdwikipedia7fast32 31.746
        gcdwikipedia8Spelvin 23.3918

        I’ve submitted a PR with the fixes

        1. Thanks. Here are my results after merging your fix.

          ❯ ./gcd
          gcd between numbers in [1 and 2000]
          Running tests... Ok! 
          We proceed to report kops/ms (larger values are better).
          basicgcd                    40.404
          gcdwikipedia2               20.1005
          gcdwikipedia2fast           54.7945
          gcd_recursive               42.5532
          gcd_iterative_mod           42.5532
          gcdFranke                   38.0952
          gcdwikipedia3fast           53.3333
          gcdwikipedia4fast           42.1053
          gcdwikipedia5fast           53.3333
          gcdwikipedia2fastswap       55.5556
          gcdwikipedia7fast           67.7966
          gcdwikipedia7fast32         51.2821
          gcdwikipedia8Spelvin        46.5116
          gcd_mod_faster              43.956
          gcd between numbers in [1000000001 and 1000002000]
          Running tests... Ok! 
          We proceed to report kops/ms (larger values are better).
          basicgcd                    30.5344
          gcdwikipedia2               7.28597
          gcdwikipedia2fast           19.802
          gcd_recursive               30.0752
          gcd_iterative_mod           30.303
          gcdFranke                   14.0351
          gcdwikipedia3fast           18.4332
          gcdwikipedia4fast           14.2857
          gcdwikipedia5fast           18.5185
          gcdwikipedia2fastswap       18.7793
          gcdwikipedia7fast           23.9521
          gcdwikipedia7fast32         18.5185
          gcdwikipedia8Spelvin        46.5116
          gcd_mod_faster              32.5203

Leave a Reply to Daniel Lemire Cancel reply

Your email address will not be published. The comment form expects plain text. If you need to format your text, you can use HTML elements such strong, blockquote, cite, code and em. For formatting code as HTML automatically, I recommend

You may subscribe to this blog by email.