Greatest common divisor, the extended Euclidean algorithm, and speed!

We sometimes need to find the greatest common divisor between two integers in software. The fastest way to compute the greatest common divisor might be the binary Euclidean algorithm. In C++20, it can be implemented generically as follows:

template <typename int_type>
int_type binary_gcd(int_type u, int_type v) {
  if (u == 0) { return v; }
  if (v == 0) { return u; }
  auto shift = std::countr_zero(u | v);
  u >>= std::countr_zero(u);
  do {
   v >>= std::countr_zero(v);
   if (u > v) { std::swap(u, v); }
   v = v - u;
  } while (v != 0);
  return u << shift;
}

The std::countr_zero function computes the “number of trailing zeroes” in an integer. A key insight is that this function often translates into a single instruction on modern hardware.

Its computational complexity is the number of bits in the largest of the two integers.

There are many variations that might be more efficient. I like an approach proposed by Paolo Bonzini which is simpler as it avoid the swap:

int_type binary_gcd_noswap(int_type u, int_type v) {
  if (u == 0) { return v; }
  if (v == 0) { return u; }
  auto shift = std::countr_zero(u | v);
  u >>= std::countr_zero(u);
  do {
   int_type t = v >> std::countr_zero(v);
   if (u > t) v = u - t, u = t;
   else v = t - u;
  } while (v != 0);
  return u << shift;
}

The binary Euclidean algorithm is typically faster than the textbook Euclidean algorithm which has to do divisions (a slow operation), although the resulting code is pleasantly short:

template <typename int_type>
int_type naive_gcd(int_type u, int_type v) {
  return (u % v) == 0 ? v : naive_gcd(v, u % v);
}

There are cases where the naive GCD algorithm is faster. For example, if v divides u, which is always the case when v is 1, then the naive algorithm returns immediately whereas the binary GCD algorithm might require many steps if u is large.

To balance the result, we can use a hybrid approach where we first use a division, as in the conventional Euclidean algorithm, and then switch to the binary approach:

template <class int_type> 
int_type hybrid_binary_gcd(int_type u, int_type v) {
  if (u < v) { std::swap(u, v); }
  if (v == 0) { return u; }
  u %= v;
  if (u == 0) { return v; }
  auto zu = std::countr_zero(u);
  auto zv = std::countr_zero(v);
  auto shift = std::min(zu, zv);
  u >>= zu;
  v >>= zv;
  do {
    int_type u_minus_v = u - v;
    if (u > v) { u = v, v = u_minus_v; }
    else {v = v - u; }
    v >>= std::countr_zero(u_minus_v);
  } while (v != 0);
  return u << shift;
}

I found interesting that there is a now a std::gcd function in the C++ standard library so you may not want to implement your own greatest-common-divisor if you are programming in modern C++.

For the mathematically inclined, there is also an extended Euclidean algorithm. It also computes the greatest common divisor, but also the Bézout coefficients. That is, given two integers a and b, it finds integers x and y such that x * a + y * b = gcd(a,b). I must admit that I never had any need for the extended Euclidean algorithm. Wikipedia says that it is useful to find multiplicative inverses in a module space, but the only multiplicative inverses I ever needed were computed with a fast Newton algorithm. Nevertheless, we might implement it as follows:

template <typename int_type> struct bezout {
  int_type gcd;
  int_type x;
  int_type y;
};

// computes the greatest common divisor between a and b,
// as well as the Bézout coefficients x and y such as
// a*x + b*y = gcd(a,b)
template <typename int_type>
bezout<int_type> extended_gcd(int_type u, int_type v) {
  std::pair<int_type, int_type> r = {u, v};
  std::pair<int_type, int_type> s = {1, 0};
  std::pair<int_type, int_type> t = {0, 1};
  while (r.second != 0) {
    auto quotient = r.first / r.second;
    r = {r.second, r.first - quotient * r.second};
    s = {s.second, s.first - quotient * s.second};
    t = {t.second, t.first - quotient * t.second};
  }
  return {r.first, s.first, t.first};
}

There is also a binary version of the extended Euclidean algorithm although it is quite a bit more involved and it is not clear that it is can be implemented at high speed, leveraging fast instructions, when working on integers that fit in general-purpose registers. It is may beneficial when working with big integers. I am not going to reproduce my implementation, but it is available in my software repository.

To compare these functions, I decided to benchmark them over random 64-bit integers. I found interesting that the majority of pairs of random integers (about two thirds) were coprime, meaning that their greatest common divisor is 1. Mathematically, we would expect the ratio to be 6/pi2 which is about right empirically. At least some had non-trivial greatest common divisors (e.g., 42954).

Computing the greatest common divisor takes hundreds of instructions and hundreds of CPU cycle. If you somehow need to do it often, it could be a bottleneck.

I find that the std::gcd implementation which is part of the GCC C++ library under Linux is about as fast as the binary Euclidean function I presented. I have not looked at the implementation, but I assume that it might be well designed. The version that is present on the C++ library present on macOS (libc++) appears to be the naive implementation. Thus there is an opportunity to improve the lib++ implementation.

The extended Euclidean-algorithm implementation runs at about the same speed as a naive regular Euclidean-algorithm implementation, which is what you would expect. My implementation of the binary extended Euclidean algorithm is quite a bit slower and not recommended. I expect that it should be possible to optimize it further.

function GCC 12 + Intel Ice Lake Apple LLVM + M2
std::gcd 7.2 million/s 7.8 million/s
binary 7.7 million/s 12 million/s
binary (no swap) 9.2 million/s 14 million/s
hybrid binary 12 million/s 17 million/s
extended 2.9 million/s 7.8 million/s
binary ext. 0.7 million/s 2.9 million/s

It may seem surprising that the extended Euclidean algorithm runs at the same speed as std::gcd on some systems, despite the fact that it appears to do more work. However, the computation of the Bézout coefficient along with the greatest common divisor is not a critical path, and can be folded in with the rest of the computation on a superscalar processor… so the result is expected.

My source code is available.

As part of the preparation of this blog post, I had initially tried writing a C++ module. It worked quite well on my MacBook. However, it fell part under Linux with GCC, so I reverted it back. I was quite happy at how using modules made the code simpler, but it is not yet sufficiently portable.

Credit: Thanks to Harold Aptroot for a remark about the probability of two random integers being prime.

Further reading: The libc++ library might update its std::gcd implementation.

Daniel Lemire, "Greatest common divisor, the extended Euclidean algorithm, and speed!," in Daniel Lemire's blog, April 13, 2024.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

5 thoughts on “Greatest common divisor, the extended Euclidean algorithm, and speed!”

  1. libstdc++ (used by GCC) implements std::gcd() with Stein’s algorithm, so it’s no surprise it’s about as fast as your implementation of the same algorithm.

    libc++ (used by Clang) implements std::gcd() with the textbook modulo-based algorithm, so it makes sense that it’s slower than the binary algorithm (though not by that much).

    It’s interesting to see that the extended Euclidean algorithm incurs basically no additional cost beyond the basic algorithm; it’s probably because the additional bookkeeping necessary can be executed in parallel with the main algorithm. This is a testament to how well modern CPUs are able to execute instructions in parallel so long as they are independent. On much older CPUs, there probably would have been a significant performance penalty!

  2. You can obtain an even faster gcd (2 digit % improvement for me) with a small trick to avoid instruction dependency:

    template
    int_type binary_gcd_noswap(int_type u, int_type v) {
    if ((u == 0) || (v == 0)) [[unlikely]] {
    return u + v;
    }
    auto zu = std::countr_zero(u);
    auto zv = std::countr_zero(v);
    auto shift = std::min(zu, zv);
    u >>= zu;
    v >>= zv;
    do {
    int_type u_minus_v = u – v;
    if (u > v)
    u = v, v = u_minus_v;
    else
    v = v – u;
    // v >>= std::countr_zero(v); !!! avoid instruction dependency
    v >>= std::countr_zero(u_minus_v);
    } while(v != 0);
    return u << shift;
    }

    1. The binary GCD still has, effectively, logarithmic complexity… it should limit how large the difference can be.

      The example provided in this pull request is one where v divides u. I am not certain how common that is, but if it is common, then the naive algorithm might be faster.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.