Fast divisionless computation of binomial coefficients

Suppose that I give you a set of n objects and I ask you to pick k distinct objects, thus forming a new subset. How many such subsets are there? If you have taken college mathematics, you have probably learned that the answer is given by binomial coefficients. They are defined as n!/(k! * (n-k)!) where the exclamation point refers to the factorial. For a programmer, it is easier to think of binomial coefficients as the result of this loop:

def binom(n,k):
    top = 1
    bottom = 1
    for i in 1, 2, ..., k:
        top *= n - i + 1
        bottom *= i
    return top/bottom

Though simple enough, this algorithmic definition is not practical if you are interested in performance. Both the numerator and the denominator grow large quickly. They will soon require several machine words. A programming language like Python will happily give you the correct answer, but slowly. In Java or C, you are likely to get the wrong result due to a silent overflow.

Of course, if you know that the binomial coefficient is too large to fit in a machine word (64 bits), then you may as well go to a big-integer library. But what if you know that the result fits in a machine word? Maybe you have a reasonable bound on the size of n.

Then instead of waiting at the very end before doing the division, you can divide at each iteration in the loop:

def binom(n,k):
    answer = 1
    for i in 1, 2, ..., k:
        answer = answer * (n - k + 1) / k
    return answer

This new approach may still overflow even if the binomial coefficient itself fits in a machine word because we multiply before dividing. You can get around this issue by first finding a common divisor to both the multiplier and the divisor, and factoring it out. Or else, you can further restrict the values of n and k. Let us choose this last path.

We still have as a problem that we need k-1 multiplications and divisions. The multiplications are relatively cheap, but the divisions have longer latencies. We would prefer to avoid divisions entirely. If we assume that k is small, then we can just use the fact that we can always replace a division by a known value with a shift and a multiplication. All that is needed is that we precompute the shift and the multiplier. If there are few possible values of k, we can precompute it with little effort.

Hence, if you know that, say, n is smaller than 100 and k smaller than 10, the following function will work…

uint64_t fastbinomial(int n, int k) {
  uint64_t np = n - k;
  uint64_t answer = np + 1;
  for (uint64_t z = 2; z <= (uint64_t)k; z++) {
    answer = answer * (np + z); 
    fastdiv_t f = precomputed[z];
    answer >>= f.shift;
    answer *= f.inverse;
  return answer;

I provide a full portable implementation complete with some tests. Though I use C, it should work as-is in many other programming languages. It should only take tens of CPU cycles to run. It is going to be much faster than implementations relying on divisions.

Another trick that you can put to good use is that the binomial coefficient is symmetric: you can replace k by nk and get the same value. Thus if you can handle small values of k, you can also handle values of k that are close to n. That is, the above function will also work for n is smaller than 100 and k larger than 90, if you just replace k by nk.

Is that the fastest approach? Not at all. Because n is smaller than 100 and k smaller than 10, we can precompute (memoize) all possible values. You only need an array of 1000 values. It should fit in 8kB without any attempt at compression. And I am sure you can make it fit in 4kB with a little bit of compression effort. Still, there are instances where relying on a precomputed table of several kilobytes and keeping them in cache is inconvenient. In such cases, the divisionless function would be a good choice.

Alternatively, if you are happy with approximations, you will find floating-point implementations.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

4 thoughts on “Fast divisionless computation of binomial coefficients”

  1. I know you know this, but to clarify for others: a division may require a multiply and two shifts (case 3 in the Granlund-Montgomery-Warren algorithm) or more (case 4). Computing binomial coefficients can use a single shift because it falls into the special case of a division which is known a priori to be exact.

    Also, you can eke a tiny bit more range out of fastbinomial(n,k) if you do the multiplication by f.inverse before the shift.

    The largest value which must be represented internally within fastbinomial is fastbinomial(n,k) * k; if that overflows 64 bits, the result will be inaccurate. However, the overflow does not matter for the multiplication by the inverse, only the shift, so only fastbinomial(n,k) << f.shift needs to fit into 64 bits.

    1. Also, you can eke a tiny bit more range out of fastbinomial(n,k) if you do the multiplication by f.inverse before the shift.

      I do not understand this comment. Can you elaborate?

  2. I was just working on a similar problem, and for my application I have 0 <= n <= 64 and 0 <= k <= 64, and storing a lookup table is acceptable. Here’s a trick I used for using symmetry to only store half of each row.

    We start by precomputing all binom(n, k) via the recurrence B(n, k) = B(n - 1, k - 1) + B(n - 1, k) and B(n, n) = 1 and concatenating the rows into a flat array. Then, computing B(n, k) involves looking up the start of the nth row and taking the kth element of that row. The ith row has length i + 1, so the nth row starts at 1 + 2 + ... + n, which is n * (n + 1) / 2.

    1 1
    1 2 1
    1 3 3 1
    1 4 6 4 1

    But, we’d like to store only the first half of each row, computing B(n, k) as B(n, min(k, n - k)).

    1 2
    1 3
    1 4 6

    Now, the length of the nth row is ceil((i + 1) / 2), and since ceil((n + 1) / m) = floor(n / m) + 1, we can compute this as n // 2 + 1 using integer division. To compute the start of the nth row in our table, we want to compute \sum_{i=0}^{n-1} {i // 2 + 1} in closed form.

    n: 0 1 2 3 4 5 ...
    row_len: 1 1 2 2 3 3 ...
    row_start: 0 1 2 4 6 9 ...

    If we have even n = 2m, then the sum of the previous row_lens is just 2 * (1 + 2 + ... m):

    row_start(2m) = 2 * (1 + 2 + ... + m)
    = 2 * m * (m + 1) / 2
    = m * (m + 1)

    Then, if we have odd n = 2m + 1, we need to add in row_len(2m).

    row_start(2m + 1) = m * (m + 1) + row_len(2m)
    = m * (m + 1) + (m + 1)
    = (m + 1) * (m + 1)

    So we can now combine the two cases:

    row_start(n) = (n // 2 + n % 2) * (n // 2 + 1)

    The resulting table is compact, and querying it is efficient.

    fn binomial_coefficient(n: u8, k: u8) -> u64 {
    let (q, r) = (n as usize / 2, n as usize % 2);
    let row_start = (q + r) * (q + 1);
    let k = cmp::min(k, n - k) as usize;
    COEFFICIENT_TABLE[row_start + k]

    My code is in this pull request.

Leave a Reply to Sujay Jayakar Cancel reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see