Suppose that I give you a set of *n* objects and I ask you to pick *k* distinct objects, thus forming a new subset. How many such subsets are there? If you have taken college mathematics, you have probably learned that the answer is given by binomial coefficients. They are defined as n!/(k! * (n-k)!) where the exclamation point refers to the factorial. For a programmer, it is easier to think of binomial coefficients as the result of this loop:

def binom(n,k): top = 1 bottom = 1 for i in 1, 2, ..., k: top *= n - i + 1 bottom *= i return top/bottom

Though simple enough, this algorithmic definition is not practical if you are interested in performance. Both the numerator and the denominator grow large quickly. They will soon require several machine words. A programming language like Python will happily give you the correct answer, but slowly. In Java or C, you are likely to get the wrong result due to a silent overflow.

Of course, if you know that the binomial coefficient is too large to fit in a machine word (64 bits), then you may as well go to a big-integer library. But what if you know that the result fits in a machine word? Maybe you have a reasonable bound on the size of *n*.

Then instead of waiting at the very end before doing the division, you can divide at each iteration in the loop:

def binom(n,k): answer = 1 for i in 1, 2, ..., k: answer = answer * (n - k + 1) / k return answer

This new approach may still overflow even if the binomial coefficient itself fits in a machine word because we multiply before dividing. You can get around this issue by first finding a common divisor to both the multiplier and the divisor, and factoring it out. Or else, you can further restrict the values of *n* and *k*. Let us choose this last path.

We still have as a problem that we need *k-1* multiplications and divisions. The multiplications are relatively cheap, but the divisions have longer latencies. We would prefer to avoid divisions entirely. If we assume that *k* is small, then we can just use the fact that we can always replace a division by a known value with a shift and a multiplication. All that is needed is that we precompute the shift and the multiplier. If there are few possible values of *k*, we can precompute it with little effort.

Hence, if you know that, say, *n* is smaller than 100 and *k* smaller than 10, the following function will work…

uint64_t fastbinomial(int n, int k) { uint64_t np = n - k; uint64_t answer = np + 1; for (uint64_t z = 2; z <= (uint64_t)k; z++) { answer = answer * (np + z); fastdiv_t f = precomputed[z]; answer >>= f.shift; answer *= f.inverse; } return answer; }

I provide a full portable implementation complete with some tests. Though I use C, it should work as-is in many other programming languages. It should only take tens of CPU cycles to run. It is going to be much faster than implementations relying on divisions.

Another trick that you can put to good use is that the binomial coefficient is symmetric: you can replace *k* by *n*–*k* and get the same value. Thus if you can handle small values of *k*, you can also handle values of *k* that are close to *n*. That is, the above function will also work for *n* is smaller than 100 and *k* larger than 90, if you just replace *k* by *n*–*k*.

Is that the fastest approach? Not at all. Because *n* is smaller than 100 and *k* smaller than 10, we can precompute (memoize) all possible values. You only need an array of 1000 values. It should fit in 8kB without any attempt at compression. And I am sure you can make it fit in 4kB with a little bit of compression effort. Still, there are instances where relying on a precomputed table of several kilobytes and keeping them in cache is inconvenient. In such cases, the divisionless function would be a good choice.

Alternatively, if you are happy with approximations, you will find floating-point implementations.

Daniel Lemire, "Fast divisionless computation of binomial coefficients," in *Daniel Lemire's blog*, February 26, 2020.

I know you know this, but to clarify for others: a division may require a multiply and two shifts (case 3 in the Granlund-Montgomery-Warren algorithm) or more (case 4). Computing binomial coefficients can use a single shift because it falls into the special case of a division which is

known a priori to be exact.Also, you can eke a tiny bit more range out of

`fastbinomial(n,k)`

if you do the multiplication by`f.inverse`

before the shift.The largest value which must be represented internally within

`fastbinomial`

is`fastbinomial(n,k) * k`

; if that overflows 64 bits, the result will be inaccurate. However, the overflow does not matter for the multiplication by the inverse, only the shift, so only`fastbinomial(n,k) << f.shift`

needs to fit into 64 bits.I do not understand this comment. Can you elaborate?

And if anyone is wondering why the division is exact (like I was), I wrote up an explanation here: https://ro-che.info/articles/2020-03-22-binomial-coefficients-integer-division

I was just working on a similar problem, and for my application I have

`0 <= n <= 64`

and`0 <= k <= 64`

, and storing a lookup table is acceptable. Here’s a trick I used for using symmetry to only store half of each row.We start by precomputing all

`binom(n, k)`

via the recurrence`B(n, k) = B(n - 1, k - 1) + B(n - 1, k)`

and`B(n, n) = 1`

and concatenating the rows into a flat array. Then, computing`B(n, k)`

involves looking up the start of the`n`

th row and taking the`k`

th element of that row. The`i`

th row has length`i + 1`

, so the`n`

th row starts at`1 + 2 + ... + n`

, which is`n * (n + 1) / 2`

.`1`

1 1

1 2 1

1 3 3 1

1 4 6 4 1

But, we’d like to store only the first half of each row, computing

`B(n, k)`

as`B(n, min(k, n - k))`

.`1`

1

1 2

1 3

1 4 6

Now, the length of the

`n`

th row is`ceil((i + 1) / 2)`

, and since`ceil((n + 1) / m) = floor(n / m) + 1`

, we can compute this as`n // 2 + 1`

using integer division. To compute the start of the`n`

th row in our table, we want to compute`\sum_{i=0}^{n-1} {i // 2 + 1}`

in closed form.`n: 0 1 2 3 4 5 ...`

row_len: 1 1 2 2 3 3 ...

row_start: 0 1 2 4 6 9 ...

If we have even

`n = 2m`

, then the sum of the previous`row_len`

s is just`2 * (1 + 2 + ... m)`

:`row_start(2m) = 2 * (1 + 2 + ... + m)`

= 2 * m * (m + 1) / 2

= m * (m + 1)

Then, if we have odd

`n = 2m + 1`

, we need to add in`row_len(2m)`

.`row_start(2m + 1) = m * (m + 1) + row_len(2m)`

= m * (m + 1) + (m + 1)

= (m + 1) * (m + 1)

So we can now combine the two cases:

`row_start(n) = (n // 2 + n % 2) * (n // 2 + 1)`

The resulting table is compact, and querying it is efficient.

`fn binomial_coefficient(n: u8, k: u8) -> u64 {`

let (q, r) = (n as usize / 2, n as usize % 2);

let row_start = (q + r) * (q + 1);

let k = cmp::min(k, n - k) as usize;

COEFFICIENT_TABLE[row_start + k]

}

My code is in this pull request.