# Computing the number of digits of an integer even faster

I my previous blog post, I documented how one might proceed to compute the number of digits of an integer quickly. E.g., given the integer 999, you want 3 but given the integer 1000, you want 4. It is effectively the integer logarithm in base 10.

On computers, you can quickly compute the integer logarithm in base 2, and it follows that you can move from one to the other rather quickly. You just need a correction which you can implement with a table. A very good solution found in references such as Hacker’s Delight is as follows:

```    static uint32_t table[] = {9, 99, 999, 9999, 99999,
999999, 9999999, 99999999, 999999999};
int y = (9 * int_log2(x)) >> 5;
y += x > table[y];
return y + 1;
```

Except for the computation of the integer logarithm, it involves a multiplication by 9, a shift, a conditional move, a table lookup and an increment. Can you do even better? You might! Kendall Willets found an even more economical solution.

```int fast_digit_count(uint32_t x) {
static uint64_t table[] = {
4294967296,  8589934582,  8589934582,  8589934582,  12884901788,
12884901788, 12884901788, 17179868184, 17179868184, 17179868184,
21474826480, 21474826480, 21474826480, 21474826480, 25769703776,
25769703776, 25769703776, 30063771072, 30063771072, 30063771072,
34349738368, 34349738368, 34349738368, 34349738368, 38554705664,
38554705664, 38554705664, 41949672960, 41949672960, 41949672960,
42949672960, 42949672960};
return (x + table[int_log2(x)]) >> 32;
}
```

If I omit the computation of the integer logarithm in base 2, it requires just a table lookup, an addition and a shift:

```add     rax, qword ptr [8*rcx + table]
shr     rax, 32
```

The table contains the numbers ceil(log10(2j)) * 232 + 232 – 10ceil(log10(2j)) for j from 2 to 30, and then just ceil(log10(2j)) for j = 31 and j = 32. The first value is 232 . ### Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

## 28 thoughts on “Computing the number of digits of an integer even faster”

1. KWillets says:

Let me give Josh Bleecher Snyder credit for the idea that int_log2 splits the input into ranges where log10 can only go up by 1 at most. Within each such interval the problem is reduced to comparing against a table-derived threshold and incrementing if greater.

1. foobar says:

Just a quick idea: wouldn’t zeroing three bottom bits of int_log2 result and indexing array (without the multiplier 8) also create a possibility of a smaller table? Sure this would increase latency by one clock cycle, but just in case the size of the table annoys somebody…

1. foobar says:

Err, not quite. It would probably be meaningful to try out those ideas before writing a comment…

1. Josh Bleecher Snyder says:

Relatedly, see this suggestion by Pete Cawley: https://mobile.twitter.com/i/timeline

1. Daniel Lemire says:

The comment in question is “As a tiny optimisation to make the tables three entries smaller: within `int_log2`, replace `|1` with `|8` – although then relying on C compiler to rewrite `table[int_log2(x)-3]` as `(table-3)[int_log2(x)]` and form `table-3` as a constant.”

2. Josh Bleecher Snyder says:

Very nice! Three cheers for collaboration. 🙂

3. Thomas Müller Graf says:

I wonder if this very nice trick could be used for other problems, e.g. fast square root, specially the integer square root.

4. Tushar Bali says:

exciting stuff

5. George Spelvin says:

I still prefer the first solution because it extends to 64 bits naturally. The second solution can be considered a variant of the first where:
* The high half of the 9*bits/32 approximation is replaced by a table lookup,
* The threshold table lookup is indexed by number of bits, not number of digits (making it larger by a factor of log2(10) = 3.32), and
* The threshold comparison is done in the (semantically equivalent) form of an add with carry to the low half of the table.

Bloating the table by a factor or 6.64 (for 32 bits, from 114 = 44 bytes to 328 = 256 bytes, or +3 cache lines) doesn’t seem like a good use of L1 cache, especially as any digit-counting function is going to be part of some much larger output formatting code and not an inner loop by itself.

There’s one more trick that hasn’t shown up yet and might be useful. Given a safe underestimate such as
y = 9*int_log2(x) / 32;
then x >> y can be used to compute the number of digits without losing accuracy; the low digit_count(x) bits of x don’t affect the result (because 10 is a multiple of 2).

This might enable people to fit the necessary low and high parts of each table entry into a single word.

1. jk-jeon says:

Okay, I actually tried George Spelvin’s idea and got this: https://godbolt.org/z/3h971c8K8.
So difference between two implementations are that for the “fused” table lookup, we have

```shrx eax, edi, eax add eax, DWORD PTR table[0+rdx*4] shr eax, 26 ```

and for the normal method explained in Prof. Lemire’s previous blog post, we have

```cmp DWORD PTR table[0+rdx*4], edi adc eax, 1 ```

I believe the first one might be very slightly faster as `adc` instruction is usually a bit slower than `add`. (But really, there should be no real difference.) I tested through quick-bench.com and it seems that GCC slightly favors the first one and Clang slightly favors the second one, but I think this is unfair, since quick-bench.com does not let me to specify the architecture flag, thus it generates `shr eax, cl` instead of `shrx eax, edi, eax`, which I believe is usually a bit slower.

2. KWillets says:

I was able to get the table down to 32 bit entries by using lg2/4 as the shift — table[lg2]<<lg2/4 gives the original 64-bit entry.

The arithmetic after that still has to be wider than the argument, ie 64-bit would need uint_128t. I’m trying to shift the argument right instead but I have a few edge cases.

6. jk-jeon says:

FYI, here is an application of the idea presented in this post into the 64-bit case: https://godbolt.org/z/e76eEKKj3

1. Daniel Lemire says:
```#include <cassert>
#include <cstdint>
#include <cstddef>

constexpr int floor_log10_pow2(int e) noexcept {
return (e * 1262611) >> 22;
}

constexpr int ceil_log10_pow2(int e) noexcept {
return e == 0 ? 0 : floor_log10_pow2(e) + 1;
}

struct digit_count_table_holder_t {
std::uint64_t entry;
};

constexpr digit_count_table_holder_t generate_digit_count_table() {
digit_count_table_holder_t table{ {} };
constexpr std::uint64_t pow10[] = {
1ull,
10ull,
100ull,
1000ull,
1'0000ull,
10'0000ull,
100'0000ull,
1000'0000ull,
1'0000'0000ull,
10'0000'0000ull,
100'0000'0000ull,
1000'0000'0000ull,
1'0000'0000'0000ull,
10'0000'0000'0000ull,
100'0000'0000'0000ull,
1000'0000'0000'0000ull,
1'0000'0000'0000'0000ull,
10'0000'0000'0000'0000ull,
100'0000'0000'0000'0000ull,
1000'0000'0000'0000'0000ull
};

for (int i = 0; i < 64; ++i) {
auto const ub = std::uint64_t(ceil_log10_pow2(i));
assert(ub <= 19);
table.entry[i] = ((ub + 1) << 52) - (pow10[ub] >> (i / 4));
}

return table;
}

constexpr inline auto digit_count_table = generate_digit_count_table();

int floor_log2(std::uint64_t n) noexcept {
return 63 ^ __builtin_clzll(n);
}

int count_digit(std::uint64_t n) noexcept {
return int((digit_count_table.entry[floor_log2(n)] + (n >> (floor_log2(n) / 4))) >> 52);
}
```

7. Thomas Müller Graf says:

Just a crazy idea… As far as I understand, Intel CPUs have (limited) BCD support. I guess using it just to calculate the number of decimal digits is not all that fast (would need to be tested). But would “itoa” be faster that way?

1. Thomas Müller Graf says:

I just tested this approach, and on my machine it is terribly slow. But it works 🙂

```int slow_digit_count(uint32_t x) { double d = x; uint64_t y = 0; asm("fbstp %0" : "=m" (y) : "t" (d) : "st"); return (67 - __builtin_clzll(y | 1)) >> 2; } ```

I doubt that such a “itoa” method would be fast.

1. Thomas Müller Graf says:

Just FYI, I implemented a simple version of this “use BCD opcode” idea in the itoa-benchmark, and this seems to be about 10 times slower than the fastest implementation of itoa. It probably is one of the shortest (in number of CPU instructions), but the slowest.

```t_inline void uint32toa_bcd(uint64_t x, char* out) { double d = x; asm("fbstp %0" : "=m" (out) : "t" (d) : "st"); uint64_t y; memcpy(&y, out, sizeof(y)); int len = (67 - __builtin_clzll(y | 1)) >> 2; out[len] = 0; while (len > 0) { out[--len] = '0' + (y & 0xf); y >>= 4; } } ```

8. KWillets says:

Another fun way to get approximate log10 is to put the increments in a bitmap (1 means log10 goes up in that range), and use CLZ and a shift to discard the bits above lg2. The popcount of the masked-off bits is the approximate log10 similar to 9/32*lg2.

I don’t think this is faster than multiply and shift; the only possible time difference is multiply vs. popcount.

9. JB says:

I would observe that the definition of the lookup table is missing a “const” qualifier.
I would also want to look at where the table is placed in memory by the loader: if it’s in a DATA section, it is necessarily going to live in a separate Read-Only page, and “table[idx]” is a memory reference that might go through a page fault.

Since for 32 bits the table has only a few entries, changing the lookup table to a switch statement might in fact be faster (if only because the search in the table would in fact only reference memory that is already in I-cache). This would need to be benchmarked, obviously.

1. Daniel Lemire says:

I would also want to look at where the table is placed in memory by
the loader: if it’s in a DATA section, it is necessarily going to live
in a separate Read-Only page, and “table[idx]” is a memory reference
that might go through a page fault.

Do you know of an optimizing C compiler that would build the code in such a manner? That would seem incredibly wasteful.

Since for 32 bits the table has only a few entries, changing the
lookup table to a switch statement might in fact be faster (if only
because the search in the table would in fact only reference memory

I expect most optimizing compilers to turn a 32-entry switch/case into a table lookup. There might be exceptions, evidently, but my experience has been that there compilers are not shy about using tables to implement switch/case code.

10. camel-cdr says:

I found that a naive approach (without a lookup table) actually beats all of the more advanced ones presented here, see https://godbolt.org/z/5W4xfqjGj:

```static inline int bh_ilog10_32_naive(uint32_t x) { return (x < 10 ? 1 : (x < 100 ? 2 : (x < 1000 ? 3 : (x < 10000 ? 4 : (x < 100000 ? 5 : (x < 1000000 ? 6 : (x < 10000000 ? 7 : (x < 100000000 ? 8 : (x < 1000000000 ? 9 : 10))))))))); } ```

Result from local runs:

```bh_ilog10_32_naive: 1.000000e+00 +/- 1e-05 bh_ilog10_32d: 1.452496e+00 +/- 5e-06 bh_ilog10_32b: 1.691123e+00 +/- 1e-05 bh_ilog10_32c: 2.195664e+00 +/- 3e-05 bh_ilog10_32a: 2.212707e+00 +/- 3e-05 ```

I’m guessing that this is due to accessing memory being expensive. Maybe I’ve done something wrong with the benchmark, please correct me if something is wrong.

1. camel-cdr says:

Actually, I’ve put the relevant benchmark into a quick-bench version, that should be easier to understand than my benchmarking library:

https://www.quick-bench.com/q/QZh-vs3H77D7hDvJL8Vn3FJNPj4

This still has the same result.

1. camel-cdr says:

Ok, I think I get it now.
I generate high entropy random numbers and thus numbers are likely very large, which means that the branch predictor can be quite accurate.
If I change the random number generation to

```1 << (bench_hash64(i)&63); ```

the naive approach is now much slower.

I assume that the usual use case doesn’t call the function repeatedly, so the branch predictor will have a harder time, although this should probably be benchmarked with a real world use case.

1. Daniel Lemire says:

There are definitively cases where the number of digits is predictable and in such cases, a branchy approach will be faster. Your scenario is one of those.

11. Kas says:

I have this idea but can’t figure it out yet, i am trying to halve the table or completely removing it, will be possible because by shifting the index of highest bit to the right by 2 (division by 4) we end up with value less than 16, the problem is we have is the repeated values (0,4,9,14.), if there is a mathematical trick to remove (differentiate) these repeated values, we can either minimize the table into 15-16 items or remove the table completely by building a value from our new index and then compare, if a cheap trick can be found then it will be table-less and branchless algorithm, a cheap trick like shift the index by one then subtract again

``` 1 : 1 0 0 10 : 2 3 0 100 : 3 6 1 1000 : 4 9 2 10000 : 5 13 3 100000 : 6 16 4 1000000 : 7 19 4 10000000 : 8 23 5 100000000 : 9 26 6 1000000000 : 10 29 7 10000000000 : 11 33 8 100000000000 : 12 36 9 1000000000000 : 13 39 9 10000000000000 : 14 43 10 100000000000000 : 15 46 11 1000000000000000 : 16 49 12 10000000000000000 : 17 53 13 100000000000000000 : 18 56 14 100000 ```

12. Kas says:

Found a cheap one but with multiplication, the last column is
Log2(x) * 37 >> 7
and here the result

``` 1 : 1 0 0 0 10 : 2 3 1 0 100 : 3 6 3 1 1000 : 4 9 4 2 10000 : 5 13 6 3 100000 : 6 16 8 4 1000000 : 7 19 9 5 10000000 : 8 23 11 6 100000000 : 9 26 13 7 1000000000 : 10 29 14 8 10000000000 : 11 33 16 9 100000000000 : 12 36 18 10 1000000000000 : 13 39 19 11 10000000000000 : 14 43 21 12 100000000000000 : 15 46 23 13 1000000000000000 : 16 49 24 14 10000000000000000 : 17 53 26 15 100000000000000000 : 18 56 28 16 1000000000000000000 : 19 59 29 17 9 : 1 3 0 0 99 : 2 6 1 1 999 : 3 9 2 2 9999 : 4 13 3 3 99999 : 5 16 4 4 999999 : 6 19 4 5 9999999 : 7 23 5 6 99999999 : 8 26 6 7 999999999 : 9 29 7 8 9999999999 : 10 33 8 9 99999999999 : 11 36 9 10 999999999999 : 12 39 9 11 9999999999999 : 13 43 10 12 99999999999999 : 14 46 11 13 999999999999999 : 15 49 12 14 9999999999999999 : 16 53 13 15 99999999999999999 : 17 56 14 16 999999999999999999 : 18 59 14 17 ```

Now the table can be only 18 entry, the multiplication above i think can be replaced with better approach, but it is not my main target, the target is to remove the table now by either of :
1) Can we calculate 10^n (for 100… or 999…) with few cycles, will it be worth removing memory access.
2) The fact of the decimal 10 is 4 bits means 10^n always is far by 3 or 4 bits from 10^(n+1) and 10^(n-1), means we have 2 bits to maneuverer between where x is equal or bigger 2^(Log10(x)) yet is is smaller then 2^(Log10(x)+1), this is coming from x/10 < x/8, so by changing a the simple compare against value to compare against a rank of neighbors, can guess the length the rank of x.
3) Because we don’t need full proven mathematical theory to for any n all what we need is magical numbers just like the mult by 37, so we need these 18 values to be calculated or approximated to fix one single profile of usage, looking at the bits of these values (10,100… and 9,99,999…) we see a nice pattern where all bits after than the highest one is bigger than previous value except for one single case the one between 15-16 in 99xx.. table or 16-17 in 10xx.. ,now trying to apply the logic like the above case (2), the following is true for all except that one case x-2^(Log2(x)) > 10^(Log10(x)-1), is there a way to use this ?

1. Daniel Lemire says:

1. Andriy says: