Computing the number of digits of an integer even faster

I my previous blog post, I documented how one might proceed to compute the number of digits of an integer quickly. E.g., given the integer 999, you want 3 but given the integer 1000, you want 4. It is effectively the integer logarithm in base 10.

On computers, you can quickly compute the integer logarithm in base 2, and it follows that you can move from one to the other rather quickly. You just need a correction which you can implement with a table. A very good solution found in references such as Hacker’s Delight is as follows:

    static uint32_t table[] = {9, 99, 999, 9999, 99999,
    999999, 9999999, 99999999, 999999999};
    int y = (9 * int_log2(x)) >> 5;
    y += x > table[y];
    return y + 1;

Except for the computation of the integer logarithm, it involves a multiplication by 9, a shift, a conditional move, a table lookup and an increment. Can you do even better? You might! Kendall Willets found an even more economical solution.

int fast_digit_count(uint32_t x) {
  static uint64_t table[] = {
      4294967296,  8589934582,  8589934582,  8589934582,  12884901788,
      12884901788, 12884901788, 17179868184, 17179868184, 17179868184,
      21474826480, 21474826480, 21474826480, 21474826480, 25769703776,
      25769703776, 25769703776, 30063771072, 30063771072, 30063771072,
      34349738368, 34349738368, 34349738368, 34349738368, 38554705664,
      38554705664, 38554705664, 41949672960, 41949672960, 41949672960,
      42949672960, 42949672960};
  return (x + table[int_log2(x)]) >> 32;
}

If I omit the computation of the integer logarithm in base 2, it requires just a table lookup, an addition and a shift:

add     rax, qword ptr [8*rcx + table]
shr     rax, 32

The table contains the numbers ceil(log10(2j)) * 232 + 232 – 10ceil(log10(2j)) for j from 2 to 30, and then just ceil(log10(2j)) for j = 31 and j = 32. The first value is 232 .

My implementation of Kendall’s solution is available.

Using modern C++, you can compute the table using constant expressions.

Further reading: Josh Bleecher Snyder has a blog post on this topic which tells the whole story.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

28 thoughts on “Computing the number of digits of an integer even faster”

  1. Let me give Josh Bleecher Snyder credit for the idea that int_log2 splits the input into ranges where log10 can only go up by 1 at most. Within each such interval the problem is reduced to comparing against a table-derived threshold and incrementing if greater.

    1. Just a quick idea: wouldn’t zeroing three bottom bits of int_log2 result and indexing array (without the multiplier 8) also create a possibility of a smaller table? Sure this would increase latency by one clock cycle, but just in case the size of the table annoys somebody…

      1. Err, not quite. It would probably be meaningful to try out those ideas before writing a comment…

          1. The comment in question is “As a tiny optimisation to make the tables three entries smaller: within int_log2, replace |1 with |8 – although then relying on C compiler to rewrite table[int_log2(x)-3] as (table-3)[int_log2(x)] and form table-3 as a constant.”

  2. I still prefer the first solution because it extends to 64 bits naturally. The second solution can be considered a variant of the first where:
    * The high half of the 9*bits/32 approximation is replaced by a table lookup,
    * The threshold table lookup is indexed by number of bits, not number of digits (making it larger by a factor of log2(10) = 3.32), and
    * The threshold comparison is done in the (semantically equivalent) form of an add with carry to the low half of the table.

    Bloating the table by a factor or 6.64 (for 32 bits, from 114 = 44 bytes to 328 = 256 bytes, or +3 cache lines) doesn’t seem like a good use of L1 cache, especially as any digit-counting function is going to be part of some much larger output formatting code and not an inner loop by itself.

    There’s one more trick that hasn’t shown up yet and might be useful. Given a safe underestimate such as
    y = 9*int_log2(x) / 32;
    then x >> y can be used to compute the number of digits without losing accuracy; the low digit_count(x) bits of x don’t affect the result (because 10 is a multiple of 2).

    This might enable people to fit the necessary low and high parts of each table entry into a single word.

    1. Okay, I actually tried George Spelvin’s idea and got this: https://godbolt.org/z/3h971c8K8.
      So difference between two implementations are that for the “fused” table lookup, we have

      shrx eax, edi, eax
      add eax, DWORD PTR table[0+rdx*4]
      shr eax, 26

      and for the normal method explained in Prof. Lemire’s previous blog post, we have

      cmp DWORD PTR table[0+rdx*4], edi
      adc eax, 1

      I believe the first one might be very slightly faster as adc instruction is usually a bit slower than add. (But really, there should be no real difference.) I tested through quick-bench.com and it seems that GCC slightly favors the first one and Clang slightly favors the second one, but I think this is unfair, since quick-bench.com does not let me to specify the architecture flag, thus it generates shr eax, cl instead of shrx eax, edi, eax, which I believe is usually a bit slower.

    2. I was able to get the table down to 32 bit entries by using lg2/4 as the shift — table[lg2]<<lg2/4 gives the original 64-bit entry.

      The arithmetic after that still has to be wider than the argument, ie 64-bit would need uint_128t. I’m trying to shift the argument right instead but I have a few edge cases.

    1. #include <cassert>
      #include <cstdint>
      #include <cstddef>
      
      constexpr int floor_log10_pow2(int e) noexcept {
          return (e * 1262611) >> 22;
      }
      
      constexpr int ceil_log10_pow2(int e) noexcept {
          return e == 0 ? 0 : floor_log10_pow2(e) + 1;
      }
      
      struct digit_count_table_holder_t {
        std::uint64_t entry[64];
      };
      
      constexpr digit_count_table_holder_t generate_digit_count_table() {
          digit_count_table_holder_t table{ {} };
          constexpr std::uint64_t pow10[] = {
              1ull,
              10ull,
              100ull,
              1000ull,
              1'0000ull,
              10'0000ull,
              100'0000ull,
              1000'0000ull,
              1'0000'0000ull,
              10'0000'0000ull,
              100'0000'0000ull,
              1000'0000'0000ull,
              1'0000'0000'0000ull,
              10'0000'0000'0000ull,
              100'0000'0000'0000ull,
              1000'0000'0000'0000ull,
              1'0000'0000'0000'0000ull,
              10'0000'0000'0000'0000ull,
              100'0000'0000'0000'0000ull,
              1000'0000'0000'0000'0000ull
          };
      
          for (int i = 0; i < 64; ++i) {
              auto const ub = std::uint64_t(ceil_log10_pow2(i));
              assert(ub <= 19);
              table.entry[i] = ((ub + 1) << 52) - (pow10[ub] >> (i / 4));
          }
      
          return table;
      }
      
      constexpr inline auto digit_count_table = generate_digit_count_table();
      
      int floor_log2(std::uint64_t n) noexcept {
          return 63 ^ __builtin_clzll(n);
      }
      
      int count_digit(std::uint64_t n) noexcept {
          return int((digit_count_table.entry[floor_log2(n)] + (n >> (floor_log2(n) / 4))) >> 52);
      }
      

    1. I just tested this approach, and on my machine it is terribly slow. But it works 🙂

      int slow_digit_count(uint32_t x) {
      double d = x;
      uint64_t y = 0;
      asm("fbstp %0" : "=m" (y) : "t" (d) : "st");
      return (67 - __builtin_clzll(y | 1)) >> 2;
      }

      I doubt that such a “itoa” method would be fast.

      1. Just FYI, I implemented a simple version of this “use BCD opcode” idea in the itoa-benchmark, and this seems to be about 10 times slower than the fastest implementation of itoa. It probably is one of the shortest (in number of CPU instructions), but the slowest.

        t_inline void uint32toa_bcd(uint64_t x, char* out) {
        double d = x;
        asm("fbstp %0" : "=m" (out[0]) : "t" (d) : "st");
        uint64_t y;
        memcpy(&y, out, sizeof(y));
        int len = (67 - __builtin_clzll(y | 1)) >> 2;
        out[len] = 0;
        while (len > 0) {
        out[--len] = '0' + (y & 0xf);
        y >>= 4;
        }
        }

  3. Another fun way to get approximate log10 is to put the increments in a bitmap (1 means log10 goes up in that range), and use CLZ and a shift to discard the bits above lg2. The popcount of the masked-off bits is the approximate log10 similar to 9/32*lg2.

    I don’t think this is faster than multiply and shift; the only possible time difference is multiply vs. popcount.

  4. I would observe that the definition of the lookup table is missing a “const” qualifier.
    I would also want to look at where the table is placed in memory by the loader: if it’s in a DATA section, it is necessarily going to live in a separate Read-Only page, and “table[idx]” is a memory reference that might go through a page fault.

    Since for 32 bits the table has only a few entries, changing the lookup table to a switch statement might in fact be faster (if only because the search in the table would in fact only reference memory that is already in I-cache). This would need to be benchmarked, obviously.

    1. I would also want to look at where the table is placed in memory by
      the loader: if it’s in a DATA section, it is necessarily going to live
      in a separate Read-Only page, and “table[idx]” is a memory reference
      that might go through a page fault.

      Do you know of an optimizing C compiler that would build the code in such a manner? That would seem incredibly wasteful.

      Since for 32 bits the table has only a few entries, changing the
      lookup table to a switch statement might in fact be faster (if only
      because the search in the table would in fact only reference memory
      that is already in I-cache).

      I expect most optimizing compilers to turn a 32-entry switch/case into a table lookup. There might be exceptions, evidently, but my experience has been that there compilers are not shy about using tables to implement switch/case code.

  5. I found that a naive approach (without a lookup table) actually beats all of the more advanced ones presented here, see https://godbolt.org/z/5W4xfqjGj:

    static inline int
    bh_ilog10_32_naive(uint32_t x)
    {
    return (x < 10 ? 1 :
    (x < 100 ? 2 :
    (x < 1000 ? 3 :
    (x < 10000 ? 4 :
    (x < 100000 ? 5 :
    (x < 1000000 ? 6 :
    (x < 10000000 ? 7 :
    (x < 100000000 ? 8 :
    (x < 1000000000 ? 9 :
    10)))))))));
    }

    Result from local runs:

    bh_ilog10_32_naive: 1.000000e+00 +/- 1e-05
    bh_ilog10_32d: 1.452496e+00 +/- 5e-06
    bh_ilog10_32b: 1.691123e+00 +/- 1e-05
    bh_ilog10_32c: 2.195664e+00 +/- 3e-05
    bh_ilog10_32a: 2.212707e+00 +/- 3e-05

    I’m guessing that this is due to accessing memory being expensive. Maybe I’ve done something wrong with the benchmark, please correct me if something is wrong.

      1. Ok, I think I get it now.
        I generate high entropy random numbers and thus numbers are likely very large, which means that the branch predictor can be quite accurate.
        If I change the random number generation to

        1 << (bench_hash64(i)&63);

        the naive approach is now much slower.

        I assume that the usual use case doesn’t call the function repeatedly, so the branch predictor will have a harder time, although this should probably be benchmarked with a real world use case.

  6. I have this idea but can’t figure it out yet, i am trying to halve the table or completely removing it, will be possible because by shifting the index of highest bit to the right by 2 (division by 4) we end up with value less than 16, the problem is we have is the repeated values (0,4,9,14.), if there is a mathematical trick to remove (differentiate) these repeated values, we can either minimize the table into 15-16 items or remove the table completely by building a value from our new index and then compare, if a cheap trick can be found then it will be table-less and branchless algorithm, a cheap trick like shift the index by one then subtract again

    1 : 1 0 0
    10 : 2 3 0
    100 : 3 6 1
    1000 : 4 9 2
    10000 : 5 13 3
    100000 : 6 16 4
    1000000 : 7 19 4
    10000000 : 8 23 5
    100000000 : 9 26 6
    1000000000 : 10 29 7
    10000000000 : 11 33 8
    100000000000 : 12 36 9
    1000000000000 : 13 39 9
    10000000000000 : 14 43 10
    100000000000000 : 15 46 11
    1000000000000000 : 16 49 12
    10000000000000000 : 17 53 13
    100000000000000000 : 18 56 14
    100000

  7. Found a cheap one but with multiplication, the last column is
    Log2(x) * 37 >> 7
    and here the result

    1 : 1 0 0 0
    10 : 2 3 1 0
    100 : 3 6 3 1
    1000 : 4 9 4 2
    10000 : 5 13 6 3
    100000 : 6 16 8 4
    1000000 : 7 19 9 5
    10000000 : 8 23 11 6
    100000000 : 9 26 13 7
    1000000000 : 10 29 14 8
    10000000000 : 11 33 16 9
    100000000000 : 12 36 18 10
    1000000000000 : 13 39 19 11
    10000000000000 : 14 43 21 12
    100000000000000 : 15 46 23 13
    1000000000000000 : 16 49 24 14
    10000000000000000 : 17 53 26 15
    100000000000000000 : 18 56 28 16
    1000000000000000000 : 19 59 29 17
    9 : 1 3 0 0
    99 : 2 6 1 1
    999 : 3 9 2 2
    9999 : 4 13 3 3
    99999 : 5 16 4 4
    999999 : 6 19 4 5
    9999999 : 7 23 5 6
    99999999 : 8 26 6 7
    999999999 : 9 29 7 8
    9999999999 : 10 33 8 9
    99999999999 : 11 36 9 10
    999999999999 : 12 39 9 11
    9999999999999 : 13 43 10 12
    99999999999999 : 14 46 11 13
    999999999999999 : 15 49 12 14
    9999999999999999 : 16 53 13 15
    99999999999999999 : 17 56 14 16
    999999999999999999 : 18 59 14 17

    Now the table can be only 18 entry, the multiplication above i think can be replaced with better approach, but it is not my main target, the target is to remove the table now by either of :
    1) Can we calculate 10^n (for 100… or 999…) with few cycles, will it be worth removing memory access.
    2) The fact of the decimal 10 is 4 bits means 10^n always is far by 3 or 4 bits from 10^(n+1) and 10^(n-1), means we have 2 bits to maneuverer between where x is equal or bigger 2^(Log10(x)) yet is is smaller then 2^(Log10(x)+1), this is coming from x/10 < x/8, so by changing a the simple compare against value to compare against a rank of neighbors, can guess the length the rank of x.
    3) Because we don’t need full proven mathematical theory to for any n all what we need is magical numbers just like the mult by 37, so we need these 18 values to be calculated or approximated to fix one single profile of usage, looking at the bits of these values (10,100… and 9,99,999…) we see a nice pattern where all bits after than the highest one is bigger than previous value except for one single case the one between 15-16 in 99xx.. table or 16-17 in 10xx.. ,now trying to apply the logic like the above case (2), the following is true for all except that one case x-2^(Log2(x)) > 10^(Log10(x)-1), is there a way to use this ?

      1. Here is a table with length == 65 that used to calculate digit count for up to 58-bit numbers using the following function:

        def f(x: Long): Int = (offsets(java.lang.Long.numberOfLeadingZeros(x)) + x >> 58).toInt

Leave a Reply to JB Cancel reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax