Converting integers to fix-digit representations quickly

It is tricky to convert integers into strings because the number of characters can vary according to the amplitude of the integer. The integer ‘1’ requires a single character whereas the integer ‘100’ requires three characters. So a solution might possible need a hard-to-predict branch.

Let us simplify the problem.

Imagine that you want to serialize integers to fixed-digit strings. Thus you may want to convert 16-digit integers (up to 10000000000000000) to exactly 16 digits, including leading zeros if needed. In this manner, it is easy to write code that contains only trivial branches.

The simplest approach could be a character-by-character routine where I use the fact that the character ‘0’ in ASCII is just 0x30 (in hexadecimal):

void to_string_backlinear(uint64_t x, char *out) {
    for(int z = 0; z < 16; z++) {
        out[15-z] = (x % 10) + 0x30;
        x /= 10;
    }
}

It is somewhat strange to write the characters backward, starting from the less significant digit. You can try to go forward, but it is a bit trickier. Here is one ugly approach that is probably not efficient:

void to_string_linear(uint64_t x, char *out) {
  out[0] = x / 1000000000000000 + 0x30;
  x %= 1000000000000000;
  out[1] = x / 100000000000000 + 0x30;
  x %= 100000000000000;
  out[2] = x / 10000000000000 + 0x30;
  x %= 10000000000000;
  out[3] = x / 1000000000000 + 0x30;
  x %= 1000000000000;
  out[4] = x / 100000000000 + 0x30;
  x %= 100000000000;
  out[5] = x / 10000000000 + 0x30;
  x %= 10000000000;
  out[6] = x / 1000000000 + 0x30;
  x %= 1000000000;
  out[7] = x / 100000000 + 0x30;
  x %= 100000000;
  out[8] = x / 10000000 + 0x30;
  x %= 10000000;
  out[9] = x / 1000000 + 0x30;
  x %= 1000000;
  out[10] = x / 100000 + 0x30;
  x %= 100000;
  out[11] = x / 10000 + 0x30;
  x %= 10000;
  out[12] = x / 1000 + 0x30;
  x %= 1000;
  out[13] = x / 100 + 0x30;
  x %= 100;
  out[14] = x / 10 + 0x30;
  x %= 10;
  out[15] = x + 0x30;
}

Instead we could try to do it in a tree-like manner to reduce data dependency during the computation and hope for more core-level parallelism. We first divide the integer 100000000 to compute the first and last 8 digits separately, and so forth. It should drastically decrease data dependencies:

void to_string_tree(uint64_t x, char *out) {
  uint64_t top = x / 100000000;
  uint64_t bottom = x % 100000000;      
  uint64_t toptop = top / 10000;
  uint64_t topbottom = top % 10000;
  uint64_t bottomtop = bottom / 10000;
  uint64_t bottombottom = bottom % 10000;
  uint64_t toptoptop = toptop / 100;
  uint64_t toptopbottom = toptop % 100;
  uint64_t topbottomtop = topbottom / 100;
  uint64_t topbottombottom = topbottom % 100;
  uint64_t bottomtoptop = bottomtop / 100;
  uint64_t bottomtopbottom = bottomtop % 100;
  uint64_t bottombottomtop = bottombottom / 100;
  uint64_t bottombottombottom = bottombottom % 100;
  //
  out[0] = toptoptop / 10 + 0x30;
  out[1] = toptoptop % 10 + 0x30;
  out[2] = toptopbottom / 10 + 0x30;
  out[3] = toptopbottom % 10 + 0x30;
  out[4] = topbottomtop / 10 + 0x30;
  out[5] = topbottomtop % 10 + 0x30;
  out[6] = topbottombottom / 10 + 0x30;
  out[7] = topbottombottom % 10 + 0x30;
  out[8] = bottomtoptop / 10 + 0x30;
  out[9] = bottomtoptop % 10 + 0x30;
  out[10] = bottomtopbottom / 10 + 0x30;
  out[11] = bottomtopbottom % 10 + 0x30;
  out[12] = bottombottomtop / 10 + 0x30;
  out[13] = bottombottomtop % 10 + 0x30;
  out[14] = bottombottombottom / 10 + 0x30;
  out[15] = bottombottombottom % 10 + 0x30;
}

We could also try to accelerate the computation with table lookups. We want to keep the tables small. We can effectively process the tail end of the tree-based technique by looking up small integers smaller than 100 by looking up their conversion: the integer 12 becomes the 2-character string ’12’ and so forth (my code could be nicer):

void to_string_tree_table(uint64_t x, char *out) {
  static const char table[200] = {
      0x30, 0x30, 0x30, 0x31, 0x30, 0x32, 0x30, 0x33, 0x30, 0x34, 0x30, 0x35,
      0x30, 0x36, 0x30, 0x37, 0x30, 0x38, 0x30, 0x39, 0x31, 0x30, 0x31, 0x31,
      0x31, 0x32, 0x31, 0x33, 0x31, 0x34, 0x31, 0x35, 0x31, 0x36, 0x31, 0x37,
      0x31, 0x38, 0x31, 0x39, 0x32, 0x30, 0x32, 0x31, 0x32, 0x32, 0x32, 0x33,
      0x32, 0x34, 0x32, 0x35, 0x32, 0x36, 0x32, 0x37, 0x32, 0x38, 0x32, 0x39,
      0x33, 0x30, 0x33, 0x31, 0x33, 0x32, 0x33, 0x33, 0x33, 0x34, 0x33, 0x35,
      0x33, 0x36, 0x33, 0x37, 0x33, 0x38, 0x33, 0x39, 0x34, 0x30, 0x34, 0x31,
      0x34, 0x32, 0x34, 0x33, 0x34, 0x34, 0x34, 0x35, 0x34, 0x36, 0x34, 0x37,
      0x34, 0x38, 0x34, 0x39, 0x35, 0x30, 0x35, 0x31, 0x35, 0x32, 0x35, 0x33,
      0x35, 0x34, 0x35, 0x35, 0x35, 0x36, 0x35, 0x37, 0x35, 0x38, 0x35, 0x39,
      0x36, 0x30, 0x36, 0x31, 0x36, 0x32, 0x36, 0x33, 0x36, 0x34, 0x36, 0x35,
      0x36, 0x36, 0x36, 0x37, 0x36, 0x38, 0x36, 0x39, 0x37, 0x30, 0x37, 0x31,
      0x37, 0x32, 0x37, 0x33, 0x37, 0x34, 0x37, 0x35, 0x37, 0x36, 0x37, 0x37,
      0x37, 0x38, 0x37, 0x39, 0x38, 0x30, 0x38, 0x31, 0x38, 0x32, 0x38, 0x33,
      0x38, 0x34, 0x38, 0x35, 0x38, 0x36, 0x38, 0x37, 0x38, 0x38, 0x38, 0x39,
      0x39, 0x30, 0x39, 0x31, 0x39, 0x32, 0x39, 0x33, 0x39, 0x34, 0x39, 0x35,
      0x39, 0x36, 0x39, 0x37, 0x39, 0x38, 0x39, 0x39,
  };
  uint64_t top = x / 100000000;
  uint64_t bottom = x % 100000000;
  uint64_t toptop = top / 10000;
  uint64_t topbottom = top % 10000;
  uint64_t bottomtop = bottom / 10000;
  uint64_t bottombottom = bottom % 10000;
  uint64_t toptoptop = toptop / 100;
  uint64_t toptopbottom = toptop % 100;
  uint64_t topbottomtop = topbottom / 100;
  uint64_t topbottombottom = topbottom % 100;
  uint64_t bottomtoptop = bottomtop / 100;
  uint64_t bottomtopbottom = bottomtop % 100;
  uint64_t bottombottomtop = bottombottom / 100;
  uint64_t bottombottombottom = bottombottom % 100;
  //
  memcpy(out, &table[2 * toptoptop], 2);
  memcpy(out + 2, &table[2 * toptopbottom], 2);
  memcpy(out + 4, &table[2 * topbottomtop], 2);
  memcpy(out + 6, &table[2 * topbottombottom], 2);
  memcpy(out + 8, &table[2 * bottomtoptop], 2);
  memcpy(out + 10, &table[2 * bottomtopbottom], 2);
  memcpy(out + 12, &table[2 * bottombottomtop], 2);
  memcpy(out + 14, &table[2 * bottombottombottom], 2);
}

You can extend this trick if you are willing to include a 40kB table in your code:

void to_string_tree_bigtable(uint64_t x, char *out) {
  #include "bigtable.h"

  uint64_t top = x / 100000000;
  uint64_t bottom = x % 100000000;
  //
  uint64_t toptop = top / 10000;
  uint64_t topbottom = top % 10000;
  uint64_t bottomtop = bottom / 10000;
  uint64_t bottombottom = bottom % 10000;

  memcpy(out, &bigtable[4 * toptop], 4);
  memcpy(out + 4, &bigtable[4 * topbottom], 4);
  memcpy(out + 8, &bigtable[4 * bottomtop], 4);
  memcpy(out + 12, &bigtable[4 * bottombottom], 4);
}

An intermediate solution with a 3-character table would only require a 3kB table. I also consider Muła’s SIMD-based approach though I refer you to his article for details. Effectively Muła use fancy Intel-specific instructions to get the job done.

If you cannot use SIMD instructions, you can use something similar called SWAR. Effectively, you pack several integer values inside a wide integer (64 bits) and you try to somehow save instructions. Luckily, Khuong has a solution for us:

// credit: Paul Khuong
uint64_t encode_ten_thousands(uint64_t hi, uint64_t lo) {
  uint64_t merged = hi | (lo << 32);
  uint64_t top = ((merged * 10486ULL) >> 20) & ((0x7FULL << 32) | 0x7FULL);
  uint64_t bot = merged - 100ULL * top;
  uint64_t hundreds;
  uint64_t tens;
  hundreds = (bot << 16) + top;
  tens = (hundreds * 103ULL) >> 10;
  tens &= (0xFULL << 48) | (0xFULL << 32) | (0xFULL << 16) | 0xFULL;
  tens += (hundreds - 10ULL * tens) << 8;

  return tens;
}

void to_string_khuong(uint64_t x, char *out) {
  uint64_t top = x / 100000000;
  uint64_t bottom = x % 100000000;
  uint64_t first =
      0x3030303030303030 + encode_ten_thousands(top / 10000, top % 10000);
  memcpy(out, &first, sizeof(first));
  uint64_t second =
      0x3030303030303030 + encode_ten_thousands(bottom / 10000, bottom % 10000);
  memcpy(out + 8, &second, sizeof(second));
}

I refer you to Khuong’s blog post for a description.

I wrote a small benchmark in C++ which measures the time per integer. Remember that every call to my functions produces 16 digits, exactly.

function Apple M1, LLVM 12 AMD Zen 2, GCC 10
linear 14 ns 25 ns
backward linear 7.7 ns 18 ns
tree 6.9 ns 15 ns
Khuong 3.3 ns 8.0 ns
small table 3.7 ns 7.1 ns
SIMD non-available 4.8 ns
big table 1.5 ns 2.9 ns

On both processors, the crazy big-table (40kB) approach is about 2 times faster than the version with a small table. Though a big-table can be justified in some instances, my feeling is that only in niche applications would such a big table be acceptable for such a narrow task. Even a smaller 3kB seems like an overkill given the good results we get with a small table.

The SIMD approach has a rather minimal gain compared to the version with a small table (merely 25%).

At a glance, the small table wins on practical ground. It is small, simple and portable.

 

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

11 thoughts on “Converting integers to fix-digit representations quickly”

  1. Another nifty technique which allows 64-bit conversion without 64-bit arithmetic is Douglas W. Jones’ technique described at http://homepage.divms.uiowa.edu/~jones/bcd/decimal.html and implemented in e.g. https://elixir.bootlin.com/linux/latest/source/lib/vsprintf.c#L325.

    The Linux code also implements division-by-constant using multiplication manually. While most compilers these days know how to optimize divide by constant to a multiply and shift, they usually can’t infer the limited ranges of the inputs which allows smaller multipliers and no fixups.

  2. 16-bit numbers need only one multiply per digit?

    void lulz_atoi(char* str, uint16_t val) {
    uint64_t lo = val;
    uint64_t hi;

    __uint128_t x = (__uint128_t)lo * ((0xFFFFFFFFFFFFFFFFULL / 10000) + 1);
    hi = x >> 64;
    lo = (uint64_t)x;

    str[0] = hi + 0x30;
    for (int i = 1; i > 64;
    lo = (uint64_t)x;

    str[i] = hi + 0x30;
    }
    str[5] = 0;
    }

      1. Updated gist to 64-bits. I’ve not checked the generated assembly. Not benchmarked against the other implementations because a uint64_t should have 20 decimal digits…

  3. ICX AVX2 numbers look pretty nice, although “-march=native” was needed to get all three SIMD versions.

    # make
    c++ -O3 -march=native -Wall -Wextra -std=c++17 -o convert convert.cpp
    # ./convert
    khuong 7.15067
    backlinear 30.8381
    linear 21.6466
    tree 14.2078
    treetst 10.0964
    treest 6.32523
    treebt 2.30575
    sse2 4.81692
    sse2(2) 4.70681
    avx2 2.00375

    khuong 7.15235
    backlinear 30.8286
    linear 21.6496
    tree 14.2603
    treetst 10.0969
    treest 6.32516
    treebt 2.30584
    sse2 4.81617
    sse2(2) 4.70639
    avx2 2.00354

    khuong 7.15085
    backlinear 30.8319
    linear 21.6403
    tree 14.2665
    treetst 10.0935
    treest 6.32525
    treebt 2.30579
    sse2 4.81627
    sse2(2) 4.70653
    avx2 2.00359

  4. ICX AVX2 numbers look pretty nice, although “-march=native” was needed to get all three SIMD versions.

    # make
    c++ -O3 -march=native -Wall -Wextra -std=c++17 -o convert convert.cpp
    # ./convert
    khuong 7.15067
    backlinear 30.8381
    linear 21.6466
    tree 14.2078
    treetst 10.0964
    treest 6.32523
    treebt 2.30575
    sse2 4.81692
    sse2(2) 4.70681
    avx2 2.00375

    khuong 7.15235
    backlinear 30.8286
    linear 21.6496
    tree 14.2603
    treetst 10.0969
    treest 6.32516
    treebt 2.30584
    sse2 4.81617
    sse2(2) 4.70639
    avx2 2.00354

    khuong 7.15085
    backlinear 30.8319
    linear 21.6403
    tree 14.2665
    treetst 10.0935
    treest 6.32525
    treebt 2.30579
    sse2 4.81627
    sse2(2) 4.70653
    avx2 2.00359

  5. Are approaches that only work for streams of many integers of interest? I’m calculating a base-10 checksum of all 10 digits of a u32 in <1ns per u32 with AVX2. This includes an intermediate step in which the length of the encoded string (without leading zeros) is calculated, but of course it does not include outputting the string.

    The layout of the digits within the vectors is pretty bad for conversion-to-string purposes: the digits for a single integer are spread out between 2 adjacent bytes of 5 different vectors. It seems like it would be not so expensive to fix the layout to be friendlier for outputting, especially if the task was just to output u32 of at most 8 digits or u64 of at most 16 digits. I don't know a clever way to output u64 of 20 digits though, I would just have to add another scalar modmul at the beginning to split the u64 into groups of (4,8,8) digit.

    1. To compute the length, if you imagine you have the groups of 4 digits in 4 vectors, you can take max(andnot(digits == 0, digitcount))+1 to get the length where digitcount is a constant:
      [3,2,1,0]
      [7,6,5,4]
      [11,10,9,8]
      [15,14,13,12]
      and max() is actually a sequence of 3 max, shuffle, max, shuffle, max. After that you can use arithmetic to calculate the proper shuffle to move the digits into the front of the vector so you can output a single formatted integer. You end up with a bunch of data dependencies but no branches. The data dependencies will surely rain on your parade though.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.