It is tricky to convert integers into strings because the number of characters can vary according to the amplitude of the integer. The integer ‘1’ requires a single character whereas the integer ‘100’ requires three characters. So a solution might possible need a hard-to-predict branch.

Let us simplify the problem.

Imagine that you want to serialize integers to fixed-digit strings. Thus you may want to convert 16-digit integers (up to 10000000000000000) to exactly 16 digits, including leading zeros if needed. In this manner, it is easy to write code that contains only trivial branches.

The simplest approach could be a character-by-character routine where I use the fact that the character ‘0’ in ASCII is just 0x30 (in hexadecimal):

void to_string_backlinear(uint64_t x, char *out) { for(int z = 0; z < 16; z++) { out[15-z] = (x % 10) + 0x30; x /= 10; } }

It is somewhat strange to write the characters backward, starting from the less significant digit. You can try to go forward, but it is a bit trickier. Here is one ugly approach that is probably not efficient:

void to_string_linear(uint64_t x, char *out) { out[0] = x / 1000000000000000 + 0x30; x %= 1000000000000000; out[1] = x / 100000000000000 + 0x30; x %= 100000000000000; out[2] = x / 10000000000000 + 0x30; x %= 10000000000000; out[3] = x / 1000000000000 + 0x30; x %= 1000000000000; out[4] = x / 100000000000 + 0x30; x %= 100000000000; out[5] = x / 10000000000 + 0x30; x %= 10000000000; out[6] = x / 1000000000 + 0x30; x %= 1000000000; out[7] = x / 100000000 + 0x30; x %= 100000000; out[8] = x / 10000000 + 0x30; x %= 10000000; out[9] = x / 1000000 + 0x30; x %= 1000000; out[10] = x / 100000 + 0x30; x %= 100000; out[11] = x / 10000 + 0x30; x %= 10000; out[12] = x / 1000 + 0x30; x %= 1000; out[13] = x / 100 + 0x30; x %= 100; out[14] = x / 10 + 0x30; x %= 10; out[15] = x + 0x30; }

Instead we could try to do it in a tree-like manner to reduce data dependency during the computation and hope for more core-level parallelism. We first divide the integer 100000000 to compute the first and last 8 digits separately, and so forth. It should drastically decrease data dependencies:

void to_string_tree(uint64_t x, char *out) { uint64_t top = x / 100000000; uint64_t bottom = x % 100000000; uint64_t toptop = top / 10000; uint64_t topbottom = top % 10000; uint64_t bottomtop = bottom / 10000; uint64_t bottombottom = bottom % 10000; uint64_t toptoptop = toptop / 100; uint64_t toptopbottom = toptop % 100; uint64_t topbottomtop = topbottom / 100; uint64_t topbottombottom = topbottom % 100; uint64_t bottomtoptop = bottomtop / 100; uint64_t bottomtopbottom = bottomtop % 100; uint64_t bottombottomtop = bottombottom / 100; uint64_t bottombottombottom = bottombottom % 100; // out[0] = toptoptop / 10 + 0x30; out[1] = toptoptop % 10 + 0x30; out[2] = toptopbottom / 10 + 0x30; out[3] = toptopbottom % 10 + 0x30; out[4] = topbottomtop / 10 + 0x30; out[5] = topbottomtop % 10 + 0x30; out[6] = topbottombottom / 10 + 0x30; out[7] = topbottombottom % 10 + 0x30; out[8] = bottomtoptop / 10 + 0x30; out[9] = bottomtoptop % 10 + 0x30; out[10] = bottomtopbottom / 10 + 0x30; out[11] = bottomtopbottom % 10 + 0x30; out[12] = bottombottomtop / 10 + 0x30; out[13] = bottombottomtop % 10 + 0x30; out[14] = bottombottombottom / 10 + 0x30; out[15] = bottombottombottom % 10 + 0x30; }

We could also try to accelerate the computation with table lookups. We want to keep the tables small. We can effectively process the tail end of the tree-based technique by looking up small integers smaller than 100 by looking up their conversion: the integer 12 becomes the 2-character string ’12’ and so forth (my code could be nicer):

void to_string_tree_table(uint64_t x, char *out) { static const char table[200] = { 0x30, 0x30, 0x30, 0x31, 0x30, 0x32, 0x30, 0x33, 0x30, 0x34, 0x30, 0x35, 0x30, 0x36, 0x30, 0x37, 0x30, 0x38, 0x30, 0x39, 0x31, 0x30, 0x31, 0x31, 0x31, 0x32, 0x31, 0x33, 0x31, 0x34, 0x31, 0x35, 0x31, 0x36, 0x31, 0x37, 0x31, 0x38, 0x31, 0x39, 0x32, 0x30, 0x32, 0x31, 0x32, 0x32, 0x32, 0x33, 0x32, 0x34, 0x32, 0x35, 0x32, 0x36, 0x32, 0x37, 0x32, 0x38, 0x32, 0x39, 0x33, 0x30, 0x33, 0x31, 0x33, 0x32, 0x33, 0x33, 0x33, 0x34, 0x33, 0x35, 0x33, 0x36, 0x33, 0x37, 0x33, 0x38, 0x33, 0x39, 0x34, 0x30, 0x34, 0x31, 0x34, 0x32, 0x34, 0x33, 0x34, 0x34, 0x34, 0x35, 0x34, 0x36, 0x34, 0x37, 0x34, 0x38, 0x34, 0x39, 0x35, 0x30, 0x35, 0x31, 0x35, 0x32, 0x35, 0x33, 0x35, 0x34, 0x35, 0x35, 0x35, 0x36, 0x35, 0x37, 0x35, 0x38, 0x35, 0x39, 0x36, 0x30, 0x36, 0x31, 0x36, 0x32, 0x36, 0x33, 0x36, 0x34, 0x36, 0x35, 0x36, 0x36, 0x36, 0x37, 0x36, 0x38, 0x36, 0x39, 0x37, 0x30, 0x37, 0x31, 0x37, 0x32, 0x37, 0x33, 0x37, 0x34, 0x37, 0x35, 0x37, 0x36, 0x37, 0x37, 0x37, 0x38, 0x37, 0x39, 0x38, 0x30, 0x38, 0x31, 0x38, 0x32, 0x38, 0x33, 0x38, 0x34, 0x38, 0x35, 0x38, 0x36, 0x38, 0x37, 0x38, 0x38, 0x38, 0x39, 0x39, 0x30, 0x39, 0x31, 0x39, 0x32, 0x39, 0x33, 0x39, 0x34, 0x39, 0x35, 0x39, 0x36, 0x39, 0x37, 0x39, 0x38, 0x39, 0x39, }; uint64_t top = x / 100000000; uint64_t bottom = x % 100000000; uint64_t toptop = top / 10000; uint64_t topbottom = top % 10000; uint64_t bottomtop = bottom / 10000; uint64_t bottombottom = bottom % 10000; uint64_t toptoptop = toptop / 100; uint64_t toptopbottom = toptop % 100; uint64_t topbottomtop = topbottom / 100; uint64_t topbottombottom = topbottom % 100; uint64_t bottomtoptop = bottomtop / 100; uint64_t bottomtopbottom = bottomtop % 100; uint64_t bottombottomtop = bottombottom / 100; uint64_t bottombottombottom = bottombottom % 100; // memcpy(out, &table[2 * toptoptop], 2); memcpy(out + 2, &table[2 * toptopbottom], 2); memcpy(out + 4, &table[2 * topbottomtop], 2); memcpy(out + 6, &table[2 * topbottombottom], 2); memcpy(out + 8, &table[2 * bottomtoptop], 2); memcpy(out + 10, &table[2 * bottomtopbottom], 2); memcpy(out + 12, &table[2 * bottombottomtop], 2); memcpy(out + 14, &table[2 * bottombottombottom], 2); }

You can extend this trick if you are willing to include a 40kB table in your code:

void to_string_tree_bigtable(uint64_t x, char *out) { #include "bigtable.h" uint64_t top = x / 100000000; uint64_t bottom = x % 100000000; // uint64_t toptop = top / 10000; uint64_t topbottom = top % 10000; uint64_t bottomtop = bottom / 10000; uint64_t bottombottom = bottom % 10000; memcpy(out, &bigtable[4 * toptop], 4); memcpy(out + 4, &bigtable[4 * topbottom], 4); memcpy(out + 8, &bigtable[4 * bottomtop], 4); memcpy(out + 12, &bigtable[4 * bottombottom], 4); }

An intermediate solution with a 3-character table would only require a 3kB table. I also consider Muła’s SIMD-based approach though I refer you to his article for details. Effectively Muła use fancy Intel-specific instructions to get the job done.

If you cannot use SIMD instructions, you can use something similar called SWAR. Effectively, you pack several integer values inside a wide integer (64 bits) and you try to somehow save instructions. Luckily, Khuong has a solution for us:

// credit: Paul Khuong uint64_t encode_ten_thousands(uint64_t hi, uint64_t lo) { uint64_t merged = hi | (lo << 32); uint64_t top = ((merged * 10486ULL) >> 20) & ((0x7FULL << 32) | 0x7FULL); uint64_t bot = merged - 100ULL * top; uint64_t hundreds; uint64_t tens; hundreds = (bot << 16) + top; tens = (hundreds * 103ULL) >> 10; tens &= (0xFULL << 48) | (0xFULL << 32) | (0xFULL << 16) | 0xFULL; tens += (hundreds - 10ULL * tens) << 8; return tens; } void to_string_khuong(uint64_t x, char *out) { uint64_t top = x / 100000000; uint64_t bottom = x % 100000000; uint64_t first = 0x3030303030303030 + encode_ten_thousands(top / 10000, top % 10000); memcpy(out, &first, sizeof(first)); uint64_t second = 0x3030303030303030 + encode_ten_thousands(bottom / 10000, bottom % 10000); memcpy(out + 8, &second, sizeof(second)); }

I refer you to Khuong’s blog post for a description.

I wrote a small benchmark in C++ which measures the time per integer. Remember that every call to my functions produces 16 digits, exactly.

function | Apple M1, LLVM 12 | AMD Zen 2, GCC 10 |
---|---|---|

linear | 14 ns | 25 ns |

backward linear | 7.7 ns | 18 ns |

tree | 6.9 ns | 15 ns |

Khuong | 3.3 ns | 8.0 ns |

small table | 3.7 ns | 7.1 ns |

SIMD | non-available | 4.8 ns |

big table | 1.5 ns | 2.9 ns |

On both processors, the crazy big-table (40kB) approach is about 2 times faster than the version with a small table. Though a big-table can be justified in some instances, my feeling is that only in niche applications would such a big table be acceptable for such a narrow task. Even a smaller 3kB seems like an overkill given the good results we get with a small table.

The SIMD approach has a rather minimal gain compared to the version with a small table (merely 25%).

At a glance, the small table wins on practical ground. It is small, simple and portable.

uint64_t bottombottom = top % 10000;

it’s supposed to be

uint64_t bottombottom = bottom % 10000;

Fixed.

Another nifty technique which allows 64-bit conversion without 64-bit arithmetic is Douglas W. Jones’ technique described at http://homepage.divms.uiowa.edu/~jones/bcd/decimal.html and implemented in e.g. https://elixir.bootlin.com/linux/latest/source/lib/vsprintf.c#L325.

The Linux code also implements division-by-constant using multiplication manually. While most compilers these days know how to optimize divide by constant to a multiply and shift, they usually can’t infer the limited ranges of the inputs which allows smaller multipliers and no fixups.

16-bit numbers need only one multiply per digit?

void lulz_atoi(char* str, uint16_t val) {

uint64_t lo = val;

uint64_t hi;

__uint128_t x = (__uint128_t)lo * ((0xFFFFFFFFFFFFFFFFULL / 10000) + 1);

hi = x >> 64;

lo = (uint64_t)x;

str[0] = hi + 0x30;

for (int i = 1; i > 64;

lo = (uint64_t)x;

`str[i] = hi + 0x30;`

}

str[5] = 0;

}

https://gist.github.com/aqrit/2997713a00ad043b4bac42e342294259

Updated gist to 64-bits. I’ve not checked the generated assembly. Not benchmarked against the other implementations because a uint64_t should have 20 decimal digits…

ICX AVX2 numbers look pretty nice, although “-march=native” was needed to get all three SIMD versions.

# make

c++ -O3 -march=native -Wall -Wextra -std=c++17 -o convert convert.cpp

# ./convert

khuong 7.15067

backlinear 30.8381

linear 21.6466

tree 14.2078

treetst 10.0964

treest 6.32523

treebt 2.30575

sse2 4.81692

sse2(2) 4.70681

avx2 2.00375

khuong 7.15235

backlinear 30.8286

linear 21.6496

tree 14.2603

treetst 10.0969

treest 6.32516

treebt 2.30584

sse2 4.81617

sse2(2) 4.70639

avx2 2.00354

khuong 7.15085

backlinear 30.8319

linear 21.6403

tree 14.2665

treetst 10.0935

treest 6.32525

treebt 2.30579

sse2 4.81627

sse2(2) 4.70653

avx2 2.00359

ICX AVX2 numbers look pretty nice, although “-march=native” was needed to get all three SIMD versions.

# make

c++ -O3 -march=native -Wall -Wextra -std=c++17 -o convert convert.cpp

# ./convert

khuong 7.15067

backlinear 30.8381

linear 21.6466

tree 14.2078

treetst 10.0964

treest 6.32523

treebt 2.30575

sse2 4.81692

sse2(2) 4.70681

avx2 2.00375

khuong 7.15235

backlinear 30.8286

linear 21.6496

tree 14.2603

treetst 10.0969

treest 6.32516

treebt 2.30584

sse2 4.81617

sse2(2) 4.70639

avx2 2.00354

khuong 7.15085

backlinear 30.8319

linear 21.6403

tree 14.2665

treetst 10.0935

treest 6.32525

treebt 2.30579

sse2 4.81627

sse2(2) 4.70653

avx2 2.00359

Can we run this type of coding syntax on an online compiler like this https://www.interviewbit.com/online-java-compiler/

Are approaches that only work for streams of many integers of interest? I’m calculating a base-10 checksum of all 10 digits of a u32 in <1ns per u32 with AVX2. This includes an intermediate step in which the length of the encoded string (without leading zeros) is calculated, but of course it does not include outputting the string.

The layout of the digits within the vectors is pretty bad for conversion-to-string purposes: the digits for a single integer are spread out between 2 adjacent bytes of 5 different vectors. It seems like it would be not so expensive to fix the layout to be friendlier for outputting, especially if the task was just to output u32 of at most 8 digits or u64 of at most 16 digits. I don't know a clever way to output u64 of 20 digits though, I would just have to add another scalar modmul at the beginning to split the u64 into groups of (4,8,8) digit.

To compute the length, if you imagine you have the groups of 4 digits in 4 vectors, you can take max(andnot(digits == 0, digitcount))+1 to get the length where digitcount is a constant:

[3,2,1,0]

[7,6,5,4]

[11,10,9,8]

[15,14,13,12]

and max() is actually a sequence of 3 max, shuffle, max, shuffle, max. After that you can use arithmetic to calculate the proper shuffle to move the digits into the front of the vector so you can output a single formatted integer. You end up with a bunch of data dependencies but no branches. The data dependencies will surely rain on your parade though.