It is common to want to parse long strings of digits into integer values. Because it is a common task, we want to optimize it as much as possible.
In the blog post, Quickly parsing eight digits, I presented a very quick way to parse eight ASCII characters representing an integers (e.g., 12345678) into the corresponding binary value. I want to come back to it and explain it a bit more, to show that it is not magic. This works in most programming languages, but I will stick with C for this blog post.
To recap, the long way is a simple loop:
uint32_t parse_eight_digits(const unsigned char *chars) { uint32_t x = chars[0] - '0'; for (size_t j = 1; j < 8; j++) x = x * 10 + (chars[j] - '0'); return x; }
We use the fact that in ASCII, the numbers 0, 1, … are in consecutive order in terms of byte values. The character ‘0’ is 0x30 (or 48 in decimal), the character ‘1’ is 0x31 (49 in decimal) and so forth. At each step in the loop, we multiple the running value by 10 and add the value of the next digit.
It assumes that all characters are in the valid range (from ‘0’ to ‘9’): other code should check that it is the case.
An optimizing compiler will probably unroll the loop and produce code that might look like this in assembly:
movzx eax, byte ptr [rdi] lea eax, [rax + 4*rax] movzx ecx, byte ptr [rdi + 1] lea eax, [rcx + 2*rax] lea eax, [rax + 4*rax] movzx ecx, byte ptr [rdi + 2] lea eax, [rcx + 2*rax] lea eax, [rax + 4*rax] movzx ecx, byte ptr [rdi + 3] lea eax, [rcx + 2*rax] lea eax, [rax + 4*rax] movzx ecx, byte ptr [rdi + 4] lea eax, [rcx + 2*rax] lea eax, [rax + 4*rax] movzx ecx, byte ptr [rdi + 5] lea eax, [rcx + 2*rax] lea eax, [rax + 4*rax] movzx ecx, byte ptr [rdi + 6] lea eax, [rcx + 2*rax] lea eax, [rax + 4*rax] movzx ecx, byte ptr [rdi + 7] lea eax, [rcx + 2*rax] add eax, -533333328
Notice how there are many loads, and a whole lot of operations.
We can substantially shorten the resulting code, down to something that looks like the following:
imul rax, qword ptr [rdi], 2561 movabs rcx, -1302123111085379632 add rcx, rax shr rcx, 8 movabs rax, 71777214294589695 and rax, rcx imul rax, rax, 6553601 shr rax, 16 movabs rcx, 281470681808895 and rcx, rax movabs rax, 42949672960001 imul rax, rcx shr rax, 32
How do we do it? We use a technique called SWAR which stands for SIMD within a register. The intuition behind is that modern computers have 64-bit registers. Processing eight consecutive bytes as eight distinct words, as in the native code above, is inefficient given how wide our registers are.
The first step is to load all eight characters into a 64-bit register. In C, you might do it in this manner:
int64_t val; memcpy(&val, chars, 8);
It looks maybe expensive, but most compilers will translate the memcpy instruction into a single load, when compiling with optimizations turned on.
Computers store values in little-endian order. This means that the first byte you encounter is going to be used as the least significant byte, and so forth.
Then we want to subtract the character ‘0‘ (or 0x30 in hexadecimal). We can do it with a single operation:
val = val - 0x3030303030303030;
So if you had the string ‘12345678‘, you will now have the value 0x0807060504030201.
Next we are going to do a kind of pyramidal computation. We add pairs of successive bytes, then pairs of successive 16-bit values and then pairs of successive 32-bit bytes.
It goes something like this, suppose that you have the sequence of digit values b1, b2, b3, b4, b5, b6, b7, b8. You want to do…
- add pairs of bytes: 10*b1+b2, 10*b3+b4, 10*b5+b6, 10*b7+b8
- combine first and third sum: 1000000*(10*b1+b2) + 100*(10*b5+b6)
- combine second and fourth sum: 10*b7+b8 + 10000*(10*b3+b4)
I will only explain the first step (pairs of bytes) as the other two steps are similar. Consider the least significant two bytes, which have value 256*b2 + b1. We multiply it by 10, and we add the value shifted by 8 bits, and we get b1+10*b2 in the least significant byte. We can compute 4 such operations in one operation…
val = (val * 10) + (val >> 8);
The next two steps are similar:
val1 = (((val & 0x000000FF000000FF) * (100 + (1000000ULL << 32)));
val2 = (((val >> 16) & 0x000000FF000000FF) * (1 + (10000ULL << 32))) >> 32;
And the overall code looks as follows…
uint32_t parse_eight_digits_unrolled(uint64_t val) { const uint64_t mask = 0x000000FF000000FF; const uint64_t mul1 = 0x000F424000000064; // 100 + (1000000ULL << 32) const uint64_t mul2 = 0x0000271000000001; // 1 + (10000ULL << 32) val -= 0x3030303030303030; val = (val * 10) + (val >> 8); // val = (val * 2561) >> 8; val = (((val & mask) * mul1) + (((val >> 16) & mask) * mul2)) >> 32; return val; }
Appendix: You can do much the same in C# starting with a byte pointer (byte* chars):
ulong val = Unsafe.ReadUnaligned<ulong>(chars); const ulong mask = 0x000000FF000000FF; const ulong mul1 = 0x000F424000000064; // 100 + (1000000ULL << 32) const ulong mul2 = 0x0000271000000001; // 1 + (10000ULL << 32) val -= 0x3030303030303030; val = (val * 10) + (val >> 8); // val = (val * 2561) >> 8; val = (((val & mask) * mul1) + (((val >> 16) & mask) * mul2)) >> 32;
Please put a HTML code element inside of HTML pre text
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/code#notes
I am very much intrigued by the differences of the algorithms in the two code snippets.
The algorithm in the C code looks slower than the one in C# because it has 3 multiplications that depend on each other. The last two multiplications of the C# code are independent of each other.
I am also curious on why you did not use ‘val = (val * 2561) >> 8’ in the second last line of the C# code?
I have implemented both algorithms in Java, and measurements with JMH indicate that the C# algorithm is slower than the C algorithm. However when I use ‘val = (val * 2561) >> 8’ in the C# algorithm, it has the same speed as the C algorithm.
I have merged the two versions. Thank for your comment.
Hi, Daniel. A couple things I just noticed in
parse_eight_digits_swar
… (1) It seems to be missing the declaration and loading of the variableval
. (2)val
was declared earlier as typeint64_t
but the return value of the function isuint32_t
. This should be fine since $log_2(10^8) < 32$, but is likely to generate a compiler warning in the absence of an explicit cast, e.g.,return (uint32_t)val;
.is likely to generate a compiler warning
Yes. It is likely to generate a warning since there is an implicit cast.
It seems to be missing the declaration and loading of the variable val.
It is passed as a parameter to the function.
It is now (thank you for fixing it!), but at the time I posted my comment, it was not. The only parameter being passed was
const char *chars
(I’m looking at it right now on my screen, which still has the old code in another window.)Cheers!
Thanks for your comment. You are correct.
It may be worth noting that this technique is not only useful for converting an 8-digit decimal integer to binary, but can also be used to load 8 digits at a time from a larger number, if you treat the 8 characters as a base-100,000,000 digit. So, say you were loading a 40-digit base-10 number with leading zeros into a uint128_t variable. It would require only five invocations of the 8-digit parsing function, with five additional multiplications (by the constant 100,000,000) and five additions accumulate the five base-100,000,000 digits.
One other related thought for possible exploration is whether or not SWAR techniques can be used to efficiently count the number of ASCII base-10 digits (‘0’..’9′) appearing sequentially at a given position in a string or at a given memory location. If so, then one could construct 8 versions of the SWAR parser, one for each count of available characters, 1 to 8, and invoke the appropriate version, thus eliminating the requirement for leading zeros to be present.
Hey, Dave, it’s mb interesting for u:
Here, 1+ year ago i wrote a publication (in Russian) about parsing decimal numbers, without using LUT:
https://habr.com/ru/post/522820/
It s also using this trick, which u named SWAR :]
There’s a typo which is a little confusing.
1000000*(10*b1+b2) + 100*(10*b3+b6)
should read:
1000000*(10*b1+b2) + 100*(10*b5+b6)
Thank you!