# SWAR explained: parsing eight digits

It is common to want to parse long strings of digits into integer values. Because it is a common task, we want to optimize it as much as possible.

In the blog post, Quickly parsing eight digits, I presented a very quick way to parse eight ASCII characters representing an integers (e.g., 12345678) into the corresponding binary value. I want to come back to it and explain it a bit more, to show that it is not magic. This works in most programming languages, but I will stick with C for this blog post.

To recap, the long way is a simple loop:

```uint32_t parse_eight_digits(const unsigned char *chars) {
uint32_t x = chars - '0';
for (size_t j = 1; j < 8; j++)
x = x * 10 + (chars[j] - '0');
return x;
}
```

We use the fact that in ASCII, the numbers 0, 1, … are in consecutive order in terms of byte values. The character ‘0’ is 0x30 (or 48 in decimal), the character ‘1’ is 0x31 (49 in decimal) and so forth. At each step in the loop, we multiple the running value by 10 and add the value of the next digit.

It assumes that all characters are in the valid range (from ‘0’ to ‘9’): other code should check that it is the case.

An optimizing compiler will probably unroll the loop and produce code that might look like this in assembly:

```        movzx   eax, byte ptr [rdi]
lea     eax, [rax + 4*rax]
movzx   ecx, byte ptr [rdi + 1]
lea     eax, [rcx + 2*rax]
lea     eax, [rax + 4*rax]
movzx   ecx, byte ptr [rdi + 2]
lea     eax, [rcx + 2*rax]
lea     eax, [rax + 4*rax]
movzx   ecx, byte ptr [rdi + 3]
lea     eax, [rcx + 2*rax]
lea     eax, [rax + 4*rax]
movzx   ecx, byte ptr [rdi + 4]
lea     eax, [rcx + 2*rax]
lea     eax, [rax + 4*rax]
movzx   ecx, byte ptr [rdi + 5]
lea     eax, [rcx + 2*rax]
lea     eax, [rax + 4*rax]
movzx   ecx, byte ptr [rdi + 6]
lea     eax, [rcx + 2*rax]
lea     eax, [rax + 4*rax]
movzx   ecx, byte ptr [rdi + 7]
lea     eax, [rcx + 2*rax]
```

Notice how there are many loads, and a whole lot of operations.

We can substantially shorten the resulting code, down to something that looks like the following:

```        imul    rax, qword ptr [rdi], 2561
movabs  rcx, -1302123111085379632
shr     rcx, 8
movabs  rax, 71777214294589695
and     rax, rcx
imul    rax, rax, 6553601
shr     rax, 16
movabs  rcx, 281470681808895
and     rcx, rax
movabs  rax, 42949672960001
imul    rax, rcx
shr     rax, 32
```

How do we do it? We use a technique called SWAR which stands for SIMD within a register. The intuition behind is that modern computers have 64-bit registers. Processing eight consecutive bytes as eight distinct words, as in the native code above, is inefficient given how wide our registers are.

The first step is to load all eight characters into a 64-bit register. In C, you might do it in this manner:

```int64_t val;
memcpy(&val, chars, 8);
```

It looks maybe expensive, but most compilers will translate the memcpy instruction into a single load, when compiling with optimizations turned on.

Computers store values in little-endian order. This means that the first byte you encounter is going to be used as the least significant byte, and so forth.

Then we want to subtract the character ‘0‘ (or 0x30 in hexadecimal). We can do it with a single operation:

```val = val - 0x3030303030303030;
```

So if you had the string ‘12345678‘, you will now have the value 0x0807060504030201.

Next we are going to do a kind of pyramidal computation. We add pairs of successive bytes, then pairs of successive 16-bit values and then pairs of successive 32-bit bytes.

It goes something like this, suppose that you have the sequence of digit values b1, b2, b3, b4, b5, b6, b7, b8. You want to do…

• add pairs of bytes: 10*b1+b2, 10*b3+b4, 10*b5+b6, 10*b7+b8
• combine first and third sum: 1000000*(10*b1+b2) + 100*(10*b5+b6)
• combine second and fourth sum: 10*b7+b8 + 10000*(10*b3+b4)

I will only explain the first step (pairs of bytes) as the other two steps are similar. Consider the least significant two bytes, which have value 256*b2 + b1. We multiply it by 10, and we add the value shifted by 8 bits, and we get b1+10*b2 in the least significant byte. We can compute 4 such operations in one operation…

```val = (val * 10) + (val >> 8);
```

The next two steps are similar:

```val1 = (((val & 0x000000FF000000FF) * (100 + (1000000ULL << 32)));
```

```val2 = (((val >> 16) & 0x000000FF000000FF)
* (1 + (10000ULL << 32))) >> 32;
```

And the overall code looks as follows…

```uint32_t  parse_eight_digits_unrolled(uint64_t val) {
const uint64_t mask = 0x000000FF000000FF;
const uint64_t mul1 = 0x000F424000000064; // 100 + (1000000ULL << 32)
const uint64_t mul2 = 0x0000271000000001; // 1 + (10000ULL << 32)
val -= 0x3030303030303030;
val = (val * 10) + (val >> 8); // val = (val * 2561) >> 8;
val = (((val & mask) * mul1) + (((val >> 16) & mask) * mul2)) >> 32;
return val;
}
```

Appendix: You can do much the same in C# starting with a byte pointer (byte* chars):

```ulong val = Unsafe.ReadUnaligned<ulong>(chars);
const ulong mask = 0x000000FF000000FF;
const ulong mul1 = 0x000F424000000064;
// 100 + (1000000ULL << 32)
const ulong mul2 = 0x0000271000000001;
// 1 + (10000ULL << 32)
val -= 0x3030303030303030;
val = (val * 10) + (val >> 8); // val = (val * 2561) >> 8;
val = (((val & mask) * mul1) + (((val >> 16) & mask) * mul2)) >> 32;
``` ### Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

## 12 thoughts on “SWAR explained: parsing eight digits”

1. Werner Randelshofer says:

I am very much intrigued by the differences of the algorithms in the two code snippets.

The algorithm in the C code looks slower than the one in C# because it has 3 multiplications that depend on each other. The last two multiplications of the C# code are independent of each other.

I am also curious on why you did not use ‘val = (val * 2561) >> 8’ in the second last line of the C# code?

I have implemented both algorithms in Java, and measurements with JMH indicate that the C# algorithm is slower than the C algorithm. However when I use ‘val = (val * 2561) >> 8’ in the C# algorithm, it has the same speed as the C algorithm.

1. Daniel Lemire says:

I have merged the two versions. Thank for your comment.

2. Todd Lehman says:

Hi, Daniel. A couple things I just noticed in `parse_eight_digits_swar`… (1) It seems to be missing the declaration and loading of the variable `val`. (2) `val` was declared earlier as type `int64_t` but the return value of the function is `uint32_t`. This should be fine since \$log_2(10^8) < 32\$, but is likely to generate a compiler warning in the absence of an explicit cast, e.g., `return (uint32_t)val;`.

1. Daniel Lemire says:

is likely to generate a compiler warning

Yes. It is likely to generate a warning since there is an implicit cast.

It seems to be missing the declaration and loading of the variable val.

It is passed as a parameter to the function.

1. Todd Lehman says:

It is passed as a parameter to the function.

It is now (thank you for fixing it!), but at the time I posted my comment, it was not. The only parameter being passed was `const char *chars` (I’m looking at it right now on my screen, which still has the old code in another window.)
Cheers!

1. Daniel Lemire says:

Thanks for your comment. You are correct.

3. Todd Lehman says:

It may be worth noting that this technique is not only useful for converting an 8-digit decimal integer to binary, but can also be used to load 8 digits at a time from a larger number, if you treat the 8 characters as a base-100,000,000 digit. So, say you were loading a 40-digit base-10 number with leading zeros into a uint128_t variable. It would require only five invocations of the 8-digit parsing function, with five additional multiplications (by the constant 100,000,000) and five additions accumulate the five base-100,000,000 digits.

4. Todd Lehman says:

One other related thought for possible exploration is whether or not SWAR techniques can be used to efficiently count the number of ASCII base-10 digits (‘0’..’9′) appearing sequentially at a given position in a string or at a given memory location. If so, then one could construct 8 versions of the SWAR parser, one for each count of available characters, 1 to 8, and invoke the appropriate version, thus eliminating the requirement for leading zeros to be present.

5. Ioann_V says:

Hey, Dave, it’s mb interesting for u:

Here, 1+ year ago i wrote a publication (in Russian) about parsing decimal numbers, without using LUT:

https://habr.com/ru/post/522820/

It s also using this trick, which u named SWAR :]

6. David Fetter says:

There’s a typo which is a little confusing.

1000000*(10*b1+b2) + 100*(10*b3+b6)
1. Daniel Lemire says: