SWAR explained: parsing eight digits

It is common to want to parse long strings of digits into integer values. Because it is a common task, we want to optimize it as much as possible.

In the blog post, Quickly parsing eight digits, I presented a very quick way to parse eight ASCII characters representing an integers (e.g., 12345678) into the corresponding binary value. I want to come back to it and explain it a bit more, to show that it is not magic. This works in most programming languages, but I will stick with C for this blog post.

To recap, the long way is a simple loop:

uint32_t parse_eight_digits(const unsigned char *chars) {
  uint32_t x = chars[0] - '0';
  for (size_t j = 1; j < 8; j++)
    x = x * 10 + (chars[j] - '0');
  return x;
}

We use the fact that in ASCII, the numbers 0, 1, … are in consecutive order in terms of byte values. The character ‘0’ is 0x30 (or 48 in decimal), the character ‘1’ is 0x31 (49 in decimal) and so forth. At each step in the loop, we multiple the running value by 10 and add the value of the next digit.

It assumes that all characters are in the valid range (from ‘0’ to ‘9’): other code should check that it is the case.

An optimizing compiler will probably unroll the loop and produce code that might look like this in assembly:

        movzx   eax, byte ptr [rdi]
        lea     eax, [rax + 4*rax]
        movzx   ecx, byte ptr [rdi + 1]
        lea     eax, [rcx + 2*rax]
        lea     eax, [rax + 4*rax]
        movzx   ecx, byte ptr [rdi + 2]
        lea     eax, [rcx + 2*rax]
        lea     eax, [rax + 4*rax]
        movzx   ecx, byte ptr [rdi + 3]
        lea     eax, [rcx + 2*rax]
        lea     eax, [rax + 4*rax]
        movzx   ecx, byte ptr [rdi + 4]
        lea     eax, [rcx + 2*rax]
        lea     eax, [rax + 4*rax]
        movzx   ecx, byte ptr [rdi + 5]
        lea     eax, [rcx + 2*rax]
        lea     eax, [rax + 4*rax]
        movzx   ecx, byte ptr [rdi + 6]
        lea     eax, [rcx + 2*rax]
        lea     eax, [rax + 4*rax]
        movzx   ecx, byte ptr [rdi + 7]
        lea     eax, [rcx + 2*rax]
        add     eax, -533333328

Notice how there are many loads, and a whole lot of operations.

We can substantially shorten the resulting code, down to something that looks like the following:

        imul    rax, qword ptr [rdi], 2561
        movabs  rcx, -1302123111085379632
        add     rcx, rax
        shr     rcx, 8
        movabs  rax, 71777214294589695
        and     rax, rcx
        imul    rax, rax, 6553601
        shr     rax, 16
        movabs  rcx, 281470681808895
        and     rcx, rax
        movabs  rax, 42949672960001
        imul    rax, rcx
        shr     rax, 32

How do we do it? We use a technique called SWAR which stands for SIMD within a register. The intuition behind is that modern computers have 64-bit registers. Processing eight consecutive bytes as eight distinct words, as in the native code above, is inefficient given how wide our registers are.

The first step is to load all eight characters into a 64-bit register. In C, you might do it in this manner:

int64_t val; 
memcpy(&val, chars, 8);

It looks maybe expensive, but most compilers will translate the memcpy instruction into a single load, when compiling with optimizations turned on.

Computers store values in little-endian order. This means that the first byte you encounter is going to be used as the least significant byte, and so forth.

Then we want to subtract the character ‘0‘ (or 0x30 in hexadecimal). We can do it with a single operation:

val = val - 0x3030303030303030;

So if you had the string ‘12345678‘, you will now have the value 0x0807060504030201.

Next we are going to do a kind of pyramidal computation. We add pairs of successive bytes, then pairs of successive 16-bit values and then pairs of successive 32-bit bytes.

It goes something like this, suppose that you have the sequence of digit values b1, b2, b3, b4, b5, b6, b7, b8. You want to do…

  • add pairs of bytes: 10*b1+b2, 10*b3+b4, 10*b5+b6, 10*b7+b8
  • combine first and third sum: 1000000*(10*b1+b2) + 100*(10*b5+b6)
  • combine second and fourth sum: 10*b7+b8 + 10000*(10*b3+b4)

I will only explain the first step (pairs of bytes) as the other two steps are similar. Consider the least significant two bytes, which have value 256*b2 + b1. We multiply it by 10, and we add the value shifted by 8 bits, and we get b1+10*b2 in the least significant byte. We can compute 4 such operations in one operation…

val = (val * 10) + (val >> 8);

The next two steps are similar:

val1 = (((val & 0x000000FF000000FF) * (100 + (1000000ULL << 32)));

val2 = (((val >> 16) & 0x000000FF000000FF) 
          * (1 + (10000ULL << 32))) >> 32;

And the overall code looks as follows…

uint32_t  parse_eight_digits_unrolled(uint64_t val) {
  const uint64_t mask = 0x000000FF000000FF;
  const uint64_t mul1 = 0x000F424000000064; // 100 + (1000000ULL << 32)
  const uint64_t mul2 = 0x0000271000000001; // 1 + (10000ULL << 32)
  val -= 0x3030303030303030;
  val = (val * 10) + (val >> 8); // val = (val * 2561) >> 8;
  val = (((val & mask) * mul1) + (((val >> 16) & mask) * mul2)) >> 32;
  return val;
}

Appendix: You can do much the same in C# starting with a byte pointer (byte* chars):

ulong val = Unsafe.ReadUnaligned<ulong>(chars);
const ulong mask = 0x000000FF000000FF;
const ulong mul1 = 0x000F424000000064; 
// 100 + (1000000ULL << 32)
const ulong mul2 = 0x0000271000000001; 
// 1 + (10000ULL << 32)
val -= 0x3030303030303030;
val = (val * 10) + (val >> 8); // val = (val * 2561) >> 8;
val = (((val & mask) * mul1) + (((val >> 16) & mask) * mul2)) >> 32;

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

12 thoughts on “SWAR explained: parsing eight digits”

  1. I am very much intrigued by the differences of the algorithms in the two code snippets.

    The algorithm in the C code looks slower than the one in C# because it has 3 multiplications that depend on each other. The last two multiplications of the C# code are independent of each other.

    I am also curious on why you did not use ‘val = (val * 2561) >> 8’ in the second last line of the C# code?

    I have implemented both algorithms in Java, and measurements with JMH indicate that the C# algorithm is slower than the C algorithm. However when I use ‘val = (val * 2561) >> 8’ in the C# algorithm, it has the same speed as the C algorithm.

  2. Hi, Daniel. A couple things I just noticed in parse_eight_digits_swar… (1) It seems to be missing the declaration and loading of the variable val. (2) val was declared earlier as type int64_t but the return value of the function is uint32_t. This should be fine since $log_2(10^8) < 32$, but is likely to generate a compiler warning in the absence of an explicit cast, e.g., return (uint32_t)val;.

    1. is likely to generate a compiler warning

      Yes. It is likely to generate a warning since there is an implicit cast.

      It seems to be missing the declaration and loading of the variable val.

      It is passed as a parameter to the function.

      1. It is passed as a parameter to the function.

        It is now (thank you for fixing it!), but at the time I posted my comment, it was not. The only parameter being passed was const char *chars (I’m looking at it right now on my screen, which still has the old code in another window.)
        Cheers!

  3. It may be worth noting that this technique is not only useful for converting an 8-digit decimal integer to binary, but can also be used to load 8 digits at a time from a larger number, if you treat the 8 characters as a base-100,000,000 digit. So, say you were loading a 40-digit base-10 number with leading zeros into a uint128_t variable. It would require only five invocations of the 8-digit parsing function, with five additional multiplications (by the constant 100,000,000) and five additions accumulate the five base-100,000,000 digits.

  4. One other related thought for possible exploration is whether or not SWAR techniques can be used to efficiently count the number of ASCII base-10 digits (‘0’..’9′) appearing sequentially at a given position in a string or at a given memory location. If so, then one could construct 8 versions of the SWAR parser, one for each count of available characters, 1 to 8, and invoke the appropriate version, thus eliminating the requirement for leading zeros to be present.

  5. There’s a typo which is a little confusing.

    1000000*(10*b1+b2) + 100*(10*b3+b6)
    should read:
    1000000*(10*b1+b2) + 100*(10*b5+b6)

Leave a Reply to David Fetter Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.