Parsing 8-bit integers quickly

Suppose that you want to parse quickly 8-bit integers (0, 1, 2, …, 254, 255) from an ASCII/UTF-8 string. The problem comes up in the simdzone project lead by Jeroen Koekkoek (NLnet Labs). You are given a string and its length: e.g., ’22’ and length is 2. The naive approach in C might be:

int parse_uint8_naive(const char *str, size_t len, uint8_t *num) {
  uint32_t n = 0;
  for (size_t i = 0, r = len & 0x3; i < r; i++) {
    uint8_t d = (uint8_t)(str[i] - '0');
    if (d > 9)
     return 0;
    n = n * 10 + d;
  }
  *num = (uint8_t)n;
  return n < 256 && len && len < 4;
}

This code is a C function that takes a string of characters, its length, and a pointer to an unsigned 8-bit integer as input arguments. The function returns a Boolean value indicating whether the input string can be parsed into an unsigned 8-bit integer or not. It restricts the input to at most three characters but it allows leading zeros (e.g. 002 is 2).

The function first initializes a 32-bit unsigned integer n to zero, we will store our answer there. The function then iterates over the input string, extracting each digit character from the string and converting it to an unsigned 8-bit integer. The extracted digit is then added to n after being multiplied by 10. This process continues until the end of the string or until the function has processed 4 bytes of the string. Finally, the function assigns the value of n to the unsigned 8-bit integer pointed to by num. It then returns a boolean value indicating whether the parsed integer is less than 256 and the length of the input string is between 1 and 3 characters.  If the input string contains any non-digit characters or if the length of the string is greater than 3 bytes, the function returns false.

If the length of the input is predictable, then this naive function should be fast. However, if the length varies (between 1 and 3), then the processor will tend to  mispredict branches, and expensive penalty.

In C++, you could call from_chars:

int parse_uint8_fromchars(const char *str, size_t len, uint8_t *num) {
  auto [p, ec] = std::from_chars(str, str + len, *num);
  return (ec == std::errc());
}

The std::from_chars function takes three arguments: a pointer to the beginning of the input character sequence, a pointer to the end of the input character sequence, and a reference to the output variable that will hold the parsed integer value. The function returns a std::from_chars_result object that contains two members: a pointer to the first character that was not parsed, and an error code that indicates whether the parsing was successful or not.

In this function, the std::from_chars function is called with the input string and its length as arguments, along with a pointer to the unsigned 8-bit integer that will hold the parsed value. The function then checks whether the error code returned by std::from_chars is equal to std::errc(), which indicates that the parsing was successful. If the parsing was successful, the function returns true. Otherwise, it returns false.

Can we do better than these functions? It is not obvious that we can. A function that reads between 1 and 3 bytes is not a function you would normally try to optimize. But still: can we do it? Can we go faster?

Suppose that you can always read 4 bytes, even if the string is shorter (i.e., there is a buffer). This is often a safe assumption. If you numbers are within a larger string, then you can often check efficiently whether you are within 4 bytes of the end of the string. Even if that is not the case, reading 4 bytes is always safe as long as you do not cross a page boundary, something you may check easily.

So you can load your input into a 32-bit word and process all bytes at once, in a single register. This often called SWAR: In computer science, SWAR means SIMD within a register, which is a technique for performing parallel operations on data contained in a processor register.

Jeroen Koekkoek first came up with a valid SWAR approach, but it was only about 40% faster than the naive approach in the case where we had unpredictable inputs, and potentially slower than the naive approach given predictable inputs. Let us examine an approach that might be competitive all around:

int parse_uint8_fastswar(const char *str, size_t len, 
    uint8_t *num) {
  if(len == 0 || len > 3) { return 0; }
  union { uint8_t as_str[4]; uint32_t as_int; } digits;
  memcpy(&digits.as_int, str, sizeof(digits));
  digits.as_int ^= 0x30303030lu;
  digits.as_int <<= ((4 - len) * 8);
  uint32_t all_digits = 
    ((digits.as_int | (0x06060606 + digits.as_int)) & 0xF0F0F0F0) 
       == 0;
  *num = (uint8_t)((0x640a01 * digits.as_int) >> 24);
  return all_digits 
   & ((__builtin_bswap32(digits.as_int) <= 0x020505));
}

Again, this code is a C function that takes a string of characters, its length, and a pointer to an unsigned 8-bit integer as input arguments. The function returns a boolean value indicating whether the input string can be parsed into an unsigned 8-bit integer or not. We check whether the length is in range ([1,3]), if not, we return false, terminating the function. After this initial check, the function first initializes a union digits that contains an array of 4 unsigned 8-bit integers and a 32-bit unsigned integer. The function then copies the input string into the 32-bit unsigned integer using the memcpy function. The memcpy function copies the input string into the 32-bit unsigned integer. In production code where you do not know the target platform, you would want to reverse the bytes when the target is a big-endian system. Big endian systems are vanishingly rare: mostly just mainframes. Modern systems compile a byte reversal to a single fast instructions. For code on my blog post, I assume that you do not have a big-endian system which is 99.99% certain.

The function then flips the bits of the 32-bit unsigned integer using the XOR operator and the constant value 0x30303030lu. This operation converts each digit character in the input string to its corresponding decimal value. Indeed, the ASCII characters from 0 to 9 have code point values 0x30 to 0x39 in ASCII. The function then shifts the 32-bit unsigned integer to the left by a certain number of bits, depending on the length of the input string. This operation removes any trailing bytes that were not part of the input string. Technically when the length is not within the allowed range ([1,3]), the shift might be undefined behaviour, but we return a false value in this case earlier, indicating that the result of the computation is invalid.

The next part is where I contributed to the routine. First we check all characters are digits using a concise expression. The function then multiplies the 32-bit unsigned integer by the constant value 0x640a01 using a 32-bit unsigned integer. It is a concise way to do two multiplications (by 100 and by 10) and two sums at once. Observe that 0x64 is 100 and 0x0a is 10. The result of this multiplication is then shifted to the right by 24 bits. This operation extracts the most significant byte of the 32-bit unsigned integer, which represents the parsed unsigned 8-bit integer. Finally, the function assigns the value of the parsed unsigned 8-bit integer to the unsigned 8-bit integer pointed to by num. It then returns a boolean value indicating whether the parsed integer is less than 256 and made entirely of digits.

The function might compile to 20 assembly instructions or so on x64 processors:

lea rcx, [rsi - 4]
xor eax, eax
cmp rcx, -3
jb .END
mov eax, 808464432
xor eax, dword ptr [rdi]
shl esi, 3
neg sil
mov ecx, esi
shl eax, cl
lea ecx, [rax + 101058054]
or ecx, eax
test ecx, -252645136
sete cl
imul esi, eax, 6556161
shr esi, 24
mov byte ptr [rdx], sil
bswap eax
cmp eax, 132358
setb al
and al, cl
movzx eax, al
.END: # %return
ret

To test these functions, I wrote a benchmark. The benchmark uses random inputs, or sequential inputs (0,1,…), and it ends up being very relevant. I use GCC 12 and an Ice Lake (Intel) linux server running at 3.2 GHz. I report the number of millions of numbers parsed by second.

random numbers sequential numbers
std::from_chars 145 M/s 255 M/s
naive 210 M/s 365 M/s
SWAR 425 M/s 425 M/s

So the SWAR approach is twice as fast as the naive approach when the inputs are unpredictable. Otherwise, for predictable inputs, we surpass the naive approach by about 15%. Whether it is helpful in you use case depends on your data so you should run your own benchmarks.

Importantly, the SWAR approach is entirely equivalent to the naive approach, except for the fact that it always reads 4 bytes irrespective of the length.

The from_chars results are disappointing all around. I am puzzled as to why the naive approach is so much faster than the standard library. It solves a slightly different problem but the difference is still quite large. It could be that there is room for optimization in the standard library (GCC 12).

Can you do better? The benchmark is available. As part of the benchmark, we check exhaustively that the validation is correct.

Credit: I am grateful to Jeroen Koekkoek from NLnet Labs for suggesting this problem. The approach described was improved based on proposals by Jean-Marc Bourguet.

Update: User “Perforated Bob”, proposed a version which can be faster under some compilers:

int parse_uint8_fastswar_bob(const char *str, size_t len, uint8_t *num) {
  union { uint8_t as_str[4]; uint32_t as_int; } digits;
  memcpy(&digits.as_int, str, sizeof(digits));
  digits.as_int ^= 0x303030lu;
  digits.as_int <<= (len ^ 3) * 8;
  *num = (uint8_t)((0x640a01 * digits.as_int) >> 16);
  return ((((digits.as_int + 0x767676) | digits.as_int) & 0x808080) == 0) 
   && ((len ^ 3) < 3) 
   && __builtin_bswap32(digits.as_int) <= 0x020505ff;
}

 

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

69 thoughts on “Parsing 8-bit integers quickly”

    1. I think you’re right. Currently, the code doesn’t use the as_str field at all. Why do even need the digits union? Am I missing something?

    2. Nope. The whole trick with enum here is UB. Compiler is allowed to optimize it the way it wants. The proper solution is std::bit_cast. And memcpy is a C function that is not a constexpr, while std::bit_cast while is, allowing to make the whole function a constexpr.

        1. > The rarity of big endian is no excuse to disregard it in a supposedly high-level language

          Do you also consider systems that have less/more than 8 bits in a byte or machine word? What about systems that don’t use two’s complement?

        2. Please see the first rule on my terms of use: https://lemire.me/blog/terms-of-use/

          I copied and paste code form your blog post and it does not compile, it contained a massive security hole, it contained a terrible bug or it crashed my computer. Can I blame you?
          No. Lemire’s rule: Blogging is literature, not engineering. Code taken from a blog post should not be expected to work, it is meant to illustrate an idea. Don’t build production systems by copying and pasting random code from the Internet. It will not end well.

          I do build a lot of code that is meant to be deployed in production systems. The code used on my blog post is not meant for that. I do not run tests on mainframe platforms, for example.

      1. Why would you reverse the string instead of changing 0x640a0100 by 0x010a6400 in big endian, or do I misunderstood the way that part works?

        Instead of doing a left shift by a variable amount and a right shift by a constant amount, wouldn’t it be profitable to do only a right shift by a variable amount?

        1. Why would you reverse the string instead of changing 0x640a0100 by 0x010a6400 in big endian, or do I misunderstood the way that part works?

          It is almost surely not sufficient. Can you prove that it is?

          Instead of doing a left shift by a variable amount and a right shift by a constant amount, wouldn’t it be profitable to do only a right shift by a variable amount?

          Can you share your full code? I’d be happy to benchmark it.

          1. Now that I’ve had time to look at it, the issue with simple multiplication for big endian is there will be unwanted carry. So it doesn’t work.

            Combining the two shifts is essentially the same idea as KWillets and suffers the same issue of validation.

            On the other hand, if I’m not confused again, you can use

            uint32_t all_digits = ((digits.as_int | (0x06060606 + digits.as_int)) & 0xF0F0F0F0) == 0;

            which is simpler but doesn’t reduces the depth of the chain of dependent instructions, so I fear any gain will depend on the micro-architecture.

            And you could use

            *num = (uint8_t)((UINT32_C(0x640a01) * digits.as_int) >> 24);

            which probably doesn’t change anything on 64-bit computers but need only 32 bits x 32 bits -> 32 bits and thus could be useful on non-64-bit machines.

    1. I wondered whether and how much removing the comparison would speed up parsing. It turns out, a fair bit. On my machine (x86_64, g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0) the benchmark goes from

      parse_uint8_fastswar : 1.51 GB/s 589.6 Ma/s 1.70 ns/d

      to

      parse_uint8_fastswar : 1.64 GB/s 638.6 Ma/s 1.57 ns/d

      So GCC didn’t optimize it away.

  1. If I’ve got my arithmetic right (or left in this case) the lower 3 bytes of 640a01 * (the unshifted digits) will contain three quantities, in little-endian byte order:

    the first digit times one
    the first digit times ten plus the second digit
    the first digit times 100 plus the second digit times 10 plus the third digit.

    Pick which one to output from the length argument.

    This technique also tolerates cruft on the end, so you can unconditionally load 3-4 bytes and still pull the correct value even if bytes to the left overflow (eg “9” maps to 9/90/900, but the 900 doesn’t overflow into the part we want).

        1. I see; there are a lot of validation cycles in there that I didn’t consider.

          One thought on validation is that a well-formed ASCII integer will have the intermediate lanes bounded by 9/99/256 up to the output lane (eg “91” will have the first two lanes bounded but 910 > 256 in the third). A non-digit input should fail in one of the lanes (using 0/0/0 lower bound).

        2. It looks like parsing the way KWillets suggests then indexing the result is faster than parse_uint8_swar but slower than parse_uint8_fastswar.

          int parse_uint8_fastswar_tweak(const char *str, size_t len, uint8_t *num) {
          union {
          uint8_t as_str[4];
          uint32_t as_int;
          } digits;

          memcpy(&digits.as_int, str, sizeof(digits));
          digits.as_int ^= 0x30303030;
          digits.as_int *= 0x00640a01;
          *num = digits.as_str[(len & 0x3) - 1];
          return (digits.as_int & 0xf0f0f0f0) == 0 && len != 0 && len < 4;
          }

          ---

          volume 25680 bytes
          parse_uint8_swar : 1.04 GB/s 404.3 Ma/s 2.47 ns/d
          parse_uint8_fastswar : 1.51 GB/s 586.4 Ma/s 1.71 ns/d
          parse_uint8_fastswar_tweak : 1.23 GB/s 478.7 Ma/s 2.09 ns/d
          parse_uint8_fromchars : 0.48 GB/s 187.8 Ma/s 5.32 ns/d
          parse_uint8_naive : 0.65 GB/s 252.8 Ma/s 3.96 ns/d

          If you replace indexing with a bit shift, you get performance comparable to parse_uint8_fastswar in a simpler function.

          int parse_uint8_fastswar_tweak(const char *str, size_t len, uint8_t *num) {
          uint32_t digits;

          memcpy(&digits, str, sizeof(digits));
          digits ^= 0x30303030;
          digits *= 0x00640a01;
          *num = (uint8_t)(digits >> (8 * ((len & 0x3) - 1)));
          return (digits & 0xf0f0f0f0) == 0 && len != 0 && len < 4;
          }

          ---

          volume 25680 bytes
          parse_uint8_swar : 1.04 GB/s 404.8 Ma/s 2.47 ns/d
          parse_uint8_fastswar : 1.51 GB/s 589.6 Ma/s 1.70 ns/d
          parse_uint8_fastswar_tweak : 1.50 GB/s 584.0 Ma/s 1.71 ns/d
          parse_uint8_fromchars : 0.48 GB/s 186.8 Ma/s 5.35 ns/d
          parse_uint8_naive : 0.64 GB/s 249.5 Ma/s 4.01 ns/d

  2. Are you not allowed to use a lookup table?

    if len==1, do the ASCII conversion
    else if len == 2, append two digits into a 16-bit int, subtract 0x0300 to convert the first byte and lookup in a 2560 item sparsely filled table
    else live with a very big table or do the ASCI conversion on the first byte, multiply by 100 and add the len == 2 conversion.

  3. Hmm, doesn’t the SWAR method give the wrong return value given an input string like “12>”? The validation of only having ASCII digits seems to be missing a validation of 0x3A-0x3F.

    Also there is possibly room to be faster by shifting the result of the multiplication to the right by a variable amount (based on len), rather than shifting the masked input to the left by a variable amount; the computation of the shift amount is more likely to be able to be done in parallel with other work this way.
    Also the shift amount can be computed with len directly, rather than (len & 0x3), given that the validity of the result is indicated by the return value (i.e we don’t care about the shift amount when len>3).

  4. First, great work.

    However the test on len!=0 should be at the very start, otherwise the left shift is (((4-0)*8)==32), which is undefined behavior

    1. However the test on len!=0 should be at the very start, otherwise the left shift is (((4-0)*8)==32), which is undefined behavior.

      The C++14 and C17 standards make it defined behaviour. Here is the specification:

      The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1 x 2^E2, reduced modulo one more than the maximum value representable in the result type.

      1. Yes, as you maybe saw on twitter, I was referring to C99 (this behaviour is the same in C89 IIRC). Anyway, better be safe, and in all cases it’ll return early if len==0, so that’s that.

  5. The from_chars results are disappointing all around. I am puzzled as to why the naive approach is so much faster than the standard library.

    Indeed. Part of the reason is the “naive” function has a fixed upper bound on the number of loop iterations (the `r= len & 0x3 clause). As a result (and I checked in compiler explorer), the compiler always unrolls the loop. std::from_chars allows an unbounded number of leading zeros, so it can’t unconditionally unroll the loop (without having a suffix afterwards, I guess).

    This doesn’t explain all of the performance difference, but, as usual, often getting performance out of code is being very precise about what the actual requirements for the inputs and outputs are.

  6. Cool post, pure fun and a few neat tricks 🙂

    Two comments:

    If you want to make this routine “safe” (and hence the comparison really fair), it’s enough to check if the input crosses a page boundary, and take a slow path if it is (or have a SWAR version reading from “the other side”). That should be very branch-predictor safe, so hopefully wouldn’t slow things down much. Some hash functions use this approach.
    The benchmarks give some advantage to SWAR considering you’re using a uniform distribution in range 0..255, making the average string length 2.23. There might be use cases where the expected number of digits is smaller, and then SWAR might be less beneficial or even slower.

    But I’m nitpicking here 🙂

      1. The first version of our fast JSON parser (simdjson) was page-boundary aware… but this gave us much trouble because people using sanitizers or valgrind would report bugs. I tried arguing back but it became tiresome. So the current version of simdjson use padded strings. It works well. I have a friend who writes string functions for a runtime library, but they get their code to be excluded from valgrind checks and sanitizers… so people don’t see anything… The use case for the function in this blog post is simdzone (parsing zone files for DNS systems) in which case the author knows that the input is padded.

      2. The SWAR version is faster even in the case where we have a sequential input. So it has good chances of being faster in practice. Still people should benchmark on the data they care about!

      1. Is regular old SIMD on the menu? The byte range-checking would certainly be easier, and shuffling of quartets etc.

  7. Hi,

    Since reading up to 4 bytes from str is OK, here is another non-SWAR implementation that runs equally fast for the random and sequential cases:

    int parse_uint8_naive_md(const char *str, size_t len, uint8_t *num) {
    if (--len > 2) return 0;
    static const uint8_t sf[] = {0,0,1,10,100}; // scale factor
    uint64_t d1, d2, d3, s1, s2, s3, n;
    d1 = (uint64_t)str[0] - (uint64_t)'0';
    d2 = (uint64_t)str[1] - (uint64_t)'0';
    d3 = (uint64_t)str[2] - (uint64_t)'0';
    s1 = (uint64_t)sf[len+2];
    s2 = (uint64_t)sf[len+1];
    s3 = (uint64_t)sf[len];
    n = s1*d1 + s2*d2 + s3*d3;
    *num = (uint8_t)n;
    return n < 256 && d1<10 && d2<10 && d3<10;
    }

    1. Interesting approach, but I think it does not work as is.

      What if the input is a digit followed by random bytes, and the len is set to 1? Your return value depends on std[1] and str[2] which can be garbage.

      1. Good catch! At the expense of a couple more checks, it should be fixed with:

        return n < 256 && d1<10 && (s2==0 || d2<10) && (s3 == 0 || d3<10);

        Meanwhile, for what it’s worth, the performance (of all of the implementations here) appears to be sensitive to the mixture of inputs from the [0,255] range and the [256,999] range. It must be primarily due to the short circuiting of the various terms in the return value expressions. (The quick test is to change val to uint16_t and change %256 to %1000 in the benchmarker; otherwise the inputs are all <256 and we don’t need that validation…).

        Presumably the performance will also depend on the mixture of invalid inputs as well (cases where len=0, len>3, or non digit chars).

        1. The fast function from the blog post should compile to branchless, except for the length parameter, see the assembly in the blog post. If you use lengths that outside the allowed range ([1,3]) then the performance can vary… But otherwise, it should be flat. The function’s performance is not data dependent. You can see in my table of results.

          Make sure your build your code in release mode with full optimizations.

          1. I agree that the fastswar performance is flat between sequential and random. I am finding that using values in [0,999] is a bit faster (about +13%, the mixture is approx 25%/75% valid/invalid), and even faster for values in [256,999] (about +18%, which 100% invalid values). The fastswar return statement uses & and not && so indeed that does not short-circuit… so I’m not sure what is the source of that extra performance for inputs >255.

        2. I don’t think that the following is sufficient because you don’t know the state of s2 and s3, at least in how I interpret the benchmark.

          return n < 256 && d1<10 && (s2==0 || d2<10) && (s3 == 0 || d3<10);

          1. If we get past the first line, len will be 0, 1 or 2 (decremented by 1 from the initial 1, 2 or 3). When len=0, s1,s2,s3=1,0,0, for len=1, s1,s2,s3=10,1,0 and for len=3, s1,s2,s3 = 100,10,1. So s1,s2,s3 are always assigned. If s2 is zero we don’t care about the value of d2 (so the || short circuits), but if s2 is nonzero then we do need to check d2. Similar for s3 & d3. It’s a bit contorted in pursuit of avoiding branches.

  8. I got nerdsniped by this a bit. I think your approach is getting close to optimal, it's hard to beat a constant time operation that has a small constant. Nevertheless, my stab at trying to go faster involves creating a perfect hash table, with a very fast hash function (the key is the four byte string cast to a u32). Obviously it trades quite a bit of space for some speed. It could probably be compacted in space with a bit of tweaking. This isn't a complete solution as it does not do all the error checking and what not, but I believe this could be made to go a bit faster than the parse_uint8_fastswar, as it should have a smaller constant. To parse the ascii int, it must be padded with nuls up to the four bytes, and should be aligned properly so the casting works. Then the parse should just be something like so: `
    lut[simple_hash((unsigned*)(str))];

    Lut building code follows:

    define SIZE 16384

    uint8_t lut[SIZE] = {};

    uint32_t simple_hash(uint32_t u32_value) {
    const uint32_t shift = 32;
    uint64_t hash_val = (uint64_t)u32_value * 1000000000000000000;
    hash_val = (hash_val >> 32) % SIZE;
    return (uint32_t)hash_val;
    }

    void build_lut() {
    char strings[256*4];
    memset(strings, 0, sizeof(strings));
    char *iter = strings;
    for (int i = 0; i &lt; 256; ++i) {
    sprintf(iter, &quot;%d&quot;, i);
    iter += 4;
    }
    iter = strings;
    for (int i = 0; i &lt; 256; ++i) {
    unsigned c = *(unsigned*) iter;
    iter += 4;
    unsigned idx = simple_hash(c) % SIZE;
    lut[idx] = i;
    }
    }

  9. Wonderful article – I really enjoy this kind of work. The only bad part is that I spent a bit of time implementing this in Rust and took some time to fully understand some of your “magic”.

    One minor observation – I believe that ((digits.as_int | (0x06060606 + digits.as_int)) doesn’t need the digits.as_int | component. My test suite passes fully without it.

    My rust implementation (can’t seem to get the formatting quite right):`

    fn make_u8(s: &str) -> Option<u8> {
    if s.is_empty() || s.len() > 3 {
    return None;
    }
    let bytes = s.as_bytes();

    // using a union avoids branching on the length to initialize each byte
    // of the u32 interpretation.
    let mut working = unsafe {
    #[repr(C)]
    union U {
    bytes: [u8; 4],
    num: u32,
    }
    // could use uninit here to avoid initialization...
    let mut u = U { num: 0 };
    u.bytes[..s.len()].copy_from_slice(&bytes[..s.len()]);
    u.num
    };

    working ^= 0x30303030;

    working <<= (4 - s.len()) * 8;

    // Wrapping prevents panics on overflow.
    let mult = Wrapping(0x640a01) * Wrapping(working);
    // unwrap it now (could just use .0 but this is more explicit)
    let Wrapping(mult) = mult;

    let num = (mult >> 24) as u8;

    let all_digits = (0x06060606 + working) & 0xF0F0F0F0 == 0;
    let swapped = u32::from_be_bytes(working.to_le_bytes());

    if !all_digits || swapped > 0x00020505 {
    return None;
    }

    Some(num)

    }

    1. I think the digits.as_int | part is necessary to catch bad input that contains characters with byte values 0 to 9, e.g. "12\003".

          1. @Bruce: Your tests pass without the digits.as_int | part because they don’t actually test the byte values 0xCA to 0xCF, although they seem to test values up to 0xFF.

            The reason is that a Rust str contains UTF-8 bytes, not arbitrary bytes, and e.g. the value 0xFF gets converted to the two bytes 0xC3 and 0xBF.

            I posted an issue: https://github.com/bmacnaughton/u8-swar/issues/1

          2. yes, you’re right – thank you. I ignored Unicode characters (and their encoding). I will have some additional tests to verify that (and will add the OR’d digits.as_int back in).

  10. I had a go at a simple LUT-based approach, which seems to be faster in most situations I’ve tested, with a few caveats: https://quick-bench.com/q/r0wfNuy0JI0ZWT893FoR0oJHpSc

    There’s no hash table here, I just reinterpreted the 3-byte string as the index into a 8MB array containing every possible result. Despite needing so much memory, only small parts of the array are actually accessed in practice (the ones containing valid values) so it doesn’t waste much cache space

    There are two versions – one where I’ve transformed the input strings in advance to pad them with nulls, so they can be directly interpreted as an integer, and one that works with the original string data by masking out the unwanted bytes. The version that uses padded strings is almost always faster by a large margin, but it’s kind of cheating so I put in the version with the mask as a more fair comparison

    Some things I’ve noticed:
    * The results vary quite a lot depending on whether you use or discard the return code. If you comment out ‘benchmark::DoNotOptimize(valid);’ from the benchmarks, the bit-twiddling functions get much faster
    * Clang seems to be much better at optimising this than GCC
    * The LUT-based approach seems to randomly vary in performance more than the others, which isn’t surprising since it relies much more on data remaining in the cache, and could be affected by other processes on the machine

    1. The results vary quite a lot depending on whether you use or discard the return code.

      That’s likely because your functions are inline-able. If you check my benchmark, you will notice that the functions are compiled separately. That’s by design: I do not want inlining as it changes the problem.

      Clang seems to be much better at optimising this than GCC

      On my test machine, I get with GCC 12 :

      
      parse_uint8_fastswar_bob                 :   1.14 GB/s  443.3 Ma/s   2.26 ns/d   3.20 GHz   7.21 c/d  33.01 i/d   2.80 c/b  12.83 i/b   4.58 i/c 
      parse_uint8_fastswar                     :   1.04 GB/s  403.3 Ma/s   2.48 ns/d   3.20 GHz   7.92 c/d  36.01 i/d   3.08 c/b  13.99 i/b   4.54 i/
      

      And with clang 16…

      parse_uint8_fastswar_bob                 :   1.11 GB/s  430.2 Ma/s   2.32 ns/d   3.20 GHz   7.43 c/d  30.01 i/d   2.89 c/b  11.66 i/b   4.04 i/c 
      parse_uint8_fastswar                     :   1.14 GB/s  444.3 Ma/s   2.25 ns/d   3.20 GHz   7.19 c/d  32.01 i/d   2.79 c/b  12.44 i/b   4.45 i/
      

      So it is mixed bag but I don’t see a large difference.

      1. That’s likely because your functions are inline-able.

        Ah, that explains it. Here’s the same benchmark using __attribute__((noinline)): https://quick-bench.com/q/80Z_47-AYurJITIaCzYCntYjg9A

        Also, when I run the benchmarks on my machine (linux running in WSL on a Ryzen 5800X), I get this:

        GCC 12:
        parse_uint8_fastswar_bob : 1.55 GB/s 604.2 Ma/s 1.66 ns/d
        parse_uint8_fastswar : 1.60 GB/s 625.0 Ma/s 1.60 ns/d
        parse_uint8_swar : 1.43 GB/s 559.3 Ma/s 1.79 ns/d
        parse_uint8_lut_padded : 1.71 GB/s 668.0 Ma/s 1.50 ns/d
        parse_uint8_lut_masked : 1.86 GB/s 726.0 Ma/s 1.38 ns/d
        parse_uint8_fromchars : 0.61 GB/s 237.8 Ma/s 4.21 ns/d
        parse_uint8_naive : 0.91 GB/s 355.9 Ma/s 2.81 ns/d

        Clang 15:
        parse_uint8_fastswar_bob : 1.73 GB/s 673.2 Ma/s 1.49 ns/d
        parse_uint8_fastswar : 1.70 GB/s 659.8 Ma/s 1.52 ns/d
        parse_uint8_swar : 1.23 GB/s 478.4 Ma/s 2.09 ns/d
        parse_uint8_lut_padded : 2.07 GB/s 802.9 Ma/s 1.25 ns/d
        parse_uint8_lut_masked : 1.78 GB/s 691.3 Ma/s 1.45 ns/d
        parse_uint8_fromchars : 0.72 GB/s 281.3 Ma/s 3.55 ns/d
        parse_uint8_naive : 0.96 GB/s 372.6 Ma/s 2.68 ns/d

        1. I also tried running the benchmark on my laptop (linux with an alder lake processor), and saw some enormous improvements over the SWAR approaches (with clang at least, I’m still not sure why GCC does so badly on my machines). -march=native also makes quite a difference, especially for fastswar_bob:

          GCC 12
          parse_uint8_fastswar_bob : 1.94 GB/s 755.2 Ma/s 1.32 ns/d
          parse_uint8_fastswar : 1.84 GB/s 713.9 Ma/s 1.40 ns/d
          parse_uint8_lut_padded : 1.96 GB/s 761.1 Ma/s 1.31 ns/d
          parse_uint8_lut_masked : 1.99 GB/s 773.1 Ma/s 1.29 ns/d

          GCC 12 with -march=native:
          parse_uint8_fastswar_bob : 1.90 GB/s 738.0 Ma/s 1.35 ns/d
          parse_uint8_fastswar : 1.85 GB/s 720.3 Ma/s 1.39 ns/d
          parse_uint8_lut_padded : 1.95 GB/s 757.7 Ma/s 1.32 ns/d
          parse_uint8_lut_masked : 1.99 GB/s 773.7 Ma/s 1.29 ns/d

          clang 15:
          parse_uint8_fastswar_bob : 1.85 GB/s 719.6 Ma/s 1.39 ns/d
          parse_uint8_fastswar : 2.19 GB/s 851.0 Ma/s 1.18 ns/d
          parse_uint8_lut_padded : 4.02 GB/s 1560.5 Ma/s 0.64 ns/d
          parse_uint8_lut_masked : 2.98 GB/s 1158.7 Ma/s 0.86 ns/d

          clang 15 with -march=native:
          parse_uint8_fastswar_bob : 2.36 GB/s 918.9 Ma/s 1.09 ns/d
          parse_uint8_fastswar : 2.41 GB/s 937.0 Ma/s 1.07 ns/d
          parse_uint8_lut_padded : 4.02 GB/s 1561.2 Ma/s 0.64 ns/d
          parse_uint8_lut_masked : 3.30 GB/s 1281.0 Ma/s 0.78 ns/d

          Here’s the fork I used: https://github.com/PowellNGL/Code-used-on-Daniel-Lemire-s-blog

          1. Added with credit and some small modifications. I can confirm that GCC is slower. It seems to use more instructions to get the job done when using a LUT (about 5 extra instructions).

  11. Here’s another Rust version:

    fn parse_uint8_fastswar(b: &[u8]) -> Option<u8> {
    if b.len() == 0 || b.len() > 3 { return None; }
    let p = b.as_ptr() as *const u32;
    let mut digits = unsafe { p.read_unaligned() };
    digits ^= 0x30303030;
    digits <<= (4 - b.len()) * 8;
    let num = ((digits.wrapping_mul(0x640a01)) >> 24) as u8;
    let all_digits = ((digits | (digits.wrapping_add(0x06060606))) & 0xF0F0F0F0) == 0;
    (all_digits && digits.swap_bytes() <= 0x020505).then_some(num)
    }

    According to https://godbolt.org/z/Ts8xrqnc7, the resulting assembly code is very similar to the C version:

    lea rax, [rsi - 4]
    cmp rax, -3
    jae .LBB3_2
    xor eax, eax
    ret
    .LBB3_2:
    mov eax, 808464432
    xor eax, dword ptr [rdi]
    neg sil
    shl sil, 3
    mov ecx, esi
    shl eax, cl
    imul edx, eax, 6556161
    shr edx, 24
    lea ecx, [rax + 101058054]
    or ecx, eax
    test ecx, -252645136
    sete cl
    bswap eax
    cmp eax, 132358
    setb al
    and al, cl
    ret

    I haven’t run benchmarks yet. I hope this might be slightly faster than C because Option<u8> is returned in two registers, so there’s no write through a pointer.

      1. The benchmark results are somewhat inconclusive. Depending on CPU (Intel Xeon, Apple M2, Amazon Graviton) and compiler (GCC, Clang):

        Sometimes Rust is faster than C, sometimes slower
        Sometimes a C function returning an option in two registers is faster than writing the result through a pointer, sometimes slower
        Sometimes fastswar is faster than fastswar_bob, sometimes slower
        Sometimes one of the fastswar_* versions is as fast as the lut version, but usually lut is the fastest

        See the fork at https://github.com/jcsahnwaldt/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/11/28 for details.

  12. Here is the thing. Your code sucks!

    It just causes UB and after that – your code is not C/C++. It is writen for SUPER specific case (little-endian architecture, specific compiler which ignores UB, caused by reading more numbers that is allowed, etc).

    And, more than that, it is absoultely unreadable. I just have written the most stupid code I could imagine:

    static int parse_uint8_switch_case(const char *str, size_t len, uint8_t *num) {
    uint8_t hi, mid, lo;

    #define as_u8(x) ((uint8_t)((x) - '0'))

    switch(len) {
    case 1:
    *num = as_u8(str[0]);
    return *num < 10;
    case 2:
    hi = as_u8(str[0]);
    lo = as_u8(str[1]);
    *num = hi * 10 + lo;
    return (hi < 10) && (lo < 10);
    case 3:
    hi = as_u8(str[0]);
    mid = as_u8(str[1]);
    lo = as_u8(str[2]);
    *num = hi * 100 + mid * 10 + lo;
    return (hi < 10) && (mid < 10) && (lo < 10);

    default:
    return 0;
    }

    #undef as_u8
    }

    And then just launched it here

    The results on the screenshot

    So. I cannot read your code, it causes UB and after all it just slow)

    1. Here is the thing. Your code sucks!

      The blog post addresses your criticism, explicitly so:

      • Suppose that you can always read 4 bytes, even if the string is shorter (i.e., there is a buffer).
      • In production code where you do not know the target platform, you would want to reverse the bytes when the target is a big-endian system. Big endian systems are vanishingly rare: mostly just mainframes. Modern systems compile a byte reversal to a single fast instructions. For code on my blog post, I assume that you do not have a big-endian system which is 99.99% certain.

      You claim that there is UB, but you don’t explain why.

      As for code readability, that’s an important issue, of course. But then the blog post is specifically about optimizing for performance.

      For your benchmark, you use sequential numbers. The blog posts explains that the approach being presented is only very beneficial when the inputs are unpredictable, it says so: So the SWAR approach is twice as fast as the naive approach when the inputs are unpredictable.

      If your input is predictable, then sure, a naive implementation works. That’s already covered by the blog post.

      On my MacBook (Apple M2, LLVM14):

      volume 51473 bytes
      parse_uint8_switch_case                  :   0.76 GB/s  295.6 Ma/s   3.38 ns/d 
      parse_uint8_fastswar_bob                 :   1.81 GB/s  702.8 Ma/s   1.42 ns/d 
      parse_uint8_fastswar                     :   1.51 GB/s  585.4 Ma/s   1.71 ns/d 
      parse_uint8_lut                          :   1.81 GB/s  702.8 Ma/s   1.42 ns/d 
      parse_uint8_swar                         :   1.67 GB/s  647.8 Ma/s   1.54 ns/d 
      parse_uint8_fromchars                    :   0.39 GB/s  150.9 Ma/s   6.62 ns/d 
      parse_uint8_naive                        :   0.76 GB/s  294.7 Ma/s   3.39 ns/d 
      

      On GCC12 with an Intel Ice Lake processor.

      parse_uint8_switch_case                  :   0.54 GB/s  211.6 Ma/s   4.73 ns/d   3.19 GHz  15.09 c/d  34.07 i/d   5.86 c/b  13.24 i/b   2.26 i/c 
      parse_uint8_fastswar_bob                 :   1.14 GB/s  443.0 Ma/s   2.26 ns/d   3.20 GHz   7.21 c/d  33.01 i/d   2.80 c/b  12.83 i/b   4.58 i/c 
      parse_uint8_fastswar                     :   1.04 GB/s  402.6 Ma/s   2.48 ns/d   3.20 GHz   7.94 c/d  36.01 i/d   3.08 c/b  13.99 i/b   4.54 i/c 
      parse_uint8_lut                          :   1.17 GB/s  456.5 Ma/s   2.19 ns/d   3.20 GHz   7.00 c/d  32.01 i/d   2.72 c/b  12.44 i/b   4.57 i/c 
      parse_uint8_swar                         :   0.87 GB/s  337.5 Ma/s   2.96 ns/d   3.20 GHz   9.47 c/d  43.01 i/d   3.68 c/b  16.71 i/b   4.54 i/c 
      parse_uint8_fromchars                    :   0.37 GB/s  143.4 Ma/s   6.97 ns/d   3.19 GHz  22.27 c/d  57.89 i/d   8.65 c/b  22.50 i/b   2.60 i/c 
      parse_uint8_naive                        :   0.54 GB/s  210.5 Ma/s   4.75 ns/d   3.19 GHz  15.17 c/d  44.95 i/d   5.90 c/b  17.46 i/b   2.96 i/c 
      

      As you can see, your code is running at about half the speed as parse_uint8_fastswar_bob and parse_uint8_lut. In fact, it has the same performance as parse_uint8_naive in my tests.

      As for the criticism that the code posted on my blog is not ready for production, please see my terms of use. It is by design. The code presented in my blog posts is not meant to be copied and pasted into your projects.

  13. We can go faster if we limit dependencies(?) The idea is to ignore trash bytes and not shift them off. Given correct input, this is considerably faster on my machine. Obviously, it lacks proper error checking, but I think it’s an interesting enough approach to get input from others(?)

    __attribute__((noinline))
    static int parse_int8(const char *str, size_t len, uint8_t *num)
    {
    const uint32_t shr = ((len << 3) - 8) & 0x18;
    uint32_t dgts;

    memcpy(&dgts, str, sizeof(dgts));
    dgts &= 0x000f0f0flu;
    *num = (uint8_t)((0x640a01 * dgts) >> shr);

    return 1;
    }

  14. This is fun!

    I sent a PR that did a pretty silly optimization, namely pre-computing the number of bits to be shifted based on length. At least on my machine, it’s faster that everything except the one that doesn’t run the checks.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.