If I give a programmer a string such as "9223372036854775808" and I ask them to convert it to an integer, they might do the following in C++:
std::string s = .... uint64_t val; auto [ptr, ec] = std::from_chars(s.data(), s.data() + s.size(), val); if (ec != std::errc()) {} // I have an error ! // val holds the value
It is very fast: you can parse a sequence of random 32-bit integers at about 40 cycles per integer, using about 128 instructions.
Can you do better?
The recent Intel processors have new instructions, AVX-512, that can process multiple bytes at once and do it with masking, so that you can select just a range of data.
I am going to assume that you know the beginning and the end of sequence of digits. The following code with AVX-512 intrinsic does the following:
- Computes the span in bytes (digit_count),
- If we have more than 20 bytes, we know that the integer is too large to fit in a 64-bit integer,
- We compute a “mask”: a 32-bit value that only the most significant digit_count bits set to 1,
- We load an ASCII or UTF-8 string in a 256-bit register,
- We subtract character value ‘0’ to get values between 0 and 9 (digit values),
- We check whether some value exceeds 9, in which case we had a non-digit character.
size_t digit_count = size_t(end - start); // if (digit_count > 20) { error ....} const simd8x32 ASCII_ZERO = _mm256_set1_epi8('0'); const simd8x32 NINE = _mm256_set1_epi8(9); uint32_t mask = uint32_t(0xFFFFFFFF) << (start - end + 32); auto in = _mm256_maskz_loadu_epi8(mask, end - 32); auto base10_8bit = _mm256_maskz_sub_epi8(mask, in, ASCII_ZERO); auto nondigits = _mm256_mask_cmpgt_epu8_mask(mask, base10_8bit, NINE); if (nondigits) { // there is a non-digit }
This is the key step that uses the functionality of AVX-512. Afterward, we can use ‘old-school’ processing, for folks familiar with advanced Intel intrinsics on conventional x64 processors… Mostly, we just multiply by 10, by 100, by 100000 to create four 32-bit values: the first one corresponds to the least significant 8 ASCII bytes, the second one to the next most significant 8 ASCII bytes, and the their one to up to 4 most significant bytes. When the numbers has 8 digits or less, only one of these words is relevant, and when there are 16 or less, on the first two are significant. We always waste one 32-bit value that is made of zeroes. The code might look as follows:
auto DIGIT_VALUE_BASE10_8BIT = _mm256_set_epi8(1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10); auto DIGIT_VALUE_BASE10E2_8BIT = _mm_set_epi8( 1, 100, 1, 100, 1, 100, 1, 100, 1, 100, 1, 100, 1, 100, 1, 100); auto DIGIT_VALUE_BASE10E4_16BIT = _mm_set_epi16(1, 10000, 1, 10000, 1, 10000, 1, 10000); auto base10e2_16bit = _mm256_maddubs_epi16(base10_8bit, DIGIT_VALUE_BASE10_8BIT); auto base10e2_8bit = _mm256_cvtepi16_epi8(base10e2_16bit); auto base10e4_16bit = _mm_maddubs_epi16(base10e2_8bit, DIGIT_VALUE_BASE10E2_8BIT); auto base10e8_32bit = _mm_madd_epi16(base10e4_16bit, DIGIT_VALUE_BASE10E4_16BIT);
I have implemented this function in C++, and compiled it using GCC12. I run the benchmark on an Ice Lake server. I use random 32-bit integers for testing. The AVX-512 is over twice as fast as the standard approach.
AVX-512 | 1.8 GB/s | 57 instructions/number | 17 cycles/number |
std::from_chars | 0.8 GB/s | 128 instructions/number | 39 cycles/number |
The comparison is not currently entirely fair because the AVX-512 function assumes that it knows the start and the end of the sequence of digits.
You can boost the performance by using an inline function, which brings it up to 2.3 GB/s, so a 30% performance boost. However, that assumes that you are parsing the numbers in sequence within a loop.
My original code would return fancy std::optional values, but GCC was negatively affected, so I changed my function signatures to be more conventional. Even, LLVM/clang is slightly faster in my tests, compared to GCC.
Credit: The original code and the problem were suggested to me by John Keiser. My code is largely derivative of his code.
Oh nice, thanks for fleshing this one out!
I think it’s important to note that we’re not doing traditional parsing (at least as I’ve seen it), which involves parsing from left to right (most significant to least). The thing that makes this work is that we “right-justify” the number in the SIMD register (put the least significant digit in the most significant byte of the SIMD register), .
This contrasts with traditional left-to-right number parsers: by knowing where the least significant digit of the number is, we know exactly how much each digit is worth (the rightmost is digit10^0, penultimate is digit10^1, etc…). We no longer need to do the conventional “read the next digit, multiply the total by 10, read the next digit and add it in,” which creates data dependency from digit to digit.
Because we support up to 20 digit numbers, we start with a 32 byte register (though we quickly shrink to a 16-byte register in step 2, since 2 digits can still fit in one byte):
String: 1234567890
SIMD step 1 (8-bit x 32): 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 0
SIMD step 2 (8-bit x 16): 00 00 00 00 00 00 00 00 00 00 00 12 34 56 78 90
SIMD step 3 (16-bit x 8): 0000 0000 0000 0000 0000 0012 3456 7890
SIMD step 4 (32-bit x 4): 00000000 00000000 00000012 34567890
And after that, we manually extract the 32-bit numbers, multiply them by 10^16, 10^8 and 10^0, and add them to get the 64-bit result (in this case, 1234567890).
One interesting bit to me: with AVX-512, we could also parse 2 numbers simultaneously without modifying the algorithm. (We could parse much more if we knew they were all 8 bytes or less!) If we want 32-bit numbers, you could parse 4 at a time!
Note that you could do this without knowing the full size of the number if you load the 32 bytes (assumes padding), look for non-digits, and use that to find the end, and then “right-justifying” all the digits with a byte shift/shuffle.
A commenter on another blog suggests using vpshldvw then vpdpbusd to do the first 2 multiply-and-add steps at once by computing 4a, 4b, c, d then mul-adding with 250, 25, 10, 1.
I have various thoughts about this problem from https://highload.fun/tasks/1 but that seems to be a very different problem (it is a lot about detecting the edge of the number and zeroing out the rest of the register, or finding a way to shuffle certain digits from several numbers of unknown lengths into fixed locations in multiple registers).
In here
https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html#comment-6285552965
The blogger talks about this problem too, which is pretty similar to your discussion, but please look at the comment by sharp-o which describes how to use vnni to simplify the byte to 32 bit reduction, after which you need a regular mul and horizontal add.
Horizontal add of u32 can be done using _mm_mpsadbw_epu8 in a new method you probably heard about, but I can’t find a link.
This horizontal add technique sounds quite interesting!
Thanks a lot, this gives awesome performance. I didnt need the non-digit check, so i skipped it.