Parsing integers quickly with AVX-512

If I give a programmer a string such as "9223372036854775808" and I ask them to convert it to an integer, they might do the following in C++:

std::string s = ....
uint64_t val;
auto [ptr, ec] =
std::from_chars(s.data(), s.data() + s.size(), val);
if (ec != std::errc()) {} // I have an error !
// val holds the value

It is very fast: you can parse a sequence of random 32-bit integers at about 40 cycles per integer, using about 128 instructions.

Can you do better?

The recent Intel processors have new instructions, AVX-512, that can process multiple bytes at once and do it with masking, so that you can select just a range of data.

I am going to assume that you know the beginning and the end of sequence of digits. The following code with AVX-512 intrinsic does the following:

  1. Computes the span in bytes (digit_count),
  2. If we have more than 20 bytes, we know that the integer is too large to fit in a 64-bit integer,
  3. We compute a “mask”: a 32-bit value that only the most significant digit_count bits set to 1,
  4. We load an ASCII or UTF-8 string in a 256-bit register,
  5. We subtract character value ‘0’ to get values between 0 and 9 (digit values),
  6. We check whether some value exceeds 9, in which case we had a non-digit character.
size_t digit_count = size_t(end - start);
// if (digit_count > 20) { error ....}
const simd8x32 ASCII_ZERO = _mm256_set1_epi8('0');
const simd8x32 NINE = _mm256_set1_epi8(9);
uint32_t mask = uint32_t(0xFFFFFFFF) << (start - end + 32);
auto in = _mm256_maskz_loadu_epi8(mask, end - 32);
auto base10_8bit = _mm256_maskz_sub_epi8(mask, in, ASCII_ZERO);
auto nondigits = _mm256_mask_cmpgt_epu8_mask(mask, base10_8bit, NINE);
if (nondigits) {
// there is a non-digit
}

This is the key step that uses the functionality of AVX-512. Afterward, we can use ‘old-school’ processing, for folks familiar with advanced Intel intrinsics on conventional x64 processors… Mostly, we just multiply by 10, by 100, by 100000 to create four 32-bit values: the first one corresponds to the least significant 8 ASCII bytes, the second one to the next most significant 8 ASCII bytes, and the their one to up to 4 most significant bytes. When the numbers has 8 digits or less, only one of these words is relevant, and when there are 16 or less, on the first two are significant. We always waste one 32-bit value that is made of zeroes. The code might look as follows:

 auto DIGIT_VALUE_BASE10_8BIT =
_mm256_set_epi8(1, 10, 1, 10, 1, 10, 1, 10,
1, 10, 1, 10, 1, 10, 1, 10,
1, 10, 1, 10, 1, 10, 1, 10,
1, 10, 1, 10, 1, 10, 1, 10);
auto DIGIT_VALUE_BASE10E2_8BIT = _mm_set_epi8(
1, 100, 1, 100, 1, 100, 1, 100, 1, 100, 1, 100, 1, 100, 1, 100);
auto DIGIT_VALUE_BASE10E4_16BIT =
_mm_set_epi16(1, 10000, 1, 10000, 1, 10000, 1, 10000);
auto base10e2_16bit =
_mm256_maddubs_epi16(base10_8bit, DIGIT_VALUE_BASE10_8BIT);
auto base10e2_8bit = _mm256_cvtepi16_epi8(base10e2_16bit);
auto base10e4_16bit =
_mm_maddubs_epi16(base10e2_8bit, DIGIT_VALUE_BASE10E2_8BIT);
auto base10e8_32bit =
_mm_madd_epi16(base10e4_16bit, DIGIT_VALUE_BASE10E4_16BIT);

I have implemented this function in C++, and compiled it using GCC12. I run the benchmark on an Ice Lake server. I use random 32-bit integers for testing. The AVX-512 is over twice as fast as the standard approach.

AVX-512 1.8 GB/s 57 instructions/number 17 cycles/number
std::from_chars 0.8 GB/s 128 instructions/number 39 cycles/number

The comparison is not currently entirely fair because the AVX-512 function assumes that it knows the start and the end of the sequence of digits.

You can boost the performance by using an inline function, which brings it up to 2.3 GB/s, so a 30% performance boost. However, that assumes that you are parsing the numbers in sequence within a loop.

My original code would return fancy std::optional values, but GCC was negatively affected, so I changed my function signatures to be more conventional. Even, LLVM/clang is slightly faster in my tests, compared to GCC.

Credit: The original code and the problem were suggested to me by John Keiser. My code is largely derivative of his code.

Transcoding Unicode strings at crazy speeds with AVX-512

In software, we store strings of text as arrays of bytes in memory using one of the Unicode Transformation Formats (UTF), the most popular being UTF-8 and UTF-16. Windows, Java, C# and other systems common languages and systems default on UTF-16, whereas other systems and most of the web relies on UTF-8. There are benefits to both formats and we are not going to adopt one over the other any time soon. It means that we must constantly convert our strings from UTF-8 to UTF-16 and back.

It is so important that IBM mainframes based on z/Architecture provide special-purposes instructions named “CONVERT UTF-8 TO UTF-16” and “CONVERT UTF-16 TO UTF-8” for translation between the two encodings. By virtue of being implemented in hardware, these exceed 10 GiB processing speed for typical inputs. The rest of us do not have access to powerful mainframes, but our processors have single-instruction-multiple-data (SIMD) instructions. These SIMD instructions operate on larger registers (128 bits, 256 bits, 512 bits) representing vectors of numbers. Starting with the AMD Zen 4 family of processors, and with recent server Intel processors (i.e., Ice Lake or better), we have powerful SIMD instructions that can operate over 512 bits of data (AVX-512). The width (512 bits) is not the most interesting thing about these instructions: they are also distinctly more powerful than prior SIMD instruction sets.

We have a software library for this purpose called simdutf. The library automatically detects your processor and selects a ‘kernel’ of functions that is most appropriate. It supports a wide range of instruction sets (ARM NEON, SSE2, AVX2), but until last year, it did not support AVX-512.

So, how quickly can you convert (or ‘transcode’) strings using AVX-512? It is the subject of a new paper: Transcoding unicode characters with AVX-512 instructions (Software: Practice and Experience, to appear). A reference point is the popular ICU library that is used almost universally for this task. It is mature and well optimized.

It is fairly easy to optimize the performance of transcoding for simple cases (e.g., English), but much more challenging for languages such as Arabic… and the most difficult case is probably streams of Emojis. The following graphic indicates our performance in gigabytes of input per second, comparing against our fast AVX2 kernel and ICU for difficult cases: plot A  is for UTF-8 to UTF-16 transcoding and plot B is for UTF-16 to UTF-8 transcoding. In these experiments, we use an Ice Lake processor and the Clang 14 C++ compiler from the LLVM.

 

Using AVX-512 on recent Intel and AMD processors, we exceed 4 GB/s on Chinese and Emoji inputs and almost reach 8 GB/s on Arabic text when transcoding from UTF-8. When transcoding from UTF-16, we exceed 5 GB/s on Chinese and Emoji text and break the 20 GB/s barrier on Arabic text.

I also like to report the speed in ‘gigacharacters per second’: billions of characters processed per second.  It is less impressive that gigabytes per second because characters can span multiple bytes, but with gigacharacters per second, we can directly compare the UTF-8 to UTF-16 transcoding to the UTF-16 to UTF-8 transcoding. We find it much easier to transcoding from UTF-16, since it is a simpler format, than to transcode from UTF-8. The benefits compared to ICU depend on the data source, but our simdutf library can be several times faster as the following tables show.

UTF-8 to UTF-16:

ICU AVX-512
Arabic 0.80 4.3
Chinese 0.50 1.8
Emoji 0.22 1.0
Hebrew 0.80 4.3
Hindi 0.43 1.7
Japanese 0.51 1.7
Korean 0.62 1.8
Latin 1.5 20.
Russian 0.46 4.2

UTF-16 to UTF-8:

ICU AVX-512
Arabic 0.67 11.
Chinese 0.36 3.9
Emoji 0.27 1.6
Hebrew 0.68 11.
Hindi 0.21 3.8
Japanese 0.37 3.8
Korean 0.37 3.8
Latin 0.91 20.
Russian 0.23 11.

Older Intel processor that supported AVX-512 raised serious concerns due to various forms of ‘frequency throttling’: whenever you’d use these instructions, they would trigger a lowering of the frequency. For this reason, we do not use AVX-512 instructions on these older processors, preferring to fall back on AVX instructions. There is no frequency throttling on AMD Zen 4 or recent Intel processors due AVX-512 instructions.

Our functions have already been used in production systems for nearly a year as part of the simdutf library. The library includes an extensive collection of tests and benchmarks. It is part of several important projects such as Node.js, Bun, Oracle graaljs, and it is used by Couchbase and others. The simdutf library is a joint community effort, with multiple important contributors (e.g., Wojciech Muła, Yagiz Nizipli and so forth).

We are currently in the process of adding Latin 1 support to the library. In the future, I hope that we might add SVE/SVE2 support for the new generation of ARM processors.

Credit: The algorithmic design of the new AVX-512 functions should be credited to the always brilliant Robert Clausecker.

Locating ‘identifiers’ quickly (ARM NEON edition)

A common problem in parsing is that you want to find all identifiers (e.g., variable names, function names) in a document quickly. There are typically some fixed rules. For example, it is common to allow ASCII letters and digits as well as characters like ‘_’ in the identifier, but to forbid some characters at the beginning of the identifier (such as digits). E.g., ab123 is an identifier but 123ab might not be.

An efficient way to proceed is to use a 256-element table where allowed leading characters have a set value (say 255), non-identifier characters have the value 0, and all other characters have a non-zero, non-255 value.

In C, you might be about to count identifiers using the following routine:

while (source < end) {
  uint8_t c = identifier_map[*source++];
  if(c) {
    count += (c == 255);
    while (source < end && identifier_map[*source]) {
      source++;
    }
  }
}

Can you do better?

Suppose that you have an ARM-based machine (such as a recent macBook). Then you have access to Single instruction, multiple data (SIMD) instructions which can process up to 16 bytes at a time. ARM has several types of SIMD instructions but ARM NEON are the most well known.

We can apply vectorized classification to the problem (see Parsing Gigabytes of JSON per Second, The VLDB Journal, 28(6), 2019). I am not going to review the technique in details but the gist of it is that we replace the table lookup with a vectorized (or SIMD-based) table lookup. The relevant component of the code, using ARM instrinsic functions, looks as follows:

uint8x16_t low_nibble_mask = (uint8x16_t){
42, 62, 62, 62, 62, 62, 62, 62, 62, 62, 60, 20, 20, 20, 20, 21};
uint8x16_t high_nibble_mask =
(uint8x16_t){0, 0, 0, 2, 16, 33, 4, 8, 0, 0, 0, 0, 0, 0, 0, 0};
uint8x16_t chars = vld1q_u8((const uint8_t *)source);
uint8x16_t low =
vqtbl1q_u8(low_nibble_mask, vandq_u8(chars, vdupq_n_u8(0xf)));
uint8x16_t high = vqtbl1q_u8(high_nibble_mask, vshrq_n_u8(chars, 4));
uint8x16_t v = vtstq_u8(low, high);
uint8x16_t v_no_number = vtstq_u8(low, vandq_u8(high, vdupq_n_u8(61)));

The register v contains a non-zero byte value where there is an identifier character, and the v_no_number contains a non-zero byte value where there is a non-digit identifier character. We can map these SIMD registers into bitmaps from which we can identify the location of the identifier using regular C code.

We need to do a few logical operations and bit shifting. Counting the number of identifiers is a simple as the following:

while (source <= end16) {
  auto m = identifier_neon(source);
  uint16_t mask =
   m.leading_identifier_mask & ~(m.identifier_mask << 1 | lastbit);
  count += popcount(mask);
  lastbit = m.identifier_mask >> 15;
  source += 16;
}

ARM NEON is not particularly powerful, not compared with AVX-512 on Intel Ice Lake or AMD Zen 4 processors, but it is still quite good by historical standards, and the Apple implementation is fine. You can extend this approach to go to 64 bytes.

So how does the SIMD approach compares to the conventional approach? I am using a script that generate a large document. We want to count the number of identifiers as quickly as possible. The file is 10 MB and it contains half a million identifiers.

SIMD (simdjson-like, 16 bytes) 10 GB/s 2 ns/identifier
SIMD (simdjson-like, 64 bytes) 17 GB/s 1.1 ns/identifier
conventional (table-based) 0.5 GB/s 25 ns/identifier

So the SIMD-based approach over 10 times faster in this instance. My Apple-based M2 processor runs at 3.6 GHz. Using the SIMD-based code, it retires over 5 instructions per cycle. The conventional table-based approach retires far few instructions, and it is bound by load latencies.

For context, my macBook has a disk that can read at well over 2 GB/s. So the conventional routine is not fast.

On AVX-512, it should be possible to reach even higher speeds.

My source code is available.

Credit: This problem was suggested to me by Kevin Newton (Shopify). He provided the test file. The 64-byte implementation was provided by reader Perforated Blob.

Science and Technology links (September 2 2023)

  1. Physicists have a published a paper with 5154 authors. The list of authors takes 24 pages out of the 33 pages. The lesson is that if someone tell you that they have published an important paper, you should ask how many authors there were and what their exact role was.
  2. Vegatarians are at higher risk for hip fracture.
  3. Virgin Galatic brings private customers to space.
  4. Quercetin, a commonly available supplement, appears to reduce blood pressure and improve the lipid profile in human beings. (Quercetin is relatively expensive in Canada.)
  5. Scientists increasingly talk with certitude, they increasingly avoid words such as might or probably. It seems that it might be part of an effort to promote the research. In effect, science writing might be adapting strategies from marketing.
  6. I have long accepted the fact that if you encourage students to think that they can grow their abilities by hard work, they will better results. Macnamara and Burgoyne find that this result may not hold:

    We conclude that apparent effects of growth mindset interventions on academic achievement are likely attributable to inadequate study design, reporting flaws, and bias.

  7. Heterochronic parabiosis is the process by which you connect the blood vessels of old and young animals (typically mice). Researchers ran a heterochronic parabiosis experiment over a long period of time (3 months) and they found that the old mice that have benefited from heterochronic parabiosis lived longer than control old mice. Furthermore, the treated old mice appeared rejuvenated, and the rejuvenation was still visible two months after the treatment. It seems that heterochronic parabiosis made the old mice younger at the level of gene expression. The approach is unlikely to be used on human beings as is. You would not want to connect the blood vessels of young human beings and old human beings for extended periods of times. However, it is a proof of principle. It seems that you can rejuvenate old animals by acting on the composition of their blood.
  8. Researchers used young blood plasma from pigs to seemingly rejuvenate old rats.
  9. A specific compound, AOH1996, appears to selectively kill cancer cells without side effect:

    We report a small molecule inhibitor, AOH1996, of a cancer-associated isoform of PCNA (caPCNA), which notably, almost completely inhibits the growth of xenograft tumors without causing any discernible toxicity to experimental animals. (Gu et al., 2023)

  10. Old people tend to become frail, a process called sarcopenia. Though the most obvious effect is a loss of muscle mass, the muscle quality is also decreased. It is possible that the loss of muscle might a consequence of a loss of nerves in the muscles. Tezze et al. (2023) find that metformin and galantamine treat sarcopenia in old mice. Galantamine has been used to treat some of the side effects of Alzheimer’s whereas metformin is routinely used to treat diabetes.
  11. It seems that a significant fraction (e.g., 40%) of the measured warming in the recent decades could be explained by the urban effect: thermometers that were once located in the country become surrounded by a warm city.
  12. Naked mole rates are nearly ageless, they don’t age the way we do. Researchers have transfered a gene from naked mole rates to mice and found that they were able to increase the longevity of mice.
  13. The placebo effect is the phenomenon where people who receive a bogus treatment get better, despite the lack of actual drug. Two doctors (Hróbjartsson and  Gøtzsche) suggest that there is not much of placebo effect outside small effect in the context of pain management.
  14. The number of natural disasters is not increasing.
  15. Used coffee grounds make strong concrete.
  16. China restricts how long young people can play video games. Their policies appear to be ineffectual: young people in China still play a lot of video games.
  17. Aging is conserved across mammalian species at select regions of the DNA. Thus if we rejuvenate some animals at the gene expression level, it is credible that it could extend to other animals.
  18. Psychiatrists choosing a treatment for themselves predominantly selected a different treatment than the one recommended to patients by participants asked in the regular recommendation role. Psychiatrists preferred the lless invasive options for themselves (i.e. watchful waiting) whereas they recommended the more invasive ones for the patient (i.e. antidepressants and depot injections).

Transcoding Latin 1 strings to UTF-8 strings at 18 GB/s using AVX-512

Though most strings online today follow the Unicode standard (e.g., using UTF-8), the Latin 1 standard is still in widespread inside some systems (such as browsers) as JavaScript strings are often stored as either Latin 1, UTF-8 or UTF-16 internally. Latin 1 captures the first 256 characters from the Unicode standard and represents them as a single byte per character.

In a previous blog post, we examined how we could convert UTF-8 strings to Latin 1. I want to examine the reverse problem, the transcoding of Latin 1 strings to UTF-8. We receive a byte array, and each byte is one character. There is no need for any validation. However, some characters (the ASCII characters) are mapped to one byte whereas all others should be mapped to two bytes.

You can code it in C using code such as this routine:

 unsigned char byte = data[pos];
if ((byte & 0x80) == 0) { // if ASCII
  // will generate one UTF-8 byte
  *utf8_output++ = (char)(byte);
  pos++;
} else {
  // will generate two UTF-8 bytes
  *utf8_output++ = (char)((byte >> 6) | 0b11000000);
  *utf8_output++ = (char)((byte & 0b111111) | 0b10000000);
  pos++;
}

Can we do better using the new AVX-512 instructions available on recent Intel (e.g. Ice Lake) and AMD (e.g., Zen4) processors?

We can try to do it as follows:

  1. Load 32 bytes into a register.
  2. Identify all the non-ASCII bytes (they have the most significant bit set to 1).
  3. Identify the bytes that have their two most significant bits set by comparing with 192.
  4. Cast all bytes to 16-bit words into a 64-byte register.
  5. Add 0xc200 to all 16-bit words, as the most significant byte in a 16-bit word must be either 0xc2 or 0xc3.
  6. Add 0x00c0 to the 16-bit words where the corresponding byte had its two most significant bit set, this will move up a bit value so that the most signficant byte may become 0xc3.
  7. Flip the order of the bytes within each 16-bit word since UTF-8 is big endian.
  8. Remove one byte (‘compress’) where we had an ASCII byte.
  9. Store the result (can be between 32 bytes and 64 bytes).

Using Intel intrinsics, the result might look as follows:

__m256i input = _mm256_loadu_si256((__m256i *)(buf + pos));
__mmask32 nonascii = _mm256_movepi8_mask(input);
int output_size = 32 + _popcnt32(nonascii);
uint64_t compmask = ~_pdep_u64(~nonascii, 0x5555555555555555);
__mmask32 sixth =
_mm256_cmpge_epu8_mask(input, _mm256_set1_epi8(192));
__m512i input16 = _mm512_cvtepu8_epi16(input);
input16 =_mm512_add_epi16(input16, _mm512_set1_epi16(0xc200));
__m512i output16 =
_mm512_mask_add_epi16(input16, sixth, input16,
_mm512_set1_epi16(0x00c0));
output16 = _mm512_shuffle_epi8(output16, byteflip);
__m512i output = _mm512_maskz_compress_epi8(compmask, output16);

Our AVX-512 registers can load and process 64 bytes at a time, but this approach  consumes only 32 bytes (instead of 64 bytes). An anonymous reader has contributed a better approach:

  • Load 64 bytes into a register.
  • Identify all the non-ASCII bytes (they have the most significant bit set to 1).
  • Identify the bytes that have their two most significant bits set by comparing with 192.
  • Create a ‘compression’ mask where ASCII characters correspond to 01 whereas non-ASCII characters correspond to 11: we need to distinct 64-bit masks.
  • We use the vpshldw instruction (instead of the vpmovzxbw instruction) to upscale the bytes to 16-bit value, adding the 0b11000000 leading byte in the process. We adjust for the bytes that have their two most significant bits. This takes care of the first 32 bytes, assuming we interleaved the bytes.
  • The second part (next 32 bytes) are handled with bitwise logical operations.
  • We save two blocks of 32 bytes.

 

The code using AVX-512 intrinsics might look at follows:

__mmask32 nonascii = _mm256_movepi8_mask(input);
__mmask64 sixth =
_mm512_cmpge_epu8_mask(input, _mm512_set1_epi8(-64));
const uint64_t alternate_bits = UINT64_C(0x5555555555555555);
uint64_t ascii = ~nonascii;
uint64_t maskA = ~_pdep_u64(ascii, alternate_bits);
uint64_t maskB = ~_pdep_u64(ascii>>32, alternate_bits);
// interleave bytes from top and bottom halves (abcd...ABCD -> aAbBcCdD)
__m512i input_interleaved = _mm512_permutexvar_epi8(_mm512_set_epi32(
0x3f1f3e1e, 0x3d1d3c1c, 0x3b1b3a1a, 0x39193818,
0x37173616, 0x35153414, 0x33133212, 0x31113010,
0x2f0f2e0e, 0x2d0d2c0c, 0x2b0b2a0a, 0x29092808,
0x27072606, 0x25052404, 0x23032202, 0x21012000
), input);
// double size of each byte, and insert the leading byte
__m512i outputA = _mm512_shldi_epi16(input_interleaved, _mm512_set1_epi8(-62), 8);
outputA = _mm512_mask_add_epi16(outputA, (__mmask32)sixth, outputA, _mm512_set1_epi16(1 - 0x4000));
__m512i leadingB = _mm512_mask_blend_epi16((__mmask32)(sixth>>32), _mm512_set1_epi16(0x00c2), _mm512_set1_epi16(0x40c3));
__m512i outputB = _mm512_ternarylogic_epi32(input_interleaved, leadingB, _mm512_set1_epi16((short)0xff00), (240 & 170) ^ 204); // (input_interleaved & 0xff00) ^ leadingB
// prune redundant bytes
outputA = _mm512_maskz_compress_epi8(maskA, outputA);
outputB = _mm512_maskz_compress_epi8(maskB, outputB);

It is possible to do even better if you expect that the input is often ASCII or contains few non-ASCII bytes. You can branch on the case where a sequence of 64-bit is all ASCII and use a fast path in this case: trivially, we just store the 64 bytes we just read, as is.

I use GCC 11 on an Ice Lake server. My source code is available. For benchmarking, I use a French version of the Mars wikipedia entry.

technique CPU cycles/byte instructions/byte
conventional 1.7 5.0
AVX-512 (32 bytes) 0.26 0.77
AVX-512 (64 bytes) 0.21 0.54
AVX-512 (64 bytes+branch) 0.17 0.45

On a 3.2 GHz processor, the 32-byte AVX-512 routine reaches 12 GB/s. It is nearly seventimes faster than the conventional routine, and it uses six times fewer instructions. The approach consuming 64 bytes is faster than 14 GB/s, and the version with an ASCII branch breaks the 18 GB/s limit.

This server has memory bandwidth of at least 15 GB/s, but the CPU cache bandwidth is several times higher. There are disks with bandwidths of over 10 GB/s.

Remark. Whenever I mention Latin 1, some people are prone to remark that browsers treat HTML pages declared as Latin 1 and ASCII as windows-1252. That is because modern web browsers do not support Latin 1 and ASCII in HTML. Even so, you should not use Latin 1, ASCII or even windows-1252 for your web pages. I recommend using Unicode (UTF-8). However, if you code in Python, Go or Node.js, and you declare a string as Latin 1, it should be Latin 1, not windows-1252. It is bug to confuse Latin 1, ASCII and windows-1252. They are different formats.

How accurate is the birthday’s paradox formula?

Given a set of r random values from a large set (of size N), I have been using the formula 1-exp(-r**2/(2N)) to approximate the probability of a collision. It assumes that r is much smaller than N. The formula suggests that if you have hundreds of millions of random 64-bit numbers, you will start getting collisions with non-trivial probabilities, meaning that at least two values will be equal. At the one billion range, the probability of a collision is about 3% according to this formula.

It is somewhat unintuitive because if I give you two random 64-bit values, the probability of a collision is so low that it might as well be zero.

Though it is a textbook formula, we should still test it out to make sure that it is reasonable. Let us generate 32-bit random values for speed. I use a simple frequentist approximation: I generate many sets of 32-bit random values, I count the number of sets with a collision, and I divide this number by the total number of sets.

My results are as follows. The formula agrees with my results: I get a maximal error of 23%. The exact measured output depends on the random number generation and will vary depending on how you set it up. Nevertheless, it looks good! As you can see, if you even only 51,200 32-bit random values, the probability of a collision reaches 25%. My code is available.

number theory measured relative error
100 0.000001 0.000001 error: 23%
200 0.000005 0.000005 error: 13%
400 0.000019 0.000014 error: 23%
800 0.000075 0.000073 error: 2%
1600 0.000298 0.000254 error: 15%
3200 0.001191 0.001079 error: 9%
6400 0.004757 0.004700 error: 1%
12800 0.018893 0.017570 error: 7%
25600 0.073456 0.071261 error: 3%
51200 0.263006 0.240400 error: 9%

Transcoding UTF-8 strings to Latin 1 strings at 18 GB/s using AVX-512

Most strings online are Unicode strings in the UTF-8 format. Other systems (e.g., Java, Microsoft) might prefer UTF-16. However, Latin 1 is still a common encoding (e.g., within JavaScript runtimes). Its relationship with Unicode is simple: Latin 1 includes the first 256 Unicode characters. It is rich enough to convert most of the standard European languages. If something is stored in Latin 1, it can be encoded using Unicode. The reverse is obviously false. Nevertheless, let us assume that you have a Unicode string in UTF-8 that you want to quickly transcode to Latin 1.

The transcoding can be done with a short routine in C:

uint8_t leading_byte = data[pos]; // leading byte
if (leading_byte < 0b10000000) {
  *latin_output++ = leading_byte;
  pos++;
} else if ((leading_byte & 0b11100000) == 0b11000000) {
  *latin_output++ = (leading_byte & 1) << 6 | (data[pos + 1]);
  pos += 2;
}

It processes the data one byte at a time. There are two cases: ASCII bytes (one byte, one character) and two-byte characters (one leading byte, one continuation byte).

Can we do better?

We can use Single instruction, multiple data (SIMD) and specifically the advanced  SIMD instructions available on recent AMD Zen 4 and Intel Ice Lake processors: AVX-512 with VBMI2.

We can then process the data 64 bytes at a time. Using AVX-512, we lead 64 bytes. We identify the location of the ASCII bytes, the leading and the continuation bytes. These are identified using masks. We then modify the leading bytes, keeping one bit (just the least significant one) and shift it up by six positions. We then do two compression: one where we omit the continuation bytes, and one where we omit the newly transformed leading bytes. We then simply using a byte-wise logical OR, and we are done. Using Intel intrinsics, the code might look as follows:

 __m512i input = _mm512_loadu_si512((__m512i *)(buf + pos));
__mmask64 ascii = _mm512_cmplt_epu8_mask(input, mask_80808080);
__mmask64 continuation = _mm512_cmplt_epi8_mask(input,
    _mm512_set1_epi8(-64));
__mmask64 leading = _kxnor_mask64(ascii, continuation);
__m512i highbits = _mm512_maskz_add_epi8(leading, input,
    _mm512_set1_epi8(62));
highbits = _mm512_slli_epi16(highbits, 6); // shift in position
input = _mm512_mask_blend_epi8(leading, input, highbits);
__m512i ascii_continuation = _mm512_maskz_compress_epi8(ascii | 
    continuation, input);
__m512i ascii_leading = _mm512_maskz_compress_epi8(ascii | leading,
    input);
__m512i output = _mm512_or_si512(ascii_continuation, ascii_leading);
_mm512_storeu_epi8((__m512i*)latin_output, output);

Of course, we must also validate the input, and it adds some complexity, but not too much.

An anonymous reader points out to a significantly faster and simpler approach:

  1. Load 64 bytes
  2. Identify the leading bytes (non-continuation non-ASCII bytes) with a single comparison.
  3. Test whether the leading bytes have their less significant bit set, and construct a mask with it.
  4. Shift the mask by one position and set the second last bit to 1 in the contibuation bytes that are preceded by a leading byte. We can do it by subtracting 0b11000000 because we have that 0b10000000 – 0b11000000 is 0b11000000.
  5. Prune all the leading bytes (non-continuation non-ASCII bytes) and write it out.

Using Intel intrinsics, the core implementation might be like so:

__m512i input = _mm512_loadu_si512((__m512i *)(buf + pos));
__mmask64 leading = _mm512_cmpge_epu8_mask(input, _mm512_set1_epi8(-64));
__mmask64 bit6 = _mm512_mask_test_epi8_mask(leading, input, _mm512_set1_epi8(1));
input = _mm512_mask_sub_epi8(input, (bit6<<1) | next_bit6, input, _mm512_set1_epi8(-64));
next_bit6 = bit6 >> 63;
__mmask64 retain = ~leading;
__m512i output = _mm512_maskz_compress_epi8(retain, input);
int64_t written_out = _popcnt64(retain);
__mmask64 store_mask = (1ULL << written_out) - 1;
_mm512_mask_storeu_epi8((__m512i *)latin_output, store_mask, output);

I use GCC 11 on an Ice Lake server. My source code is available. For benchmarking, I use a French version of the Mars wikipedia entry that has been modified to fit in latin 1.

technique CPU cycles/byte instructions/byte
conventional 1.9 5.4
AVX-512 0.3 0.75
Faster AVX-512 0.2 0.43

On a 3.2 GHz processor, the AVX-512 routine reaches 12 GB/s. It is about 6 times faster than the conventional routine, and it uses 7 times fewer instructions. The faster AVX-512 routine 10 times faster than the conventional routine while using 11 times fewer instructions. It reaches 18 GB/s. It is likely faster than the RAM bandwidth which  I estimate to be at least 15 GB/s, but the CPU cache bandwidth is several times higher. Keep in mind that there are disks with a 14 GB/s bandwidth.

This problem illustrates that AVX-512 can really do well on non-trivial string transformations without excessive cleverness.

Remark. Whenever I mention Latin 1, some people are prone to remark that browsers treat HTML pages declared as Latin 1 and ASCII as windows-1252. That is because modern web browsers do not support Latin 1 and ASCII in HTML. Even so, you should not use Latin 1, ASCII or even windows-1252 for your web pages. I recommend using Unicode (UTF-8). However, if you code in Python, Go or Node.js, and you declare a string as Latin 1, it should be Latin 1, not windows-1252. It is bug to confuse Latin 1, ASCII and windows-1252. They are different formats.

Coding of domain names to wire format at gigabytes per second

When you enter in your browser the domain name lemire.me, it eventually gets encoded into a so-called wire format. The name lemire.me contains two labels, one of length 6 (lemire) and one of length two (me). The wire format starts with 6lemire2me: that is, imagining that the name starts with an imaginary dot, all dots are replaced by the length (in bytes) of the upcoming label. The numbers are stored as byte values, not ASCII digits. There is also a final zero byte. RFC 1035 specifies the format:

Each label is represented as a one octet length field followed by that number of octets. Since every domain name ends with the null label of the root, a domain name is terminated by a length byte of zero. The high order two bits of every length octet must be zero, and the remaining six bits of the length field limit the label to 63 octets or less. To simplify implementations, the total length of a domain name (i.e., label octets and label length octets) is restricted to 255 octets or less.

I find the problem interesting because the wire format is an “indexed” format: you can easily iterate over the labels, jumping in constant time over them, knowing the exact length, and so forth. Thus, coding strings into the wire format is a form of indexing.

How quickly can we do it?

You can use conventional code to convert the name to the wire format. You copy non-dot characters and when you encounter a dot, you write the distance to the previous dot (imagining that there is a virtual first dot). In C, it might look as follows (omitting the final zero byte):

 char *src = "lemire.me";
 uint8_t *dst = ...; // buffer
 uint8_t *counter = dst++;
 do {
   while (is_name_character(*src)) {
    *dst++ = (uint8_t)*src; src++;
  }
  *counter = (uint8_t)(dst - counter - 1);
  counter = dst++;
  if (*src != '.') { break; }
  src++;
 } while (true);

Can you do better?

You can use Single instruction, multiple data (SIMD) instructions. You load 16, 32 or 64 characters in wide register. You locate the dots. You construct a new register where the dots are replaced by position indexes and everything else is zeroed. E.g., starting with

.a.bcde.

You mark the location of the dots:

0 0 0xff 0 0 0 0 0xff

You compute the byte-wise AND with the following constant:

0 1 2 3 4 5 6 7

and you get an array where the dots have been replaced by indexes…

0 0 2 0 0 0 0 7

You then do a prefix computation, propagating the values:

0 2 2 7 7 7 7 7

This can be done using a logarithmic algorithm: shifting the values off by 1 byte, comparing, shifting the values off by 2 bytes, comparing, and so forth.

Finally, we can shift by one and subtract the result. Masking off the result so that we are only keeping where the dots are, we get the desired counts:

2 - 5 - - - - 0

You can then blend the results:

2 a 5 b c d e 0

If you use Intel intrinsics, the code might look like this… It is a bit technical so I am not going to explain it further, but the idea is as I explained…

__m128i dot_to_counters(__m128i input) {
  __m128i dots = _mm_cmpeq_epi8(input, _mm_set1_epi8('.'));
  __m128i sequential =
_mm_setr_epi8(-128, -127, -126, -125, -124, -123, -122, -121, -120, -119,
-118, -117, -116, -115, -114, -113);
  __m128i dotscounts = _mm_and_si128(dots, sequential);
  __m128i origdotscounts = dotscounts;
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 1),
    dotscounts);
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 2),
    dotscounts);
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 4),
    dotscounts);
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 8),
    dotscounts);
  __m128i next = _mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 1);
  dotscounts = _mm_subs_epu8(next, origdotscounts);
  dotscounts = _mm_subs_epu8(dotscounts, _mm_set1_epi8(1));
  return _mm_blendv_epi8(input, dotscounts, dots);
}

I call this strategy  “Prefix-Minimum”. It is essentially data independent. It does not matter where the dots are, the code is always the same. Processors like that a lot!

Prefix-Minimum will be fast, especially if you use wide registers (e.g., 32 bytes with AVX2 instructions). However, it does not offer a direct path to validating the inputs: are the counters in the proper range? Do you have overly long labels?

If you need to reason about where the dots are, the best way is probably to build an index in the spirit of the simdjson library (our original paper is available as PDF). For example, you can load 64 bytes of data, and construct a 64-bit word where each bit correspond to a byte value, and the byte is set to 1 if the corresponding byte is a dot (‘.’), you can then iterate through the bits, as a bitset. The construction of the 64-bit word might look as follows using Intel instructions and AVX2:

__m256i input1 = _mm256_loadu_si256((const __m256i *)src);
__m256i input2 = _mm256_loadu_si256((const __m256i *)(src + 32));
uint32_t dots1 = (uint32_t)_mm256_movemask_epi8(
_mm256_cmpeq_epi8(input1, _mm256_set1_epi8('.')));
uint32_t dots2 = (uint32_t)_mm256_movemask_epi8(
_mm256_cmpeq_epi8(input2, _mm256_set1_epi8('.')));
uint64_t DOTS = (((uint64_t)dots2) << 32) | dots1;

I implemented different strategies using AVX2 intrinsics. We can use a database of popular domain names for testing. Results vary depending on the system. I use GCC 11 and an Ice Lake server.

technique CPU cycles/string instructions/string
conventional 150 240
Prefix-Minimum (256 bits) 44 77
simdjson-like (256 bits) 56 77

My source code is available.

The Prefix-Minimum approach is fastest. Observe how the Prefix-Minimum approach is faster than the simdjson-like approach, despite executing the same number of instructions. That is because it is able to retire more instructions per cycle. The downside of the Prefix-Minimum approach is that it does not validate the label length.

The simdjson-like approach does full validation and is still quite fast: over three times faster than the conventional approach. It is sufficient to break the gigabyte per second barrier (1.7 GB/s in these experiments).

Being able to validate the counter values with the Prefix-Minimum without adding too much overhead would be fantastic: it might be possible to get the best speed without compromising on the validation. It remains an open problem whether it is possible.

I suspect that my implementations can be improved, by at least 20%.

Future work should consider AVX-512. I have the sketch of an implementation using AVX-512 but it is not faster. Porting to ARM NEON would be interesting.

Credit: The work was motivated by the simdzone project by NLnetLabs.

Science and Technology links (August 6 2023)

  1. In an extensive study, You et al. (2022) found that meat consumption was correlated with higher life expectancies:

    Meat intake is positively correlated with life expectancies. This relationship remained significant when influences of caloric intake, urbanization, obesity, education and carbohydrate crops were statistically controlled. Stepwise linear regression selected meat intake, not carbohydrate crops, as one of the significant predictors of life expectancy. In contrast, carbohydrate crops showed weak and negative correlation with life expectancy.

  2. Tupy and Deutsch write:

    The world’s population has increased eightfold since 1800, and standards of living have never been higher. Despite increases in consumption, and contrary to the prophecies of generations of Malthusians, the world hasn’t run out of a single metal or mineral. In fact, resources have generally grown cheaper relative to income over the past two centuries. Even on the largest cosmic scale, resources may well be limitless.

  3. Though the United States has one of the fastest economic growths, its growth might be limited since the 1970 by uneqal rate of innovations among industries. That is, economic progress might be held back by bottlenecks.
  4. In the past, people built extensive cities underground, where tens of thousands of people lived, protected from invaders.
  5. In some cases, workers are less productive when working from home. The productivity of workers randomly assigned to working from home is 18% lower than those in the office. Workers who prefer home work are substantially less productive at home than at the office (27% less compared to 13% less for workers who prefer the office).
  6. Law students are optimistic: 95% believed they would end up in the top half of the class.
  7. HMB, a common muscle-building supplement, might also protect your brain.
  8. Antibiotics are associated with the inflammatory bowel disease.
  9. Statins, a commonly prescribed class of drugs, might increase insuline resistance and cause diabetes.
  10. Artificial intelligence (deep reinforcement learning) has improved the speed of some data sorting algorithms, in practice. Meanwhile a simple algorithm based on zip compression can beat fancy artificial intelligence techniques at text classification.
  11. A cocktail based on metformin (anti-diabetes drug) and leucine (a common supplement) can protect against muscle atrophy.

Decoding base16 sequences quickly

We sometimes represent binary data using the hexadecimal notation. We use a base-16 representation where the first 10 digits are 0, 1, 2, 3, 5, 6, 7, 8, 9 and where the following digits are A, B, C, D, E, F (or a, b, c, d, e, f). Thus each character represents 4 bits. A pair of characters can represent any byte value (8 bits).

In Parsing short hexadecimal strings efficiently, I examined the problem of decoding short sequences (e.g., 4 characters). A conventional parser can be based on table lookups, where unexpected characters are mapped to a special value (e.g., -1) but where the characters 0 to 9 are mapped to 0 to 9, and A to F are mapped to 10 to 15. A conventional decoder can rely on a simple routine such as…

int8_t n1 = digittoval[src[0]];
int8_t n2 = digittoval[src[1]];
src += 2;
*dst = (uint8_t)((n1 << 4) | n2);
dst += 1;

Can we go faster? In Parsing hex numbers with validation, Langdale and Muła considered the question. They use SIMD instructions. They get good results, but let us revisit the question.

I am considering the problem of parsing sequences of 56 characters (representing 28 bytes). I want to compare the best approach described by Langdale and Muła with a straight adaptation of our base32 decoder. Generally speaking, the main difference between our approach and the Langdale-Muła is in the validation. The Langdale-Muła approach does straight-forward arithmetic and comparisons, whereas we favour vectorized classification (e.g., we use vectorized table lookups).

Our main routine to decode 16 characters goes as follows:

  1. Load 16 bytes into a vector register.
  2. Subtract 1 from all character values.
  3. Using a 4-bit shift and a mask, select the most significant 4 bits of each byte value.
  4. Do a vectorized table lookup using the most significant 4 bits, and add the result to the value. This inexpensive computation can detect any non-hexadecimal character: any such character would map to a byte value with the most significant bit set. We later branch out based on a movemask which detects these bytes.
  5. We do a similar computation, a vectorized table lookup using the most significant 4 bits, and add the result to the value, to map the character to a 4-bit value (between 0 and 16).
  6. We use a multiply-add and a pack instruction to pack the 4-bit values, going from 16 bytes to 8 bytes.
  7. We write the 8 bytes to the output.

Using Intel instructions, the hot loop looks as follows:

  __m128i v = _mm_loadu_si128((__m128i *)src);
  __m128i vm1 = _mm_add_epi8(v, _mm_set1_epi8(-1));
  __m128i hash_key =
  _mm_and_si128(_mm_srli_epi32(vm1, 4), _mm_set1_epi8(0x0F));
 __m128i check = _mm_add_epi8(_mm_shuffle_epi8(delta_check, hash_key), vm1);
  v = _mm_add_epi8(vm1, _mm_shuffle_epi8(delta_rebase, hash_key));
  unsigned int m = (unsigned)_mm_movemask_epi8(check);
  if (m) {
   // error
  }
__m128i t3 = _mm_maddubs_epi16(v, _mm_set1_epi16(0x0110));
__m128i t5 = _mm_packus_epi16(t3, t3);
_mm_storeu_si128((__m128i *)dst, t5);

You can do the equivalent with 256-bit (and even 512-bit) registers if you want. I refer the interested reader to our paper Faster Base64 Encoding and Decoding using AVX2 Instructions.

Your results will vary depending on the system. I use GCC 11 and an Ice Lake server. The inputs are rather small and there is overhead due to the function calls so it is not an ideal case for fancy algorithms.

technique CPU cycles/string instructions/string
conventional 72 360
Langdale-Muła (128 bits) 24 110
ours (128 bits) 21 88
ours (256 bits) 16 61

My source code is available.

Compared to a conventional decoder, our fast techniques are 3 to 5 times faster.

At least in this code, our approach is slightly faster than the Langdale-Muła approach. However, the difference is small and if you consider different hardware or different compilers, the results might flip. The 256-bit approach is noticeably faster, but if your inputs are small (shorter than 32 characters), then that would not hold.

If you have access to AVX-512, you can assuredly go much faster, but I did not consider this case.

In the real-world, you may need to handle spaces in the encoded sequence. My benchmark does not handle such cases and some careful engineering based on real-world data would be needed to deal efficiently with such inputs at high speed. I nevertheless include a fast decoder that can ignore spaces in the input for demonstration purposes (see source code). It would be easier to prune spaces using AVX-512.

Porting this code to ARM NEON would be easy.

Conclusion: You can decode base16 inputs using about one instruction per input character which is about six times better than a conventional decoder. It seems that you can beat the fast Langdale-Muła decoder, at least on my test system. Your main tool to go faster is to use wider SIMD registers. Future work should consider AVX-512: it might be possible to double the performance.

Credit: The work was motivated by the simdzone project by NLnetLabs. GitHub user @aqrit lent his expertise.

Science and Technology links (July 23 2023)

  1. People increasingly consume ultra processed foods. They include
    energy drinks, mass-produced packaged breads, margarines, cereal, energy bars, fruit yogurts, fruit drinks, vegan meat and cheese, infant formulas, pizza, chicken nuggets, and so forth. Ultra processed foods are correlated with poorer health.
  2. Homo sapiens was not the first human species to venture out of Africa, it was the last.
  3. Researchers claim to be able to reprogram the cells in our eyes to repair retinal degenerence.
  4. Older people are often prescribed statins to keep their risk of cardiovascular disease low. However, statins reduce short-term cognitive performance.
  5. Killifish are animals that have amazing regeneration abilities, being even able to regenerate part of their brain. As they age, this regenerative ability is lost. Killifish, like other animals, tend to accumulate dysfunctional cells with age, called senescent cells. Though having a few senescent cells is not a problem, and may even be necessary, having too many is believed to reduce fitness. Thankfully, we have drugs called senolytics that can eliminate some of the senescent cells. By using these drugs in old killifish, they were able to restore some of the amazing regenerative abilities.
  6. There may be a mushroom that is able to increase nerve growth and improve memory, in human beings.
  7. Resistance training might be a viable anti-skin-aging strategy for women (and possibly men).

Fast decoding of base32 strings

We often need to encode binary data into ASCII strings (e.g., email). The standards to do so include base16, base32 and base64.

There are some research papers on fast base64 encoding and decoding: Base64 encoding and decoding at almost the speed of a memory copy and Faster Base64 Encoding and Decoding using AVX2 Instructions.

For the most parts, these base64 techniques are applicable to base32. Base32 works in the following manner: you use 8 ASCII characters to encode 5 bytes. Each ASCII characters carries 5 bits of information: it can be one of 32 characters. For reference, base64 uses 64 different ASCII characters so each character carries more information. However, base64 requires using both upper case and lower case letters and other special characters, so it is less portable. Base32 can be case invariant.

There are different variations, but we can consider Base 32 Encoding with Extended Hex Alphabet which uses the letters 0 to 9 for the values 0 to 9 and the letters A to V for the numbers from 10 to 31. So each character represents a value between 0 to 31. If required, you can pad the coding with the ‘=’ character so that it is divisible by 8 characters. However, that is not always required. Instead, you may simply stop decoding as soon as an out-of-range character is found.

‘0’ 0
‘1’ 1
‘2’ 2
‘4’ 4
‘9’ 9
‘A’ 10
‘V’ 31

A conventional decoder might use branchy code:

if (ch >= '0' && ch <= '9')
  d = ch - '0';
else if (ch >= 'A' && ch <= 'V')
  d = ch - 'A' + 10;
else if (ch >= 'a' && ch <= 'v')
  d = ch - 'a' + 10;
else
  return -1;

Though that’s not going to be fast, it can be convenient. You can do better with tables. For example, you may populate a 256-long table (one for each character byte value) with the values from the table above, and use a special error code when the value is out of range (e.g., 32). That can produce efficient code:

uint64_t r = 0;
for (size_t i = 0; i < 8; i++) {
  uint8_t x = table[*src++];
  if (x > 31) {
    r <<= (5 * (8 - i));
    break;
  }
  r <<= 5;
  r |= x;
}

You can also program a version using SIMD instructions. I am not going to present the code, it is similar to the code described in a base64 paper.

I wrote a benchmark focused on short inputs (32 bytes). My benchmark makes function inlining difficult, thus the function call overhead and the population of the constant is a non-negligible cost.

I run it on an Ice Lake server, and the code is compiled with GCC11 (targeting an old processor, Haswell). We can use either 128-bit SIMD registers or 256-bit SIMD registers.

technique CPU cycles/string instructions/string
branchy 500 1400
table 43 230
SIMD (128 bits) 15 70
SIMD (256 bits) 13 61

In this instance, the table-based approach is reasonable and is only about three times slower than the SIMD-based approaches. Because I am using small inputs, the 256-bit SIMD code has only marginal benefits, but I expect it would do better over longer strings. The branchy code is terrible performance-wise, but it is more flexible and can easily deal with skipping white space characters and so forth.

My source code is available.

Credit: The work was motivated by the simdzone project by NLnetLabs. The initial base32 decoder implementation was provided by GitHub user @aqrit.

Science and Technology links (July 16 2023)

  1. Most people think that they are more intelligent than average.
  2. Lack of vitamin C may damage the arteries. Make sure you have enough!
  3. A difficult problem in software is caching. Caching is the idea that you keep some values in fast memory. But how do you choose which values to keep? A standard technique is to evict from cache the least recently used value. Intuitively, you just get rid of the value you have not needed in a long time. However, that’s relatively expensive. Yang et al. find that a simpler technique (first-in first-out) can be just as good. You just enter values in queue and the last value to enter cache is the first one to be considered for evication. For best performance, they explain that you should reinsert values that have been evicted when it has been used at least once, and that you should have a probation period for the newly inserted values.
  4. Soybean oil induces obesity, diabetes, insulin resistance, and fatty liver in mice. Unlike other oils (e.g., coconut oil), soybean oils affect gene expression.
  5. Receiving a Nobel Prize might lower your intellectual productivity.
  6. We are currently in the holocene, which started at the end of the last glacial period (about 12,000 years ago). During the first half of the holocene, the Sahara was green, filled with grassland and tropical trees.
  7. Conspiracy theorizing is most prevalent in freer societies (including Australia, Canada, Germany, Japan, the Netherlands and the United Kingdom).
  8. Cancer thrives on sugar. Thus tumors do not grow as much on a low-sugar diet. However, it does not follow that a ketogenic diet is helpful in fighting cancer because the lack of sugar may accelerate frailty and thus hasten death.
  9. Offsprings of parents with severe mental illnesses are more at risk for diabetes.
  10. Vitamin D supplements might reduce your risk of myocardial infarction.
  11. If you take away much of the genetic material of a cell, leaving it less fit, it can quickly gain back the fitness by natural selection.
  12. You can tell apart men and women by how they smell.

Recognizing string prefixes with SIMD instructions

Suppose that I give you a long list of string tokens (e.g., “A”, “A6”, “AAAA”, “AFSDB”, “APL”, “CAA”, “CDS”, “CDNSKEY”, “CERT”, “CH”, “CNAME”, “CS”, “CSYNC”, “DHC”, etc.). I give you a pointer inside a much larger string and I ask you whether you are pointing at one of these tokens, and if so, which one. To make things slightly more complicated, you want the token to be followed by a valid separator (e.g., a space, a semi-colon, etc.) and you want to ignore the case (so that “aaaa” matches “AAAA”).

How might you solve this efficiently?

Comparing against each token might do well if you have few of them, but it is clearly a bad idea if you have many (say 70).

A reasonable approach is to do a binary search through the sorted list of tokens. The C function bsearch is well suited. You need a comparison function as part of the implementation. You may use the C function strncasecmp to compare the strings while ignoring the case, and you add a check to make sure that you only return a match (value 0) when the input has a terminating character at the right position.

Then the linearithmic (O(n log n)) implementation looks like this…

std::string *lookup_symbol(const char *input) {
  return bsearch(input, strings.data(), strings.size(),
  sizeof(std::string), compare);
}

Simple enough.

Another approach is to use a trie. You implement a tree where the first level checks for the first character, the second level for the second character and so forth.

It gets a little bit lengthy, but you can use a script or a library to generate the code. You can use a series of switch/case like so…

switch (s[0]) {
  case 'A': case 'a':
  switch (s[1]) {
    case '6': return is_separator(s[2]) ? 1 : -1;
  case 'A': case 'a':
    switch (s[2]) {
     case 'A': case 'a':
       switch (s[3]) { 
         case 'A': case 'a':
          return is_separator(s[4]) ? 2 : -1;
       default:
         return -1;
...

The running time complexity depends on the data, but is at most the length of the longest string in your set. The trie is a tempting solution but it is branchy: if the processor can predict the upcoming content, it should do well, but if the input is random, you might be unlikely and get poor performance.

We can also use a finite-state machine which requires a relative large table, but has really simple execution:

int s = 71;
while (*str && s >= 0) {
  uint8_t tok = char2token[uint8_t(*str)];
  str++;
  s = statetable[32 * s + tok];
}
*token = (uint8_t)s;
return s != 0;

With SIMD instructions, you can write a tight implementation that is effectively branchless: its execution does not depend on the input data.

The code works in this manner:

  1. We load unconditionally 16 bytes in a register.
  2. We first find the location of the first separator, if any. We can do this with vectorized classification. It is a significant cost.
  3. We set to zero all bytes in the register starting from this first separator. We also switch all characters in A-Z to a lower case.
  4. We use a hash function to map the processed bytes to a table containing our tokens in 16-byte blocks. The hash function is designed in a such a way that if the input matches one of our tokens, then it should be mapped to an identical value. We can derive the hash function empirically (by trial and error). Computing the hash function is a significant cost so we have the be careful. In this instance, I use a simple function:
    (((val >> 32) ^ val)&0xffffffff) * (uint64_t)3523216699) >> 32.
  5. We compare the processed input with the loaded value from the hash function.

The C function written using Intel intrinsic functions is as follows:

bool sse_type(const char *type_string, uint8_t *type) {
  __m128i input = _mm_loadu_si128((__m128i *)type_string);
  __m128i delimiters =
    _mm_setr_epi8(0x00, 0x00, 0x22, 0x00, 0x00, 0x00, 
                  0x00, 0x00, 0x28, 0x09, 0x0a, 0x3b, 
                  0x00, 0x0d, 0x00, 0x00);
  __m128i mask = _mm_setr_epi8(-33, -1, -1, -1, -1, 
                  -1, -1, -1, -1, -33, -1, -1,
                  -1, -1, -1, -1);
  __m128i pattern = _mm_shuffle_epi8(delimiters, input);
  __m128i inputc = _mm_and_si128(input, 
      _mm_shuffle_epi8(mask, input));
  int bitmask = _mm_movemask_epi8(
      _mm_cmpeq_epi8(inputc, pattern));
  uint16_t length = __builtin_ctz(bitmask);
  __m128i zero_mask = _mm_loadu_si128(
       (__m128i *)(zero_masks + 16 - length));
  __m128i inputlc = _mm_or_si128(input, _mm_set1_epi8(0x20));
  input = _mm_andnot_si128(zero_mask, inputlc);
  uint8_t idx = hash((uint64_t)_mm_cvtsi128_si64(input));
  *type = idx;
  __m128i compar = _mm_loadu_si128((__m128i *)buffers[idx]);
  __m128i xorthem = _mm_xor_si128(compar, input);
  return _mm_test_all_zeros(xorthem, xorthem);
}

We expect it to compile to branchfree code, as follows:

sse_type(char const*, unsigned char*):
  movdqu xmm1, XMMWORD PTR [rdi]
  mov edx, 16
  movdqa xmm2, XMMWORD PTR .LC0[rip]
  movdqa xmm0, XMMWORD PTR .LC1[rip]
  pshufb xmm2, xmm1
  pshufb xmm0, xmm1
  pand xmm0, xmm1
  pcmpeqb xmm0, xmm2
  por xmm1, XMMWORD PTR .LC2[rip]
  pmovmskb eax, xmm0
  bsf eax, eax
  cdqe
  sub rdx, rax
  movdqu xmm0, XMMWORD PTR zero_masks[rdx]
  pandn xmm0, xmm1
  movq rax, xmm0
  movq rdx, xmm0
  shr rax, 32
  xor eax, edx
  mov edx, 3523216699
  imul rax, rdx
  shr rax, 32
  mov BYTE PTR [rsi], al
  movzx eax, al
  sal rax, 4
  pxor xmm0, XMMWORD PTR buffers[rax]
  ptest xmm0, xmm0
  sete al
  ret

I wrote a benchmark where we repeatedly try to check for matching tokens, using many thousands random tokens (enough to prevent the processor from having trivial branch prediction). I run it on an Ice Lake server, and the code is compiled with GCC11 (targeting an old processor, Westmere).

technique CPU cycles/string instructions/string
binary search 236 335
trie 71 39
finite state 42 61
SIMD 15 39

In this particular test, the SIMD-based approach is four times faster than the trie despite the fact that it retires as many instructions. The trie struggles with branch mispredictions. The SIMD-based approach has a relatively high number of instructions retired per cycle (2.5). The binary search is disastrous in this case, being more than 10 times slower. The finite-state approach is interesting as it is only three times slower than the SIMD-based approach and significantly faster than the other non-SIMD approaches. It uses near twice as many instructions as  the trie, but it is nearly twice as fast. However, the finite-state approach requires a relatively large table, larger than the alternatives.

The trie can match the SIMD-based approach when the input is predictable. I can simulate this scenario by repeatedly trying to match a small number of tokens (say 100) always in the same order. I get that the trie can then be just as fast in this easy case.

My code is available. It could be easily ported to ARM NEON or to AVX-512.

Credit: The problem was suggested to me by Jeroen Koekkoek from NLnet Labs. He sketched part of the solution. I want to thank GitHub user @aqrit for their comments.

Note: The problem considered in this blog post is not the recognition of strings from a set. I have a blog post on that other topic: Quickly checking that a string belongs to a small set.

Further reading: Is Prefix Of String In Table? by Nelson and Modern perfect hashing for strings by Muła.

Stealth, not secrecy

The strategy for winning is simple: do good work and tell the world about it. In that order! This implies some level of stealth as you are doing the good work.

If you plan to lose weight, don’t announce it… lose the weight and then do the reveal.

Early feedback frames the problem and might derail you. You can end up feeling stuck in your path and it may increase your risk of failure.

If you want to feel free to let go of bad ideas, do not commit to them publicly.

Or you may feel like you already succeeded and take the “do good work” for granted: it is called hubris and smart people are susceptible to it.

The timing of the first reveal matters because people are lazy. If you reveal something that is flawed, it will be hard to correct this impression later. You are better off waiting a bit later and making sure you present good work. Furthermore, announcing your next product early can drain interest for your current products. Steve Jobs once said: “It’s really simple, if we tell people what our next product is, they stop buying our current products.”

But the reverse does not apply: the most secretive people are not those who have genuine good work upcoming. For every Apple, you have hundreds of companies that have nothing worth stealing. And they will make sure you never find out.1, 2, 3

“Don’t worry about people stealing your ideas. If your ideas are any good, you’ll have to ram them down people’s throats.” Howard Aiken.

1. Theranos had an insane culture of secrecy. The company lasted 15 years. It was once valued at $10 billion. The company only failed after John Ioannidis famously questioned the wild claims of the company while the rest of the medical community just went along. It turns out that there was no innovation, no technology worth having. The company was soon worth $0.

2. Apple is successful and secretive. But Apple is secretive about features that their competitors already have.

3. In Surprisingly Small: The Effect of Trade Secret Breaches on Firm Performance, Searl and Vivian (2021) find that security breaches have little effect on a firm’s performance. They write:

For policy makers, the implication is the narrative of trade secret theft as a fundamental threat to a country’s economy and innovation (in our case, the US), may simply be rhetoric when contrasted to the market’s understanding of such theft. Therefore, calls for the expansion of trade secrecy protections, such as the expansion of its definition or further criminalisation of trade secret theft, may do less to protect innovation overall and instead expand negative externalities such as increased litigation and constraints on labour mobility.
For managers, the implication is that the benefits of the protection of trade secrets may be overstated. Counterintuitively, the findings suggest managers should not prioritise trade secret protections and cybersecurity if the main goal is protecting shareholders.

 

Packing a string of digits into an integer quickly

Suppose that I give you a short string of digits, containing possibly spaces or other characters (e.g., "20141103 012910"). We would like to pack the digits into an integer (e.g., 0x20141103012910) so that the lexicographical order over the string matches the ordering of the integers.

We can use the fact that in ASCII, the digits have byte values 0x30, 0x31 and so forth. Thus as a sequence of bytes, the string "20141103 012910" is actually 0x32, 0x30, 0x31… so we must select the least significant 4 bits of each byte, discarding the rest. Intel and AMD processors have a convenient instruction which allows us to select any bits from a word to construct a new word (pext).

A problem remains: Intel and AMD processors are little endian, which means that if I load the string in memory, the first byte becomes the least significant, not the most significant. Thankfully, Intel and AMD can handle this byte order during the load process.

In C, the desired function using Intel intrinsics might look like this:

#include <x86intrin.h> // Windows: <intrin.h>
#include <string.h>

// From "20141103 012910", we want to get
// 0x20141103012910
uint64_t extract_nibbles(const char* c) {
  uint64_t part1, part2;
  memcpy(&part1, c, sizeof(uint64_t));
  memcpy(&part2 , c + 7, sizeof(uint64_t));
  part1 = _bswap64(part1);
  part2 = _bswap64(part2);
  part1 = _pext_u64(part1, 0x0f0f0f0f0f0f0f0f);
  part2 = _pext_u64(part2, 0x0f000f0f0f0f0f0f);
  return (part1<<24) | (part2);
}

It compiles to relatively few instructions: only 4 non-load instructions. The memcpy calls in my code translate into 64-bit load instructions. The register loading instructions (movabs) are nearly free in practice.

movbe rax, QWORD PTR [rdi]
movbe rdx, QWORD PTR [rdi+7]
movabs rcx, 1085102592571150095
pext rax, rax, rcx
movabs rcx, 1080880467920490255
sal rax, 24
pext rdx, rdx, rcx
or rax, rdx

Prior to the AMD Zen 3 processors, pext had terrible performance on AMD processors. Recent AMD processors have performance on par with Intel, meaning that pext has a latency of about 3 cycles, and can run once per cycle. So it is about as expensive as a multiplication. Not counting the loads, the above function could nearly be complete in about 5 cycles, which is quite fast.

For ARM processors, you can do it with ARM NEON like so: mask the high nibbles, shuffle the bytes, then shift/or, then narrow (16->8), extract to general register.

#include <arm_neon.h>
// From "20141103 012910", we want to get
// 0x20141103012910
uint64_t extract_nibbles(const char *c) {
  const uint8_t *ascii = (const uint8_t *)(c);
  uint8x16_t in = vld1q_u8(ascii);
  // masking the high nibbles,
  in = vandq_u8(in, vmovq_n_u8(0x0f));
  // shuffle the bytes
  const uint8x16_t shuf = {14, 13, 12, 11, 10, 9, 7, 6,
    5, 4, 3, 2, 1, 0, 255, 255};
  in = vqtbl1q_u8(in, shuf);
  // then shift/or
  uint16x8_t ins =
    vsraq_n_u16(vreinterpretq_u16_u8(in),
    vreinterpretq_u16_u8(in), 4);
  // then narrow (16->8),
  int8x8_t packed = vmovn_u16(ins);
  // extract to general register.
  return vget_lane_u64(vreinterpret_u64_u16(packed), 0);
}

It might compile to something like this:

adrp x8, .LCPI0_0
ldr q1, [x0]
movi v0.16b, #15
ldr q2, [x8, :lo12:.LCPI0_0]
and v0.16b, v1.16b, v0.16b
tbl v0.16b, { v0.16b }, v2.16b
usra v0.8h, v0.8h, #4
xtn v0.8b, v0.8h
fmov x0, d0

My source code is available.

Having fun with string literal suffixes in C++

The C++11 standard introduced user-defined string suffixes. It also added regular  expressions to the C++ language as a standard feature. I wanted to have fun and see whether we could combine these features.

Regular expressions are useful to check whether a given string matches a pattern. For example, the expression \d+ checks that the string is made of one or more digits. Unfortunately, the backlash character needs to be escaped in C++, so the string \d+ may need to be written as "\\d+" or you may use a raw string: a raw string literal starts with R"( and ends in )" so you can write R"(\d+)". For complicated expressions, a raw string might be better.

A user-defined string literal is a way to specialize a string literal according to your own needs. It is effectively a convenient way to design your own “string types”. You can code it up as:

myclass operator"" _mysuffix(const char *str, size_t len) {
  return myclass(str, len);
}

And once it is defined, instead of writing myclass("mystring", 8), you can write "mystring"_mysuffix.

In any case, we would like to have a syntax such as this:

bool is_digit = "\\d+"_re("123");

I can start with a user-defined string suffix:

convenience_matcher operator "" _re(const char *str, size_t) {
return convenience_matcher(str);
}

I want my convenience_matcher to construct a regular expression instance, and to call the matching function whenever a parameter is passed in parenthesis. The following class might work:

#include <regex>
struct convenience_matcher {
  convenience_matcher(const char *str) : re(str) {}
  bool match(const std::string &s) {
    std::smatch base_match;
    return std::regex_match(s, base_match, re);
  }
  bool operator()(const std::string &s) { return match(s); }
  std::regex re;
};

And that is all. The following expressions will then return a Boolean value indicating whether we have the required pattern:

 "\\d+"_re("123") // true
 "\\d+"_re("a23") // false
 R"(\d+)"_re("123") // true
 R"(\d+)"_re("a23") // false

I have posted a complete example. It is just for illustration and I do not recommend using this code for anything serious. I am sure that you can do better!

Parsing time stamps faster with SIMD instructions

In software, it is common to represent time as a time-stamp string. It is usually specified by a time format string. Some standards use the format %Y%m%d%H%M%S meaning that we print the year, the month, the day, the hours, the minutes and the seconds. The current time as I write this blog post would be 20230701205436 as a time stamp in this format. It is convenient because it is short, easy to read and if you sort the strings lexicographically, you also sort them chronologically.

You can generate time stamps using any programming language. In C, the following program will print the current time (universal, not local time):

#include <time.h>
#include <stdio.h>
int main() {
  char buffer[15];
  struct tm timeinfo;
  time_t rawtime;
  time(&rawtime);
  gmtime_r(&rawtime, &timeinfo);
  size_t len = strftime(buffer, 15, "%Y%m%d%H%M%S", &timeinfo);
  buffer[14] = '\0';
  puts(buffer);
}

We are interested in the problem of parsing these strings. In practice, this means that we want to convert them to an integer presenting the number of seconds since the Unix epoch. The Unix epoch is January 1st 1970. For my purposes, I will consider the time to be an unsigned 32-bit integer so we can represent time between 1970 and 2106. It is not difficult to switch over to a 64-bit integer or to signed integers.

The way you typically solve this problem is to use something like the C function strptime. Can we do better?

Modern processors have fast instructions that operate on several words at once, called SIMD instructions. We have a block of 14 characters. Let us assume that we can read 16 characters safely, ignoring the content of the leftover characters.

We load the block of digits in a SIMD register. We subtract 0x30 (the code point value of the character ‘0’), and all bytes values should be between 0 and 9, inclusively. We know that some character must be smaller than 9, for example, we generally cannot have more than 59 seconds and never 60 seconds, in the time stamp string. In Unix time, we never allow 60 seconds. So one character must be between 0 and 5. Similarly, we start the hours at 00 and end at 23, so one character must be between 0 and 2. We do a saturating subtraction of the maximum: the result of such a subtraction should be zero if the value is no larger. We then use a special instruction to multiply one byte by 10, and sum it up with the next byte, getting a 16-bit value. We then repeat the same approach as before, checking that the result is not too large.

The code might look as follow using Intel intrinsic functions:

 __m128i v = _mm_loadu_si128((const __m128i *)date_string);
v = _mm_sub_epi8(v, _mm_set1_epi8(0x30));
__m128i limit =
_mm_setr_epi8(9, 9, 9, 9, 1, 9, 3, 9, 2, 9, 5, 9, 5, 9, -1, -1);
__m128i abide_by_limits = _mm_subs_epu8(v, limit); // must be all zero
const __m128i weights = _mm_setr_epi8(
10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 0, 0);
v = _mm_maddubs_epi16(v, weights);
__m128i limit16 =
_mm_setr_epi16(99,99, 12, 31, 23, 59, 59, -1);
__m128i abide_by_limits16 = _mm_subs_epu16(v, limit16);
__m128i limits = _mm_or_si128(abide_by_limits16,abide_by_limits);
if (!_mm_test_all_zeros(limits, limits)) {
  return false;
}

It does not get all the parsing done, but at this point, you have the months, days, hours, minutes and seconds as valid binary integer values. The year is parsed in two components (the first two digits, and the next two digits).

We can just use standard C code for the result.

Is it fast? I wrote a benchmark that I compile using GCC 12 on an Intel Ice Lake Linux server.

instructions per stamp time per stamp
standard C with strptime 700 46
SIMD approach 65 7.9

We use about 10 times fewer instructions, and we go 6 times faster. That is not bad, but I suspect it is not nearly optimal.

Importantly, we do full error checking, and abide by the standard.

The source code is available.

Credit: Thanks to Jeroen Koekkoek from NLnetLabs for initial work and for proposing the problem, and to @aqrit for sketching the current code.

 

Dynamic bit shuffle using AVX-512

Suppose that you want to reorder, arbitrarily, the bits in a 64-bit word. This question was raised on Twitter by @experquisite. Formally, you might want to provide, for each of the 64 bit position, an original bit position you want to copy.

Hence, the following code would reverse the bit order in your 64-bit word:

uint64_t w = some value;
uint8_t indexes[64] = {63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51,
                       50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38,
                       37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25,
                       24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12,
                       11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
bit_shuffle(w, indexes); // returns a reversed version 

A naive way to do it in C might be as follows:

uint64_t slow_bit_shuffle(uint64_t w, uint8_t indexes[64]) {
  uint64_t out{};
  for (size_t i = 0; i < 64; i++) {
    bool bit_set = w & (uint64_t(1) << indexes[i]);
    out |= (uint64_t(bit_set) << i);
  }
  return out;
}

This might be an acceptable implementation, but what if you want do it using few instructions? You can do it on recent Intel and AMD processors with support for AVX-512 instructions. You go from the general-purpose register to a mask register, to a 512-bit AVX-512 register, you apply a shuffle (vpermb), you go back to a mask register and finally back to a general-purpose register.

The code with Intel intrinsic functions looks as follows:

uint64_t bit_shuffle(uint64_t w, uint8_t indexes[64]) {
  __mmask64 as_mask = _cvtu64_mask64(w);
  __m512i as_vec_register =
  _mm512_maskz_mov_epi8(as_mask, _mm512_set1_epi8(0xFF));
  __m512i as_vec_register_shuf =
  _mm512_permutexvar_epi8(_mm512_loadu_si512(indexes), as_vec_register);
  return _cvtmask64_u64(_mm512_movepi8_mask(as_vec_register_shuf));
}

It might compile to about six instructions:

kmovq k0, rdi
vpmovm2b zmm0, k0
vmovdqu8 zmm1, ZMMWORD PTR [rsi]
vpermb zmm0, zmm1, zmm0
vpmovb2m k1, zmm0
kmovq rax, k1

As one reader points out, you can do better because AVX-512 has a dedicated instruction for bit shuffling which directly returns a mask and work directly from the 64-bit word as long as it is loaded in a vector register:

uint64_t faster_bit_shuffle(uint64_t w, uint8_t indexes[64]) {
  __m512i as_vec_register = _mm512_set1_epi64(w);
  __mmask64 as_mask = _mm512_bitshuffle_epi64_mask(as_vec_register,
     _mm512_loadu_si512(indexes));
  return _cvtmask64_u64(as_mask);
}

The resulting assembly is quite short:

vpbroadcastq zmm0, rdi
vpshufbitqmb k0, zmm0, ZMMWORD PTR [rsi]
kmovq rax, k0

Loading your indexes is likely to have a long latency, so if you can buffer the load (_mm512_loadu_si512(indexes)), you will reduce significantly the latency.

I have an implementation in C++.

Science and Technology links (June 25 2023)

  1. Women in highly religious relationships report the highest levels of relationship quality.
  2. US politics is largely divided into two parties (Republicans and Democrats). People who are affiliated with the Republicans have many more kids.
  3. The Antartic ice shelves gained 661 gigaton of ice over the past decade.
  4. A low protein diet increases mortality among older men.
  5. Consuming magnesium may preserve your brain, especially if you are a woman.
  6. A third of the planet around common stars in our galaxy could be in the habitable zone.
  7. Taurine (a supplement) might keep you young. It works, in some ways, in different animals. McGaunn et al. (2023) make a case for human beings, but they do not have much in the way of quality clinical trials.
  8. Older people often lose their thymus, a gland that plays an important role in your immune system. Sandstedt et al. report that about two-third of their middle-age subjects had lost entirely the thymus (complete degeneration). However, a small percentage (7%) has maintained the thymus. These lucky people had much lower abnominal fat. The authors conclude that obesity might be driving immunological aging.
  9. Online shaming is motivated by schadenfreude (pleasure derived from the misfortunes of others.).
  10. Obese people tend not to feel satisfied even after eating. This phenomenon remains even after they have lost weight.
  11. Oral contraceptives may increase the risk of depression (HR=1.7).
  12. Trees are growing at higher and higher altitudes.
  13. Female conservative candidates are usually more attractive. Conservative candidates are more likely to express happiness.
  14. 80 % of participants rated their intelligence as above average.