Decoding base16 sequences quickly

We sometimes represent binary data using the hexadecimal notation. We use a base-16 representation where the first 10 digits are 0, 1, 2, 3, 5, 6, 7, 8, 9 and where the following digits are A, B, C, D, E, F (or a, b, c, d, e, f). Thus each character represents 4 bits. A pair of characters can represent any byte value (8 bits).

In Parsing short hexadecimal strings efficiently, I examined the problem of decoding short sequences (e.g., 4 characters). A conventional parser can be based on table lookups, where unexpected characters are mapped to a special value (e.g., -1) but where the characters 0 to 9 are mapped to 0 to 9, and A to F are mapped to 10 to 15. A conventional decoder can rely on a simple routine such as…

int8_t n1 = digittoval[src[0]];
int8_t n2 = digittoval[src[1]];
src += 2;
*dst = (uint8_t)((n1 << 4) | n2);
dst += 1;

Can we go faster? In Parsing hex numbers with validation, Langdale and Muła considered the question. They use SIMD instructions. They get good results, but let us revisit the question.

I am considering the problem of parsing sequences of 56 characters (representing 28 bytes). I want to compare the best approach described by Langdale and Muła with a straight adaptation of our base32 decoder. Generally speaking, the main difference between our approach and the Langdale-Muła is in the validation. The Langdale-Muła approach does straight-forward arithmetic and comparisons, whereas we favour vectorized classification (e.g., we use vectorized table lookups).

Our main routine to decode 16 characters goes as follows:

  1. Load 16 bytes into a vector register.
  2. Subtract 1 from all character values.
  3. Using a 4-bit shift and a mask, select the most significant 4 bits of each byte value.
  4. Do a vectorized table lookup using the most significant 4 bits, and add the result to the value. This inexpensive computation can detect any non-hexadecimal character: any such character would map to a byte value with the most significant bit set. We later branch out based on a movemask which detects these bytes.
  5. We do a similar computation, a vectorized table lookup using the most significant 4 bits, and add the result to the value, to map the character to a 4-bit value (between 0 and 16).
  6. We use a multiply-add and a pack instruction to pack the 4-bit values, going from 16 bytes to 8 bytes.
  7. We write the 8 bytes to the output.

Using Intel instructions, the hot loop looks as follows:

  __m128i v = _mm_loadu_si128((__m128i *)src);
  __m128i vm1 = _mm_add_epi8(v, _mm_set1_epi8(-1));
  __m128i hash_key =
  _mm_and_si128(_mm_srli_epi32(vm1, 4), _mm_set1_epi8(0x0F));
 __m128i check = _mm_add_epi8(_mm_shuffle_epi8(delta_check, hash_key), vm1);
  v = _mm_add_epi8(vm1, _mm_shuffle_epi8(delta_rebase, hash_key));
  unsigned int m = (unsigned)_mm_movemask_epi8(check);
  if (m) {
   // error
  }
__m128i t3 = _mm_maddubs_epi16(v, _mm_set1_epi16(0x0110));
__m128i t5 = _mm_packus_epi16(t3, t3);
_mm_storeu_si128((__m128i *)dst, t5);

You can do the equivalent with 256-bit (and even 512-bit) registers if you want. I refer the interested reader to our paper Faster Base64 Encoding and Decoding using AVX2 Instructions.

Your results will vary depending on the system. I use GCC 11 and an Ice Lake server. The inputs are rather small and there is overhead due to the function calls so it is not an ideal case for fancy algorithms.

technique CPU cycles/string instructions/string
conventional 72 360
Langdale-Muła (128 bits) 24 110
ours (128 bits) 21 88
ours (256 bits) 16 61

My source code is available.

Compared to a conventional decoder, our fast techniques are 3 to 5 times faster.

At least in this code, our approach is slightly faster than the Langdale-Muła approach. However, the difference is small and if you consider different hardware or different compilers, the results might flip. The 256-bit approach is noticeably faster, but if your inputs are small (shorter than 32 characters), then that would not hold.

If you have access to AVX-512, you can assuredly go much faster, but I did not consider this case.

In the real-world, you may need to handle spaces in the encoded sequence. My benchmark does not handle such cases and some careful engineering based on real-world data would be needed to deal efficiently with such inputs at high speed. I nevertheless include a fast decoder that can ignore spaces in the input for demonstration purposes (see source code). It would be easier to prune spaces using AVX-512.

Porting this code to ARM NEON would be easy.

Conclusion: You can decode base16 inputs using about one instruction per input character which is about six times better than a conventional decoder. It seems that you can beat the fast Langdale-Muła decoder, at least on my test system. Your main tool to go faster is to use wider SIMD registers. Future work should consider AVX-512: it might be possible to double the performance.

Credit: The work was motivated by the simdzone project by NLnetLabs. GitHub user @aqrit lent his expertise.

Science and Technology links (July 23 2023)

  1. People increasingly consume ultra processed foods. They include
    energy drinks, mass-produced packaged breads, margarines, cereal, energy bars, fruit yogurts, fruit drinks, vegan meat and cheese, infant formulas, pizza, chicken nuggets, and so forth. Ultra processed foods are correlated with poorer health.
  2. Homo sapiens was not the first human species to venture out of Africa, it was the last.
  3. Researchers claim to be able to reprogram the cells in our eyes to repair retinal degenerence.
  4. Older people are often prescribed statins to keep their risk of cardiovascular disease low. However, statins reduce short-term cognitive performance.
  5. Killifish are animals that have amazing regeneration abilities, being even able to regenerate part of their brain. As they age, this regenerative ability is lost. Killifish, like other animals, tend to accumulate dysfunctional cells with age, called senescent cells. Though having a few senescent cells is not a problem, and may even be necessary, having too many is believed to reduce fitness. Thankfully, we have drugs called senolytics that can eliminate some of the senescent cells. By using these drugs in old killifish, they were able to restore some of the amazing regenerative abilities.
  6. There may be a mushroom that is able to increase nerve growth and improve memory, in human beings.
  7. Resistance training might be a viable anti-skin-aging strategy for women (and possibly men).

Fast decoding of base32 strings

We often need to encode binary data into ASCII strings (e.g., email). The standards to do so include base16, base32 and base64.

There are some research papers on fast base64 encoding and decoding: Base64 encoding and decoding at almost the speed of a memory copy and Faster Base64 Encoding and Decoding using AVX2 Instructions.

For the most parts, these base64 techniques are applicable to base32. Base32 works in the following manner: you use 8 ASCII characters to encode 5 bytes. Each ASCII characters carries 5 bits of information: it can be one of 32 characters. For reference, base64 uses 64 different ASCII characters so each character carries more information. However, base64 requires using both upper case and lower case letters and other special characters, so it is less portable. Base32 can be case invariant.

There are different variations, but we can consider Base 32 Encoding with Extended Hex Alphabet which uses the letters 0 to 9 for the values 0 to 9 and the letters A to V for the numbers from 10 to 31. So each character represents a value between 0 to 31. If required, you can pad the coding with the ‘=’ character so that it is divisible by 8 characters. However, that is not always required. Instead, you may simply stop decoding as soon as an out-of-range character is found.

‘0’ 0
‘1’ 1
‘2’ 2
‘4’ 4
‘9’ 9
‘A’ 10
‘V’ 31

A conventional decoder might use branchy code:

if (ch >= '0' && ch <= '9')
  d = ch - '0';
else if (ch >= 'A' && ch <= 'V')
  d = ch - 'A' + 10;
else if (ch >= 'a' && ch <= 'v')
  d = ch - 'a' + 10;
else
  return -1;

Though that’s not going to be fast, it can be convenient. You can do better with tables. For example, you may populate a 256-long table (one for each character byte value) with the values from the table above, and use a special error code when the value is out of range (e.g., 32). That can produce efficient code:

uint64_t r = 0;
for (size_t i = 0; i < 8; i++) {
  uint8_t x = table[*src++];
  if (x > 31) {
    r <<= (5 * (8 - i));
    break;
  }
  r <<= 5;
  r |= x;
}

You can also program a version using SIMD instructions. I am not going to present the code, it is similar to the code described in a base64 paper.

I wrote a benchmark focused on short inputs (32 bytes). My benchmark makes function inlining difficult, thus the function call overhead and the population of the constant is a non-negligible cost.

I run it on an Ice Lake server, and the code is compiled with GCC11 (targeting an old processor, Haswell). We can use either 128-bit SIMD registers or 256-bit SIMD registers.

technique CPU cycles/string instructions/string
branchy 500 1400
table 43 230
SIMD (128 bits) 15 70
SIMD (256 bits) 13 61

In this instance, the table-based approach is reasonable and is only about three times slower than the SIMD-based approaches. Because I am using small inputs, the 256-bit SIMD code has only marginal benefits, but I expect it would do better over longer strings. The branchy code is terrible performance-wise, but it is more flexible and can easily deal with skipping white space characters and so forth.

My source code is available.

Credit: The work was motivated by the simdzone project by NLnetLabs. The initial base32 decoder implementation was provided by GitHub user @aqrit.

Science and Technology links (July 16 2023)

  1. Most people think that they are more intelligent than average.
  2. Lack of vitamin C may damage the arteries. Make sure you have enough!
  3. A difficult problem in software is caching. Caching is the idea that you keep some values in fast memory. But how do you choose which values to keep? A standard technique is to evict from cache the least recently used value. Intuitively, you just get rid of the value you have not needed in a long time. However, that’s relatively expensive. Yang et al. find that a simpler technique (first-in first-out) can be just as good. You just enter values in queue and the last value to enter cache is the first one to be considered for evication. For best performance, they explain that you should reinsert values that have been evicted when it has been used at least once, and that you should have a probation period for the newly inserted values.
  4. Soybean oil induces obesity, diabetes, insulin resistance, and fatty liver in mice. Unlike other oils (e.g., coconut oil), soybean oils affect gene expression.
  5. Receiving a Nobel Prize might lower your intellectual productivity.
  6. We are currently in the holocene, which started at the end of the last glacial period (about 12,000 years ago). During the first half of the holocene, the Sahara was green, filled with grassland and tropical trees.
  7. Conspiracy theorizing is most prevalent in freer societies (including Australia, Canada, Germany, Japan, the Netherlands and the United Kingdom).
  8. Cancer thrives on sugar. Thus tumors do not grow as much on a low-sugar diet. However, it does not follow that a ketogenic diet is helpful in fighting cancer because the lack of sugar may accelerate frailty and thus hasten death.
  9. Offsprings of parents with severe mental illnesses are more at risk for diabetes.
  10. Vitamin D supplements might reduce your risk of myocardial infarction.
  11. If you take away much of the genetic material of a cell, leaving it less fit, it can quickly gain back the fitness by natural selection.
  12. You can tell apart men and women by how they smell.

Recognizing string prefixes with SIMD instructions

Suppose that I give you a long list of string tokens (e.g., “A”, “A6”, “AAAA”, “AFSDB”, “APL”, “CAA”, “CDS”, “CDNSKEY”, “CERT”, “CH”, “CNAME”, “CS”, “CSYNC”, “DHC”, etc.). I give you a pointer inside a much larger string and I ask you whether you are pointing at one of these tokens, and if so, which one. To make things slightly more complicated, you want the token to be followed by a valid separator (e.g., a space, a semi-colon, etc.) and you want to ignore the case (so that “aaaa” matches “AAAA”).

How might you solve this efficiently?

Comparing against each token might do well if you have few of them, but it is clearly a bad idea if you have many (say 70).

A reasonable approach is to do a binary search through the sorted list of tokens. The C function bsearch is well suited. You need a comparison function as part of the implementation. You may use the C function strncasecmp to compare the strings while ignoring the case, and you add a check to make sure that you only return a match (value 0) when the input has a terminating character at the right position.

Then the linearithmic (O(n log n)) implementation looks like this…

std::string *lookup_symbol(const char *input) {
  return bsearch(input, strings.data(), strings.size(),
  sizeof(std::string), compare);
}

Simple enough.

Another approach is to use a trie. You implement a tree where the first level checks for the first character, the second level for the second character and so forth.

It gets a little bit lengthy, but you can use a script or a library to generate the code. You can use a series of switch/case like so…

switch (s[0]) {
  case 'A': case 'a':
  switch (s[1]) {
    case '6': return is_separator(s[2]) ? 1 : -1;
  case 'A': case 'a':
    switch (s[2]) {
     case 'A': case 'a':
       switch (s[3]) { 
         case 'A': case 'a':
          return is_separator(s[4]) ? 2 : -1;
       default:
         return -1;
...

The running time complexity depends on the data, but is at most the length of the longest string in your set. The trie is a tempting solution but it is branchy: if the processor can predict the upcoming content, it should do well, but if the input is random, you might be unlikely and get poor performance.

We can also use a finite-state machine which requires a relative large table, but has really simple execution:

int s = 71;
while (*str && s >= 0) {
  uint8_t tok = char2token[uint8_t(*str)];
  str++;
  s = statetable[32 * s + tok];
}
*token = (uint8_t)s;
return s != 0;

With SIMD instructions, you can write a tight implementation that is effectively branchless: its execution does not depend on the input data.

The code works in this manner:

  1. We load unconditionally 16 bytes in a register.
  2. We first find the location of the first separator, if any. We can do this with vectorized classification. It is a significant cost.
  3. We set to zero all bytes in the register starting from this first separator. We also switch all characters in A-Z to a lower case.
  4. We use a hash function to map the processed bytes to a table containing our tokens in 16-byte blocks. The hash function is designed in a such a way that if the input matches one of our tokens, then it should be mapped to an identical value. We can derive the hash function empirically (by trial and error). Computing the hash function is a significant cost so we have the be careful. In this instance, I use a simple function:
    (((val >> 32) ^ val)&0xffffffff) * (uint64_t)3523216699) >> 32.
  5. We compare the processed input with the loaded value from the hash function.

The C function written using Intel intrinsic functions is as follows:

bool sse_type(const char *type_string, uint8_t *type) {
  __m128i input = _mm_loadu_si128((__m128i *)type_string);
  __m128i delimiters =
    _mm_setr_epi8(0x00, 0x00, 0x22, 0x00, 0x00, 0x00, 
                  0x00, 0x00, 0x28, 0x09, 0x0a, 0x3b, 
                  0x00, 0x0d, 0x00, 0x00);
  __m128i mask = _mm_setr_epi8(-33, -1, -1, -1, -1, 
                  -1, -1, -1, -1, -33, -1, -1,
                  -1, -1, -1, -1);
  __m128i pattern = _mm_shuffle_epi8(delimiters, input);
  __m128i inputc = _mm_and_si128(input, 
      _mm_shuffle_epi8(mask, input));
  int bitmask = _mm_movemask_epi8(
      _mm_cmpeq_epi8(inputc, pattern));
  uint16_t length = __builtin_ctz(bitmask);
  __m128i zero_mask = _mm_loadu_si128(
       (__m128i *)(zero_masks + 16 - length));
  __m128i inputlc = _mm_or_si128(input, _mm_set1_epi8(0x20));
  input = _mm_andnot_si128(zero_mask, inputlc);
  uint8_t idx = hash((uint64_t)_mm_cvtsi128_si64(input));
  *type = idx;
  __m128i compar = _mm_loadu_si128((__m128i *)buffers[idx]);
  __m128i xorthem = _mm_xor_si128(compar, input);
  return _mm_test_all_zeros(xorthem, xorthem);
}

We expect it to compile to branchfree code, as follows:

sse_type(char const*, unsigned char*):
  movdqu xmm1, XMMWORD PTR [rdi]
  mov edx, 16
  movdqa xmm2, XMMWORD PTR .LC0[rip]
  movdqa xmm0, XMMWORD PTR .LC1[rip]
  pshufb xmm2, xmm1
  pshufb xmm0, xmm1
  pand xmm0, xmm1
  pcmpeqb xmm0, xmm2
  por xmm1, XMMWORD PTR .LC2[rip]
  pmovmskb eax, xmm0
  bsf eax, eax
  cdqe
  sub rdx, rax
  movdqu xmm0, XMMWORD PTR zero_masks[rdx]
  pandn xmm0, xmm1
  movq rax, xmm0
  movq rdx, xmm0
  shr rax, 32
  xor eax, edx
  mov edx, 3523216699
  imul rax, rdx
  shr rax, 32
  mov BYTE PTR [rsi], al
  movzx eax, al
  sal rax, 4
  pxor xmm0, XMMWORD PTR buffers[rax]
  ptest xmm0, xmm0
  sete al
  ret

I wrote a benchmark where we repeatedly try to check for matching tokens, using many thousands random tokens (enough to prevent the processor from having trivial branch prediction). I run it on an Ice Lake server, and the code is compiled with GCC11 (targeting an old processor, Westmere).

technique CPU cycles/string instructions/string
binary search 236 335
trie 71 39
finite state 42 61
SIMD 15 39

In this particular test, the SIMD-based approach is four times faster than the trie despite the fact that it retires as many instructions. The trie struggles with branch mispredictions. The SIMD-based approach has a relatively high number of instructions retired per cycle (2.5). The binary search is disastrous in this case, being more than 10 times slower. The finite-state approach is interesting as it is only three times slower than the SIMD-based approach and significantly faster than the other non-SIMD approaches. It uses near twice as many instructions as  the trie, but it is nearly twice as fast. However, the finite-state approach requires a relatively large table, larger than the alternatives.

The trie can match the SIMD-based approach when the input is predictable. I can simulate this scenario by repeatedly trying to match a small number of tokens (say 100) always in the same order. I get that the trie can then be just as fast in this easy case.

My code is available. It could be easily ported to ARM NEON or to AVX-512.

Credit: The problem was suggested to me by Jeroen Koekkoek from NLnet Labs. He sketched part of the solution. I want to thank GitHub user @aqrit for their comments.

Note: The problem considered in this blog post is not the recognition of strings from a set. I have a blog post on that other topic: Quickly checking that a string belongs to a small set.

Further reading: Is Prefix Of String In Table? by Nelson and Modern perfect hashing for strings by Muła.

Stealth, not secrecy

The strategy for winning is simple: do good work and tell the world about it. In that order! This implies some level of stealth as you are doing the good work.

If you plan to lose weight, don’t announce it… lose the weight and then do the reveal.

Early feedback frames the problem and might derail you. You can end up feeling stuck in your path and it may increase your risk of failure.

If you want to feel free to let go of bad ideas, do not commit to them publicly.

Or you may feel like you already succeeded and take the “do good work” for granted: it is called hubris and smart people are susceptible to it.

The timing of the first reveal matters because people are lazy. If you reveal something that is flawed, it will be hard to correct this impression later. You are better off waiting a bit later and making sure you present good work. Furthermore, announcing your next product early can drain interest for your current products. Steve Jobs once said: “It’s really simple, if we tell people what our next product is, they stop buying our current products.”

But the reverse does not apply: the most secretive people are not those who have genuine good work upcoming. For every Apple, you have hundreds of companies that have nothing worth stealing. And they will make sure you never find out.1, 2, 3

“Don’t worry about people stealing your ideas. If your ideas are any good, you’ll have to ram them down people’s throats.” Howard Aiken.

1. Theranos had an insane culture of secrecy. The company lasted 15 years. It was once valued at $10 billion. The company only failed after John Ioannidis famously questioned the wild claims of the company while the rest of the medical community just went along. It turns out that there was no innovation, no technology worth having. The company was soon worth $0.

2. Apple is successful and secretive. But Apple is secretive about features that their competitors already have.

3. In Surprisingly Small: The Effect of Trade Secret Breaches on Firm Performance, Searl and Vivian (2021) find that security breaches have little effect on a firm’s performance. They write:

For policy makers, the implication is the narrative of trade secret theft as a fundamental threat to a country’s economy and innovation (in our case, the US), may simply be rhetoric when contrasted to the market’s understanding of such theft. Therefore, calls for the expansion of trade secrecy protections, such as the expansion of its definition or further criminalisation of trade secret theft, may do less to protect innovation overall and instead expand negative externalities such as increased litigation and constraints on labour mobility.
For managers, the implication is that the benefits of the protection of trade secrets may be overstated. Counterintuitively, the findings suggest managers should not prioritise trade secret protections and cybersecurity if the main goal is protecting shareholders.

 

Packing a string of digits into an integer quickly

Suppose that I give you a short string of digits, containing possibly spaces or other characters (e.g., "20141103 012910"). We would like to pack the digits into an integer (e.g., 0x20141103012910) so that the lexicographical order over the string matches the ordering of the integers.

We can use the fact that in ASCII, the digits have byte values 0x30, 0x31 and so forth. Thus as a sequence of bytes, the string "20141103 012910" is actually 0x32, 0x30, 0x31… so we must select the least significant 4 bits of each byte, discarding the rest. Intel and AMD processors have a convenient instruction which allows us to select any bits from a word to construct a new word (pext).

A problem remains: Intel and AMD processors are little endian, which means that if I load the string in memory, the first byte becomes the least significant, not the most significant. Thankfully, Intel and AMD can handle this byte order during the load process.

In C, the desired function using Intel intrinsics might look like this:

#include <x86intrin.h> // Windows: <intrin.h>
#include <string.h>

// From "20141103 012910", we want to get
// 0x20141103012910
uint64_t extract_nibbles(const char* c) {
  uint64_t part1, part2;
  memcpy(&part1, c, sizeof(uint64_t));
  memcpy(&part2 , c + 7, sizeof(uint64_t));
  part1 = _bswap64(part1);
  part2 = _bswap64(part2);
  part1 = _pext_u64(part1, 0x0f0f0f0f0f0f0f0f);
  part2 = _pext_u64(part2, 0x0f000f0f0f0f0f0f);
  return (part1<<24) | (part2);
}

It compiles to relatively few instructions: only 4 non-load instructions. The memcpy calls in my code translate into 64-bit load instructions. The register loading instructions (movabs) are nearly free in practice.

movbe rax, QWORD PTR [rdi]
movbe rdx, QWORD PTR [rdi+7]
movabs rcx, 1085102592571150095
pext rax, rax, rcx
movabs rcx, 1080880467920490255
sal rax, 24
pext rdx, rdx, rcx
or rax, rdx

Prior to the AMD Zen 3 processors, pext had terrible performance on AMD processors. Recent AMD processors have performance on par with Intel, meaning that pext has a latency of about 3 cycles, and can run once per cycle. So it is about as expensive as a multiplication. Not counting the loads, the above function could nearly be complete in about 5 cycles, which is quite fast.

For ARM processors, you can do it with ARM NEON like so: mask the high nibbles, shuffle the bytes, then shift/or, then narrow (16->8), extract to general register.

#include <arm_neon.h>
// From "20141103 012910", we want to get
// 0x20141103012910
uint64_t extract_nibbles(const char *c) {
  const uint8_t *ascii = (const uint8_t *)(c);
  uint8x16_t in = vld1q_u8(ascii);
  // masking the high nibbles,
  in = vandq_u8(in, vmovq_n_u8(0x0f));
  // shuffle the bytes
  const uint8x16_t shuf = {14, 13, 12, 11, 10, 9, 7, 6,
    5, 4, 3, 2, 1, 0, 255, 255};
  in = vqtbl1q_u8(in, shuf);
  // then shift/or
  uint16x8_t ins =
    vsraq_n_u16(vreinterpretq_u16_u8(in),
    vreinterpretq_u16_u8(in), 4);
  // then narrow (16->8),
  int8x8_t packed = vmovn_u16(ins);
  // extract to general register.
  return vget_lane_u64(vreinterpret_u64_u16(packed), 0);
}

It might compile to something like this:

adrp x8, .LCPI0_0
ldr q1, [x0]
movi v0.16b, #15
ldr q2, [x8, :lo12:.LCPI0_0]
and v0.16b, v1.16b, v0.16b
tbl v0.16b, { v0.16b }, v2.16b
usra v0.8h, v0.8h, #4
xtn v0.8b, v0.8h
fmov x0, d0

My source code is available.

Having fun with string literal suffixes in C++

The C++11 standard introduced user-defined string suffixes. It also added regular  expressions to the C++ language as a standard feature. I wanted to have fun and see whether we could combine these features.

Regular expressions are useful to check whether a given string matches a pattern. For example, the expression \d+ checks that the string is made of one or more digits. Unfortunately, the backlash character needs to be escaped in C++, so the string \d+ may need to be written as "\\d+" or you may use a raw string: a raw string literal starts with R"( and ends in )" so you can write R"(\d+)". For complicated expressions, a raw string might be better.

A user-defined string literal is a way to specialize a string literal according to your own needs. It is effectively a convenient way to design your own “string types”. You can code it up as:

myclass operator"" _mysuffix(const char *str, size_t len) {
  return myclass(str, len);
}

And once it is defined, instead of writing myclass("mystring", 8), you can write "mystring"_mysuffix.

In any case, we would like to have a syntax such as this:

bool is_digit = "\\d+"_re("123");

I can start with a user-defined string suffix:

convenience_matcher operator "" _re(const char *str, size_t) {
return convenience_matcher(str);
}

I want my convenience_matcher to construct a regular expression instance, and to call the matching function whenever a parameter is passed in parenthesis. The following class might work:

#include <regex>
struct convenience_matcher {
  convenience_matcher(const char *str) : re(str) {}
  bool match(const std::string &s) {
    std::smatch base_match;
    return std::regex_match(s, base_match, re);
  }
  bool operator()(const std::string &s) { return match(s); }
  std::regex re;
};

And that is all. The following expressions will then return a Boolean value indicating whether we have the required pattern:

 "\\d+"_re("123") // true
 "\\d+"_re("a23") // false
 R"(\d+)"_re("123") // true
 R"(\d+)"_re("a23") // false

I have posted a complete example. It is just for illustration and I do not recommend using this code for anything serious. I am sure that you can do better!

Parsing time stamps faster with SIMD instructions

In software, it is common to represent time as a time-stamp string. It is usually specified by a time format string. Some standards use the format %Y%m%d%H%M%S meaning that we print the year, the month, the day, the hours, the minutes and the seconds. The current time as I write this blog post would be 20230701205436 as a time stamp in this format. It is convenient because it is short, easy to read and if you sort the strings lexicographically, you also sort them chronologically.

You can generate time stamps using any programming language. In C, the following program will print the current time (universal, not local time):

#include <time.h>
#include <stdio.h>
int main() {
  char buffer[15];
  struct tm timeinfo;
  time_t rawtime;
  time(&rawtime);
  gmtime_r(&rawtime, &timeinfo);
  size_t len = strftime(buffer, 15, "%Y%m%d%H%M%S", &timeinfo);
  buffer[14] = '\0';
  puts(buffer);
}

We are interested in the problem of parsing these strings. In practice, this means that we want to convert them to an integer presenting the number of seconds since the Unix epoch. The Unix epoch is January 1st 1970. For my purposes, I will consider the time to be an unsigned 32-bit integer so we can represent time between 1970 and 2106. It is not difficult to switch over to a 64-bit integer or to signed integers.

The way you typically solve this problem is to use something like the C function strptime. Can we do better?

Modern processors have fast instructions that operate on several words at once, called SIMD instructions. We have a block of 14 characters. Let us assume that we can read 16 characters safely, ignoring the content of the leftover characters.

We load the block of digits in a SIMD register. We subtract 0x30 (the code point value of the character ‘0’), and all bytes values should be between 0 and 9, inclusively. We know that some character must be smaller than 9, for example, we generally cannot have more than 59 seconds and never 60 seconds, in the time stamp string. In Unix time, we never allow 60 seconds. So one character must be between 0 and 5. Similarly, we start the hours at 00 and end at 23, so one character must be between 0 and 2. We do a saturating subtraction of the maximum: the result of such a subtraction should be zero if the value is no larger. We then use a special instruction to multiply one byte by 10, and sum it up with the next byte, getting a 16-bit value. We then repeat the same approach as before, checking that the result is not too large.

The code might look as follow using Intel intrinsic functions:

 __m128i v = _mm_loadu_si128((const __m128i *)date_string);
v = _mm_sub_epi8(v, _mm_set1_epi8(0x30));
__m128i limit =
_mm_setr_epi8(9, 9, 9, 9, 1, 9, 3, 9, 2, 9, 5, 9, 5, 9, -1, -1);
__m128i abide_by_limits = _mm_subs_epu8(v, limit); // must be all zero
const __m128i weights = _mm_setr_epi8(
10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 0, 0);
v = _mm_maddubs_epi16(v, weights);
__m128i limit16 =
_mm_setr_epi16(99,99, 12, 31, 23, 59, 59, -1);
__m128i abide_by_limits16 = _mm_subs_epu16(v, limit16);
__m128i limits = _mm_or_si128(abide_by_limits16,abide_by_limits);
if (!_mm_test_all_zeros(limits, limits)) {
  return false;
}

It does not get all the parsing done, but at this point, you have the months, days, hours, minutes and seconds as valid binary integer values. The year is parsed in two components (the first two digits, and the next two digits).

We can just use standard C code for the result.

Is it fast? I wrote a benchmark that I compile using GCC 12 on an Intel Ice Lake Linux server.

instructions per stamp time per stamp
standard C with strptime 700 46
SIMD approach 65 7.9

We use about 10 times fewer instructions, and we go 6 times faster. That is not bad, but I suspect it is not nearly optimal.

Importantly, we do full error checking, and abide by the standard.

The source code is available.

Credit: Thanks to Jeroen Koekkoek from NLnetLabs for initial work and for proposing the problem, and to @aqrit for sketching the current code.

 

Dynamic bit shuffle using AVX-512

Suppose that you want to reorder, arbitrarily, the bits in a 64-bit word. This question was raised on Twitter by @experquisite. Formally, you might want to provide, for each of the 64 bit position, an original bit position you want to copy.

Hence, the following code would reverse the bit order in your 64-bit word:

uint64_t w = some value;
uint8_t indexes[64] = {63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51,
                       50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38,
                       37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25,
                       24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12,
                       11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
bit_shuffle(w, indexes); // returns a reversed version 

A naive way to do it in C might be as follows:

uint64_t slow_bit_shuffle(uint64_t w, uint8_t indexes[64]) {
  uint64_t out{};
  for (size_t i = 0; i < 64; i++) {
    bool bit_set = w & (uint64_t(1) << indexes[i]);
    out |= (uint64_t(bit_set) << i);
  }
  return out;
}

This might be an acceptable implementation, but what if you want do it using few instructions? You can do it on recent Intel and AMD processors with support for AVX-512 instructions. You go from the general-purpose register to a mask register, to a 512-bit AVX-512 register, you apply a shuffle (vpermb), you go back to a mask register and finally back to a general-purpose register.

The code with Intel intrinsic functions looks as follows:

uint64_t bit_shuffle(uint64_t w, uint8_t indexes[64]) {
  __mmask64 as_mask = _cvtu64_mask64(w);
  __m512i as_vec_register =
  _mm512_maskz_mov_epi8(as_mask, _mm512_set1_epi8(0xFF));
  __m512i as_vec_register_shuf =
  _mm512_permutexvar_epi8(_mm512_loadu_si512(indexes), as_vec_register);
  return _cvtmask64_u64(_mm512_movepi8_mask(as_vec_register_shuf));
}

It might compile to about six instructions:

kmovq k0, rdi
vpmovm2b zmm0, k0
vmovdqu8 zmm1, ZMMWORD PTR [rsi]
vpermb zmm0, zmm1, zmm0
vpmovb2m k1, zmm0
kmovq rax, k1

As one reader points out, you can do better because AVX-512 has a dedicated instruction for bit shuffling which directly returns a mask and work directly from the 64-bit word as long as it is loaded in a vector register:

uint64_t faster_bit_shuffle(uint64_t w, uint8_t indexes[64]) {
  __m512i as_vec_register = _mm512_set1_epi64(w);
  __mmask64 as_mask = _mm512_bitshuffle_epi64_mask(as_vec_register,
     _mm512_loadu_si512(indexes));
  return _cvtmask64_u64(as_mask);
}

The resulting assembly is quite short:

vpbroadcastq zmm0, rdi
vpshufbitqmb k0, zmm0, ZMMWORD PTR [rsi]
kmovq rax, k0

Loading your indexes is likely to have a long latency, so if you can buffer the load (_mm512_loadu_si512(indexes)), you will reduce significantly the latency.

I have an implementation in C++.

Science and Technology links (June 25 2023)

  1. Women in highly religious relationships report the highest levels of relationship quality.
  2. US politics is largely divided into two parties (Republicans and Democrats). People who are affiliated with the Republicans have many more kids.
  3. The Antartic ice shelves gained 661 gigaton of ice over the past decade.
  4. A low protein diet increases mortality among older men.
  5. Consuming magnesium may preserve your brain, especially if you are a woman.
  6. A third of the planet around common stars in our galaxy could be in the habitable zone.
  7. Taurine (a supplement) might keep you young. It works, in some ways, in different animals. McGaunn et al. (2023) make a case for human beings, but they do not have much in the way of quality clinical trials.
  8. Older people often lose their thymus, a gland that plays an important role in your immune system. Sandstedt et al. report that about two-third of their middle-age subjects had lost entirely the thymus (complete degeneration). However, a small percentage (7%) has maintained the thymus. These lucky people had much lower abnominal fat. The authors conclude that obesity might be driving immunological aging.
  9. Online shaming is motivated by schadenfreude (pleasure derived from the misfortunes of others.).
  10. Obese people tend not to feel satisfied even after eating. This phenomenon remains even after they have lost weight.
  11. Oral contraceptives may increase the risk of depression (HR=1.7).
  12. Trees are growing at higher and higher altitudes.
  13. Female conservative candidates are usually more attractive. Conservative candidates are more likely to express happiness.
  14. 80 % of participants rated their intelligence as above average.

Citogenesis in science and the importance of real problems

Scientists publish papers in refereed journals and conferences: they write up their results and we ask anonymous referees to assess it. If the work is published, presumably because the anonymous referees found nothing objectionable, the published paper joins the “literature”.

It is not a strict requirement: you can do excellent research without publishing in refereed journals. Einstein refused to publish in refereed journals. The famous computer scientist Dijkstra mostly published his work in the form of letters he would send to his friends: today, we would refer to this model as a blog. Dijkstra invited computer scientists to become independent of peer review as he viewed the peer review process as a path toward mediocrity. More recently, the folks from OpenAI appear to mostly publish unrefereed online papers. Yet OpenAI has probably produced the greatest scientific breakthrough of the decade.

Unfortunately, some people confuse “science” with the publication of refereed papers. They may also confuse our current knowledge with what is currently published in refereed journals.

Many papers in Computer Science tell the following story:

  • There is a pre-existing problem P.
  • There are few relatively simple but effective solutions to problem P. Among them is solution X.
  • We came up with a new solution X+ which is a clever variation on X. It looks good on paper.
  • We ran some experiments and tweaked our results until X+ looked good. We found a clever way to avoid comparing X+ and X directly and fairly, as it might then become obvious that the gains are small, or even negative! They may think: We would gladly report negative results, but then our paper could not be published. Some years ago, I attended a talk by a highly productive research who was providing advice to the students: never run an experiment unless you are sure you can turn it into a positive result.

It seems hard to believe that you can make sure that all your ideas turn out to be correct. But it is not so difficult. A popular approach to get positive results is to use a model as validation. Testing in the real world takes a lot of effort and your results could be negative, so why bother? Even when running experiments in the real world, there are many ways to cheat to ensure you get the result you need.

It looks harmless enough: just people trying to build up their careers. But there might be real harm down the line. Sometimes, especially if the authors are famous and the idea is compelling, the results will spread. People will adopt X+ and cite it in their work. And the more they cite it, the more enticing it is to use X+ as every citation becomes further validation for X+. And why bother with algorithm X given that it is older and X+ is the state-of-the-art?

Occasionally, someone might try both X and X+, and they may report results showing that the gains due to X+ are small, or negative. But they have no incentive to make a big deal of it because they are trying to propose yet another better algorithm (X++).

This process is called citogenesis. It is what happens when the truth is determined solely by the literature, not by independent experiments. Everyone assumes, implicitly, that X+ is better than X. The beauty of it is that you do not even need for anyone to have claimed so. You simply need to say that X+ is currently considered the best technique.

Some claim that science is self-correcting. People will stop using X+ or someone will try to make a name for himself by proving that X+ is no better and maybe worse than X. But in a business of science driven by publications, it is not clear why it should happen. Publishing that X+ is no better than X is an unimpressive negative result and those are rarely presented in prestigious venues.

Of course, the next generation of scientist may have an incentive to displace old ideas. There is an intergenerational competition. If you are young and you want to make a name for yourself, displacing the ideas of the older people is a decent strategy. So it is credible that science may self-correct, one funeral at a time:

A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die (…) An important scientific innovation rarely makes its way by gradually winning over and converting its opponents (…) What does happen is that its opponents gradually die out, and that the growing generation is familiarized with the ideas from the beginning: another instance of the fact that the future lies with the youth.

— Max Planck, Scientific autobiography, 1950, p. 33, 97
There are cases where self-correction does happen: if there is a free market and the scientific result can make a difference by offering a better product or a better service. That is why computer science has made so much progress, so rapidly: we have clients that will adopt our ideas if they are worthwhile. Once your new idea is in the smartphones, it is hard to deny that it works. Similarly, we know that Physics work because we have nuclear bombs and space exploration. What keeps us honest are the real problems.

John Regehr made a similar point about our inability to address mistakes in the literature:

In many cases an honest retrospective would need to be a bit brutal, for example to indicate which papers really just were not good ideas (of course some of these will have won best paper awards). In the old days, these retrospectives would have required a venue willing to publish them, (…), but today they could be uploaded to arXiv. I would totally read and cite these papers if they existed (…)

But there is hope! If problem P is a real problem, for example, a problem that engineers are trying to solve, then you can get actual and reliable validation. Good software engineers do not trust research papers: they run experiments. Is this algorithm faster, really? They verify.

We can actually see this effect. Talk to any Computer Scientist and he will tell you of clever algorithms that have never been adopted by the industry. Most often, there is an implication that industry is backward and that it should pay more attention to academic results. However, I suspect that in a lot of cases, the engineers have voted against X+ and in favor of X after assessing them, fairly and directly. That is what you do when you are working on real problems and really need good results.

It gets trickier in fields such a medicine because success may not ever be measured. Do you know if your doctor is more likely to cure you than another doctor? They may not even know themselves. So you need to work on problems where people measure the results, and where they have an incentive to  adopt ideas that work. In effect, you need real problems with people who have skin in the game.

We fall too often for the reification fallacy. A science paper is a like a map. You can create a map of a whole new territory, even if you have never been there. The new territory might exist or it might not exist, or it could be quite different from what you have described. If you never visit the territory, if nobody visits the territory in question, then you may never find out about your mistake. In effect, we need actual explorers, not just map makers.
 
And adding more map makers, and more maps, does not help. In some sense, it makes things harder. Who is going to compare all these maps? We need more explorers.

Science and Technology links (June 11 2023)

  1. Similar species can have vastly different lifespan. Researchers have been looking for the limiting factors that explain these differences. As we age, our genes are expressed differently through methylation. Different species vary their methylation at different speeds. There is some evidence that it could be that the methylation rate that determines the lifespan. If that is true, then it suggests that any process that could reset or slow methylation could improve lifespan.
  2. Employers regard female applicants as more suitable for jobs in female-dominated occupations, while they do not regard male applicants as more suitable anywhere.
  3. Supplementing older people with Glycine and N-Acetylcysteine improves and reverses multiple age-associated abnormalities.
  4. According to Audretsch et al., academic freedom has recently been declining. It seems that it is having a measurable effect on innovation.
  5. Fynn-Paul (2023) questions the idea that Europeans committed genocide when they arrived in America:

    In the United States, where the native population might have approached 2,000,000 individuals prior to Christopher Columbus’ arrival, widely-accepted tallies show that the total number of natives massacred by whites prior to 1848 amounted to less than 8,000 individuals

  6. The Lincoln Sea borders northern Greenland and Canada, and it is covered by ice all year long. According to climate models, it could transition to seasonal sea-ice, as it were 10,000 years in the early Holocene. For reference, we believe that human beings were present in American 30,000 years ago.
  7. We do not appear to be running out of exhaustible resources, like minerals.
  8. Consuming more magnesium might preserve your brain as you age. Chocolate is rich in magnesium.
  9. According to Krispenz and Bertrams (2023), strong ideological view, according to which a violent revolution against existing societal structures is legitimate (i.e., anti-hierarchical aggression), was associated with antagonistic narcissism and psychopathy. They conclude: some leftist political activists do not actually strive for social justice and equality but rather use political activism to endorse or exercise violence against others to satisfy their own ego-focused needs.
  10. DiMarco and Savitz (2023) estimate that 17% of all heterosexual women have coerced a man into having sex.

Parsing IP addresses crazily fast

Most of us are familiar with IP addresses: they are strings typically of the form “ddd.ddd.ddd.ddd” where ddd is a decimal number of up to three digits in the range 0 to 255. For example, 127.0.0.1 or 192.168.0.2.

Each of the four number is a byte value, and the address is an IPv4 network address that fits in 32 bits. There is a more recent extension (IPv6) that spans 128 bits, but the conventional IPv4 addresses are common. They can be part of URLs (e.g., http://127.0.0.1/home).

I have a blog post on how you can go from the 32-bit binary value to the string quickly: Serializing IPs quickly in C++. It turns out that you can write the strings very, very fast.

For our fast URL parsing library, I also wrote the counterpart, where we go from the string to the 32-bit value. Our URL parsing library supports various IP address formats and lengths. However, I did not optimize it particularly well because there are many other challenges in efficient URL parsing.

Recently, Jeroen Koekkoek brought to my attention that my friend Wojciech has a recent article on the parsing of IP addresses. As usual, it is brillant. Koekkoek is working on accelerating DNS record parsing in his simdzone project and we expect to parse a lot of IP addresses.

Wojciech’s results are slightly depressing, however. He suggests that you can beat a standard approach by a factor of 2 or 2.5 in speed, which is excellent, but at the cost of relatively large tables.

Let summarize the strategy developed by Wojciech:

  1. Identify the location of the dots.
  2. Construct an integer (a dot “mask”) which has its bits set according to the location of the dots (if a dot is present as the second character, we set the second bit to 1). You can check that there are only 81 possible masks, after adding a virtual dot at the end of the address.
  3. Build a function which maps this dot mask to precomputed values which can be used to reorganize the digits and process them with single instruction, multiple data (SIMD) instructions.

Wojciech implemented many variations. Generally, he distinguishes between the case where we have no 3-digit numbers (e.g., 12.1.0.1), and so forth. I don’t expect that this optimization is very useful in practice. He came up with two different mapping functions, one that is slow but requires little memory, and one that is faster but requires a fair amount of memory. After some effort, I was able to compute with a compromise that is nearly as fast as his fastest approach, but without using much more memory than his slow routine. I end up using about 255 bytes for the mapping function, which is reasonable.

He has also separate steps to validate the input: i.e., ensure that we are only dealing with digits and dots. I feel that this can be accomplished as part of the computation. He loads the expected dot mask as part of the validation, but I expect that you can more simply validate the input by comparing the expected length of the address (which we get in any case). Overall, I only need 81 times sixteen bytes of data to do the main processing. All in all, I use nearly only half as much storage as Wojciech’s fastest routine with the expectation that I can go faster because I have simplified a few steps.

A standard C function (under Linux and similar system) to parse IP addresses is inet_pton. It is similar in performance to the routines that Wojciech tested against. I called my own function (coded in C), sse_inet_aton. Unlike the standard function, mine will always read sixteen  bytes, though it will only process the specified number of bytes. This means that the input string must either be part of a large string or you must overallocate. This limitation was already there in Wojciech’s code, and I could not find a good way around it. If you have a fancy new processors with advanced SIMD instructions (AVX-512 or SVE), there are masked load and store instructions which solve this problem nicely. But like Wojciech, I decided to stick with old school SIMD instructions.

For my benchmark, I just generate a lot of random IP addresses, and I try to parse them as quickly as possible. I use an Intel IceLake server with GCC 11 and clang (LLVM 16). My source code is available.

instructions/address speed addresses per s
inet_pton 300 0.38 GB/s 35 million
sse_inet_aton (LLVM 16) 52 2.7 GB/s 200 million
sse_inet_aton (GCC 11) 63 2.3 GB/s 170 million

So the optimized function is six times faster than the standard one using GCC. Switching to clang (LLVM), you go seven times faster. The fact that LLVM has such an edge over GCC warrants further examination.

I suspect that my code is not nearly optimal, but it is definitively worth it in some cases in its current state. I should stress that the optimized function includes full validation: we are not cheating.

Future work: The code has the hard-coded assumption that you have a Linux or macOS system with an SSE 4.1 compatible processor: that’s virtually all Intel and AMD processors in operation today. The code could be faster on AVX-512 machines, but I leave this for future work. I have also not included support for ARM (through NEON instructions), but it should be easy. Furthermore, I only support the conventional IPv4 addresses in its most common format (ddd.ddd.ddd.ddd) and I do not support IPv6 at all.

Science and Technogy links (June 3 2023)

    1. There are fewer serial killers these days. Some suggests it is due to better forensic techniques: we catch the killers faster and more efficiently.
    2. Between the beginnings of the Web (1996) and today, the household Internet connection bandwidth got over 5000 times faster. Thus what took nearly two hours of download time back then can now be downloaded in a second.
    3. Managers make a difference: The abilities of retail store managers explain most of the store-to-store variability in productivity.
    4. Rausch et al. report that the latest generation of college students favours emotional well-being, and is less interested in academic rigor.
      But female and male students differ in what they favour. Males tend to value academic freedom, while females value emotional well-being.
    5. Switching away from sugar-sweetened beverage to water or unsweetened beverage had little effect over a twelve-week period. However, switching to water had a significant effect in that it reduce how much people sought sugar. I have switched my own diet away from sugar many years ago, and I find sugar in coffee offensive. This may be surprising to some but sugar is an acquired taste.
    6. Giving old mice both glycine and N-acetylcysteine (NAC) reverses age-associated cognitive decline. Glycine is a cheap and safe supplement which you may take right now. NAC is also available as a supplement.
    7. The rate at which people die of a heart attack has fallen massively: Researchers found the overall rate of death from heart attack, adjusted for age, fell from about 87 deaths per 100,000 people in 1999 to about 38 deaths per 100,000 people in 2020. (source) Unfortunately, there was a setback during the pandemic era.
    8. There is ongoing research to find out why some dogs tilt their head when hearing a verbal command.
    9. Eat less might be an effective way to improve your cardiovascular health: Weight loss was found to promote atherosclerosis resolution independent of plasma cholesterol. It seems that as you reduce your food intake, macrophages help clear out your arteries, at least if you are a mouse.
    10. Consuming large volumes of vitamin C might clear your allergies (speculative).
    11. White people tend to avoid Black people on the street in New York City.
    12. About 12% of total atmospheric CO2 in 2018 was due to fossil fuels.
    13. Organic agriculture could support up to 4.7 billion human beings. There are 8 billion people on Earth currently.
    14. An increase in regulatory costs results in lower sales for small firms, and higher sales for large firms. In other words, government regulations lead to market concentration in the hands of fewer large players.
    15. Start-up founders who are low in conscientiousness are at disavantage when first raising capital, but they have an overall advantage when it comes to success.

Peak credentialism

How much is a degree from a prestigious university worth? The answer is a bit difficult to answer because there are many cofounding factors: people from the connected class  (folks that ‘know people’) tend to attend the most prestigious universities, and they also tend to do well professionally. It is likely that highly connected people do well no matter what.

Sorting it out is difficult, but we should remember that the objective of an employer is often to hire the most qualified person. The credentials are only an approximation. In jobs where it is easy to identify the highly productive individuals, you would expect that credentials would have less value.

In After credentials (2008), Paul Graham wrote:

Credentials are a step beyond bribery and influence. But they’re not the final step. There’s an even better way to block the transmission of power between generations: to encourage the trend toward an economy made of more, smaller units. Then you can measure what credentials merely predict.

The era of credentials began to end when the power of large organizations peaked in the late twentieth century. Now we seem to be entering a new era based on measurement. The reason the new model has advanced so rapidly is that it works so much better. It shows no sign of slowing.

The New York Times told us in 2017 about how tech jobs rely on measuring skills rather than degrees. They wrote:

In the last two years, nearly a third of IBM’s new hires there and in a few other locations have not had four-year college degrees. IBM has jointly developed curriculums with the local community college, as well as one-year and two-year courses aligned with the company’s hiring needs.

My oldest son will be attending college next fall, studying in computer science. We reviewed job ads together. It was interesting that few job ads in technology made much of a case for education. Unsurprisingly, the more senior the position, the less likely the job was to require specific education.

In the United Kingdom, the most prestigious universities are part of the Russell Group. Klein (2021) finds that graduating from these universities is not especially advantageous for most students, except maybe for those from low backgrounds:

The findings show no discernible differences in occupational prestige between graduates from Russell Group universities and other universities once conditioning on covariates. When considering career trajectories, graduates from Russell Group universities had no advantages at immediate labor market entry but gained somewhat higher occupational prestige levels than students from other universities in the first two years since their first significant job. This advantage remained stable until their sixth year in the labor market. After six years in the labor market, however, graduates from other universities had a steeper career progression and caught up with their peers from Russell Group universities.

Looking a Germany, Lang and Schwabe (2021) find that attending ‘excellent’ universities is not as helpful as one might think:

Applying a difference-in-differences approach in combination with a simulation study, we do not identify a statistically significant excellence premium in the wages for graduates of ‘universities of excellence’.

Thus there is evidence that Graham was correct in 2008: we are moving beyond credentials. In effect, you probably need to worry less today about which college you tend, or even whether you should attend college. Keep in mind that after you have had a few real jobs, few people will care about your degrees. After all, 8% of CEOs have never even completed college.

Expected performance of a Bloom filter

A hash function is a function that maps a value (such as a string) to an integer value. Typically, we want random-looking values.

A Bloom filter is a standard data structure in computer science to approximate a set. Basically, you start with a large array of bits, all initialized at zero. Each time you want to add an element to the set, you compute k different hash values and you set the bits at the k corresponding locations to one.

A fun application of a Bloom filter is a compromised password application: given a large set of passwords, you want a concise data structure that indicates whether a password belongs to the set.

To check whether an element is in the set, you compute, again, k different hash values and you check the bits at the corresponding k locations. If the value is in the set, then all k bits have to have value 1. If not, then you know for sure that the value is not in the set.

The way we naturally implement a Bloom filter is through a loop, where we check for each bit value in the sequence, stopping as soon as a 0 bit is found.

Unfortunately, it is possible for the Bloom filter to be wrong: all the k bits could be 1s out of luck. If that happens, you have a false positive. The false positives are often not a problem in practice because you can prune them out with a secondary application. For example, if you want to make sure that the password has been compromised, you may simply look it up in a conventional database. The Bloom filter will still handle most cases.

By spending about 12 bits per value (12 bits time the size of the set) and using 8 hash functions, the probability of a false positive is about 1/256 (or 0.3%) which is reasonable.

If you know a bit about software performance, the 8 bits could be concerning. Looking up 8 values at random location in memory is not cheap. And, indeed, when the element is in the set, you must check all 8 locations. It is expensive.

What if the value is not in the set, however? If your Bloom filter is configured optimally, for the lowest false-positive rate given a storage budget per element, about 50% of all bits in the array of bits are 1s.

How many random bits must we access before we find a 0 in such a case? We stop after one bit 1/2 the time, then a quarter of the time after two bits, and so forth. So the expected value is 1/2 + 1/4*2 + 1/8*3 + 1/16 *4 +… which tends towards 2. The answer does not depend much on the number of hash functions k.

This means that if you are using a Bloom filter in a scenario where the values are likely not in the set, you can get away with very few memory accesses.

Of course, the processor is not going to naively load each bit one by one. It will do speculative loads. Thankfully, we can use performance counters to measure the actual number of loads.

Another concern for programmers who are interested in performance is the number of mispredicted branches. Interestingly, the relationship is reversed: if your elements are not in the set, then you face unpredictable branches (so a bad performance) but if they are in the set, then the branches are perfectly predictable.

Let us consider the performance metrics of the Bloom filter when we have a false positive (miss) and an actual positive (hit). I use an Intel Ice Lake server with a filter that exceeds the CPU cache, and I record CPU performance counters.

number of hash functions bits per values cache misses per miss cache misses per hit mispredict per miss mispredict per hit
8 12 3.5 7.5 0.95 0.0
11 16 3.8 10.5 0.95 0.0

For the misprediction tests, I use either the regime where all values are in the sets, or no value is in the set.

What is the performance of both hits and misses? We find that the performance of a miss does not depend too much on the number of hash functions, as you would expect. The cost of a hit grows roughly in proportion of the number of hash functions, as you would expect.

number of hash functions bits per values cycles/miss cycles/hit
8 12 135 170
11 16 140 230

You can reproduce my results with the fastfilter_cpp.

Let us now consider, for the same problem size, binary fuse filters (Go implementation, C single-header implementation, Rust implementation, zig implementation, Julia implementation). They use a slightly different approach (see paper): we can either use a flat 3 accesses per value or a flat 4 accesses per value.

cache misses per value mispredict per value
3-wise binary fuse 2.8 0.0
4-wise binary fuse 3.7 0.0

So the 4-wise filters has about as many cache misses as the Bloom filter. Yet they are significantly faster than Bloom filters.

cycles per access
3-wise binary fuse 85
4-wise binary fuse 100

The absurd cost of finalizers in Go

The Go programming language makes it easy to call C code. Suppose you have the following C functions:

char* allocate() {
  return (char*)malloc(100);
}
void free_allocated(char *c) {
  free(c);
}

Then you can call them from Go as follows:

c := C.allocate()
C.free_allocated(c)

It works well.

You might argue that my functions are useless, but I designed them to be trivial on purpose. In practice, you will call C code to do something actually useful. Importantly, there is an allocation followed by a necessary deallocation: this is typical in realistic code.

Reasonably, Go programmers are likely to insist that the memory allocated from Go be released automatically. Thankfully, Go has a mechanism for this purpose. You put the C data in a Go structure which is subject to automatic garbage collection. Then you tie the instances of this structure to a “finalizer” function which will be called before the Go structures is garbage collected. The code might look as follows:

type Cstr struct {
  cpointer *C.char
}

func AllocateAuto() *Cstr {
  answer := &Cstr{C.allocate()}
  runtime.SetFinalizer(answer, func(c *Cstr) {  C.free_allocated(c.cpointer); runtime.KeepAlive(c) })
  return answer
}

 

So far so good. Go is doing very well up until now.

But what is the performance impact? We are comparing these two routines. First, the inconvenient version where you manually have to free the allocated memory…

p := Allocate()
Free(p)

and then the version which relies on Go’s memory management…

AllocateAuto()

Let us benchmark it. My benchmarking code is available, your results will differ from mine but I care only about the big picture.

In my case, the automated version is nearly ten times slower.

AllocateAuto 650 ns
Allocate-Free 75 ns

The 650 ns result is silly: it is thousands of CPU cycles.

Maybe it is the overhead due to garbage collection ? Thankfully, Go allows us to disable garbage collection with GOGC=off:

AllocateAuto (no GC) 580 ns
Allocate-Free (no GC) 75 ns

So the numbers are slightly better, but barely so.

We can try to see where the problem lies with profiling:

go test -cpuprofile gah -bench BenchmarkAllocateAuto -run -
go tool pprof gah
> top

We get that most of the processing time is spent in runtime.cgocall:

     1.67s 70.17% 70.17%      1.67s 70.17%  runtime.cgocall
     0.23s  9.66% 79.83%      0.23s  9.66%  runtime.usleep
     0.12s  5.04% 84.87%      0.12s  5.04%  runtime.pthread_cond_signal

What if I try a dummy finalizer?

func AllocateDummy() *Cstr {
 answer := &Cstr{C.allocate()}
 runtime.SetFinalizer(answer, func(c *Cstr) {})
 return answer
}

I get the same poor performance, suggesting that it is really the finalizer that is expensive.

This is seemingly consistent with Java, which also has finalizers:

Oh, and there’s one more thing: there is a severe performance penalty for using finalizers. On my machine, the time to create and destroy a simple object is about 5.6ns. Adding a finalizer increases the time to 2,400ns. In other words, it is about 430 times slower to create and destroy objects with finalizers. (Effective Java 2nd Edition: Item 7: Avoid finalizers)

Maybe there is a way to do better, I hope there is, but I suspect not.

Further reading. Some notes on the cost of Go finalizers (in Go 1.20) by Chris Siebenmann.

Computing the UTF-8 size of a Latin 1 string quickly (ARM NEON edition)

While most of our software relies on Unicode strings, we often still encounter legacy encodings such as Latin 1. JavaScript strings are often stored as either Latin 1, UTF-8 or UTF-16 internally. Before we convert Latin 1 strings to Unicode (e.g., UTF-8), we must compute the size of the UTF-8 string. It is fairly easy: all ASCII characters map 1 byte to 1 byte, while other characters (with code point values from 128 to 256) map 1 Latin byte to 2 Unicode bytes (in UTF-8).

Computers represent strings using bytes. Most often, we use the Unicode standard to represent characters in bytes. The universal format to exchange strings online is called UTF-8. It can represent over a million characters while retaining compatibility with the ancient ASCII format.

Though most of our software stack has moved to Unicode strings, there are still older standards like Latin 1 used for European strings. Thus we often need to convert Latin 1 strings to UTF-8. It is useful to first compute the size (in bytes) of the eventual UTF-8 strings. You can code a simple C function to compute the UTF-8 size from the Latin 1 input as follow:

size_t scalar_utf8_length(const char * c, size_t len) {
  size_t answer = 0;
  for(size_t i = 0; i<len; i++) {
    if((c[i]>>7)) { answer++;}
  }
  return answer + len;
}

In Computing the UTF-8 size of a Latin 1 string quickly (AVX edition), I reviewed faster techniques to solve this problem on x64 processors.

What about ARM processors (as in your recent MacBook)?

Keyhan Vakil came up with a nice solution with relies on the availability for “narrowing instructions” in ARM processors. Basically you can take a 16-byte vector registers and create a 8-byte register (virtually) by truncating or rounding the results. Conveniently, you can also combine bit shifting with narrowing.

Consider pairs of successive 8-bit words as a 16-bit word. E.g., if the 16 bits start out as aaaaaaaabbbbbbbb then a shift-by-four-and-narrow creates the byte value aaaabbbb. Indeed, if you shift a 16-bit word by 4 bits and keep only the least significant 8 bits of the result, then

  1. the most significant 4 bits from the second 8-bit word become the least significant 4 bits in the result
  2. and the least significant 4 bits from the first 8-bit word become the most significant 4 bits.

This is convenient because vectorized comparison functions often generate filled bytes when the comparison is true (all 1s). The final algorithm in C looks as follows:

uint64_t utf8_length_kvakil(const uint8_t *data, uint32_t length) {
  uint64_t result = 0;
  const int lanes = sizeof(uint8x16_t);
  uint8_t rem = length % lanes;
  const uint8_t *simd_end = data + (length / lanes) * lanes;
  const uint8x16_t threshold = vdupq_n_u8(0x80);
  for (; data < simd_end; data += lanes) {
    // load 16 bytes
    uint8x16_t input = vld1q_u8(data);
    // compare to threshold (0x80)
    uint8x16_t withhighbit = vcgeq_u8(input, threshold);
    // shift and narrow
    uint8x8_t highbits = vshrn_n_u16(vreinterpretq_u16_u8(withhighbit), 4);
    // we have 0, 4 or 8 bits per byte
    uint8x8_t bitsperbyte = vcnt_u8(highbits);
    // sum the bytes vertically to uint16_t
   result += vaddlv_u8(bitsperbyte);
  }
  result /= 4; // we overcounted by a factor of 4
  // scalar tail
  for (uint8_t j = 0; j < rem; j++) {
    result += (simd_end[j] >> 7);
  }
  return result + length;
}

Can you beat Vakil? You can surely reduce the instruction count but once you reach speeds like 20 GB/s, it becomes difficult to go much faster without hitting memory and cache speed limits.

Pete Cawley proposed a simpler algorithm which avoids the narrowing shifts, and does a vertical addition instead:

uint64_t utf8_length_simpler(const uint8_t *data, uint32_t length) {
  uint64_t result = 0;
  const int lanes = sizeof(uint8x16_t);
  uint8_t rem = length % lanes;
  const uint8_t *simd_end = data + (length / lanes) * lanes;
  const uint8x16_t threshold = vdupq_n_u8(0x80);
  for (; data < simd_end; data += lanes) {
    // load 16 bytes
    uint8x16_t input = vld1q_u8(data);
    // compare to threshold (0x80)
    uint8x16_t withhighbit = vcgeq_u8(input, threshold);
    // vertical addition
    result -= vaddvq_s8(withhighbit);
  }
  // scalar tail
  for (uint8_t j = 0; j < rem; j++) {
    result += (simd_end[j] >> 7);
  }
  return result + length;
}

Are the hand-tuned NEON functions fast?

On my Apple M2, they are three times as fast as what the compiler produces from the scalar code on large enough inputs. Observe that the compiler already relies on vector instructions even when compiling scalar code.

scalar code ~6 GB/s
NEON code (both functions) ~20 GB/s

On my Apple laptop, both NEON functions are equally fast. Using Graviton 3 processors on AWS (with GCC 11), I can tell them apart:

scalar code ~7 GB/s
NEON code (Vakil) ~27 GB/s
NEON code (Cawley) ~30 GB/s

The Cawley function is slightly better. My source code is available. Your results will vary.

Update: if you check out my source code, you will find two new versions that are quite fast. It seems that there is a lot of room for optimization on this problem.

Remark. Whenever I mention Latin 1, some people are prone to remark that browsers treat HTML pages declared as Latin 1 and ASCII as windows-1252. That is because modern web browsers do not support Latin 1 and ASCII in HTML. Even so, you should not use Latin 1, ASCII or even windows-1252 for your web pages. I recommend using Unicode (UTF-8). However, if you code in Python, Go or Node.js, and you declare a string as Latin 1, it should be Latin 1, not windows-1252. It is bug to confuse Latin 1, ASCII and windows-1252. They are different formats.

ARM instructions do “less work”?

Modern processors can execute several instructions per cycle. Because processors cannot easily run faster (in terms of clock speed), vendors try to get their processors to do more work per cycle.

Apple processors are wide in the sense that they can retire many more instructions per cycle than comparable Intel or AMD processors. However, some people argue that it is unfair because ARM instructions are less powerful and do less work than x64 (Intel/AMD) instructions so that we have performance parity.

Let us verify.

I have a number parsing benchmark that records the number of cycles, instructions and nanosecond spent parsing numbers on average. The number of instructions and cycles is measured using performance counters, as reported by Linux. I parse a standard dataset of numbers (canada.txt), I keep the fast_float numbers (ASCII mode).

system instructions per float cycles per float instructions per cycle
Intel Ice Lake, GCC 11 302 64 4.7
Apple M1, LLVM 14 299 45 6.6

Of course, that’s a single task, but number parsing is fairly generic as a computing task.

Looking the assembly output often does not reveal a massive benefit for x64. Consider the following simple routine:

// parses an integer of length 'l'
// into an int starting with value
// x.
for(int i = 0; i < l; i++) {
  x = 10 * x + (c[i]-'0');
}
return x;

LLVM 16 compiles this to the following optimized ARM assembly:

start:
 ldrb w10, [x1], #1
 subs x8, x8, #1
 madd w10, w0, w9, w10
 sub w0, w10, #48
 b.ne start

Or the following x64 assembly…

start:
  lea eax, [rax + 4*rax]
  movsx edi, byte ptr [rsi + rdx]
  lea eax, [rdi + 2*rax]
  add eax, -48
  inc rdx
  cmp rcx, rdx
  jne start

Though your mileage will vary, I find that for the tasks that I benchmark, I often see as many ARM instructions being retired than x64 instructions. There are differences, but they are small.

For example, in a URL parsing benchmark, I find that ARM requires 2444 instructions to parse a URL on average, against 2162 instructions for x64: a 13% benefit for x64. That’s not zero but it is not a massive benefit that overrides other concerns.

However, Apple processors definitively retire more instructions per cycle than Intel processors.