Fast Buffer-to-String conversion in JavaScript with a Lookup Table

When programming in a JavaScript environment such as Node.js, you might recover raw data from the network and need to convert the bytes into strings. In a system such as Node.js, you may represent such raw bytes using a Buffer instance.

You can conveniently convert a Buffer instance into a JavaScript (mybuffer.toString()). But, maybe surprisingly, creating new strings can be a bottleneck. Thus a worthwhile optimization might be to try to recognize that your incoming bytes are one out of a list of known strings. This is not a problem unique to JavaScript.

One example of such a problem occurs when parsing HTTP headers. These headers contain common strings such as  ‘accept’, ‘accept-encoding’, ‘content-encoding’, ‘content-language’, ‘content-length’, ‘from’, ‘host’, etc. If we can recognize the bytes as one of these strings, we can just point at the existing strings. To make things more complicated, we might want to ignore the case, so that both inputs ‘Accept’ and ‘ACCEPT’ should be mapped to accept’.

This problem has been addressed recently in a library called undici. This library provides Node.js with an HTTP/1.1 client. GitHub user tsctx initially proposed solving this problem using a trie implemented with JavaScript objects. A trie is a type of data structure that is used to store and search for strings in an efficient way. In its simplest implementation (sometimes called a digital search tries), we first branch out on the first character, and each possible character becomes a new tree based on the second character and so forth. The last node of each string is marked as the end of the word, and may also store some additional information, such as a point to a desired output. The problem with such an approach is that it can use much memory. And, indeed, I estimated that tsctx‘s implementation might cost between 1 MB to 2 MB of memory. Whether that is a concern depends on your priorities. Following my comments, user tsctx opted for a new strategy that may be less time efficient, but that is significantly more economical memory-wise: a ternary. A ternary tree is similar to a binary tree, but each node can have up to three children, usually called left, mid, and right.

I think that tsctx‘s is excellent, but it is sometimes important to compare with a few competitors.

I decided to implement my own approach based the observation that it is fairly easy to quickly identify a candidate using solely the length of the input. For example, there is only one candidate string of length 2: ‘TE’. So it makes sense to write code like this:

function toLowerCase(s) {
 var len = s.length;
  switch (len) {
   case 2:
    // check that the buffer is equal to te and return it if so
    if ((s[0] == 116 || s[0] == 84) && (s[1] == 101 || s[1] == 69)) {
     return "te";
    }

This code is a function that takes a string as an input and returns a lowercased version of it. The function works as follows: It declares a variable called len and assigns it the value of the length of the input string s. It uses a switch statement to check the value of len and execute different code blocks depending on the case.
In this example, the function only has one case, which is 2. In the case of 2, the function checks that the input string is equal to “te” or “TE” or “Te” or “tE”. It does this by comparing the ASCII codes of the characters in the string. The ASCII code of t is 116, the ASCII code of T is 84, the ASCII code of e is 101, and the ASCII code of E is 69. The function uses the logical operators || (or) and && (and) to combine the conditions. If the input string matches any of these four combinations, the function will return “te”. Here is an example of how the function works: If the input string is “te”, the function will return “te”. If the input string is “TE”, the function will return “te”. If the input string is “Te”, the function will return “te”. If the input string is “tE”, the function will return “te”. If the input string is “ta”, the function will continue.

If the buffer has length 3, then I have four possible candidate strings (age, ect, rtt, via). I can differentiate them by looking only at the first character. The logic is much the same:

case 3:
  switch (s[0]) {
   case 97: case 65:
    // check that the buffer is equal to age and return it if so
    if ((s[1] == 103 || s[1] == 71) && (s[2] == 101 || s[2] == 69)) {
      return "age";
    }
    break;
   case 101:case 69:
    // check that the buffer is equal to ect and return it if so
    if ((s[1] == 99 || s[1] == 67) && (s[2] == 116 || s[2] == 84)) {
     return "ect";
    }
    break;
   case 114:case 82:
    // check that the buffer is equal to rtt and return it if so
    if ((s[1] == 116 || s[1] == 84) && (s[2] == 116 || s[2] == 84)) {
      return "rtt";
    }
    break;
   case 118:case 86:
    // check that the buffer is equal to via and return it if so
    if ((s[1] == 105 || s[1] == 73) && (s[2] == 97 || s[2] == 65)) {
     return "via";
    }
    break;
   default:
    break;
   }

It is easy enough to do it by hand, but it gets tedious, so I wrote a little Python script. It is not complicated… I just repeat the same logic in a loop.

Pay attention to the fact that the switch key is made of nearly continuous integers from 2 to 35… It means that a good compiler will almost surely use a jump table and not a series of comparisons.

First let us compare the memory usage of the four approaches: the original (simple) code used by undici, the naive switch-case approach, the ternary tree and the digital search trie. I use various recent versions of Node.js on a Linux server. I wrote scripts that simply include the function, and only the function, and I print the memory usage. I repeat five times and report the lowest figure. When using Node.js, I call the garbage collector and pause to try to minimize the memory usage.

Node.js 21 Node.js 20 Node.js 19 Bun 1.0
original 43.3 MB 42.4 MB 44.9 MB 20.2 MB
naive switch 43.3 MB 42.9 MB 42.9 MB 23.8 MB
ternary tree 43.5 MB 44.2 MB 45.2 MB 29.3 MB
digital search trie 45.1 MB 44.6 MB 47.0 MB 26.7 MB

Thus only the digital search trie appears to bring a substantial memory usage with Node.js 21. If you use a different version of Node.js or a different operating system, results will differ… but I verified that the conclusion is the same on my macBook.

What about performance? I use an Intel Ice Lake processor running at 3.2 GHz. I wrote a small benchmark that parses a few headers. I rely on a well-known JavaScript benchmark framework (mitata).

Node.js 21 Node.js 20 Node.js 19 Bun 1.0
original 15 µs 14 µs 15 µs 12 µs
naive switch 7.8 µs 7.9 µs 7.8 µs 8.2 µs
ternary tree 9.4 µs 9.4 µs 9.0 µs 8.8 µs
digital search trie 12 µs 12 µs 11 µs 10 µs

I am not quite certain why the digital search trie does poorly in this case. I also ran the same experiment on my 2022 macBook (Apple M2 processor). I am usually against benchmarking on laptops, but these macBooks tend to give very stable numbers.

Node.js 21 Node.js 20 Node.js 19 Bun 1.0
original 8.5 µs 9.1 µs 8.5 µs 8.2 µs
naive switch 5.0 µs 4.9 µs 4.7 µs 5.3 µs
ternary tree 5.8 µs 5.8 µs 5.6 µs 6.1 µs
digital search trie 5.3 µs 5.5 µs 5.4 µs 5.5 µs

Thus I would conclude that both the naive switch and the ternary tree are consistently faster than the original. The original implementation is about 1.8 times slower than the naive switch when using Node.js 21.

One approach I did not try is perfect hashing. I am concerned that it might be difficult to pull off because JavaScript might not compile the code efficiently. One benefit of perfect hashing is that it can be nearly branchless so it provides consistent performance. We could use the perfect hashing strategy with the switch case approach: we would have just one hash computation, and then we would end up straight to single buffer-to-target comparison. It would like a C/C++ implementation in spirit, although it would generate more code.

We rely on the fact that this function was identified as a bottleneck. We ran a microbenchmark, but it would be useful to see whether these functions make a difference in a realistic application.

My source code and benchmark is online. It can be improved, pull requests invited.

How fast can you validate UTF-8 strings in JavaScript?

When you recover textual content from the disk or from the network, you may expect it to be a Unicode string in UTF-8. It is the most common format. Unfortunately, not all sequences of bytes are valid UTF-8 and accepting invalid UTF-8 without validating it is a security risk.

How might you validate a UTF-8 string in a JavaScript runtime?

You might use the valid-8 module:

import valid8 from "valid-8";
if(!valid8(file_content)) { console.log("not UTF-8"); }

Another recommended approach is to use the fact that TextDecoder can throw an exception upon error:
new TextDecoder("utf8", { fatal: true }).decode(file_content)

Or you might use the isUtf8 function which is part of Node.js and Bun.
import { isUtf8 } from "node:buffer";
if(!isUtf8(file_content)) { console.log("not UTF-8"); }

How do they compare? Using Node.js 20 on a Linux server (Intel Ice Lake), I get the following speeds with three files representative of different languages. The Latin file is just ASCII. My benchmark is available.
Arabic Chinese Latin
valid-8 0.14 GB/s 0.17 GB/s 0.50 GB/s
TextDecoder 0.18 GB/s 0.19 GB/s 7 GB/s
node:buffer 17 GB/s 17 GB/s 44 GB/s

The current isUtf8 function in Node.js was implemented by Yagiz Nizipli. It uses the simdutf library underneath. John Keiser should be credited for the UTF-8 validation algorithm.

Parsing 8-bit integers quickly

Suppose that you want to parse quickly 8-bit integers (0, 1, 2, …, 254, 255) from an ASCII/UTF-8 string. The problem comes up in the simdzone project lead by Jeroen Koekkoek (NLnet Labs). You are given a string and its length: e.g., ’22’ and length is 2. The naive approach in C might be:

int parse_uint8_naive(const char *str, size_t len, uint8_t *num) {
  uint32_t n = 0;
  for (size_t i = 0, r = len & 0x3; i < r; i++) {
    uint8_t d = (uint8_t)(str[i] - '0');
    if (d > 9)
     return 0;
    n = n * 10 + d;
  }
  *num = (uint8_t)n;
  return n < 256 && len && len < 4;
}

This code is a C function that takes a string of characters, its length, and a pointer to an unsigned 8-bit integer as input arguments. The function returns a Boolean value indicating whether the input string can be parsed into an unsigned 8-bit integer or not. It restricts the input to at most three characters but it allows leading zeros (e.g. 002 is 2).

The function first initializes a 32-bit unsigned integer n to zero, we will store our answer there. The function then iterates over the input string, extracting each digit character from the string and converting it to an unsigned 8-bit integer. The extracted digit is then added to n after being multiplied by 10. This process continues until the end of the string or until the function has processed 4 bytes of the string. Finally, the function assigns the value of n to the unsigned 8-bit integer pointed to by num. It then returns a boolean value indicating whether the parsed integer is less than 256 and the length of the input string is between 1 and 3 characters.  If the input string contains any non-digit characters or if the length of the string is greater than 3 bytes, the function returns false.

If the length of the input is predictable, then this naive function should be fast. However, if the length varies (between 1 and 3), then the processor will tend to  mispredict branches, and expensive penalty.

In C++, you could call from_chars:

int parse_uint8_fromchars(const char *str, size_t len, uint8_t *num) {
  auto [p, ec] = std::from_chars(str, str + len, *num);
  return (ec == std::errc());
}

The std::from_chars function takes three arguments: a pointer to the beginning of the input character sequence, a pointer to the end of the input character sequence, and a reference to the output variable that will hold the parsed integer value. The function returns a std::from_chars_result object that contains two members: a pointer to the first character that was not parsed, and an error code that indicates whether the parsing was successful or not.

In this function, the std::from_chars function is called with the input string and its length as arguments, along with a pointer to the unsigned 8-bit integer that will hold the parsed value. The function then checks whether the error code returned by std::from_chars is equal to std::errc(), which indicates that the parsing was successful. If the parsing was successful, the function returns true. Otherwise, it returns false.

Can we do better than these functions? It is not obvious that we can. A function that reads between 1 and 3 bytes is not a function you would normally try to optimize. But still: can we do it? Can we go faster?

Suppose that you can always read 4 bytes, even if the string is shorter (i.e., there is a buffer). This is often a safe assumption. If you numbers are within a larger string, then you can often check efficiently whether you are within 4 bytes of the end of the string. Even if that is not the case, reading 4 bytes is always safe as long as you do not cross a page boundary, something you may check easily.

So you can load your input into a 32-bit word and process all bytes at once, in a single register. This often called SWAR: In computer science, SWAR means SIMD within a register, which is a technique for performing parallel operations on data contained in a processor register.

Jeroen Koekkoek first came up with a valid SWAR approach, but it was only about 40% faster than the naive approach in the case where we had unpredictable inputs, and potentially slower than the naive approach given predictable inputs. Let us examine an approach that might be competitive all around:

int parse_uint8_fastswar(const char *str, size_t len, 
    uint8_t *num) {
  if(len == 0 || len > 3) { return 0; }
  union { uint8_t as_str[4]; uint32_t as_int; } digits;
  memcpy(&digits.as_int, str, sizeof(digits));
  digits.as_int ^= 0x30303030lu;
  digits.as_int <<= ((4 - len) * 8);
  uint32_t all_digits = 
    ((digits.as_int | (0x06060606 + digits.as_int)) & 0xF0F0F0F0) 
       == 0;
  *num = (uint8_t)((0x640a01 * digits.as_int) >> 24);
  return all_digits 
   & ((__builtin_bswap32(digits.as_int) <= 0x020505));
}

Again, this code is a C function that takes a string of characters, its length, and a pointer to an unsigned 8-bit integer as input arguments. The function returns a boolean value indicating whether the input string can be parsed into an unsigned 8-bit integer or not. We check whether the length is in range ([1,3]), if not, we return false, terminating the function. After this initial check, the function first initializes a union digits that contains an array of 4 unsigned 8-bit integers and a 32-bit unsigned integer. The function then copies the input string into the 32-bit unsigned integer using the memcpy function. The memcpy function copies the input string into the 32-bit unsigned integer. In production code where you do not know the target platform, you would want to reverse the bytes when the target is a big-endian system. Big endian systems are vanishingly rare: mostly just mainframes. Modern systems compile a byte reversal to a single fast instructions. For code on my blog post, I assume that you do not have a big-endian system which is 99.99% certain.

The function then flips the bits of the 32-bit unsigned integer using the XOR operator and the constant value 0x30303030lu. This operation converts each digit character in the input string to its corresponding decimal value. Indeed, the ASCII characters from 0 to 9 have code point values 0x30 to 0x39 in ASCII. The function then shifts the 32-bit unsigned integer to the left by a certain number of bits, depending on the length of the input string. This operation removes any trailing bytes that were not part of the input string. Technically when the length is not within the allowed range ([1,3]), the shift might be undefined behaviour, but we return a false value in this case earlier, indicating that the result of the computation is invalid.

The next part is where I contributed to the routine. First we check all characters are digits using a concise expression. The function then multiplies the 32-bit unsigned integer by the constant value 0x640a01 using a 32-bit unsigned integer. It is a concise way to do two multiplications (by 100 and by 10) and two sums at once. Observe that 0x64 is 100 and 0x0a is 10. The result of this multiplication is then shifted to the right by 24 bits. This operation extracts the most significant byte of the 32-bit unsigned integer, which represents the parsed unsigned 8-bit integer. Finally, the function assigns the value of the parsed unsigned 8-bit integer to the unsigned 8-bit integer pointed to by num. It then returns a boolean value indicating whether the parsed integer is less than 256 and made entirely of digits.

The function might compile to 20 assembly instructions or so on x64 processors:

lea rcx, [rsi - 4]
xor eax, eax
cmp rcx, -3
jb .END
mov eax, 808464432
xor eax, dword ptr [rdi]
shl esi, 3
neg sil
mov ecx, esi
shl eax, cl
lea ecx, [rax + 101058054]
or ecx, eax
test ecx, -252645136
sete cl
imul esi, eax, 6556161
shr esi, 24
mov byte ptr [rdx], sil
bswap eax
cmp eax, 132358
setb al
and al, cl
movzx eax, al
.END: # %return
ret

To test these functions, I wrote a benchmark. The benchmark uses random inputs, or sequential inputs (0,1,…), and it ends up being very relevant. I use GCC 12 and an Ice Lake (Intel) linux server running at 3.2 GHz. I report the number of millions of numbers parsed by second.

random numbers sequential numbers
std::from_chars 145 M/s 255 M/s
naive 210 M/s 365 M/s
SWAR 425 M/s 425 M/s

So the SWAR approach is twice as fast as the naive approach when the inputs are unpredictable. Otherwise, for predictable inputs, we surpass the naive approach by about 15%. Whether it is helpful in you use case depends on your data so you should run your own benchmarks.

Importantly, the SWAR approach is entirely equivalent to the naive approach, except for the fact that it always reads 4 bytes irrespective of the length.

The from_chars results are disappointing all around. I am puzzled as to why the naive approach is so much faster than the standard library. It solves a slightly different problem but the difference is still quite large. It could be that there is room for optimization in the standard library (GCC 12).

Can you do better? The benchmark is available. As part of the benchmark, we check exhaustively that the validation is correct.

Credit: I am grateful to Jeroen Koekkoek from NLnet Labs for suggesting this problem. The approach described was improved based on proposals by Jean-Marc Bourguet.

Update: User “Perforated Bob”, proposed a version which can be faster under some compilers:

int parse_uint8_fastswar_bob(const char *str, size_t len, uint8_t *num) {
  union { uint8_t as_str[4]; uint32_t as_int; } digits;
  memcpy(&digits.as_int, str, sizeof(digits));
  digits.as_int ^= 0x303030lu;
  digits.as_int <<= (len ^ 3) * 8;
  *num = (uint8_t)((0x640a01 * digits.as_int) >> 16);
  return ((((digits.as_int + 0x767676) | digits.as_int) & 0x808080) == 0) 
   && ((len ^ 3) < 3) 
   && __builtin_bswap32(digits.as_int) <= 0x020505ff;
}

 

A simple WebSocket benchmark in Python

Modern web applications often use the http/https protocols. However, when the server and client needs to talk to each other in a symmetrical fashion, the WebSocket protocol might be preferable. For example, if you want to program a multiplayer video game, the WebSocket protocol is almost surely better than http. In A simple WebSocket benchmark in JavaScript, I showed that JavaScript (through Node.js) can produce efficient WebSocket servers, at least in the simplest cases.

My benchmark is what I call a ping-pong. Client 1 sends a message to the server.The server receives the message and broadcasts it to the second client. Client 2 receives the message from the server. Client 2 replies back to the server. Client 1 receives the message. And so forth.

I want to know how many roundtrips I can generate per second. For Node.js (JavaScript), the answer is about 20,000. If you use a faster JavaScript engine (bun), you might get twice as many.

What about Python? I wrote my client using standard code, without any tweaking:

import asyncio
import websockets
async def client1():
  async with websockets.connect('ws://localhost:8080') as websocket:
    message = 'client 1!'
    await websocket.send(message)
    while True:
      response = await websocket.recv()
      await websocket.send(message)

async def client2():
  async with websockets.connect('ws://localhost:8080') as websocket:
    message = 'client 2!'
    while True:
      response = await websocket.recv()
      await websocket.send(message)

async def main():
  task1 = asyncio.create_task(client1())
  task2 = asyncio.create_task(client2())
  await asyncio.gather(task1, task2)

asyncio.run(main())


Python has several different frameworks to build WebSocket servers. I picked three that looked popular and mature: sanic, blacksheep, and aiohttp. By default, a module like sanic should use good optimizations like the uvloop module.

My source code is available. I run the benchmark on a Linux machine with Python 3.9. The packets are local, they don’t go out to the Internet. There is no docker container involved.

Python client Node.js 20 client
sanic server 3700 5200
blacksheep server 3000 200
aiohttp server 3600 270
Node.js 20 server 6000 19,000

Mixing the blacksheep and aiohttp servers I wrote with the Node.js server gives terrible results: I have not investigated the cause, but it would be interesting to see if others can reproduce it, or diagnose it.

Otherwise, I get that the sanic server is nearly 4 times slower than the Node.js server. Writing the client in Python appears to cut the performance significantly (except for the blacksheep and aiohttp servers anomaly).

JavaScript shines in comparison with Python in these tests. My implementations are surely suboptimal, and there might be mistakes. Nevertheless, I believe that it is as expected: the standard Python interpreter is not very fast. Node.js has a great Just-in-Time compiler. It would be interesting to switch to faster implementation of Python such as pypy.

This being said, 3000 roundtrips per second might be quite sufficient for several applications. Yet, a real-world WebSocket server would be assuredly slower: it would go through the Internet, it would do non-trivial work, and so forth.

A simple WebSocket benchmark in JavaScript: Node.js versus Bun

Conventional web applications use the http protocol (or the https variant). The http protocol is essentially asymmetrical: a client application such as a browser issues requests and the server responds. It is not possible for the server to initiate communication with the client. Certain types of applications are therefore more difficult to design. For example, if we wanted to design a multiplayer video game using the http protocol, such as a chess game, we could have one server, and two browsers connected to the server. When one of the players moves a piece within its browser, the browser can inform the server via an http request. But how do you inform the second browser? One solution is to have the browsers make requests to the server at regular intervals. A better solution is to use another protocol, the WebSocket protocol.

WebSocket is a network protocol for creating bidirectional communication channels between browsers and web servers. Most browsers support WebSocket, although the standard is relatively recent (2011). It enables the client to be notified of a change in server status, without having to make a request.

You expect WebSocket to be relatively efficient. I wrote an elementary WebSocket benchmark in JavaScript. I use the standard module ws. In my benchmark, I have one server. The server takes whatever messages it receives, and it sends them to other clients. Meanwhile I create two clients. Both clients initiate a connection to the server, so we have two connections. The clients then engage in a continual exchange:

  1. Client 1 sends a message to the server.
  2. The server receives the message and broadcasts it to the second client.
  3. Client 2 receives the message from the server.
  4. Client 2 replies back to the server.
  5. Client 1 receives the message.

My code is as simple as possible. I do not do any trick to go faster. It is ‘textbook’ code.

Importantly, this benchmark has a strong data dependency: there is only just one connection active while the other one is stalling. So we are measuring the latency (how long the trips take) rather than how many requests we can support simultaneously.

How fast can it go? I run my benchmark locally on a Linux server with a server processor (Xeon Gold). The tests are local so that they do no go through the Internet, they do not use docker or a VM, etc. Obviously, if you run these benchmark on the Internet, you will get slower results due to the network overhead. Furthermore, my benchmark does not do any processing, it just sends simple messages. But I am interested in how low the latency can get.

I use Node.js as a runtime environment (version 20). There is an alternative JavaScript runtime environment called Bun which I also use for comparison (1.0.14). Because I have two JavaScript processes, I four possibilities: the two processes may run Node.js, or they may run bun, or a mixed of those.

Can we do better? Bun has its own WebSocket API. I wrote a script specifically for it. I am keeping the clients unchanged (i.e., I am not using a bun-specific client).

I measure the number of roundtrips per second in steady state.

Node.js 20 (client) Bun 1.0 (client)
Node.js 20 (server with ws) 19,000 23,000
Bun 1.0 (server with ws) 15,000 27,000
Bun 1.0 (bun-specific server) 44,000 50,000

It seems fair to compare the pure Node.js configuration (19,000) with the pure Bun configuration (27,000) when using the ws module. At least in my tests, I am getting that Node.js clients are faster when using a Node.js server. I am not sure why that is. Bun is 40% faster than Node.js in this one test. Once you switch to the bun-specific JavaScript code, then bun is twice as fast.

In a simple http benchmark, I got that Node.js could support about 45,000 http queries per second while bun while nearly twice as capable. However, to get these high numbers, we would have multiple requests in flight at all times. So while I am not making a direct comparison, it seems likely that WebSocket is more efficient than repeatedly polling the servers from both clients.

Importantly, all of my source code is available. The benchmark should be fully reproducible.

Science and Technology links (November 12 2023)

  1. Vitamin K2 supplements might reduce the risk of myocardial infarction (heart attacks) and of all-cause death (Hasific et al. 2022). You find vitamin K2 in some Gouda cheeses and in eggs.
  2. Most of the water on Earth is salinated (in the oceans) and cannot be consumed. Fresh water is often scarce. Yet Israel is desalinating water for less than a dollar per cubic meter.
  3. People living in South America engaged in warfare for 10,000 years before the arrival of the Europeans (Standen et al. 2023).
  4. The last glacial period ended about 12,000 years ago and last about 100,000 years. About 26,000 years ago, all of Canada was covered by a permanent ice sheet. Thus many of us were taught in school that human beings first colonized America about 12,000 years ago by the Bering land bridge, that existed back then between modern-day Russia and modern-day Alaska. The evidence accumulates that there were human beings in America much earlier than initially thougth. They would have been present 21,000 to 23,000 years ago in New Mexico. We even have their footprints.
  5. As recently as 20,000 years ago—not long in geological terms—Britain was not, in fact, an island. Instead, the terrain that became the British Isles was linked to mainland Europe by Doggerland, a tract of now-submerged territory where early Mesolithic hunter-gatherers lived, settled and traveled (McGreevy, 2020). Correspondly, there were human beings in Ireland 31,000 years ago.
  6. Gray et al. (2023) argue that the limited freedom that children enjoy in our modern societies is leading to a rise in mental disorders.
  7. Most people cannot understand the bat and ball problem, even after the solution is given. The problem can be stated as follows: “A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?” ChatGPT can solve it:
  8. When hiring, we find a slight bias in favour of females in male-dominated fields, and a strong bias in favour of females in female-dominated fields (Schaerer et al., 2023). Overall, people greatly overestimate gender biases in hiring.
  9. Retinol, a common cosmetic product, keeps one’s skin younger.
  10. Unwarranted financial optimism might be the result of low cognitive abilities.

Generating arrays at compile-time in C++ with lambdas

Suppose that you want to check whether a character in C++ belongs to a fixed set, such as ‘\0’, ‘\x09’, ‘\x0a’,’\x0d’, ‘ ‘, ‘#’, ‘/’, ‘:’, ‘<‘, ‘>’, ‘?’, ‘@’, ‘[‘, ‘\\’, ‘]’, ‘^’, ‘|’. A simple way is to generate a 256-byte array of Boolean values and lookup the value. This approach is sometimes called memoization (and not memorization!!!). You might do it as follows:

constexpr static bool is_forbidden_host_code_point_table[] = {
  1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
bool is_forbidden_host_code_point(char c) {
  return is_forbidden_host_code_point_table[uint8_t(c)];
}

It is reasonably efficient in practice. Some people might object to how the table is generated. Can you have the C++ compiler generate the array at compile-time from a function?

Using C++17, you might do it with an std::array as follows:

constexpr static std::array<uint8_t, 256> is_forbidden_array = []() {
  std::array<uint8_t, 256> result{};
  for (uint8_t c : {'\0', '\x09', '\x0a','\x0d', ' ', '#', '/', ':',
    '<', '>', '?', '@', '[', '\\', ']', '^', '|'}) {
   result[c] = true;
  }
  return result;
}();

bool is_forbidden_host_code_point_array(char c) {
  return is_forbidden_array[uint8_t(c)];
}

These two approaches should be equivalent in practice. This might compile down to a single lookup instruction, or the equivalent.

You may want to compare it against the default (without memoization) which might be…

bool is_forbidden_host_code_point_default(char c) noexcept {
  return c == '\0' || c == '\x09' || c == '\x0a'
   || c == '\x0d' || c == ' ' || c == '#'
   || c == '/'|| c == ':' || c == '<'|| c == '>'
   || c == '?' || c == '@' || c == '['|| c == '\\'
  || c == ']'|| c == '^' || c == '|';
}

A compiler like GCC might compile this routine to a bitset approach such as …

 cmp dil, 62
ja .L2
movabs rax, 6052978675329017345
bt rax, rdi
setc al
ret
.L2:
sub edi, 63
cmp dil, 61
ja .L4
movabs rax, 2305843013240225795
bt rax, rdi
setc al
ret

The default approach certainly generates more instructions, and might be less efficient in some cases.

Further reading. The evolution of constexpr: compile-time lookup tables in C++

Appending to an std::string character-by-character: how does the capacity grow?

In C++, suppose that you append to a string one character at a time:

while(my_string.size() <= 10'000'000) {
  my_string += "a";
}

In theory, it might be possible for the C++ runtime library to implement this routine as the creation of a new string with each append: it could allocate a new memory region that contains just one extra character, and copy to the new region. It would be very slow in the worst case. Of course, the people designing the runtime libraries are aware of such potential problem. Instead of allocating memory and copying with each append, they will typically grow the memory usage in bulk. That is, every time new memory is needed, they double the memory usage (for example).

Empirically, we can measure the allocation. Starting with an empty string, we may add one character at a time. I find that GCC 12 uses capacities of size 15 × 2 k for every increasing integers k, so that the string capacities are 15, 30, 60, 120, 240, 480, 960, 1920, etc. Under macOS (LLVM 15), I get that clang doubles the capacity and add one, except for the initial doubling, so you get capacities of 22, 47, 95, 191, 383, 767, etc. So the string capacity grows by successively doubling in all cases.

If you omit the cost of writing the character, what is the cost of these allocations and copy for long strings? Assume that allocating N bytes costs you N units of work. Let us consider the GCC 12 model : they both lead to the same conclusion. To construct a string of size up to 15 × 2n, it costs you 15 + 15 × 21 + 15 × 22 + … + 15 × 2n which is a geometric series with value 15 × (2n + 1 – 1). Generally speaking, you find that this incremental doubling approach costs you no more than 2N units of work to construct a string of size N, after rounding N up to a power of two.  In computer science parlance, the complexity is linear. Insertion in a dynamic array with capacity that is expanded by a constant factor ensures that inserting an element is constant time (amortized). In common sense parlance, it scales well.

In the general case, where you replace 2 by another value (e.g., 1.5), you still get a geometric series but it is in a different basis, 1 +  b1 + b2 + … +  bn  which sums up to  (bn+1-1)/(b – 1). The ratio 2, becomes  b/(b – 1) asymptotically, so for a basis of 1.5, you get 3N units of work instead of 2N. So a smaller scaling factor b leads to more work, but it is still just a constant factor.

If I benchmark the following function for various values of ‘volume’, I get a practically constant time-per-value:

std::string my_string;
while (my_string.size() <= volume) {
  my_string += "a";
}

On my laptop, I get the following results. You can run my benchmark yourself.

volume time/entry (direct measure)
100 5.83 ns
1000 5.62 ns
10000 5.47 ns
100000 5.62 ns
1000000 5.68 ns
10000000 5.69 ns
100000000 5.80 ns

A consequence of how strings allocate memory is that you may find that many of your strings have excess capacity if you construct them by repeatedly appending characters. To save memory, you may call the method shrink_to_fit() to remove this excess capacity. If you are using a temporary string, it is not a concern since the memory is recovered immediately.

For processing strings, streams in C++ can be slow

The C++ library has long been organized around stream classes, at least when it comes to reading and parsing strings. But streams can be surprisingly slow. For example, if you want to parse numbers, then this C++ routine is close to being the worst possible choice for performance:

std::stringstream in(mystring);
while(in >> x) {
   sum += x;
}
return sum;

I recently learned that some Node.js engineers prefer stream classes when building strings, for performance reasons. I am skeptical.

Let us run an experiment. We shall take strings containing the ‘%’ character and we build new strings where the ‘%’ character is replaced by ‘%25’ but the rest of the string is otherwise unchanged.

A straight-forward string construction is as follows:

std::string string_escape(const std::string_view file_path) {
  std::string escaped_file_path;
  for (size_t i = 0; i < file_path.length(); ++i) {
    escaped_file_path += file_path[i];
    if (file_path[i] == '%')
      escaped_file_path += "25";
  }
  return escaped_file_path;
}

An optimized version using streams is as follows:

std::string stream_escape(const std::string_view file_path) {
  std::ostringstream escaped_file_path;
  for (size_t i = 0; i < file_path.length(); ++i) {
    escaped_file_path << file_path[i];
    if (file_path[i] == '%')
      escaped_file_path << "25";
  }
  return escaped_file_path.str();
}

I envision using these functions over strings that contain few ‘%’ characters. It is possible that most of the strings do not contain the ‘%’. In such cases, I can just search for the character and only do non-trivial work when one is found. The following code should do:

std::string find_string_escape(std::string_view file_path) {
  // Avoid unnecessary allocations.
  size_t pos = file_path.empty() ? std::string_view::npos :
    file_path.find('%');
  if (pos == std::string_view::npos) {
   return std::string(file_path);
  }
  // Escape '%' characters to a temporary string.
  std::string escaped_file_path;
  do {
    escaped_file_path += file_path.substr(0, pos + 1);
    escaped_file_path += "25";
    file_path = file_path.substr(pos + 1);
    pos = file_path.empty() ? std::string_view::npos :
      file_path.find('%');
  } while (pos != std::string_view::npos);
  escaped_file_path += file_path;
  return escaped_file_path;
}

I wrote a benchmark that uses a large collection of actual file URLs as a data source. The benchmark runs under macOS and Linux. I use Linux, a recent Intel server and GCC 12:

naive strings 260 ns/string 0.45 GB/s
stream 1000 ns/string 0.12 GB/s
find 33 ns/string 3.49 GB/s

At least in this case, I find that the stream version is four times slower than  naive string processing, and it is 30 times slower than the optimized ‘find’ approach.

Your results will vary depending on your system, but I generally consider the use of streams in C++ as a hint that there might be poor performance.

Further reading: I turned this blog post into a pull request to Node.js.

How many billions of transistors in your iPhone processor?

In about 10 years, Apple has multiplied by 19 the number of transistors in its mobile processors. It corresponds roughly to a steady rate of improvement of 34% per year on the number of transistors, or a doubling every 2.5 years. In real dollars, an iPhone has roughly a constant price: the price tag of a new iPhone increases every year, but it does so while tracking the inflation. Thus you are getting ever more transistors in your iPhone for the same price.

processor release year transistors
Apple A7 2013 1 billion
Apple A8 2014 2 billions
Apple A9 2015 2 billions
Apple A10 2016 3.2 billions
Apple A11 2017 4.3 billions
Apple A12 2018 6.9 billions
Apple A13 2019 8.5 billions
Apple A14 2020 11.8 billions
Apple A15 2021 15 billions
Apple A16 2022 16 billions
Apple A17 2023 19 billions

Randomness in programming (with Go code)

Computer software is typically deterministic on paper: if you run twice the same program with the same inputs, you should get the same outputs. In practice, the complexity of modern computing makes it unlikely that you could ever run twice the same program and get exactly the same result, down to the exact same execution time. For example, modern operating systems randomize the memory addresses as a security precaution: a technique called Address space layout randomization. Thus if you run a program twice, you cannot be guaranteed that the memory is stored at the same memory addresses. In Go, you can print the address of a pointer with the %p directive. The following program will allocate a small array of integers, and print the corresponding address, using a pointer to the first value. If you run this program multiple times, you may get different addresses.

package main

import (
    "fmt"
)

func main() {
  x := make([]int, 3)
    fmt.Printf("Hello %p", &x[0])
}

Thus, in some sense, software programs are already randomized whether we like it or not. Randomization can make programming more challenging. For example, a bad program might behave correctly most of the time and only fail intermittently. Such unpredictable behavior is a challenge for a programmer.

Nevertheless, we can use randomization to produce better software: for example by testing our code with random inputs. Furthermore, randomness is a key element of security routines.

Though randomness is an intuitive notion, defining it requires more care. Randomness is usually tied to a lack of information. For example, it may be measured by our inability to predict an outcome. Maybe you are generating numbers, one every second, and after looking at the last few numbers you generated, I still cannot predict the next number you will generate. It does not imply that the approach you are using to generate numbers is magical. Maybe you are applying a perfectly predictable mathematical routine. Thus randomness is relative to the observer and their knowledge.

In software, we distinguish between pseudo-randomness and randomness. If I run a mathematical routine that generates random-looking numbers, but these numbers are perfectly determined, I will say that they are ‘pseudo-random’. What random looking means is subjective and the concept of pseudo-randomness is likewise subjective.

It is possible, on a computer, to produce numbers that cannot be predicted by the programmer. For example, you might use a temperature sensor in your processor to capture physical ‘noise’ that can serve as a random-looking input. You might use the time of day when a program was started as a random input. We often refer to such values are random (as opposed to pseudo-random). We consider them random in the sense that, even in principle, it is not possible for the software to predict them: they are produced by a process from outside of the software system. ### Hashing Hashing is the process by which we design a function that takes various inputs (for example variable-length strings) and outputs a convenient value, often an integer value. Because hashing involves a function, given the same input, we always get the same output. Typically, hash functions produce a fixed number of bits: e.g., 32 bits, 64 bits, and so forth.

One application of hashing has to do with security: given a file recovered from the network, you may compute a hash value from it. You may then compare it with the hash value that the server provides. If the two hash values match, then it is likely that the file you recovered is a match for the file on the server. Systems such as git rely on this strategy.

Hashing can also be used to construct useful data structures. For example, you can create a hash table: given a set of key values, you compute a hash value representing an index in an array. You can then store the key and a corresponding value at the given index, or nearby. When a key is provided, you can hash it, seek the address in the array, and find the matching value. If the hashing is random-looking, then you should be able to hash N objects into an array of M elements for M slightly larger than N so that a few objects are hashed to the same location in the array. It is difficult to ensure that no two objects are ever mapped to the same array element: it requires that M be much, much larger than N. For M much larger than N, the probability of a collision is about 1 - exp(-N*N/(2*M)). Though this probability falls to zero as M grows large, it requires M to be much larger than N for it to be practically zero. Solving for p in 1 - exp(-N*N/(2*M)) = p, we get M = -1/2 N*N / ln(1-p). That is, to maintain a probability p, then M must grow quadratically (proportionally to N*N) concerning N. Thus we should expect that there will be collisions in a hash table even if the hash function appears random. We can handle collisions in various ways. For example, you may use chaining: each element in the array stores a reference to a dynamic container that may contain several keys. You may also use linear probing: if the array location is occupied, you find the next unoccupied value and store the key there. When searching for a key under linear probing, you first access the element indicated by the hash value, and if it is occupied by a different key, you move to the next element and so forth, until you have visited all of the array, or until you have found an unoccupied element. There are many variations on the same theme (linear probing).

An ideal hash function might take each possible input and assign it to a purely random value given by an Oracle. Unfortunately, such hash functions are often impractical. They require the storage of large tables of input values and matching random values. In practice, we aim to produce hash functions that behave as if they were purely random while still being easy to implement efficiently.

A reasonable example to hash non-zero integer values is the murmur function. The murmur function consists of two multiplications and and three shift/xor operations. The following Go program will display random-looking 64-bit integers, using the murmur function:

package main

import (
    "fmt"
    "math/bits"
)

func murmur64(h uint64) uint64 {
    h ^= h >> 33
    h *= 0xff51afd7ed558ccd
    h ^= h >> 33
    h *= 0xc4ceb9fe1a85ec53
    h ^= h >> 33
    return h
}

func main() {

    for i := 0; i < 10; i++ {
        fmt.Println(i, murmur64(uint64(i)))
    }
}

It is a reasonably fast function. One downside of the murmur64 function is that zero is mapped to zero, so some care is needed.

In practice, your values might not be integers. If you want to hash a string, you might use a recursive function. You process the string character by character. At each character, you combine the character value with the hash value computed so far, generating a new hash value. Once the function is completed, you may then apply murmur to the result:

package main

import (
    "fmt"
)

func murmur64(h uint64) uint64 {
    h ^= h >> 33
    h *= 0xff51afd7ed558ccd
    h ^= h >> 33
    h *= 0xc4ceb9fe1a85ec53
    h ^= h >> 33
    return h
}

func hash(s string) (v uint64) {
    v = uint64(0)
    for _, c := range s {
        v = uint64(c) + 31*v
    }
    return murmur64(v)
}

func main() {
    fmt.Print(hash("la vie"), hash("Daniel"))
}

There are better and faster hash functions, but the result from recursive hashing with a murmur finalizer is reasonable.

Importantly, it is reasonably easy to generate two strings that hash to the same values, i.e., to create a collision. For example, you can verify that the strings "Ace", "BDe", "AdF", "BEF" all have the same hash value:

fmt.Print(hash("Ace"), hash("BDe"), hash("AdF"), hash("BEF"))

When hashing arbitrarily long strings, collisions are always possible. However, we can use more sophisticated (and more computationally expensive) hash functions to reduce the probability that we encounter a problem.

An interesting characteristic of the provided murmur64 function is that it is invertible. If you consider the steps, you have two multiplication by odd integers. A multiplication by an odd integer is always invertible: the multiplicative inverse of 0xff51afd7ed558ccd is 0x4f74430c22a54005 and the multiplicative inverse of 0xc4ceb9fe1a85ec53 is 0x9cb4b2f8129337db, as 64-bit unsigned integers. It may be slightly less obvious that h ^= h >> 33 is invertible. But if h is a 64-bit integer, we have that h and h ^ (h >> 33) are identical in their most significant 33 bits, by inspection. Thus if we are given z = h ^ (h >> 33), we have that z >> (64-33) == h >> (64-33). That is, we have identified the most significant 33 bits of h from h ^ (h >> 33). Extending this reasoning, we have that g is the inverse of f in the following code, in the sense that g(f(i)) == i.

func f(h uint64) uint64 {
    return h ^ (h >> 33)
}

func g(z uint64) uint64 {
    h := z & 0xffffffff80000000
    h = (h >> 33) ^ z
    return h
}

We often need hash values to fit within an interval starting at zero. E.g., you might want to get a hash value in [0,max), you might use the following function:

func toIntervalBias(random uint64, max uint64) uint64 {
    hi,_ := bits.Mul64(random, max)
    return hi
}

This function outputs a value in [0,max) using a single multiplication. There are alternatives such as random % max, but an integer remainder operation may compile to a division instruction, and a division is typically more expensive than a multiplication. Whenever possible, you should avoid division instructions when performance is a factor.

Importantly, the toIntervalBias function introduces a slight bias: we start with 264 distinct values and we map them to N distinct values. This means that out of 264 original values, about 264/N values correspond to each output value. Let x be the smallest integer no smaller than x and x be the larger integer no larger than x. When 264/N is not an integer, then some output values match ⌈264/N original values, whereas others match ⌊264/N original values. When N is small, it may be negligible, but as N grows, the bias is relatively more important. In some sense, it is the smallest possible bias if we are starting from original values that are uniformly distributed over a set of 264 possible values.

Putting it all together, the following program will hash a string into a value in the interval [0,10).

package main

import (
    "fmt"
    "math/bits"
)

func murmur64(h uint64) uint64 {
    h ^= h >> 33
    h *= 0xff51afd7ed558ccd
    h ^= h >> 33
    h *= 0xc4ceb9fe1a85ec53
    h ^= h >> 33
    return h
}

func hash(s string) (v uint64) {
    v = uint64(0)
    for _, c := range s {
        v = uint64(c) + 31*v
    }
    return murmur64(v)
}

func toIntervalBias(random uint64, max uint64) uint64 {
    hi,_ := bits.Mul64(random, max)
    return hi
}

func main() {
    fmt.Print(toIntervalBias(hash("la vie"),10))
}

Though the toIntervalBias function is generally efficient, it is unnecessarily expensive when the range is a power of two. If max is a power of two (e.g., 32), then random % max == random & (max-1). A bitwise AND with the decremented maximum is faster than even just a multiplication, typically. Thus the following function is preferable.

func toIntervalPowerOfTwo(random uint64, max uint64) uint64 {
    return random & (max-1)
}

Estimating cardinality

One use case for hashing is to estimate the cardinality of the values in an array or stream of values. Suppose that your software receives billions of identifiers, how many distinct identifiers are there? You could build a database of all identifiers, but it could use a lot of memory and be relatively expensive. Sometimes, you only want a crude approximation, but you want to compute it quickly.

There are many techniques to estimate cardinalities using hashing: Probabilistic Counting, LOGLOG Probabilistic Counting, and so forth. We can explain the core idea and even produce a useful function without any advanced mathematics. If you hash all values (e.g., identifiers), and if the hash function is of good quality, you would expect that all distinct values will be hash to a random-looking value within the set of all values.

The space between distinct hash values should be about 264/(N+1) where N is the number of distinct values. If we find the small hash value m, then we should have approximately m = 264/(N+1) or N = 264/m − 1. When N is much larger than 1 but much smaller than 264, this is approximatively N = (264−1)/m. The following function applies this formula to estimate the cardinality:

// estimateCardinality estimates the number of distinct values
func estimateCardinality(values []uint64) int {
    if len(values) < 2 {
        return len(values)
    }
    mi1 := murmur64(values[0])

    for i := 1; i < len(values); i++ {
        t := murmur64(values[i])
        if t < mi1 {
            mi1 = t
        }
    }
    return int(math.MaxUint64 / mi1)
}

We can apply this function in the following program. The approximation is rather crude, but it can be good enough in some practical cases.

package main

import (
    "fmt"
    "math"
)

func mu(h uint64, step uint64) uint64 {
    return h * step
}

func murmur64(h uint64) uint64 {
    h ^= h >> 33
    h *= 0xff51afd7ed558ccd
    h ^= h >> 33
    h *= 0xc4ceb9fe1a85ec53
    h ^= h >> 33
    return h
}

// fillArray fills the array with up to howmany distinct values
func fillArray(arr []uint64, howmany int) {
    for i := 0; i < len(arr); i++ {
        // careful not to include zero because murmur64(0) == 0
        arr[i] = 1 + uint64(i%howmany)
    }
}

// estimateCardinality estimates the number of distinct values
func estimateCardinality(values []uint64) int {
    if len(values) < 2 {
        return len(values)
    }
    m := murmur64(values[0])

    for i := 1; i < len(values); i++ {
        t := murmur64(values[i])
        if t < mi1 {
            m = t
        }
    }
    return int(math.MaxUint64 / m)
}

func main() {
    values := make([]uint64, 5000000) // 50 M
    distinct := 2200000               // 1.2 M
    fillArray(values, distinct)
    fmt.Println(estimateCardinality(values), distinct)

}

Integers

There are many ways to generate random integers, but a particularly simple approach is to rely on hashing. For example, we could start from an integer (e.g., 10) and return the random integer murmur64(10) and then increment the integer (e.g, to 11) and next return the integer murmur64(10).

Steele et al. (2014) propose a similar strategy which they call SplitMix: it is part of the Java standard library. It works much like what we just described but instead of incrementing the counter by one, they increment it by a large odd integer. They also use a slightly different version of the murmur64 version. The following function prints 10 different random values, following the SplitMix formula:

package main

import "fmt"

func splitmix64(seed *uint64) uint64 {
    *seed += 0x9E3779B97F4A7C15
    z := *seed
    z = (z ^ (z >> 30))
    z *= (0xBF58476D1CE4E5B9)
    z = (z ^ (z >> 27))
    z *= (0x94D049BB133111EB)
    return z ^ (z >> 31)
}

func main() {
    seed := uint64(1234)
    for z := 0; z < 10; z++ {
        r := splitmix64(&seed)
        fmt.Println(r)
    }
}

Each time the splitmix64 function is called, the hidden seed variable is advanced by a constant (0x9E3779B97F4A7C15). If you start from the same seed, you always get the same random values.

The function then performs a series of bitwise operations on z. First, it performs an XOR operation between z and z shifted right by 30 bits. It then multiplies the result by the constant value 0xBF58476D1CE4E5B9. Next, it performs another XOR operation between the result and the result shifted right by 27 bits. Finally, it multiplies the result by the constant value 0x94D049BB133111EB and returns the result XORed with the result shifted right by 31 bits.

It produces integers using the full 64-bit range. If one needs a random integer in an interval (e.g., [0,N)), then more work is needed. If the size of the interval is a power of two (e.g., [0,32)), then we may simply use the same technique as for hashing:

// randomInPowerOfTwo -> [0,max)
func randomInPowerOfTwo(seed *uint64, max uint64) uint64 {
    r := splitmix64(seed)
    return r & (max-1)
}

However, when the bound is arbitrary (not a power of two) and we want to avoid biases, a slightly more complicated algorithm is needed. Indeed, if we assume that the 64-bit integers are truly random, then all values are equally likely. However, if we are not careful, we can introduce a bias when converting the 64-bit integers to values in [0,N). It is not a concern when N is a power of two, but it becomes a concern when N is arbitrary. A fast routine was described by Lemire (2019) to solve this problem:

func toIntervalUnbiased(seed *uint64, max uint64) uint64 {
    x := splitmix64(seed)
    hi, lo := bits.Mul64(x, max)
    if lo < max {
        t := (-max) % max // division!!!
        for lo < t {
            x := splitmix64(seed)
            hi, lo = bits.Mul64(x, max)
        }
    }
    return hi
}

The toIntervalUnbiased function takes two arguments: a pointer to a 64-bit unsigned integer seed and a 64-bit unsigned integer max. It returns a 64-bit unsigned integer. The function first calls the splitmix64 function with the seed pointer as an argument to generate a random 64-bit unsigned integer x. It then multiplies x with max using the bits.Mul64 function, which returns the product of two 64-bit unsigned integers as two 64-bit unsigned integers. The higher 64 bits of the product are stored in the variable hi, and the lower 64 bits are stored in the variable lo. If lo is less than max, the function enters a loop that generates new random numbers using splitmix64 and recalculates the product of x and max until lo is greater than or equal to -max % max. This is done to ensure that the distribution of random numbers is unbiased.

The general strategy used by this function is called the rejection method: we repeatedly try to generate a random integer until we can produce an unbias result. However, when the interval is much smaller than 264 (a common case), then we are very unlikely to use the rejection method or to even have to compute an integer remainder. Most of the time, the function never enters in the rejection loop.

Testing that a random generator appears random is challenging. We can use many testing strategies, and each testing strategy can be more or less extensive. Thankfully, it is not difficult to think of some tests we can apply. For example, we want the distribution of values to be uniform: the probability that any one value is generated should be 1 over the number of possible values. When generating 2 to the 64 possible values, it is technically challenging to test for uniformity. However, we can conveniently restrict the size of the output with a function such as toInterval.

The following program computes the relative standard deviation of a frequency histogram based on 100 million values. The relative standard deviation is far smaller than 1% (0.05655%) which suggests that the distribution is uniform.

package main

import (
    "fmt"
    "math"
    "math/bits"
)

func splitmix64(seed *uint64) uint64 {
    *seed += 0x9E3779B97F4A7C15
    z := *seed
    z = (z ^ (z >> 30))
    z *= (0xBF58476D1CE4E5B9)
    z = (z ^ (z >> 27))
    z *= (0x94D049BB133111EB)
    return z ^ (z >> 31)
}

func toIntervalUnbiased(seed *uint64, max uint64) uint64 {
    x := splitmix64(seed)
    hi, lo := bits.Mul64(x, max)
    if lo < max {
        t := (-max) % max // division!!!
        for lo < t {
            x := splitmix64(seed)
            hi, lo = bits.Mul64(x, max)
        }
    }
    return hi
}

func meanAndStdDev(arr []int) (float64, float64) {
    var sum, sumSq float64
    for _, val := range arr {
        sum += float64(val)
        sumSq += math.Pow(float64(val), 2)
    }
    n := float64(len(arr))
    mean := sum / n
    stdDev := math.Sqrt((sumSq / n) - math.Pow(mean, 2))
    return mean, stdDev
}

func main() {
    seed := uint64(1234)
    const window = 30
    var counter [window]int

    for z := 0; z < 100000000; z++ {
        counter[toIntervalUnbiased(&seed, window)] += 1
    }
    moyenne, ecart := meanAndStdDev(counter[:])
    fmt.Println("relative std ", ecart/moyenne*100, "%")
}

Random shuffle

Sometimes, you are given an array that you want to randomly shuffle. An elegant algorithm described by Knuth is the standard approach. The algorithm works by iterating over the array from the last element to the first element. At each iteration, it selects a random index between 0 and the current index (inclusive) and swaps the element at the current index with the element at the randomly generated index.

The following program shuffles randomly an array based on a seed. Changing the seed would change the order of the array. For large arrays, the number of possible permutations is likely to exceed the number of possible seeds: it implies that not all possible permutations are possible with such an algorithm using a simple fixed-length seed.

package main

import (
    "fmt"
    "math/bits"
)

func splitmix64(seed *uint64) uint64 {
    *seed += 0x9E3779B97F4A7C15
    z := *seed
    z = (z ^ (z >> 30))
    z *= (0xBF58476D1CE4E5B9)
    z = (z ^ (z >> 27))
    z *= (0x94D049BB133111EB)
    return z ^ (z >> 31)
}

func toIntervalUnbiased(seed *uint64, max uint64) uint64 {
    x := splitmix64(seed)
    hi, lo := bits.Mul64(x, max)
    if lo < max {
        t := (-max) % max // division!!!
        for lo < t {
            x := splitmix64(seed)
            hi, lo = bits.Mul64(x, max)
        }
    }
    return hi
}

func shuffle(seed *uint64, arr []int) {
    for i := len(arr)-1; i >= 1; i-- {
        j := toIntervalUnbiased(seed, uint64(i+1))
        arr[i], arr[j] = arr[j], arr[i]
    }
}

func main() {
    seed := uint64(1234)
    numbers := []int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
    shuffle(&seed, numbers)
    fmt.Println(numbers)
}

Floats

It is often necessary to generate random floating-point numbers. Software systems typically use IEEE 754 floating-point numbers.

To generate 32-bit floating-point numbers in the interval [0,1), it may seem that we could generate a 32-bit integer (in [0, 232)) and divide it by 232 to get a random floating-point value in [0, 1). That’s certainly “approximately true”, but we are making an error when doing so. How much of an error?

Floating-point (normal* numbers are represented as a sign bit, a mantissa, and an exponent as follows:

  • There is a single sign bit. Because we only care about positive numbers, this bit is fixed and can be ignored.
  • The mantissa of a 32-bit floating point number is 23 bits. It is implicitly preceded by the number 1.
  • There are eight bits dedicated to the exponent. For normal numbers, the exponent ranges from -126 to 127. To represent zero, you need an exponent value of -127 and zero mantissa.

So how many normal non-zero numbers are there between 0 and 1? The negative exponents range from -1 to -126. In each case, we have 223 distinct floating-point numbers because the mantissa is made of 23 bits. So we have 126 x 223 normal floating-point numbers in [0,1). If you don’t have a calculator handy, that’s 1,056,964,608. If we want to add the numbers 0 and 1, that’s 126 × 223 + 2 slightly over a billion distinct values. There are 232 32-bit words or slightly over 4 billion, so about a quarter of them are in the interval [0,1]. Of all the float-pointing point numbers your computer can represent, a quarter of them lie in [0,1]. By extension, half of the floating-point numbers are in the interval [-1,1].

The number 232 is not divisible by 126 × 223 + 2, so we can’t take a 32-bit non-negative integer, divide it by 232 and hope that this will generate a number in [0,1] or [0,1) in an unbiased way.

We can use the fact that the mantissa uses 23 bits. This means in particular that you pick any integer in [0, 224), and divide it by 224, then you can recover your original integer by multiplying the result again by 224. This works with 224 but not with 225 or any other larger number. For 64-bit floating-point numbers, you have greater accuracy as you can replace 24 with 53.

So you can pick a random integer in [0, 224), divide it by 224 and you will get a random number in [0,1) without bias, meaning that for every integer in [0,2^{24}), there is one and only one number in [0,1). Moreover, the distribution is uniform in the sense that the possible floating-point numbers are evenly spaced (the distance between them is a flat 2−24).

So even though single-precision floating-point numbers use 32-bit words, and even though your computer can represent about 230 distinct and normal floating-point numbers in [0, 1), chances are good that your random generator only produces 224 distinct 32-bit floating-point numbers in the interval [0, 1), and only 253 distinct 64-bit floating-point numbers.

A common way to generate random integers in an interval [0,N) is to first generate a random floating-point number [0,1) and then multiply the result by N. Should N exceeds 224 (or 253), then you are unable to generate all integers in the interval [0,N). Similarly, to generate numbers in [a,b), you would generate a random floating-point number [0,1) and then multiply the result by b-a and add a. The result may not be ideal in general.

The following program generates random floating-point numbers:

package main

import (
    "fmt"
)

func splitmix64(seed *uint64) uint64 {
    *seed += 0x9E3779B97F4A7C15
    z := *seed
    z = (z ^ (z >> 30))
    z *= (0xBF58476D1CE4E5B9)
    z = (z ^ (z >> 27))
    z *= (0x94D049BB133111EB)
    return z ^ (z >> 31)
}

// toFloat32 -> [0,1)
func toFloat32(seed *uint64) float32 {
    x := splitmix64(seed)
    x &= 0xffffff // %2**24
    return float32(x)/float32(0xffffff)
}


// toFloat64 -> [0,1)
func toFloat64(seed *uint64) float64 {
    x := splitmix64(seed)
    x &= 0x1fffffffffffff // %2**53
    return float64(x)/float64(0x1fffffffffffff)
}

func main() {
    seed := uint64(1231114)
    fmt.Println(toFloat32(&seed))
    fmt.Println(toFloat64(&seed))
}

An amusing application of floating-point is to estimate the value of pi. If we generate two floating-point numbers x, y in [0, 1), [0, 1), then out of an area of 1 (the unit square), then the area was x*x+y*y <= 1 should be pi/4. The following program prints an estimate of the value of pi.

package main

import (
    "fmt"
)

func splitmix64(seed *uint64) uint64 {
    *seed += 0x9E3779B97F4A7C15
    z := *seed
    z = (z ^ (z >> 30))
    z *= (0xBF58476D1CE4E5B9)
    z = (z ^ (z >> 27))
    z *= (0x94D049BB133111EB)
    return z ^ (z >> 31)
}

// toFloat64 -> [0,1)
func toFloat64(seed *uint64) float64 {
    x := splitmix64(seed)
    x &= 0x1fffffffffffff // %2**53
    return float64(x) / float64(0x1fffffffffffff)
}

func main() {
    seed := uint64(1231114)
    N := 100000000
    circle := 0
    for i := 0; i < N; i++ {
        x := toFloat64(&seed)
        y := toFloat64(&seed)
        if x*x+y*y <= 1 {
            circle += 1
        }

    }
    fmt.Println(4 * float64(circle)/float64(N))
}

Of course, practical algorithms might require other distributions such as the normal distribution. We can generate high quality normally distributed floating-point values at high speed using the The Ziggurat Method Marsaglia & Tsang, 2000. The implementation is not difficult, but it is technical. In particular, it requires a precomputed table. Typically, we generate normally distributed values with a mean of zero and a standard deviation of one: we often multiply the result by the square root of the desired standard deviation, and we add the desired mean.

Discrete distributions

Sometimes we are given a collection of possible values and each value has a corresponding probability. For example, we might pick at random one of three colors (red, blue, green) with corresponding probabilities (20%, 40%, 40%). If there are few such values (for example three), a standard approach is a roulette wheel selection. We divide the interval from 0 to 1 into three distinct components, one for each colour: from 0 to 0.2, we pick red, from 0.2 to 0.6, we pick blue, from 0.6 to 1.0, we pick green.

The following program illustrates this algorithm:

package main

import (
    "fmt"
    "math/rand"
    "time"
)

func splitmix64(seed *uint64) uint64 {
    *seed += 0x9E3779B97F4A7C15
    z := *seed
    z = (z ^ (z >> 30))
    z *= (0xBF58476D1CE4E5B9)
    z = (z ^ (z >> 27))
    z *= (0x94D049BB133111EB)
    return z ^ (z >> 31)
}

func toFloat64(seed *uint64) float64 {
    x := splitmix64(seed)
    x &= 0x1fffffffffffff // %2**53
    return float64(x) / float64(0x1fffffffffffff)
}

func rouletteWheelSelection(seed *uint64, colors []string, probabilities []float64) string {
    rand.Seed(time.Now().UnixNano())

    // Create a slice of cumulative probabilities
    cumulativeProbabilities := make([]float64, len(probabilities))
    cumulativeProbabilities[0] = probabilities[0]
    for i := 1; i < len(probabilities); i++ {
        cumulativeProbabilities[i] = cumulativeProbabilities[i-1] + probabilities[i]
    }

    // Generate a random number between 0 and 1
    randomNumber := toFloat64(seed)

    // Select the color based on the random number and cumulative probabilities
    if randomNumber < cumulativeProbabilities[0] {
        return colors[0]
    }
    for i := 1; i < len(cumulativeProbabilities); i++ {
        if randomNumber >= cumulativeProbabilities[i-1] && randomNumber < cumulativeProbabilities[i] {
            return colors[i]
        }
    }

    return colors[len(colors)-1]
}

func main() {
    seed := uint64(1231114)

    colors := []string{"red", "blue", "green"}
    probabilities := []float64{0.2, 0.4, 0.4}

    fmt.Println(rouletteWheelSelection(&seed, colors, probabilities))
}

If you have to pick a value out of a large set, a roulette-wheel selection approach can become inefficient. In such cases, we may use the alias method.

Cryptographic hashing and random number generation

We do not typically reimplement cryptographic functions. It is preferable to use well-tested implementations. They are typically reserved for cases where security is a concern because they often use more resources.

Cryptographic hashing of strings is designed so that it is difficult to find two strings that collide (have the same hash value). Thus if you receive a message, and you were given its hash value ahead of time, and you check that the hash value and the message correspond, there are good chances that the message has not been corrupted. It is difficult (but not impossible) for an attacker to produce a message that matches the hash value you were given. To hash a string in Go cryptographically, you may use the following code:

package main

import (
    "crypto/sha256"
    "fmt"
)

func main() {
    message := "Hello, world!"
    hash := sha256.Sum256([]byte(message))
    fmt.Printf("Message: %s\nHash: %x\n", message, hash)
}

Similarly, you may want to generate random numbers in a cryptographical manner: in such cases, the produced random numbers are difficult to predict. Even if I were to give you the ten last numbers, it would be difficult to predict the next one. If you were to implement software for an online casino, you should probably use cryptographic random numbers.

package main

import (
    "crypto/rand"
    "fmt"
    "math/big"
)

func main() {
    nBig, err := rand.Int(rand.Reader, big.NewInt(100))
    if err != nil {
        panic(err)
    }
    n := nBig.Int64()
    fmt.Printf("Here is a random %T between 0 and 99: %d\n", n, n)
}

Web server ‘hello world’ benchmark : Go vs Node.js vs Nim vs Bun

The Web is a convenient interface to your software. Many times, if you have an existing application, you may want to allow Web access to it using HTTP. Or you may want to build a small specialized Web application. In such instances, you do not want to use an actual Web server (e.g., Apache or IIS).

There are many popular frameworks for writing little web applications. Go and JavaScript (Node.js) are among the most popular choices. Reportedly, Netflix runs on Node.js; Uber moved from Node.js to Go for better performance. There are also less popular options such as Nim.

An in-depth review of their performance characteristics would be challenging.  But I just write a little toy web application, will I see the difference? A minimalist application gives you a reference since more complex applications are likely to run slower.

Let us try it out. I want the equivalent of ‘hello world’ for web servers. I also do not want to do any fiddling: let us keep things simple.

A minimalist Go server might look as follows:

package main
import (
  "io"
  "fmt"
  "log"
  "net/http"
)
func main() {
  http.HandleFunc("/simple", func(w http.ResponseWriter, r *http.Request){
    io.WriteString(w, "Hello!")
  })
  fmt.Printf("Starting server at port 3000\n")
  if err := http.ListenAndServe(":3000", nil); err != nil {
    log.Fatal(err)
  }
}

A basic JavaScript (Node.js) server might look like this:

const f = require('fastify')()
f.get('/simple', async (request) => {
  return "hello"
})
f.listen({ port: 3000})
  .then(() => console.log('listening on port 3000'))
  .catch(err => console.error(err))

It will work as-is in an alternative runtime such as Bun, but to get the most of the Bun runtime, you may need to write Bun-specific code:

const server = Bun.serve({
  port: 3000,
  fetch(req) {
   let url = new URL(req.url);
   let pname = url.pathname;
   if(pname == '/simple'){
     return Response('Hello');
   }
   return new Response("Not Found.");
  }
});

Nim offers a nice way to achieve the same result:

import options, asyncdispatch
import httpbeast
proc onRequest(req: Request): Future[void] =
  if req.httpMethod == some(HttpGet):
    case req.path.get()
    of "/simple":
      req.send("Hello World")
    else:
      req.send(Http404)
run(onRequest, initSettings(port=Port(3000)))

An interesting alternative is to use uWebSockets.js with Node:

const uWS = require('uWebSockets.js')
const port = 3000;
const app = uWS.App({
}).get('/simple', (res, req) => {
  res.end('Hello!');
}).listen(port, (token) => {
  if (token) {
    console.log('Listening to port ' + port);
  } else {
    console.log('Failed to listen to port ' + port);
  }
});

We can also use C++ with the lithium library:

#include <lithium_http_server.hh>
int main() {
  li::http_api my_api;
  my_api.get("/simple") =
    [&](li::http_request& request, li::http_response& response) {
      response.write("hello world.");
    };
  li::http_serve(my_api, 3000);
}

I wrote a benchmark, my source code is available. I ran it on a powerful IceLake-based server with 64 cores. As is typical, such big servers have relatively low clock speeds (base frequency of 2 GHz, up to 3.2 GHz). I use a simple bombardier command as part of the benchmark:

bombardier -c 10 http://localhost:3000/simple

You can increase the number of concurrent connections to 1000 (-c 1000). My initial tests used autocannon which is a poor choice for this task.

My result indicates that Nim is doing quite well on this toy example.

system requests/second (10 connections) requests/second (1000 connections)
Nim 2.0 and httpbeast 315,000 +/- 18,000 350,000 +/- 60,000
GCC12 (C++) + lithium 190,000 +/- 60,000 385,000 +/- 80,000
Go 1.19 95,000 +/- 30,000 250,000 +/- 45,000
Node.js 20 and uWebSockets.js 100,000 +/- 25,000 100,000 +/- 35,000
Bun 1.04 80,000 +/- 15,000 65,000 +/- 20,000
Node.js 20 (JavaScript) 45,000 +/- 7,000 41,000 +/- 10,000
Bun + fastify 40,000 +/- 6,000 35,000 +/- 9,000

*Jarred Sumner, the author of Bun, said on X that fastify is not fast in bun right now but that Bun.serve() is more than twice faster than node:http in bun.

My web server does very little work, so it is an edge case. I have also not done any configuration: it is ‘out of the box’ performance. Furthermore, the server is probably more powerful than anything web developers will use in practice.

There is considerable noise in this results, and you should not trust my numbers entirely. I recommend you try running the benchmark for yourself.

I reviewed some blog posts, all concluding that Go is faster :

It would be interesting to add C, Rust and Zig to this benchmark.

Regarding the C++ solution, I initially encountered many difficulties. Using Lithium turned out to be simple: the most difficult part is to ensure that you have installed OpenSSL and Boost on your system. My solution is just as simple as the alternatives. The author of Lithium has a nice twist where he explains how to run a Lithium server using a docker container with a script. Doing it in this manner means that you do not have to worry about installing libraries on your system. Running a server in a docker container is perfectly reasonable but there is a performance overhead, so I did not use this solution in my benchmark.

While preparing this blog post, I had the pleasure of compiling software written in the Nim language for the first time. I must say that it left a good impression. The authors state that Nim was inspired by Python. It does feel quite like Python. I will revisit nim later.

Parsing integers quickly with AVX-512

If I give a programmer a string such as "9223372036854775808" and I ask them to convert it to an integer, they might do the following in C++:

std::string s = ....
uint64_t val;
auto [ptr, ec] =
std::from_chars(s.data(), s.data() + s.size(), val);
if (ec != std::errc()) {} // I have an error !
// val holds the value

It is very fast: you can parse a sequence of random 32-bit integers at about 40 cycles per integer, using about 128 instructions.

Can you do better?

The recent Intel processors have new instructions, AVX-512, that can process multiple bytes at once and do it with masking, so that you can select just a range of data.

I am going to assume that you know the beginning and the end of sequence of digits. The following code with AVX-512 intrinsic does the following:

  1. Computes the span in bytes (digit_count),
  2. If we have more than 20 bytes, we know that the integer is too large to fit in a 64-bit integer,
  3. We compute a “mask”: a 32-bit value that only the most significant digit_count bits set to 1,
  4. We load an ASCII or UTF-8 string in a 256-bit register,
  5. We subtract character value ‘0’ to get values between 0 and 9 (digit values),
  6. We check whether some value exceeds 9, in which case we had a non-digit character.
size_t digit_count = size_t(end - start);
// if (digit_count > 20) { error ....}
const simd8x32 ASCII_ZERO = _mm256_set1_epi8('0');
const simd8x32 NINE = _mm256_set1_epi8(9);
uint32_t mask = uint32_t(0xFFFFFFFF) << (start - end + 32);
auto in = _mm256_maskz_loadu_epi8(mask, end - 32);
auto base10_8bit = _mm256_maskz_sub_epi8(mask, in, ASCII_ZERO);
auto nondigits = _mm256_mask_cmpgt_epu8_mask(mask, base10_8bit, NINE);
if (nondigits) {
// there is a non-digit
}

This is the key step that uses the functionality of AVX-512. Afterward, we can use ‘old-school’ processing, for folks familiar with advanced Intel intrinsics on conventional x64 processors… Mostly, we just multiply by 10, by 100, by 100000 to create four 32-bit values: the first one corresponds to the least significant 8 ASCII bytes, the second one to the next most significant 8 ASCII bytes, and the their one to up to 4 most significant bytes. When the numbers has 8 digits or less, only one of these words is relevant, and when there are 16 or less, on the first two are significant. We always waste one 32-bit value that is made of zeroes. The code might look as follows:

 auto DIGIT_VALUE_BASE10_8BIT =
_mm256_set_epi8(1, 10, 1, 10, 1, 10, 1, 10,
1, 10, 1, 10, 1, 10, 1, 10,
1, 10, 1, 10, 1, 10, 1, 10,
1, 10, 1, 10, 1, 10, 1, 10);
auto DIGIT_VALUE_BASE10E2_8BIT = _mm_set_epi8(
1, 100, 1, 100, 1, 100, 1, 100, 1, 100, 1, 100, 1, 100, 1, 100);
auto DIGIT_VALUE_BASE10E4_16BIT =
_mm_set_epi16(1, 10000, 1, 10000, 1, 10000, 1, 10000);
auto base10e2_16bit =
_mm256_maddubs_epi16(base10_8bit, DIGIT_VALUE_BASE10_8BIT);
auto base10e2_8bit = _mm256_cvtepi16_epi8(base10e2_16bit);
auto base10e4_16bit =
_mm_maddubs_epi16(base10e2_8bit, DIGIT_VALUE_BASE10E2_8BIT);
auto base10e8_32bit =
_mm_madd_epi16(base10e4_16bit, DIGIT_VALUE_BASE10E4_16BIT);

I have implemented this function in C++, and compiled it using GCC12. I run the benchmark on an Ice Lake server. I use random 32-bit integers for testing. The AVX-512 is over twice as fast as the standard approach.

AVX-512 1.8 GB/s 57 instructions/number 17 cycles/number
std::from_chars 0.8 GB/s 128 instructions/number 39 cycles/number

The comparison is not currently entirely fair because the AVX-512 function assumes that it knows the start and the end of the sequence of digits.

You can boost the performance by using an inline function, which brings it up to 2.3 GB/s, so a 30% performance boost. However, that assumes that you are parsing the numbers in sequence within a loop.

My original code would return fancy std::optional values, but GCC was negatively affected, so I changed my function signatures to be more conventional. Even, LLVM/clang is slightly faster in my tests, compared to GCC.

Credit: The original code and the problem were suggested to me by John Keiser. My code is largely derivative of his code.

Transcoding Unicode strings at crazy speeds with AVX-512

In software, we store strings of text as arrays of bytes in memory using one of the Unicode Transformation Formats (UTF), the most popular being UTF-8 and UTF-16. Windows, Java, C# and other systems common languages and systems default on UTF-16, whereas other systems and most of the web relies on UTF-8. There are benefits to both formats and we are not going to adopt one over the other any time soon. It means that we must constantly convert our strings from UTF-8 to UTF-16 and back.

It is so important that IBM mainframes based on z/Architecture provide special-purposes instructions named “CONVERT UTF-8 TO UTF-16” and “CONVERT UTF-16 TO UTF-8” for translation between the two encodings. By virtue of being implemented in hardware, these exceed 10 GiB processing speed for typical inputs. The rest of us do not have access to powerful mainframes, but our processors have single-instruction-multiple-data (SIMD) instructions. These SIMD instructions operate on larger registers (128 bits, 256 bits, 512 bits) representing vectors of numbers. Starting with the AMD Zen 4 family of processors, and with recent server Intel processors (i.e., Ice Lake or better), we have powerful SIMD instructions that can operate over 512 bits of data (AVX-512). The width (512 bits) is not the most interesting thing about these instructions: they are also distinctly more powerful than prior SIMD instruction sets.

We have a software library for this purpose called simdutf. The library automatically detects your processor and selects a ‘kernel’ of functions that is most appropriate. It supports a wide range of instruction sets (ARM NEON, SSE2, AVX2), but until last year, it did not support AVX-512.

So, how quickly can you convert (or ‘transcode’) strings using AVX-512? It is the subject of a new paper: Transcoding unicode characters with AVX-512 instructions (Software: Practice and Experience, to appear). A reference point is the popular ICU library that is used almost universally for this task. It is mature and well optimized.

It is fairly easy to optimize the performance of transcoding for simple cases (e.g., English), but much more challenging for languages such as Arabic… and the most difficult case is probably streams of Emojis. The following graphic indicates our performance in gigabytes of input per second, comparing against our fast AVX2 kernel and ICU for difficult cases: plot A  is for UTF-8 to UTF-16 transcoding and plot B is for UTF-16 to UTF-8 transcoding. In these experiments, we use an Ice Lake processor and the Clang 14 C++ compiler from the LLVM.

 

Using AVX-512 on recent Intel and AMD processors, we exceed 4 GB/s on Chinese and Emoji inputs and almost reach 8 GB/s on Arabic text when transcoding from UTF-8. When transcoding from UTF-16, we exceed 5 GB/s on Chinese and Emoji text and break the 20 GB/s barrier on Arabic text.

I also like to report the speed in ‘gigacharacters per second’: billions of characters processed per second.  It is less impressive that gigabytes per second because characters can span multiple bytes, but with gigacharacters per second, we can directly compare the UTF-8 to UTF-16 transcoding to the UTF-16 to UTF-8 transcoding. We find it much easier to transcoding from UTF-16, since it is a simpler format, than to transcode from UTF-8. The benefits compared to ICU depend on the data source, but our simdutf library can be several times faster as the following tables show.

UTF-8 to UTF-16:

ICU AVX-512
Arabic 0.80 4.3
Chinese 0.50 1.8
Emoji 0.22 1.0
Hebrew 0.80 4.3
Hindi 0.43 1.7
Japanese 0.51 1.7
Korean 0.62 1.8
Latin 1.5 20.
Russian 0.46 4.2

UTF-16 to UTF-8:

ICU AVX-512
Arabic 0.67 11.
Chinese 0.36 3.9
Emoji 0.27 1.6
Hebrew 0.68 11.
Hindi 0.21 3.8
Japanese 0.37 3.8
Korean 0.37 3.8
Latin 0.91 20.
Russian 0.23 11.

Older Intel processor that supported AVX-512 raised serious concerns due to various forms of ‘frequency throttling’: whenever you’d use these instructions, they would trigger a lowering of the frequency. For this reason, we do not use AVX-512 instructions on these older processors, preferring to fall back on AVX instructions. There is no frequency throttling on AMD Zen 4 or recent Intel processors due AVX-512 instructions.

Our functions have already been used in production systems for nearly a year as part of the simdutf library. The library includes an extensive collection of tests and benchmarks. It is part of several important projects such as Node.js, Bun, Oracle graaljs, and it is used by Couchbase and others. The simdutf library is a joint community effort, with multiple important contributors (e.g., Wojciech Muła, Yagiz Nizipli and so forth).

We are currently in the process of adding Latin 1 support to the library. In the future, I hope that we might add SVE/SVE2 support for the new generation of ARM processors.

Credit: The algorithmic design of the new AVX-512 functions should be credited to the always brilliant Robert Clausecker.

Locating ‘identifiers’ quickly (ARM NEON edition)

A common problem in parsing is that you want to find all identifiers (e.g., variable names, function names) in a document quickly. There are typically some fixed rules. For example, it is common to allow ASCII letters and digits as well as characters like ‘_’ in the identifier, but to forbid some characters at the beginning of the identifier (such as digits). E.g., ab123 is an identifier but 123ab might not be.

An efficient way to proceed is to use a 256-element table where allowed leading characters have a set value (say 255), non-identifier characters have the value 0, and all other characters have a non-zero, non-255 value.

In C, you might be about to count identifiers using the following routine:

while (source < end) {
  uint8_t c = identifier_map[*source++];
  if(c) {
    count += (c == 255);
    while (source < end && identifier_map[*source]) {
      source++;
    }
  }
}

Can you do better?

Suppose that you have an ARM-based machine (such as a recent macBook). Then you have access to Single instruction, multiple data (SIMD) instructions which can process up to 16 bytes at a time. ARM has several types of SIMD instructions but ARM NEON are the most well known.

We can apply vectorized classification to the problem (see Parsing Gigabytes of JSON per Second, The VLDB Journal, 28(6), 2019). I am not going to review the technique in details but the gist of it is that we replace the table lookup with a vectorized (or SIMD-based) table lookup. The relevant component of the code, using ARM instrinsic functions, looks as follows:

uint8x16_t low_nibble_mask = (uint8x16_t){
42, 62, 62, 62, 62, 62, 62, 62, 62, 62, 60, 20, 20, 20, 20, 21};
uint8x16_t high_nibble_mask =
(uint8x16_t){0, 0, 0, 2, 16, 33, 4, 8, 0, 0, 0, 0, 0, 0, 0, 0};
uint8x16_t chars = vld1q_u8((const uint8_t *)source);
uint8x16_t low =
vqtbl1q_u8(low_nibble_mask, vandq_u8(chars, vdupq_n_u8(0xf)));
uint8x16_t high = vqtbl1q_u8(high_nibble_mask, vshrq_n_u8(chars, 4));
uint8x16_t v = vtstq_u8(low, high);
uint8x16_t v_no_number = vtstq_u8(low, vandq_u8(high, vdupq_n_u8(61)));

The register v contains a non-zero byte value where there is an identifier character, and the v_no_number contains a non-zero byte value where there is a non-digit identifier character. We can map these SIMD registers into bitmaps from which we can identify the location of the identifier using regular C code.

We need to do a few logical operations and bit shifting. Counting the number of identifiers is a simple as the following:

while (source <= end16) {
  auto m = identifier_neon(source);
  uint16_t mask =
   m.leading_identifier_mask & ~(m.identifier_mask << 1 | lastbit);
  count += popcount(mask);
  lastbit = m.identifier_mask >> 15;
  source += 16;
}

ARM NEON is not particularly powerful, not compared with AVX-512 on Intel Ice Lake or AMD Zen 4 processors, but it is still quite good by historical standards, and the Apple implementation is fine. You can extend this approach to go to 64 bytes.

So how does the SIMD approach compares to the conventional approach? I am using a script that generate a large document. We want to count the number of identifiers as quickly as possible. The file is 10 MB and it contains half a million identifiers.

SIMD (simdjson-like, 16 bytes) 10 GB/s 2 ns/identifier
SIMD (simdjson-like, 64 bytes) 17 GB/s 1.1 ns/identifier
conventional (table-based) 0.5 GB/s 25 ns/identifier

So the SIMD-based approach over 10 times faster in this instance. My Apple-based M2 processor runs at 3.6 GHz. Using the SIMD-based code, it retires over 5 instructions per cycle. The conventional table-based approach retires far few instructions, and it is bound by load latencies.

For context, my macBook has a disk that can read at well over 2 GB/s. So the conventional routine is not fast.

On AVX-512, it should be possible to reach even higher speeds.

My source code is available.

Credit: This problem was suggested to me by Kevin Newton (Shopify). He provided the test file. The 64-byte implementation was provided by reader Perforated Blob.

Science and Technology links (September 2 2023)

  1. Physicists have a published a paper with 5154 authors. The list of authors takes 24 pages out of the 33 pages. The lesson is that if someone tell you that they have published an important paper, you should ask how many authors there were and what their exact role was.
  2. Vegatarians are at higher risk for hip fracture.
  3. Virgin Galatic brings private customers to space.
  4. Quercetin, a commonly available supplement, appears to reduce blood pressure and improve the lipid profile in human beings. (Quercetin is relatively expensive in Canada.)
  5. Scientists increasingly talk with certitude, they increasingly avoid words such as might or probably. It seems that it might be part of an effort to promote the research. In effect, science writing might be adapting strategies from marketing.
  6. I have long accepted the fact that if you encourage students to think that they can grow their abilities by hard work, they will better results. Macnamara and Burgoyne find that this result may not hold:

    We conclude that apparent effects of growth mindset interventions on academic achievement are likely attributable to inadequate study design, reporting flaws, and bias.

  7. Heterochronic parabiosis is the process by which you connect the blood vessels of old and young animals (typically mice). Researchers ran a heterochronic parabiosis experiment over a long period of time (3 months) and they found that the old mice that have benefited from heterochronic parabiosis lived longer than control old mice. Furthermore, the treated old mice appeared rejuvenated, and the rejuvenation was still visible two months after the treatment. It seems that heterochronic parabiosis made the old mice younger at the level of gene expression. The approach is unlikely to be used on human beings as is. You would not want to connect the blood vessels of young human beings and old human beings for extended periods of times. However, it is a proof of principle. It seems that you can rejuvenate old animals by acting on the composition of their blood.
  8. Researchers used young blood plasma from pigs to seemingly rejuvenate old rats.
  9. A specific compound, AOH1996, appears to selectively kill cancer cells without side effect:

    We report a small molecule inhibitor, AOH1996, of a cancer-associated isoform of PCNA (caPCNA), which notably, almost completely inhibits the growth of xenograft tumors without causing any discernible toxicity to experimental animals. (Gu et al., 2023)

  10. Old people tend to become frail, a process called sarcopenia. Though the most obvious effect is a loss of muscle mass, the muscle quality is also decreased. It is possible that the loss of muscle might a consequence of a loss of nerves in the muscles. Tezze et al. (2023) find that metformin and galantamine treat sarcopenia in old mice. Galantamine has been used to treat some of the side effects of Alzheimer’s whereas metformin is routinely used to treat diabetes.
  11. It seems that a significant fraction (e.g., 40%) of the measured warming in the recent decades could be explained by the urban effect: thermometers that were once located in the country become surrounded by a warm city.
  12. Naked mole rates are nearly ageless, they don’t age the way we do. Researchers have transfered a gene from naked mole rates to mice and found that they were able to increase the longevity of mice.
  13. The placebo effect is the phenomenon where people who receive a bogus treatment get better, despite the lack of actual drug. Two doctors (Hróbjartsson and  Gøtzsche) suggest that there is not much of placebo effect outside small effect in the context of pain management.
  14. The number of natural disasters is not increasing.
  15. Used coffee grounds make strong concrete.
  16. China restricts how long young people can play video games. Their policies appear to be ineffectual: young people in China still play a lot of video games.
  17. Aging is conserved across mammalian species at select regions of the DNA. Thus if we rejuvenate some animals at the gene expression level, it is credible that it could extend to other animals.
  18. Psychiatrists choosing a treatment for themselves predominantly selected a different treatment than the one recommended to patients by participants asked in the regular recommendation role. Psychiatrists preferred the lless invasive options for themselves (i.e. watchful waiting) whereas they recommended the more invasive ones for the patient (i.e. antidepressants and depot injections).

Transcoding Latin 1 strings to UTF-8 strings at 18 GB/s using AVX-512

Though most strings online today follow the Unicode standard (e.g., using UTF-8), the Latin 1 standard is still in widespread inside some systems (such as browsers) as JavaScript strings are often stored as either Latin 1, UTF-8 or UTF-16 internally. Latin 1 captures the first 256 characters from the Unicode standard and represents them as a single byte per character.

In a previous blog post, we examined how we could convert UTF-8 strings to Latin 1. I want to examine the reverse problem, the transcoding of Latin 1 strings to UTF-8. We receive a byte array, and each byte is one character. There is no need for any validation. However, some characters (the ASCII characters) are mapped to one byte whereas all others should be mapped to two bytes.

You can code it in C using code such as this routine:

 unsigned char byte = data[pos];
if ((byte & 0x80) == 0) { // if ASCII
  // will generate one UTF-8 byte
  *utf8_output++ = (char)(byte);
  pos++;
} else {
  // will generate two UTF-8 bytes
  *utf8_output++ = (char)((byte >> 6) | 0b11000000);
  *utf8_output++ = (char)((byte & 0b111111) | 0b10000000);
  pos++;
}

Can we do better using the new AVX-512 instructions available on recent Intel (e.g. Ice Lake) and AMD (e.g., Zen4) processors?

We can try to do it as follows:

  1. Load 32 bytes into a register.
  2. Identify all the non-ASCII bytes (they have the most significant bit set to 1).
  3. Identify the bytes that have their two most significant bits set by comparing with 192.
  4. Cast all bytes to 16-bit words into a 64-byte register.
  5. Add 0xc200 to all 16-bit words, as the most significant byte in a 16-bit word must be either 0xc2 or 0xc3.
  6. Add 0x00c0 to the 16-bit words where the corresponding byte had its two most significant bit set, this will move up a bit value so that the most signficant byte may become 0xc3.
  7. Flip the order of the bytes within each 16-bit word since UTF-8 is big endian.
  8. Remove one byte (‘compress’) where we had an ASCII byte.
  9. Store the result (can be between 32 bytes and 64 bytes).

Using Intel intrinsics, the result might look as follows:

__m256i input = _mm256_loadu_si256((__m256i *)(buf + pos));
__mmask32 nonascii = _mm256_movepi8_mask(input);
int output_size = 32 + _popcnt32(nonascii);
uint64_t compmask = ~_pdep_u64(~nonascii, 0x5555555555555555);
__mmask32 sixth =
_mm256_cmpge_epu8_mask(input, _mm256_set1_epi8(192));
__m512i input16 = _mm512_cvtepu8_epi16(input);
input16 =_mm512_add_epi16(input16, _mm512_set1_epi16(0xc200));
__m512i output16 =
_mm512_mask_add_epi16(input16, sixth, input16,
_mm512_set1_epi16(0x00c0));
output16 = _mm512_shuffle_epi8(output16, byteflip);
__m512i output = _mm512_maskz_compress_epi8(compmask, output16);

Our AVX-512 registers can load and process 64 bytes at a time, but this approach  consumes only 32 bytes (instead of 64 bytes). An anonymous reader has contributed a better approach:

  • Load 64 bytes into a register.
  • Identify all the non-ASCII bytes (they have the most significant bit set to 1).
  • Identify the bytes that have their two most significant bits set by comparing with 192.
  • Create a ‘compression’ mask where ASCII characters correspond to 01 whereas non-ASCII characters correspond to 11: we need to distinct 64-bit masks.
  • We use the vpshldw instruction (instead of the vpmovzxbw instruction) to upscale the bytes to 16-bit value, adding the 0b11000000 leading byte in the process. We adjust for the bytes that have their two most significant bits. This takes care of the first 32 bytes, assuming we interleaved the bytes.
  • The second part (next 32 bytes) are handled with bitwise logical operations.
  • We save two blocks of 32 bytes.

 

The code using AVX-512 intrinsics might look at follows:

__mmask32 nonascii = _mm256_movepi8_mask(input);
__mmask64 sixth =
_mm512_cmpge_epu8_mask(input, _mm512_set1_epi8(-64));
const uint64_t alternate_bits = UINT64_C(0x5555555555555555);
uint64_t ascii = ~nonascii;
uint64_t maskA = ~_pdep_u64(ascii, alternate_bits);
uint64_t maskB = ~_pdep_u64(ascii>>32, alternate_bits);
// interleave bytes from top and bottom halves (abcd...ABCD -> aAbBcCdD)
__m512i input_interleaved = _mm512_permutexvar_epi8(_mm512_set_epi32(
0x3f1f3e1e, 0x3d1d3c1c, 0x3b1b3a1a, 0x39193818,
0x37173616, 0x35153414, 0x33133212, 0x31113010,
0x2f0f2e0e, 0x2d0d2c0c, 0x2b0b2a0a, 0x29092808,
0x27072606, 0x25052404, 0x23032202, 0x21012000
), input);
// double size of each byte, and insert the leading byte
__m512i outputA = _mm512_shldi_epi16(input_interleaved, _mm512_set1_epi8(-62), 8);
outputA = _mm512_mask_add_epi16(outputA, (__mmask32)sixth, outputA, _mm512_set1_epi16(1 - 0x4000));
__m512i leadingB = _mm512_mask_blend_epi16((__mmask32)(sixth>>32), _mm512_set1_epi16(0x00c2), _mm512_set1_epi16(0x40c3));
__m512i outputB = _mm512_ternarylogic_epi32(input_interleaved, leadingB, _mm512_set1_epi16((short)0xff00), (240 & 170) ^ 204); // (input_interleaved & 0xff00) ^ leadingB
// prune redundant bytes
outputA = _mm512_maskz_compress_epi8(maskA, outputA);
outputB = _mm512_maskz_compress_epi8(maskB, outputB);

It is possible to do even better if you expect that the input is often ASCII or contains few non-ASCII bytes. You can branch on the case where a sequence of 64-bit is all ASCII and use a fast path in this case: trivially, we just store the 64 bytes we just read, as is.

I use GCC 11 on an Ice Lake server. My source code is available. For benchmarking, I use a French version of the Mars wikipedia entry.

technique CPU cycles/byte instructions/byte
conventional 1.7 5.0
AVX-512 (32 bytes) 0.26 0.77
AVX-512 (64 bytes) 0.21 0.54
AVX-512 (64 bytes+branch) 0.17 0.45

On a 3.2 GHz processor, the 32-byte AVX-512 routine reaches 12 GB/s. It is nearly seventimes faster than the conventional routine, and it uses six times fewer instructions. The approach consuming 64 bytes is faster than 14 GB/s, and the version with an ASCII branch breaks the 18 GB/s limit.

This server has memory bandwidth of at least 15 GB/s, but the CPU cache bandwidth is several times higher. There are disks with bandwidths of over 10 GB/s.

Remark. Whenever I mention Latin 1, some people are prone to remark that browsers treat HTML pages declared as Latin 1 and ASCII as windows-1252. That is because modern web browsers do not support Latin 1 and ASCII in HTML. Even so, you should not use Latin 1, ASCII or even windows-1252 for your web pages. I recommend using Unicode (UTF-8). However, if you code in Python, Go or Node.js, and you declare a string as Latin 1, it should be Latin 1, not windows-1252. It is bug to confuse Latin 1, ASCII and windows-1252. They are different formats.

How accurate is the birthday’s paradox formula?

Given a set of r random values from a large set (of size N), I have been using the formula 1-exp(-r**2/(2N)) to approximate the probability of a collision. It assumes that r is much smaller than N. The formula suggests that if you have hundreds of millions of random 64-bit numbers, you will start getting collisions with non-trivial probabilities, meaning that at least two values will be equal. At the one billion range, the probability of a collision is about 3% according to this formula.

It is somewhat unintuitive because if I give you two random 64-bit values, the probability of a collision is so low that it might as well be zero.

Though it is a textbook formula, we should still test it out to make sure that it is reasonable. Let us generate 32-bit random values for speed. I use a simple frequentist approximation: I generate many sets of 32-bit random values, I count the number of sets with a collision, and I divide this number by the total number of sets.

My results are as follows. The formula agrees with my results: I get a maximal error of 23%. The exact measured output depends on the random number generation and will vary depending on how you set it up. Nevertheless, it looks good! As you can see, if you even only 51,200 32-bit random values, the probability of a collision reaches 25%. My code is available.

number theory measured relative error
100 0.000001 0.000001 error: 23%
200 0.000005 0.000005 error: 13%
400 0.000019 0.000014 error: 23%
800 0.000075 0.000073 error: 2%
1600 0.000298 0.000254 error: 15%
3200 0.001191 0.001079 error: 9%
6400 0.004757 0.004700 error: 1%
12800 0.018893 0.017570 error: 7%
25600 0.073456 0.071261 error: 3%
51200 0.263006 0.240400 error: 9%

Transcoding UTF-8 strings to Latin 1 strings at 18 GB/s using AVX-512

Most strings online are Unicode strings in the UTF-8 format. Other systems (e.g., Java, Microsoft) might prefer UTF-16. However, Latin 1 is still a common encoding (e.g., within JavaScript runtimes). Its relationship with Unicode is simple: Latin 1 includes the first 256 Unicode characters. It is rich enough to convert most of the standard European languages. If something is stored in Latin 1, it can be encoded using Unicode. The reverse is obviously false. Nevertheless, let us assume that you have a Unicode string in UTF-8 that you want to quickly transcode to Latin 1.

The transcoding can be done with a short routine in C:

uint8_t leading_byte = data[pos]; // leading byte
if (leading_byte < 0b10000000) {
  *latin_output++ = leading_byte;
  pos++;
} else if ((leading_byte & 0b11100000) == 0b11000000) {
  *latin_output++ = (leading_byte & 1) << 6 | (data[pos + 1]);
  pos += 2;
}

It processes the data one byte at a time. There are two cases: ASCII bytes (one byte, one character) and two-byte characters (one leading byte, one continuation byte).

Can we do better?

We can use Single instruction, multiple data (SIMD) and specifically the advanced  SIMD instructions available on recent AMD Zen 4 and Intel Ice Lake processors: AVX-512 with VBMI2.

We can then process the data 64 bytes at a time. Using AVX-512, we lead 64 bytes. We identify the location of the ASCII bytes, the leading and the continuation bytes. These are identified using masks. We then modify the leading bytes, keeping one bit (just the least significant one) and shift it up by six positions. We then do two compression: one where we omit the continuation bytes, and one where we omit the newly transformed leading bytes. We then simply using a byte-wise logical OR, and we are done. Using Intel intrinsics, the code might look as follows:

 __m512i input = _mm512_loadu_si512((__m512i *)(buf + pos));
__mmask64 ascii = _mm512_cmplt_epu8_mask(input, mask_80808080);
__mmask64 continuation = _mm512_cmplt_epi8_mask(input,
    _mm512_set1_epi8(-64));
__mmask64 leading = _kxnor_mask64(ascii, continuation);
__m512i highbits = _mm512_maskz_add_epi8(leading, input,
    _mm512_set1_epi8(62));
highbits = _mm512_slli_epi16(highbits, 6); // shift in position
input = _mm512_mask_blend_epi8(leading, input, highbits);
__m512i ascii_continuation = _mm512_maskz_compress_epi8(ascii | 
    continuation, input);
__m512i ascii_leading = _mm512_maskz_compress_epi8(ascii | leading,
    input);
__m512i output = _mm512_or_si512(ascii_continuation, ascii_leading);
_mm512_storeu_epi8((__m512i*)latin_output, output);

Of course, we must also validate the input, and it adds some complexity, but not too much.

An anonymous reader points out to a significantly faster and simpler approach:

  1. Load 64 bytes
  2. Identify the leading bytes (non-continuation non-ASCII bytes) with a single comparison.
  3. Test whether the leading bytes have their less significant bit set, and construct a mask with it.
  4. Shift the mask by one position and set the second last bit to 1 in the contibuation bytes that are preceded by a leading byte. We can do it by subtracting 0b11000000 because we have that 0b10000000 – 0b11000000 is 0b11000000.
  5. Prune all the leading bytes (non-continuation non-ASCII bytes) and write it out.

Using Intel intrinsics, the core implementation might be like so:

__m512i input = _mm512_loadu_si512((__m512i *)(buf + pos));
__mmask64 leading = _mm512_cmpge_epu8_mask(input, _mm512_set1_epi8(-64));
__mmask64 bit6 = _mm512_mask_test_epi8_mask(leading, input, _mm512_set1_epi8(1));
input = _mm512_mask_sub_epi8(input, (bit6<<1) | next_bit6, input, _mm512_set1_epi8(-64));
next_bit6 = bit6 >> 63;
__mmask64 retain = ~leading;
__m512i output = _mm512_maskz_compress_epi8(retain, input);
int64_t written_out = _popcnt64(retain);
__mmask64 store_mask = (1ULL << written_out) - 1;
_mm512_mask_storeu_epi8((__m512i *)latin_output, store_mask, output);

I use GCC 11 on an Ice Lake server. My source code is available. For benchmarking, I use a French version of the Mars wikipedia entry that has been modified to fit in latin 1.

technique CPU cycles/byte instructions/byte
conventional 1.9 5.4
AVX-512 0.3 0.75
Faster AVX-512 0.2 0.43

On a 3.2 GHz processor, the AVX-512 routine reaches 12 GB/s. It is about 6 times faster than the conventional routine, and it uses 7 times fewer instructions. The faster AVX-512 routine 10 times faster than the conventional routine while using 11 times fewer instructions. It reaches 18 GB/s. It is likely faster than the RAM bandwidth which  I estimate to be at least 15 GB/s, but the CPU cache bandwidth is several times higher. Keep in mind that there are disks with a 14 GB/s bandwidth.

This problem illustrates that AVX-512 can really do well on non-trivial string transformations without excessive cleverness.

Remark. Whenever I mention Latin 1, some people are prone to remark that browsers treat HTML pages declared as Latin 1 and ASCII as windows-1252. That is because modern web browsers do not support Latin 1 and ASCII in HTML. Even so, you should not use Latin 1, ASCII or even windows-1252 for your web pages. I recommend using Unicode (UTF-8). However, if you code in Python, Go or Node.js, and you declare a string as Latin 1, it should be Latin 1, not windows-1252. It is bug to confuse Latin 1, ASCII and windows-1252. They are different formats.

Coding of domain names to wire format at gigabytes per second

When you enter in your browser the domain name lemire.me, it eventually gets encoded into a so-called wire format. The name lemire.me contains two labels, one of length 6 (lemire) and one of length two (me). The wire format starts with 6lemire2me: that is, imagining that the name starts with an imaginary dot, all dots are replaced by the length (in bytes) of the upcoming label. The numbers are stored as byte values, not ASCII digits. There is also a final zero byte. RFC 1035 specifies the format:

Each label is represented as a one octet length field followed by that number of octets. Since every domain name ends with the null label of the root, a domain name is terminated by a length byte of zero. The high order two bits of every length octet must be zero, and the remaining six bits of the length field limit the label to 63 octets or less. To simplify implementations, the total length of a domain name (i.e., label octets and label length octets) is restricted to 255 octets or less.

I find the problem interesting because the wire format is an “indexed” format: you can easily iterate over the labels, jumping in constant time over them, knowing the exact length, and so forth. Thus, coding strings into the wire format is a form of indexing.

How quickly can we do it?

You can use conventional code to convert the name to the wire format. You copy non-dot characters and when you encounter a dot, you write the distance to the previous dot (imagining that there is a virtual first dot). In C, it might look as follows (omitting the final zero byte):

 char *src = "lemire.me";
 uint8_t *dst = ...; // buffer
 uint8_t *counter = dst++;
 do {
   while (is_name_character(*src)) {
    *dst++ = (uint8_t)*src; src++;
  }
  *counter = (uint8_t)(dst - counter - 1);
  counter = dst++;
  if (*src != '.') { break; }
  src++;
 } while (true);

Can you do better?

You can use Single instruction, multiple data (SIMD) instructions. You load 16, 32 or 64 characters in wide register. You locate the dots. You construct a new register where the dots are replaced by position indexes and everything else is zeroed. E.g., starting with

.a.bcde.

You mark the location of the dots:

0 0 0xff 0 0 0 0 0xff

You compute the byte-wise AND with the following constant:

0 1 2 3 4 5 6 7

and you get an array where the dots have been replaced by indexes…

0 0 2 0 0 0 0 7

You then do a prefix computation, propagating the values:

0 2 2 7 7 7 7 7

This can be done using a logarithmic algorithm: shifting the values off by 1 byte, comparing, shifting the values off by 2 bytes, comparing, and so forth.

Finally, we can shift by one and subtract the result. Masking off the result so that we are only keeping where the dots are, we get the desired counts:

2 - 5 - - - - 0

You can then blend the results:

2 a 5 b c d e 0

If you use Intel intrinsics, the code might look like this… It is a bit technical so I am not going to explain it further, but the idea is as I explained…

__m128i dot_to_counters(__m128i input) {
  __m128i dots = _mm_cmpeq_epi8(input, _mm_set1_epi8('.'));
  __m128i sequential =
_mm_setr_epi8(-128, -127, -126, -125, -124, -123, -122, -121, -120, -119,
-118, -117, -116, -115, -114, -113);
  __m128i dotscounts = _mm_and_si128(dots, sequential);
  __m128i origdotscounts = dotscounts;
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 1),
    dotscounts);
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 2),
    dotscounts);
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 4),
    dotscounts);
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 8),
    dotscounts);
  __m128i next = _mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 1);
  dotscounts = _mm_subs_epu8(next, origdotscounts);
  dotscounts = _mm_subs_epu8(dotscounts, _mm_set1_epi8(1));
  return _mm_blendv_epi8(input, dotscounts, dots);
}

I call this strategy  “Prefix-Minimum”. It is essentially data independent. It does not matter where the dots are, the code is always the same. Processors like that a lot!

Prefix-Minimum will be fast, especially if you use wide registers (e.g., 32 bytes with AVX2 instructions). However, it does not offer a direct path to validating the inputs: are the counters in the proper range? Do you have overly long labels?

If you need to reason about where the dots are, the best way is probably to build an index in the spirit of the simdjson library (our original paper is available as PDF). For example, you can load 64 bytes of data, and construct a 64-bit word where each bit correspond to a byte value, and the byte is set to 1 if the corresponding byte is a dot (‘.’), you can then iterate through the bits, as a bitset. The construction of the 64-bit word might look as follows using Intel instructions and AVX2:

__m256i input1 = _mm256_loadu_si256((const __m256i *)src);
__m256i input2 = _mm256_loadu_si256((const __m256i *)(src + 32));
uint32_t dots1 = (uint32_t)_mm256_movemask_epi8(
_mm256_cmpeq_epi8(input1, _mm256_set1_epi8('.')));
uint32_t dots2 = (uint32_t)_mm256_movemask_epi8(
_mm256_cmpeq_epi8(input2, _mm256_set1_epi8('.')));
uint64_t DOTS = (((uint64_t)dots2) << 32) | dots1;

I implemented different strategies using AVX2 intrinsics. We can use a database of popular domain names for testing. Results vary depending on the system. I use GCC 11 and an Ice Lake server.

technique CPU cycles/string instructions/string
conventional 150 240
Prefix-Minimum (256 bits) 44 77
simdjson-like (256 bits) 56 77

My source code is available.

The Prefix-Minimum approach is fastest. Observe how the Prefix-Minimum approach is faster than the simdjson-like approach, despite executing the same number of instructions. That is because it is able to retire more instructions per cycle. The downside of the Prefix-Minimum approach is that it does not validate the label length.

The simdjson-like approach does full validation and is still quite fast: over three times faster than the conventional approach. It is sufficient to break the gigabyte per second barrier (1.7 GB/s in these experiments).

Being able to validate the counter values with the Prefix-Minimum without adding too much overhead would be fantastic: it might be possible to get the best speed without compromising on the validation. It remains an open problem whether it is possible.

I suspect that my implementations can be improved, by at least 20%.

Future work should consider AVX-512. I have the sketch of an implementation using AVX-512 but it is not faster. Porting to ARM NEON would be interesting.

Credit: The work was motivated by the simdzone project by NLnetLabs.