Scan HTML even faster with SIMD instructions (C++ and C#)

Earlier this year, both major Web engines (WebKit/Safari and Chromium/Chrome/Edge/Brave) accelerated HTML parsing using SIMD instructions. These ‘SIMD’ instructions are special instructions that are present in all our processors that can process multiple bytes at once (e.g., 16 bytes).

The problem that WebKit and Chromium solve is to jump to the next target character as fast as possible: one of <, &, \r and \0. In C++, you can use std::find_first_of for this problem whereas in C#, you might use the SearchValues class. But we can do better with a bit of extra work.

The WebKit and Chromium engineers use vectorized classification (Langdale and Lemire, 2019): we load blocks of bytes and identify the characters you seek using a few instructions. Then we identify the first such character loaded, if any. If no target character is found, we just move forward and load another block of characters. I reviewed the new methods used by the Web engines in C++ and C# in earlier posts along with minor optimizations. The results are good, and it suggests that WebKit and Chromium engineers were right to adopt this optimization.

But can we do better?

Let us consider an example to see how the WebKit/Chromium approach works. Suppose that my HTML file is as follows:

<!doctype html><html itemscope="" itemtype="http://schema.o'...

We load the first 16 bytes…

<!doctype html><

We find the target characters: the first character and the 16th character (<!doctype html><). We move to the first index. Later we start again from this point, this time loading slightly more data…

!doctype html><h

And this time, we will identify the 15th character as a target character, and move there.

Can you see why it is potentially wasteful? We have loaded the same data twice, and identified the same character again as a target character.

Instead, we can adopt an approach similar to that used by systems like simdjson. We load non-overlapping blocks of 64 bytes. Each block of 64 bytes is turned into a 64-bit register where each bit in the word correspond to a loaded character. If the character is a match, then the corresponding bit is set to 1. The computed 64-bit word serves as an index for the 64 characters. Once we have used it up, we can load another block and so forth.

Let us build a small C++ structure to solve this problem. We focus on ARM NEON, but the concept is general. ARM NEON is available on your mobile phone, on some servers and on your macBook.

Our constructor initializes the neon_match64 object with a character range defined by start and end. We define three public methods:

  • get(): Returns a pointer to the current position within the character range.
  • consume(): Increments the offset and shifts the matches value right by 1.
  • advance(): Advances the position within the range

To iterate over the target characters, we might use the class as follows:

neon_match64 m(start, end);
while (m.advance()) {
  char * match = m.get(); m.consume();

The class code might look like the following:

struct neon_match64 {
  neon_match64(const char *start, const char *end) 
       : start(start), end(end), offset(0) {
    low_nibble_mask = {0, 0, 0, 0, 0, 0, 0x26, 0, 0, 0, 0, 0, 
        0x3c, 0xd, 0, 0};
    bit_mask = {0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80,
                0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80};
    v0f = vmovq_n_u8(0xf);
  const char *get() const { return start + offset; }
  void consume() {
    matches >>= 1;
  bool advance() {
    while (matches == 0) {
      start += 64;
    int off = __builtin_ctzll(matches);
    matches >>= off;
    offset += off;
    return true;

  inline void update() { update((const uint8_t*)start); }
  inline void update(const uint8_t *buffer) {
    uint8x16_t data1 = vld1q_u8(buffer);
    uint8x16_t data2 = vld1q_u8(buffer + 16);
    uint8x16_t data3 = vld1q_u8(buffer + 32);
    uint8x16_t data4 = vld1q_u8(buffer + 48);

    uint8x16_t lowpart1 = vqtbl1q_u8(low_nibble_mask, 
      vandq_u8(data1, v0f));
    uint8x16_t lowpart2 = vqtbl1q_u8(low_nibble_mask, 
      vandq_u8(data2, v0f));
    uint8x16_t lowpart3 = vqtbl1q_u8(low_nibble_mask, 
      vandq_u8(data3, v0f));
    uint8x16_t lowpart4 = vqtbl1q_u8(low_nibble_mask, 
       vandq_u8(data4, v0f));

    uint8x16_t matchesones1 = vceqq_u8(lowpart1, data1);

    uint8x16_t matchesones4 = vceqq_u8(lowpart4, data4);

    uint8x16_t sum0 =
        vpaddq_u8(matchesones1 & bit_mask, matchesones2 & bit_mask);
    uint8x16_t sum1 =
        vpaddq_u8(matchesones3 & bit_mask, matchesones4 & bit_mask);
    sum0 = vpaddq_u8(sum0, sum1);
    sum0 = vpaddq_u8(sum0, sum0);
    matches = vgetq_lane_u64(vreinterpretq_u64_u8(sum0), 0);
    offset = 0;

  const char *start;
  const char *end;
  size_t offset;
  uint64_t matches{};
  uint8x16_t low_nibble_mask;
  uint8x16_t v0f;
  uint8x16_t bit_mask;

The most complicated function is the update function because ARM NEON makes it a bit difficult. We use use vld1q_u8(buffer) loads 16 bytes (128 bits) of data from the memory location pointed to by buffer into the data1 variable. Similarly, data2data3, and data4 load subsequent 16-byte chunks from buffer + 16buffer + 32, and buffer + 48, respectively. The expression vandq_u8(data1, v0f) performs a bitwise AND operation between data1 and a vector v0f (which contains the value 0xF in each lane). The expression vqtbl1q_u8(low_nibble_mask, ...) uses the low_nibble_mask vector to permute the low nibbles of the result of the AND operation. The result is stored in lowpart1lowpart2lowpart3, and lowpart4. The expressionvceqq_u8(lowpart1, data1): Compares each lane of lowpart1 with data1. If equal, the corresponding lane in the result is set to all ones (0xFF). They correspond to target characters. You repeat the same computation four times. We then use bitwise AND with bit_mask(...)  followed by several applications of pairwise sums vpaddq_u8(...) to compute the 64-bit word.

I have added this C++ structure to an existing benchmark. Furthermore, I have ported it to C# for good measure.

How well does it do?

I use a C++ small benchmark where I scan the HTML of the Google home page. I run the benchmark on my Apple M2 laptop (LLVM 15).

method speed instructions/byte instructions/cycle
find_first_of 2.0 GB/s 9.8 instructions/byte 5.1 instructions/byte
vectorized classification 6.8 GB/s 1.9 instructions/byte 2.8 instructions/byte
64-bit vectorized classification 33 GB/s 0.7 instructions/byte 6.6 instructions/byte

The number speaks for themselves: the 64-bit vectorized classification approach is the clear winner. It requires few instructions and it allows retiring many instructions per cycle on average.

What about C#?

method speed
Conventional 1.4 GB/s
SearchValues 4.2 GB/s
vectorized classification 6.3 GB/s
64-bit vectorized classification 29 GB/s

Again, the 64-bit vectorized-classification approach is the clear winner. The SearchValues class provided by .NET gets an honorable mention for its good performance.

I have not yet ported the code for x64 processors, but I expect equally good results.


Benchmarking ARM processors: Graviton 4, Graviton 3 and Apple M2

The world of commodity processor is roughly divided in two: x64 chips for servers and PCs, and ARM processors for mobile devices. However, ARM chips increasingly common on servers and laptop. My own favorite laptop is an Apple macBook with an M2 chip. Amazon has been producing its own ARM processors (Graviton) and it recently made available the latest of such chips, the Graviton 4. It is reportedly based on a Neoverse V2 design from ARM while its previous design was a Neoverse V1 (Graviton 3). Qualcomm has also released high performance ARM chips for Windows laptops, the Snapdragon Elite X.

I decided to quickly test Amazon’s new Graviton 4 processor. I am deliberately not comparing against x64 processors. It is far easier to compare ARM against ARM given that you can run exactly the same instructions.

In a previous benchmark, I found that, at least for the type of work that I do, the Graviton 3 processor ran slower than an Apple M2, even after correcting for its lower frequency. The Graviton 3 runs at up to 2.6 GHz, while Apple M2 can run at up to 3.5 GHz.  The Graviton 4 runs at 2.8 GHz. Going from the clock speed alone, you would not expect much of a performance gain going from the Graviton 3 to the Graviton 4 (at most 10%)

Let us run benchmarks. Under AWS, I am going to use Ubuntu 24 with GCC 13. Under macOS, I am using the latest Apple LLVM (15).

URL parsing (C++)

Let us start with a URL parsing benchmark. We use the Ada URL parser to parse thousands of URLs, as fast as possible. To reproduce, do the following:

git clone
cd ada
cmake -B build -DADA_BENCHMARKS=ON
sudo ./build/benchmarks/benchdata

We focus on the BasicBench_AdaURL_aggregator_href results.

I am getting that the AWS Graviton 4, though it runs at a lower frequency, can match the Apple M2 performance.

system ns/url GHz instructions/cycle
AWS Graviton 3 260 2.6 3.4
AWS Graviton 4 168 2.8 4.7
Apple M2 160 3.4 4.4

Unicode Validation (C#)

We recently published a fast Unicode validation library for .NET 8/C#. It contains many benchmarks, but let me consider the validation of a JSON file.

To reproduce:

sudo apt-get install -y dotnet-sdk-8.0
git clone
cd SimdUnicode/
cd benchmark/
dotnet run --configuration Release --filter "*Twitter*"

This time I am getting that the AWS Graviton 4 is significantly slower than the Apple M2, though it is much faster than the Graviton 3. Even adjusting for CPU frequency, the AWS Graviton 4 is slower: the frequency is 20% lower, but the speed is 30% lower.

system GB/s (SimdUnicode) GB/s (.NET Runtime Library)
AWS Graviton 3 14 9
AWS Graviton 4 19 11
Apple M2 25 14

JSON parsing (C++)

Let us try the simdjson library. To reproduce, do the following:

git clone
cd simdjson
cmake --build build -j
./build/benchmark/bench_ondemand --benchmark_filter="find_tweet"

On this benchmark, we find agan that the AWS Graviton 4 is significantly better than the AWS Graviton 3, but somewhat behind the Apple M2, even after adjusting for its lower frequency.

system GB/s instructions/cycle
AWS Graviton 3 3.6 4.4
AWS Graviton 4 4.6 5.1
Apple M2 6.4 5.7

Base64 encoding and decoding (C++)

This time we are going to use the simdutf library and its fast base64 encoding and decoding functions. To reproduce:

git clone
cd simdutf
cmake --build build
./build/benchmarks/base64/benchmark_base64 -r

We focus on the simdutf::arm64 results. Again, the AWS Graviton 4 is faster than the AWS Graviton 3, but slower than the Apple M2, even after adjusting for CPU frequency.

system GB/s instructions/cycle
AWS Graviton 3 2.8 2.2
AWS Graviton 4 3.5 2.6
Apple M2 6.7 3.7

Number Parsing (C++)

We can use a number-parsing benchmark used to assess the fast_float library. To reproduce:

git clone
cd simple_fastfloat_benchmark
cmake -B build
cmake --build build

We care about the fast_float results. We get similar results, again.

system GB/s
AWS Graviton 3 1.0
AWS Graviton 4 1.3
Apple M2 1.7


I ran the memory-level paralellism benchmark. I find that the Graviton 4 is slightly better than the Graviton 3. However, the difference is small and you might not notice it in practice. This is a point-chasing benchmark and you do several at once. As you get to over 10 ‘lanes’ it becomes sensitive to noise. The Graviton 3 ‘noise’ visible in the graph is likely measurement error.


These few tests suggest that the Graviton 4 processor is not quite a match for Apple Silicon on a per-core basis. However, it is significant step up from the Graviton 3. Even though both Gravitons have nearly the same clock speed, the Graviton 4 is much faster (e.g., by 30%). The Graviton 4 can retire many more instructions per cycle than the Graviton 3.

graviton 3 ▏ 2.6 GHz ███████████████████████▏
graviton 4 ▏ 2.8 GHz █████████████████████████

URL parsing
graviton 3 ▏ 3.8 Murl/s ████████████████
graviton 4 ▏ 5.9 Murl/s █████████████████████████

Unicode Validation
graviton 3 ▏ 14 GB/s ██████████████████▍
graviton 4 ▏ 19 GB/s █████████████████████████

graviton 3 ▏ 3.6 GB/s ███████████████████▌
graviton 4 ▏ 4.6 GB/s █████████████████████████

graviton 3 ▏ 2.8 GB/s ███████████████████▉
graviton 4 ▏ 3.5 GB/s ████████████████████████▉

number parsing
graviton 3 ▏ 1.0 GB/s ███████████████████▏
graviton 4 ▏ 1.3 GB/s █████████████████████████

Scan HTML faster with SIMD instructions: .NET/C# Edition

Recently, the two major Web engines (WebKit and Chromium) adopted fast SIMD routines to scan HTML content. The key insight is to use vectorized classification (Langdale and Lemire, 2019): you load blocks of characters and identify the characters you seek using a few instructions. In particular, we use ‘SIMD instructions’, special instructions that are available on practically all modern processors and can process 16 bytes or more at once.

The problem that WebKit and Chromium solve is to jump to the next relevant characters: one of <, &, \r and \0. Thus we must identify quickly whether we have found one of these characters in a block. On my Apple macbook, a fast SIMD-based approach can scan an HTML page at about 7 GB/s, with code written in C/C++.

But what about C#? The recent C# runtime (.NET8) supports fast SIMD instructions.

Let us first consider a simple version of the function:

public unsafe static void NaiveAdvanceString(ref byte* start, 
                                             byte* end)
  while (start < end)
    if (*start == '<' || *start == '&' 
         || *start == '\r' || *start == '\0')

This function just visits each character, one by one, and it compares it against the target characters. If one target character is found, we return.

Let us consider a SIMD version of the same function. It is slightly more complicated.

public unsafe static void SIMDAdvanceString(ref byte* start, 
                                          byte* end)

    Vector128<byte> low_nibble_mask = Vector128.Create(0, 0, 0, 0, 0, 
                  0, (byte)0x26, 0, 0, 0, 0, 0, (byte)0x3c, 
                 (byte)0xd, 0, 0);
    Vector128<byte> v0f = Vector128.Create((byte)0x0F);
    Vector128<byte> bit_mask = Vector128.Create((byte)16, 15, 14, 13, 
                        12, 11, 10, 9, 8,
                        7, 6, 5, 4, 3, 2, 1);

    const int stride = 16;
    while (start + (stride - 1) < end)
        Vector128<byte> data = AdvSimd.LoadVector128((byte*)start);
        Vector128<byte> lowpart 
           = AdvSimd.Arm64.VectorTableLookup(low_nibble_mask, data & v0f);
        Vector128<byte> matchesones = AdvSimd.CompareEqual(lowpart, 
        if (matchesones != Vector128<byte>.Zero)
            Vector128<byte> matches = AdvSimd.And(bit_mask, 
            int m = AdvSimd.Arm64.MaxAcross(matches).ToScalar();
            start += 16 - m;
        start += stride;

    while (start < end)
        if (*start == '<' || *start == '&' || *start == '\r' 
                || *start == '\0')

The function takes two pointers (ref byte* start and byte* end) that mark the beginning and end of the byte array.  The main loop continues  as long as start is at least 16 bytes away from end. This ensures there’s enough data for vectorized operations. We load in the variable ‘data’ 16 bytes from the memory pointed to by start. We use a vectorized lookup table and a comparison to quickly identify the target characters.The code checks if any element in matchesones is not zero. If there’s a match, then we locate the first one (out of 16 characters), we advance start and return. If no match is found, we advance by 16 characters and repeat. We conclude with a fallback look that processes the leftover data (less than 16 bytes).

As an optimization, it is helpful to use a local variable for the reference to the first pointer. Doing so improves the perfomance substantially: C# is not happy when we repeatedly modify a reference. Thus, at the start of the function, you may set byte* mystart = start, use mystart throughout, and then, just before a return, you set start = mystart.

The .NET runtime library has also a fast SearchValues class to help search characters.

SearchValues<byte> searchValues = SearchValues.Create(
   stackalloc byte[] { 0, 13, 38, 60 });

ReadOnlySpan<byte> data = allLinesUtf8;
while (!data.IsEmpty)
          int first = data.IndexOfAny(searchValues);
          data = data.Slice(first >= 0 ? first + 1 : 1);

I wrote a benchmark (in C#) that you can run if you have an ARM-based processor.

Conventional 1.4 GB/s
SearchValues 4.2 GB/s

Incredibly, the SIMD-based function is over 4 times faster than the conventional function in these tests, and the accelerated C# function about 15% slower than  the C++ version. The non-SIMD C# version is also slightly slower than the C++ version.

Harold Aptroot provided support for x64 processor (up to AVX2) so I extended my benchmark to an Intel Ice Lake system:

Conventional 1.0 GB/s
SearchValues 3.8 GB/s
SIMD (x64 AVX2) 7.6 GB/s

This time, the SIMD version is over 7 times faster than the scalar. In fact, it matches the performance numbers that I get with C/C++.

It is also important for performance to write the code in such a way that the C# compiler tends to inline the scanning function, since it is called repeatedly. Initially, I had written the benchmark with some abstraction, using a delegate function, but it limited the best possible speed.

In other words, .NET/C# allows you to write fast code using SIMD instructions. It may be well worth the effort.

Suggested reading. I have a follow-up blog post where I explain how we can go much faster (29 GB/s).

How much memory does a call to ‘malloc’ allocate?

In C, we allocate memory on the heap using the malloc function. Other programming languages like C++ or zig (e.g., std.heap.c_allocator) may call on malloc underneath so it is important to understand how malloc works. Furthermore, the same concepts apply broadly to other memory allocators.

In theory, you could allocate just one byte like so:

char * buffer = (char*) malloc(1);

How much memory does this actually allocate?

On modern systems, the request allocates virtual memory which may or may not be actual (physical) memory. On many systems (e.g., Linux), the physical memory tends to be allocated lazily, as late as possible. Other systems such as Windows are more eager to allocate physical memory. It is also common and easy to provide your own memory allocators, so the behavior varies quite a bit.

But how much virtual memory does my call to malloc(1) typically?

There is likely some fixed overhead per allocation: you can expect 8 bytes of metadata per allocation although it could be less or more depending on the allocator. You cannot use this overhead in your own program: it is consumed by the system to keep track of the memory allocations. Even if you have a highly efficient allocator with little overhead per allocation, the pointer itself must typically be tracked and, on a 64-bit system, that represents 8 bytes of data.

If you asked for 1 bytes, you are probably getting a large chunk of usable memory: maybe between 16 bytes and 24 bytes. Indeed, most memory allocations are aligned (rounded up) so that the address is divisible by max_align_t (typically 16 bytes) and there is a minimum size that you may get. And, indeed, the C language has a function called realloc which can be used to extend a memory allocation, often for free because the memory is already available.

You can ask how much memory is available. Under Linux, you can use the malloc_usable_size while under FreeBSD and macOS, you can use malloc_size. So I can write a small programs that asks how much (virtual) memory was actually granted given a request. For one byte, my macOS laptop gives out 16 bytes while my x64 Linux server seemingly gives 24 bytes. If I plot the memory actually granted versus the memory requested, you see a staircase where, on average, you get 8 extra bytes of memory. That is to be expected if the pointer returned must by at an address divisible by 16 (max_align_t). Overall, you should avoid broad generalizations about how Linux or macOS work or compare, and simply keep in mind the broad picture.

What do we conclude? You probably should avoid allocating on the heap tiny blocks of memory (i.e., smaller than 16 bytes). Furthermore, you may not want to optimize the allocation size down to a few bytes since it gets rounded up in any case. Finally, you should make use of realloc if you can as you can often extend a memory region, at least by a few bytes, for free.

Performance tip: avoid unnecessary copies

Copying data in software is cheap, but it is not at all free. As you start optimizing your code, you might find that copies become a performance bottleneck.

Let me be clear that copies really are cheap. It is often more performant to copy that data than to track the same memory across different threads. The case I am interested in is when copies turn a trivial operation into one that is relatively expensive.

Recently, the fast JavaScript runtime (Bun) optimized its base64 decoding routines.Base64 is a technique used to represent binary data, like images or files, as ASCII text. It is ubiquitous on the Internet (email formats, XML, JSON, HTML and so forth). In Node.js and in Bun, you can decode a base64 string Buffer.from(s, "base64").

Importantly, when decoding base64 strings, you can be almost certain that you are dealing with a simple ASCII string. And systems like Node.js and Bun have optimized string representations for the ASCII case, using one byte per character.

We had optimized base64 decoding in Node.js some time ago (credit to Yagiz Nizipli for his work)… but I was surprised to learn that Bun was able to beat Node.js by a factor of three. Because both Node.js and Bun use the same base64 decoding, I was confused.

I mistakenly thought, based on quick profiling, that Node.js would copy the base64 data from an ASCII format (one byte per character) to a UTF-16 format (two bytes per character) despite our best attempt at avoiding copies. You can review my technically incorrect analysis on X. What I like is the second half of my analysis:

The story illustrates why our software is slower than it should be. We have layers of abstractions to fight against. Sometimes you win, sometimes you lose.

These layers are there for a reason, but they are not free.

To make matters worse… these abstraction layers often thicken over time… and the friction goes up.

To be clear, I do not claim that the Node.js code is optimal. In fact, I know it can be better. But it is not trivial to make it go fast.

I sometimes hear people say… “well, it is C++ and C++ is hard”. No. The C++ part is easy relatively speaking. The difficulty is at a higher level. It is not a matter of syntax. It is a matter of architecture.

I say that I was technically incorrect. Why was I incorrect?

It turns out that the copy was not happening as part of base64 decoding but in a completely separate function. There is an innocuous function in Node.js called  StringBytes::Size which basically must provide an upper on the memory needed by the Buffer.from function. I went back in time and looked at how this function looked in the original Node.js (0.10):

case BASE64: {
  String::AsciiValue value(str);
  data_size = base64_decoded_size(*value, value.length());

Can you see what it does? It takes the string it receives, it makes a copy, and from the copy, it computes how many bytes the decoded version will need.

That original code tried to make an ‘ASCII’ copy so I presume it created a copy using one byte per character. It still was a shame, but not terribly so.

But very soon after (Node.js 0.11), it was changed to a version that converted to 16-byte strings, for increased safety:

case BASE64: {
  String::Value value(str);
  data_size = base64_decoded_size(*value, value.length());

This new version is potentially much more expensive because it may copy an ASCII string to a temporary UTF-16 (2-byte per character) string. It uses more memory, but it is also a more complicated function. Internally, the JavaScript engine implements this with the C++ template std::copy_n. The generated code is probably fine, but it is hardly “down to the metal”.

As long as everything was not too optimized, it ran just fine… but as you optimized the base64 decoding, you found that this simple length computation took up more than half of the running time.

So, in concrete terms, what are we talking about as far as performance goes? Let me consider a JavaScript benchmark:

import { bench, run } from "mitata";
function makeBenchmark(size, isToString) {
  const base64Input = Buffer.alloc(size, "latin1").toString("base64");
  const base64From = Buffer.from (base64Input, "base64");
  bench(`Buffer. from(${size} bytes, 'base64')`, () => {
   Buffer.from(base64Input, "base64");
[1024 * 1024 * 8]. forEach (s => makeBenchmark(s, false)) ;
await run();

I install the latest version of Bun (bun upgrade --canary). I compare Node.js 22 with a patched version. I use my Apple MacBook for testing (ARM-based M2 processor). You can see that by simply avoiding the unnecessary copy, I boosted the base64 decoding from 2.5 GB/s to 6.0 GB/s. Not bad for removing a single line of code.

function time speed
Node.js 18 6’400 µs 1.3 GB/s
Node.js 22 3’300 µs 2.5 GB/s
Node.js with fix 1’200 µs 7.0 GB/s
Bun 950 µs 8.8 GB/s

Sometimes people observe at this point that the performance of Node.js 18 was already fine: 1.3 GB/s is plenty fast. It might be fast enough, but you must take into account that we are measuring a single operation that is likely part of a string of operations. In practice, you do not just ingest base64 data. You do some work before and some work after. Maybe you decoded a JPEG image that was stored in base64, and next you might need to decode the JPEG and push it to the screen. And so forth. To have an overall fast system, every component should be fast.

You may observe that Bun is still faster than Node.js, even after I claim to have patched this issue. But there are likely other architecture issues that Bun does not have. Remember that both Node.js and Bun are using the same library in this instance: simdutf.

It is maybe interesting to review Bun’s code (in Zig):

const outlen = bun.base64.decodeLen(slice);
const to = allocator.alloc(u8, outlen) catch return &[_]u8{};
const wrote = bun.base64.decode(to[0..outlen], slice).count;
return to[0..wrote];

It is far simpler than the equivalent in Node where memory is allocated in one function, and then the resulting buffer is passed to another function where the decoding finally happens. It is likely that Bun is faster because it has a simpler architecture where it is easier to see where the unnecessary work happens.

Update. After the publication of this blog post,
Leszek Swirski added v8::String::ValueView to v8 which should reduce the need for copies.

Validating gigabytes of Unicode strings per second… in C#?

We have been working on a fast library to validate and transcode Unicode and other formats such as base64 in C++: simdutf. We wondered: could we achieve the same good results in C#?

Microsoft’s .NET framework has made great strides in leveraging advanced instructions. For instance, if your processor supports AVX-512, you can instantiate 512-bit registers right in C#! The standard .NET runtime library effectively utilizes these features, demonstrating that they practice what they preach.

Most strings on the Internet are Unicode strings stored in UTF-8. When you ingest such strings (from disk or from the network), you need to validate them. To test the waters, we set our eyes on UTF-8 validation. With John Keiser, I helped designed a fast UTF-8 validation algorithm designed for modern-day CPUs. We call the algorithm ‘lookup’. It may require less than one instruction per byte to validate even challenging input. The lookup validation algorithm is part of Oracle GraalVM, Google Fuchsia, the Node.js and Bun JavaScript runtimes and so forth.

The .NET library has its own fast UTF-8 validation function: Utf8Utility.GetPointerToFirstInvalidByte. It is highly optimized. As the name implies, it finds the location of the first byte where an error might occur. It also computes some parameters from which you can tell how the input could be transcoded. The function is an internal function, but we can expose it by copying and pasting it.

Could we beat the .NET runtime, at least some of the time? It seems that we can!

Maybe you want to know what our code looks like? Here is a simplified example where we load 64 bytes, check whether they are ASCII.

Vector512<byte> currentBlock = Avx512F.LoadVector512(pInputBuffer + processedLength);
ulong mask = currentBlock.ExtractMostSignificantBits();
if (mask == 0) {
  // got 64 ASCII bytes
} else {
  // oh oh, got non-ASCII bytes

Of course, the whole algorithm is more complicated, but not that much more… It is maybe 30 lines of code. We implement various versions of the algorithm, one for ARM processors, one for older x64 processors, and so forth.

For benchmarking, we use valid strings. The first one we check is twitter.json, a JSON file that is mostly ASCII with some non-trivial Unicode content within strings. We also use various synthetic strings representative of various languages.

On an Intel Ice Lake system, our validation function is up to 13 times faster than the standard library. On twitter.json, we are 2.4 times faster.

data set SimdUnicode AVX-512 (GB/s) .NET speed (GB/s) speed up
Twitter.json 29 12 2.4 x
Arabic-Lipsum 12 2.3 5.2 x
Chinese-Lipsum 12 3.9 3.0 x
Emoji-Lipsum 12 0.9 13 x
Hebrew-Lipsum 12 2.3 5.2 x
Hindi-Lipsum 12 2.1 5.7 x
Japanese-Lipsum 10 3.5 2.9 x
Korean-Lipsum 10 1.3 7.7 x
Latin-Lipsum 76 76
Russian-Lipsum 12 1.2 10 x

On an Apple M2 system, our validation function is 1.5 to four times faster than the standard library.

data set SimdUnicode speed (GB/s) .NET speed (GB/s) speed up
Twitter.json 25 14 1.8 x
Arabic-Lipsum 7.4 3.5 2.1 x
Chinese-Lipsum 7.4 4.8 1.5 x
Emoji-Lipsum 7.4 2.5 3.0 x
Hebrew-Lipsum 7.4 3.5 2.1 x
Hindi-Lipsum 7.3 3.0 2.4 x
Japanese-Lipsum 7.3 4.6 1.6 x
Korean-Lipsum 7.4 1.8 4.1 x
Latin-Lipsum 87 38 2.3 x
Russian-Lipsum 7.4 2.7 2.7 x

Observe how the standard library provides a function that is already quite fast: it can run at gigabytes per second. We are several times faster, but evidently, C# makes it possible to write highly optimized functions.

You can run your own benchmarks by grabbing our code from

It is a pleasure doing this performance-oriented work in C#. It is definitively one of my favorite programming languages right now.

One difficulty with ARM processors is that they have varied SIMD/NEON performance. For example, Neoverse N1 processors, not to be confused with the Neoverse V1 design used by AWS Graviton 3, have weak SIMD performance. Of course, one can pick and choose which approach is best and it is not necessary to apply SimdUnicode is all cases. I expect good performance on recent ARM-based Qualcomm processors.

The SimdUnicode library is joint work with Nick Nuon.

Rolling your own fast matrix multiplication: loop order and vectorization

If you must multiply matrices, you should use dedicated libraries. However, we sometimes need to roll our own code. In C++, you can quickly write your own Matrix template:

template <typename T> 
struct Matrix { Matrix(size_t rows, size_t cols) : 
  data(new T[rows * cols]), rows(rows), cols(cols) {} 

  T &operator()(size_t i, size_t j) { 
      return data.get()[i * cols + j]; 
  const T &operator()(size_t i, size_t j) const { 
      return data.get()[i * cols + j]; 

  std::unique_ptr<T[]> data; size_t rows; size_t cols; 

How do you implement a matrix multiplication? A matrix multiplication is a sequence of three loops. If you do not want to get fancy, you have therefore six possibilities:

template <typename T>
void multiply_ikj(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t i = 0; i < a.rows; i++) {
    for (size_t k = 0; k < a.cols; k++) {
      for (size_t j = 0; j < b.cols; j++) {
        c(i, j) += a(i, k) * b(k, j);

template <typename T>
void multiply_ijk(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t i = 0; i < a.rows; i++) {
    for (size_t j = 0; j < b.cols; j++) {
      for (size_t k = 0; k < a.cols; k++) {
        c(i, j) += a(i, k) * b(k, j);

template <typename T>
void multiply_kij(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t k = 0; k < a.cols; k++) {
    for (size_t i = 0; i < a.rows; i++) {
      for (size_t j = 0; j < b.cols; j++) {
        c(i, j) += a(i, k) * b(k, j);

template <typename T>
void multiply_kji(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t k = 0; k < a.cols; k++) {
    for (size_t j = 0; j < b.cols; j++) {
      for (size_t i = 0; i < a.rows; i++) {
        c(i, j) += a(i, k) * b(k, j);

template <typename T>
void multiply_jki(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t j = 0; j < b.cols; j++) {
    for (size_t k = 0; k < a.cols; k++) {
      for (size_t i = 0; i < a.rows; i++) {
        c(i, j) += a(i, k) * b(k, j);

template <typename T>
void multiply_jik(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t j = 0; j < b.cols; j++) {
    for (size_t i = 0; i < a.rows; i++) {
      for (size_t k = 0; k < a.cols; k++) {
        c(i, j) += a(i, k) * b(k, j);

If you use an optimizing compiler and you tell it to compile specifically for your processor, you should get some fast code, at least in some instances. Which order is best?

The exact result depends on your data type (double, float, int), on the size of the matrices, on your compiler and your hardware. I wrote a benchmark where I use 100 by 100 matrices containing double values. I use GCC 12 (with full optimization -O3) and an Intel Ice Lake processor. I tell the compiler to optimize for the exact processor I have thus I expect that it will use advanced AVX-512 instructions when possible.

The net result in my experiment is that the best ordering is ikj. The worst ordering is jik.

If you were to compute manually and naively the matrix multiplications, you would need to do 100 times 100 times 100 multiplications, so 1 million multiplications and 1 million additions. Interestingly, the best ordering (ikj) uses roughly  a million of instructions to load the data, do the multiplications, the additions and storing the data.

order speed frequency instructions per cycle instructions/mult.
ikj 7240 mult/s 2 GHz 3.3 916k
ijk 3190 mult/s 2 GHz 2.4 1526k
kij 3160 mult/s 2 GHz 2.5 1561k
kji 3090 mult/s 2 GHz 2.4 1519k
jki 3120 mult/s 3.2 GHz 3.5 3526k
jik 1280 mult/s 3.2 GHz 1.7 4331k

The resulting assembly shows that for ikj, the instruction vfmadd213pd is generated by the compiler. There are fancier routines that the compiler could use so the result is likely far from optimal.

Further work. Justine Tunney suggests manually unrolling the outer loops. One might also use an OpenMP to get good parallelism (e.g., #pragma omp pararallel for collapse(2) if (m*n*k > 300000)).

Scan HTML faster with SIMD instructions: Chrome edition

Modern processors have instructions to process several bytes at once. Effectively all processors have the capability of processing 16 bytes at once. These instructions are called SIMD, for single instruction, multiple data.

It was once an open question whether these instructions could be useful to accelerate common tasks such as parsing HTML or JSON. However, the work on JSON parsing, as in the simdjson parser, has shown rather decisively that SIMD instructions could, indeed, be helpful in breaking speed records.

Inspired by such work, the engine under the Google Chrome browser (Chromium) has adopted SIMD parsing of the HTML inputs. It is the result of the excellent work by a Google engineer, Anton Bikineev.

The approach is used to quickly jump to four specific characters: <, &, \r and \0. You can implement something that looks a lot like it using regular C++ code as follows:

void NaiveAdvanceString(const char *&start, const char *end) {
  for (;start < end; start++) {
    if(*start == '<' || *start == '&' 
        || *start == '\r' || *start == '\0') {

A ‘naive’ approach using the SIMD instructions available on ARM processors looks as follows. Basically, you just do more or less the same thing as the naive regular/scalar approach, except that instead of taking one character at a time, you take 16 characters at a time.

void AdvanceString(const char *&start, const char *end) {
  uint8x16_t quote_mask = vmovq_n_u8('<');
  uint8x16_t escape_mask = vmovq_n_u8('&');
  uint8x16_t newline_mask = vmovq_n_u8('\r');
  uint8x16_t zero_mask = vmovq_n_u8('\0');
  uint8x16_t bit_mask = {16, 15, 14, 13, 12, 11, 10, 9, 8,
                            7, 6, 5, 4, 3, 2, 1};
  static constexpr auto stride = 16;
  for (; start + (stride - 1) < end; start += stride) {
    uint8x16_t data = vld1q_u8(reinterpret_cast<const uint8_t *>(start));
    uint8x16_t quotes = vceqq_u8(data, quote_mask);
    uint8x16_t escapes = vceqq_u8(data, escape_mask);
    uint8x16_t newlines = vceqq_u8(data, newline_mask);
    uint8x16_t zeros = vceqq_u8(data, zero_mask);
    uint8x16_t mask = vorrq_u8(vorrq_u8(quotes,zeros), 
           vorrq_u8(escapes, newlines));
    uint8x16_t matches = vandq_u8(bit_mask, mask);
    int m = vmaxvq_u8(matches);
    if(m != 0) {
      start += 16 - m;
  for (;start < end; start++) {
    if(*start == '<' || *start == '&' 
       || *start == '\r' || *start == '\0') {

If you have a PC with an Intel or AMD processor, you can do the equivalent using SSE2 instructions:

void AdvanceString(const char*& start, const char* end) {
    const __m128i quote_mask = _mm_set1_epi8('<');
    const __m128i escape_mask = _mm_set1_epi8('&');
    const __m128i newline_mask = _mm_set1_epi8('\r');
    const __m128i zero_mask = _mm_set1_epi8('\0');

    static constexpr auto stride = 16;
    for (; start + (stride - 1) < end; start += stride) {
        __m128i data = _mm_loadu_si128(
           reinterpret_cast<const __m128i*>(start));
        __m128i quotes = _mm_cmpeq_epi8(data, quote_mask);
        __m128i escapes = _mm_cmpeq_epi8(data, escape_mask);
        __m128i newlines = _mm_cmpeq_epi8(data, newline_mask);
        __m128i zeros = _mm_cmpeq_epi8(data, zero_mask);
        __m128i mask = _mm_or_si128(_mm_or_si128(quotes, zeros),                   
             _mm_or_si128(escapes, newlines));
        int m = _mm_movemask_epi8(mask);
        if (m != 0) {
            start += __builtin_ctz(m);

    // Process any remaining bytes (less than 16)
    while (start < end) {
        if (*start == '<' || *start == '&' 
             || *start == '\r' || *start == '\0') {

You can do slightly better if you use an approach we call vectorized classification (see Langdale and Lemire, 2019). Basically, you combine a SIMD approach with vectorized table lookups to classify the characters. The ARM NEON version using two table lookups looks as follows:

void AdvanceStringTable(const char *&start, const char *end) {
  uint8x16_t low_nibble_mask = {0b0001, 0, 0, 0, 0, 0, 0b0100, 
          0, 0, 0, 0, 0, 0b0010, 0b1000, 0, 0};
  uint8x16_t high_nibble_mask = {0b1001, 0, 0b0100, 0b0010, 
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
  uint8x16_t v0f = vmovq_n_u8(0xf);
  uint8x16_t bit_mask = {16, 15, 14, 13, 12, 11, 10, 9, 8,
                            7, 6, 5, 4, 3, 2, 1};
  static constexpr auto stride = 16;
  for (; start + (stride - 1) < end; start += stride) {
    uint8x16_t data = vld1q_u8(reinterpret_cast<const uint8_t *>(start));
    uint8x16_t lowpart = vqtbl1q_u8(low_nibble_mask, vandq_u8(data, v0f));
    uint8x16_t highpart = vqtbl1q_u8(high_nibble_mask,  
           vshrq_n_u8(data, 4));
    uint8x16_t classify = vandq_u8(lowpart, highpart);
    uint8x16_t matchesones = vtstq_u8(classify, vdupq_n_u8(0xFF));
    uint8x16_t matches = vandq_u8(bit_mask, matchesones);
    int m = vmaxvq_u8(matches);
    if(m != 0) {
      start += 16 - m;
  for (;start < end; start++) {
    if(*start == '<' || *start == '&' || *start == '\r' 
     || *start == '\0') {

This version is close to Bikineev’s code as it appears in the Google Chrome engine, except that I use standard instrinsics while Google engineers prefer to use the excellent highway SIMD library by Jan Wassenberg.

We can do slightly better in this instance because Bikineev only needs to distinguish between four characters. A single table lookup is needed. We did not elaborate in Langdale and Lemire (2019) but vectorized classification works using one, two, three or more table lookups, depending on the complexity of the target set. The simpler version might look as follows:

void AdvanceStringTableSimpler(const char *&start, const char *end) {
  uint8x16_t low_nibble_mask = {0, 0, 0, 0, 0, 0, 0x26, 0, 0, 
                            0, 0, 0, 0x3c, 0xd, 0, 0};
  uint8x16_t v0f = vmovq_n_u8(0xf);
  uint8x16_t bit_mask = {16, 15, 14, 13, 12, 11, 10, 9, 8,
                            7, 6, 5, 4, 3, 2, 1};
  static constexpr auto stride = 16;
  for (; start + (stride - 1) < end; start += stride) {
    uint8x16_t data = vld1q_u8(reinterpret_cast<const uint8_t *>(start));
    uint8x16_t lowpart = vqtbl1q_u8(low_nibble_mask, vandq_u8(data, v0f));
    uint8x16_t matchesones = vceqq_u8(lowpart, data);
    uint8x16_t matches = vandq_u8(bit_mask, matchesones);
    int m = vmaxvq_u8(matches);
    if(m != 0) {
      start += 16 - m;
  for (;start < end; start++) {
    if(*start == '<' || *start == '&' 
     || *start == '\r' || *start == '\0') {

On a processor with slower NEON instructions (e.g., Neoverse N1, Graviton 2), the following variation might be slightly faster:

void AdvanceStringTableSimpler(const char *&start, const char *end) {
  uint8x16_t low_nibble_mask = {0, 0, 0, 0, 0, 0, 0x26, 0, 0, 0, 
                              0, 0, 0x3c, 0xd, 0, 0};
  uint8x16_t v0f = vmovq_n_u8(0xf);
  uint8x16_t bit_mask = {16, 15, 14, 13, 12, 11, 10, 9, 8,
                            7, 6, 5, 4, 3, 2, 1};
  static constexpr auto stride = 16;
  for (; start + (stride - 1) < end; start += stride) {
    uint8x16_t data = vld1q_u8(reinterpret_cast<const uint8_t *>(start));
    uint8x16_t lowpart = vqtbl1q_u8(low_nibble_mask, vandq_u8(data, v0f));
    uint8x16_t matchesones = vceqq_u8(lowpart, data);
    if(vmaxvq_u32(vreinterpretq_u32_u8(matchesones)) != 0) {
      uint8x16_t matches = vandq_u8(bit_mask, matchesones);
      int m = vmaxvq_u8(matches);
      start += 16 - m;

  for (;start < end; start++) {
    if(*start == '<' || *start == '&' || *start == '\r' || *start == '\0') {

How do these three techniques compare? I wrote a small benchmark where I scan the HTML of the Google home page. I ran the benchmark on my Apple M2 laptop (LLVM 15).

method speed instructions/byte
naive (scalar) 2.0 GB/s 9.8 instructions/byte
naive (SIMD) 5.8 GB/s 2.1 instructions/byte
vectorized classification (2 lookups) 6.0 GB/s 2.0 instructions/byte
vectorized classification (1 lookup) 6.8 GB/s 1.8 instructions/byte

The results follow my expectations: the simplest vectorized classification routine has the best performance. However, you may observe that even a rather naive SIMD approach can be quite fast in this instance.

If you have an old SSE2 PC, only the simple SIMD approach is available. My results suggest that it might be good enough to get good results.

Suggested reading. I have a follow-up blog post where I explain how we might go much faster.

Update. After the publication of this blog post, the WebKit engine (used by Safari) and Chromium adopted the optimization with a vectorized table lookup.

Quickly checking whether a string needs escaping

In software, we often represent strings by surrounding them with quotes ("). What happens if the string itself contains quotes? We then need to escape the string. For example, the quote character (") or the backslash character (\) should be replaced by \" or \\. Most programmers are familiar with this process.

Most strings do not need to be escaped. It is thus useful to quickly check whether a string requires escaping.

In the popular interchange format JSON, strings must be escaped if they contain the quote character, the backslash character or any ASCII control character (i.e., with a code point less than 32).

How might we check such a string? Let us assume that we are using C++. A reasonable function might look as follows.

bool simple_needs_escaping(std::string_view v) {
  for (char c : v) {
    if ((uint8_t(c) < 32) | (c == '"') | (c == '\\')) {
      return true;
  return false;

The function takes a std::string_view named v as its argument. It iterates over each character c in the input string v. Inside the loop, we first use the comparison
(uint8_t(c) < 32) which checks if the ASCII value of the character is less than 32. This condition covers control characters (such as newline, tab, etc.). Then the comparison (c == '"') checks if the character is a double quote (") and (c == '\\') checks if the character is a backslash (\). If any of the above conditions are true for a character, the function returns true, indicating that the character needs escaping. If none of the conditions are met for any character, the function returns false, indicating that no escaping is needed.

Importantly, this function should exit as soon as a character needing escaping is found. If we expect that no such character will be found, we might try a different approach where we always scan the whole input. This allows the compiler to try other optimizations. In particular, the compiler is more likely to use autovectorization: the ability of our optimizing compiler to compile our code using fancy instructions like Single Instruction, Multiple Data (SIMD) instructions. I call such a function “branchless” as a reference to the fact that it does not branch out. The result might look as follows:

bool branchless_needs_escaping(std::string_view v) {
  bool b = false;
  for (char c : v) {
    b |= ((uint8_t(c) < 32) | (c == '"') | (c == '\\'));
  return b;

We might also try a variation with table lookups. Instead of doing arithmetic and comparisons, we build a single table containing for all possible byte value whether it requires escaping or not.

It takes bit more effort but the result can be quite fast because each character is checked with a single load from a table, along with maybe or two additional instructions.

static constexpr std::array<uint8_t, 256> json_quotable_character =
    []() constexpr {
  std::array<uint8_t, 256> result{};
  for (int i = 0; i < 32; i++) {
    result[i] = 1;
  for (int i : {'"', '\\'}) {
    result[i] = 1;
  return result;

bool table_needs_escaping(std::string_view view) {
  uint8_t needs = 0;
  for (uint8_t c : view) {
    needs |= json_quotable_character[c];
  return needs;

Can we do better? We might if we use SIMD instructions deliberately such as NEON or SSE2. For the most part, your computer is likely either an ARM machine, supporting at least NEON instructions or an x64 machine supporting at least SSE2 instructions. It is easy to distinguish at compile time between these two scenarios. Of course, your processor might support even fancier instructions, but NEON and SSE2 should be good enough to get a good speedup, especially if the strings are not very long.

A good general strategy is to load the data in blocks of 16 bytes and do a few comparisons over these 16 bytes. The magic of SIMD instructions is that it can do 16 comparisons at once, so it can be much faster than character by character processing. What about the case where we have fewer than 16 characters? If you do not want to read past the string, you can simply fall back on one of our more conventional functions.

The NEON code might look as follows:

bool simd_needs_escaping(std::string_view view) {
  if (view.size() < 16) {
    return simple_needs_escaping(view);
  size_t i = 0;
  static uint8_t rnt_array[16] = {1, 0, 34, 0, 0,  0, 0, 0,
                                  0, 0, 0,  0, 92, 0, 0, 0};
  const uint8x16_t rnt = vld1q_u8(rnt_array);
  uint8x16_t running{0};
  for (; i + 15 < view.size(); i += 16) {
    uint8x16_t word = vld1q_u8((const uint8_t *) + i);
    running = vorrq_u8(
        vceqq_u8(vqtbl1q_u8(rnt, vandq_u8(word, vdupq_n_u8(15))), word));
    running = vorrq_u8(running, vcltq_u8(word, vdupq_n_u8(32)));
  if (i < view.size()) {
    uint8x16_t word =
        vld1q_u8((const uint8_t *) + view.length() - 16);
    running = vorrq_u8(
        vceqq_u8(vqtbl1q_u8(rnt, vandq_u8(word, vdupq_n_u8(15))), word));
    running = vorrq_u8(running, vcltq_u8(word, vdupq_n_u8(32)));
  return vmaxvq_u32(vreinterpretq_u32_u8(running)) != 0;

The SSE2 code might look at follows:

inline bool simd_needs_escaping(std::string_view view) {
  if (view.size() < 16) {
    return simple_needs_escaping(view);
  size_t i = 0;
  __m128i running = _mm_setzero_si128();
  for (; i + 15 < view.size(); i += 16) {
    __m128i word = _mm_loadu_si128((const __m128i *)( + i));
    running = _mm_or_si128(running, _mm_cmpeq_epi8(word, _mm_set1_epi8(34)));
    running = _mm_or_si128(running, _mm_cmpeq_epi8(word, _mm_set1_epi8(92)));
    running = _mm_or_si128(
        running, _mm_cmpeq_epi8(_mm_subs_epu8(word, _mm_set1_epi8(31)),
  if (i < view.size()) {
    __m128i word =
        _mm_loadu_si128((const __m128i *)( + view.length() - 16));
    running = _mm_or_si128(running, _mm_cmpeq_epi8(word, _mm_set1_epi8(34)));
    running = _mm_or_si128(running, _mm_cmpeq_epi8(word, _mm_set1_epi8(92)));
    running = _mm_or_si128(
        running, _mm_cmpeq_epi8(_mm_subs_epu8(word, _mm_set1_epi8(31)),
  return _mm_movemask_epi8(running) != 0;

You can optimize further the SIMD-based functions to reduce the number of instructions, but they already use far fewer when processing  blocks of 16 bytes than conventional functions.

It can be tricky to benchmark such functions. You will find much difference depending on your compiler and your processor. And the results are sensitive to the data, obviously. However my experience is that the SIMD approaches usually win, by a lot. To test it out, I wrote a small benchmark. In my benchmark, I use a few strings, of different lengths. Some of my strings have only a handful of characters, and some are short sentences. I have dozens of strings. None of the strings require escaping: I believe that this is common scenario.

system simple branchless table SIMD
Linux GCC 12, Intel Ice Lake (3.2 GHz) 0.65 GB/s 0.91 GB/s 1.9 GB/s 5.2 GB/s
Linux LLVM 16, Intel Ice Lake (3.2 GHz) 0.91 GB/s 2.6 GB/s 2.5 GB/s 5.7 GB/s

In these results, the table-based approach is quite competitive.  However, it can be beaten by the branchless  approach when using LLVM/clang due to its good ability to autovectorize the code.

Yet, in all instances, the hand-coded SIMD functions are at least twice as fast. As usual, my source code is available and I invite you to run your own benchmarks.

Note: The character with code point value 127 is also a control character in ASCII. Furthermore, Unicode has many control characters.

Never reason from the results of a sampling profiler

In the quest for software optimization, a trusty companion is the sampling profiler, a tool available in most programming languages. These profilers work unobtrusively, taking snapshots of the program’s state and recording the currently executing function or instruction.

While profilers sound like a silver bullet for identifying performance bottlenecks, their usefulness has limitations. They excel at highlighting bottlenecks, but often lack the nuance to pinpoint the root cause. In simpler cases, a profiler’s report might be enough, but relying solely on this data can lead one astray. Misinterpretations of profiling results are a common pitfall I’ve observed amongst programmers.

Imagine a store manager facing customer complaints about long lines. A frustrated customer like myself might be stuck due to a malfunctioning cash register. However, if the manager, instead of fixing the register, simply photographs the queue to identify the “culprit,” I, the innocent bystander, might be wrongly blamed for the delay. Profilers can be just as misleading, highlighting symptoms without revealing the underlying cause.

You are taking a few simple screenshots of a complex system. Of course, you can take more screenshots, and make your screenshots richer, but then you start affecting the system, and falling prey to a software-equivalent Heinsenberg uncertainy principle: you can either know the state of your program very precisely at all times, but then you know little about its speed, or you can measure quite precisely the speed, but with little idea of the intermediate speeds.

Do use sampling profilers. I find them useful. Just do not reason about your problems from them. They merely offer a starting point.

Further reading: Sampling profilers can mislead, and mastering any one tool (e.g., perf or VTune or uPerf) won’t magically confer analysis expertise.

Science and Technology links (May 25 2024)

  1. Artificial intelligence is far more efficient at producing content than human beings, as far as carbon emissions go.
  2. Human brains got larger by over 5% between 1930 and 1970.
  3. Replacing plastics by ‘environment friendly’ alternatives typically results in greater greenhouse gas emissions.
  4. Prostate-specific antigen screening has only a small effect on men’s risk of dying in absolute terms.
  5. Local exposure to poor individuals reduces support for redistribution among the well-off. In other words, wealthy people are more likely to favor government programs helping the poor if they never see poor people.
  6. Happier looking people are judged better. If you want to be viewed as a good person, make sure you appear happy.
  7. Females mount stronger immune responses to many pathogens, they awaken more frequently at night, they express greater concern about physically dangerous stimuli, they exert more effort to avoid social conflicts, they exhibit a personality style more focused on life’s dangers, they react to threats with greater fear, disgust and sadness and they develop more threat-based clinical conditions than males. (Benenson et al.).
  8. The Lincoln sea, the sea North of Greenland, was ice free about 10,000 years ago.
  9. ADHD and autism referrals up fivefold in 2023 in the UK. It is unclear why that is, but over diagnosis is a possibility.
  10. We believe that one of the earliest city might have been in modern-day Turkey, about 9,000 years ago.
  11. About 20,000 years ago, sea levels were over 100 meters lower, as we were in the last glacial maximum.
  12. High-intensity strength training exercises are an effective means to preserve bone density while improving muscle mass, strength, and balance in postmenopausal women.
  13. The average American is willing to pay over 500$ to get a 3-month exemption from a medical mask mandate.
  14. Experiencing nature leads to healthier food choices.
  15. Australia is getting greener, rapidly.
  16. When you lose weight, you spend less energy. However, if you consume relatively more fat or protein during the weight loss, you tend to use more energy.
  17. Trees are getting bigger.
  18. Sun exposure may improve your health.
  19. They are conducting a clinical trial for tooth regrowth technology in Japan.

Learning from the object-oriented mania

Back when I started programming professionally, every expert and every software engineering professor would swear by object-oriented programming. Resistance was futile. History had spoken: the future was object-oriented.

It is hard to understate how strong the mania was. In education, we started calling textbooks and videos ‘learning objects‘. Educators would soon ‘combine learning objects and reuse them‘.

A competitor to a client I was working on at the time had written a server in C. They had to pay lip service to object-oriented programming, so they said that their code was ‘object-oriented.

I once led a project to build an image compression system. They insisted that before we even wrote a single line of code, we planned it out using ‘UML’. It had to be object-oriented from the start, you see.

You had to know your object-oriented design patterns, or you could not be taken seriously.

People rewrote their database engines so that they would be object-oriented.

More than 25 years later, we can finally say, without needing much courage, that it was insane, outrageous, and terribly wasteful. 

Yet, even today, the pressure remains on. Students are compelled to write simple projects using multiple classes. Not just learn the principles of object-oriented programming, which is fair enough, but we still demand that they embrace the ideology.

To be fair, some of the basic principles behind object-oriented programming can be useful. At least, you should know about them.

But the mania was unwarranted and harmful.

The lesson you should draw is not that object-oriented is bad, but rather that whatever is the current trendy technique and trendy idea, is likely grossly overrated.

The social mechanism is constantly in action, though it is no longer acting for object-oriented programming. It takes many forms. Not long ago, you had to wear a mask to attend a conference. Everyone ‘knew’ that masks stopped viruses and had no side-effect… just like everyone just knew that object-oriented programming makes better and more maintainable software, without negative side-effects. All experts agree. All figure of authorities agree. The written press agrees. The celebrities agree. The social pressure to conform is everywhere. It must be true, it has to be true. Anyone disagreeing is a bad person.

You can recognize such a social contagion by its telltale signs.

  1. Rapid Spread: A social contagion spreads quickly through a group or community, much like a wildfire. One day everyone is talking about the latest object-oriented pattern, and the next day, everyone is putting it into practice.
  2. Amplification: You often observe the emergence of ‘influencers’, people who gain high social status and use their newly found position to push further the hype. The object-oriented mania was driven by many key players who made a fortune in the process. They appeared in popular shows, magazines, and so forth.
  3. Peer Influence: Social contagion often relies on peer influence. E.g., everyone around you starts talking about object-oriented programming.
  4. Conformity: People often mimic the behaviors or attitudes of others in their group, leading to a conformity effect. People who do not conform are often excluded, either explicitly or implicitly. For example, object-oriented started to appear in job ads and was promoted by government agencies.
  5. Aggressive Behavior: You see a significant change from usual behavior as irrationality creeps in. If you criticize object-oriented programming, something is wrong with you!
  6. Grandiose Beliefs or Delusions: Claims that object-oriented programming would forever change the software industry for the better were everywhere. You could just easily reuse your objects and classes from one project to the other. Never mind that none of these claims could ever be sustained.
  7. Risky Behavior: Entire businesses bet their capital on projects trying to reinvent some established tool in an object-oriented manner. People kept throwing caution to the wind: let us rebuild everything the one true way, what is the worse that can happen?

Appendix. There is a very good reason why hardly any of us wears a mask today. If masks prevented the transmission of respiratory diseases, it would have been a medical breakthrough. But they do no such thing. There is no evidence that masks have benefits and they may well create net harm. The one European country that did not embrace general mask mandates (Sweden) ended up with effectively the lowest excess mortality in Europe.

Cochrane conducted a thorough review of all the strong evidence gathered during the covid era. Here is what the Cochrane review conclude:

We included 12 trials (10 cluster‐RCTs) comparing medical/surgical masks versus no masks to prevent the spread of viral respiratory illness (two trials with healthcare workers and 10 in the community). Wearing masks in the community probably makes little or no difference to the outcome of influenza‐like illness (ILI)/COVID‐19 like illness compared to not wearing masks. Wearing masks in the community probably makes little or no difference to the outcome of laboratory‐confirmed influenza/SARS‐CoV‐2 compared to not wearing masks. Harms were rarely measured and poorly reported.

Here are the results from an earlier Cochrane review, based on pre-COVID studies:

Compared with wearing medical or surgical masks, wearing N95/P2 respirators probably makes little to no difference in how many people have confirmed flu; and may make little to no difference in how many people catch a flu-like illness.

Ian Miller wrote a fantastic book on the topic: Unmasked: The Global Failure of COVID Mask Mandates.

Atlas et al. (2024) included masks in their lessons to draw from the covid era: Masks Were of Little or No Value and Possibly Harmful.

German officials admitted that evidence for making masks mandatory was lacking, according to health agency’s deliberations released after long legal battle.

Sandlund et al. conclude:

Because benefits of masking for COVID-19 have not been identified, it should be recognised that mask recommendations for children are not supported by scientific evidence.

Forwarding references in C++

In C++, there are different ways to pass a value to a function. Typically, at any given time, an object in C++ ‘belongs’ to a single function. The various ways to call a function differ in who owns the object, the caller or the callee (the function being called).

The simplest one is that we pass by value. In such cases, a copy is typically made of the object and both the caller and the callee own a copy.

void value(MyClass obj) {}

We can pass by reference. You recognize a reference by the single ampersand (&). The caller owns the object, but the callee gets access to it.

void reference(MyClass& obj) {}

You can also pass by an “rvalue reference” which you recognize by the ‘&&’ symbols. In such cases while the caller initially creates the object, but its ownership is passed to the callee. I personally dislike the expression ‘rvalue reference’ and I would have preferred something less technical.

void rvalue_reference(MyClass&& obj) {}

However, in some instances, you do not care whether your function gets to own the value, or has merely a reference to it. Writing two functions duplicates the code. Instead, you can then use a forwarding reference:

template <typename T>
void forwarding_reference(T&& obj) {}

It looks like an rvalue reference, but it is not: it can be either a normal reference or an rvalue reference depending on how you call it.

Here is how you might call these functions in practice:

MyClass obj;

The following table is a summary. A forwarding reference might be either a regular reference or an rvalue reference depending on how it is called.

caller owns? callee owns?
by value yes yes
by reference (&) yes no
by rvalue reference (&&) no yes

Peer review is not the gold standard in science

Peer review as we know it today was introduced very late, over a century after the scientific revolution. It happened after Einstein’s time… arguably the most productive era in science. Current scientists often equate a success with the publication in a selective peer-reviewed venue. But that was never the scientific paradigm. In fact, it is pre-scientific thinking. Back in Einstein’s time, many scientists believed in the ether. It would have been difficult to dismiss the ether as a concept. The prudent approach would have been to pay lip service to the ether. Similarly, most scientists believed in eugenics. They believed in forced sterilization for the greater good. Many of the racist laws in the US followed straight from progressive science. Opposing eugenics would have been difficult in the context of peer review. It would have been difficult to challenge eugenics openly as a scientists.

Siler et al. (2014) looked at published manuscripts that were initially rejected. They find:

Of the 808 eventually published articles in our dataset, our three focal journals rejected many highly cited manuscripts, including the 14 most popular; roughly the top 2 percent. Of those 14 articles, 12 were desk-rejected. This finding raises concerns regarding whether peer review is ill-suited to recognize and gestate the most impactful ideas and research.

Recently, people like Matt Ridley challenged the idea that the SARS-Cov2 virus originated from nature. Back when he published his book on the topic, it would have been difficult to pass peer review.

You may not remember, but early on, it would widely accepted that the lab origin of SARS-Cov2 was only for far-right conspiracy theorists. The Canadian State broadcaster (CBC) told us, in its ‘science’ section:

One of the most persistent and widespread pieces of disinformation during the COVID-19 pandemic has been the conspiracy theory that the novel coronavirus that causes the disease was created in a lab — and was let loose either by accident or on purpose by some nefarious actor.

In the US Senator Cotton suggested that thespread of a coronavirus is connected to research at the Wuhan institute of virology. In response, the Washington Post wrote:

Sen. Tom Cotton (R-Ark.) keeps repeating a coronavirus conspiracy theory that was already debunked. (…) In response to Cotton’s remarks, as well as in previous interviews with The Washington Post, numerous experts dismissed the possibility the coronavirus may be man-made.

Here is what one of the most reputed medical journal (The Lancet) published:

We stand together to strongly condemn conspiracy theories suggesting that COVID-19 does not have a natural origin.

The article omits the fact that the authors have glaring conflicts of interest (undisclosed).

Thacker describes some of the event in a piece for BMJ:

But the effort to brand serious consideration of a lab leak a “conspiracy theory” only ramped up. Filippa Lentzos, codirector of the Centre for Science and Security Studies at King’s College, London, told the Wall Street Journal, “Some of the scientists in this area very quickly closed ranks.” She added, “There were people that did not talk about this, because they feared for their careers. They feared for their grants.

Daszak had support. After he wrote an essay for the Guardian in June 2020 attacking the former head of MI6 for saying that the pandemic could have “started as an accident,” Jeremy Farrar, director of the Wellcome Trust [a major funder] and co-signer of the Lancet letter, promoted Daszak’s essay on Twitter, saying that Daszak was “always worth reading.”

Daszak’s behind-the-scenes role in orchestrating the statement in the Lancet came to light in November 2020 in emails obtained through freedom of information requests by the watchdog group US Right To Know.

“Please note that this statement will not have EcoHealth Alliance logo on it and will not be identifiable as coming from any one organization or person,” wrote Daszak in a February email, while sending around a draft of the statement for signatories. In another email, Daszak considered removing his name from the statement “so it has some distance from us and therefore doesn’t work in a counterproductive way.”

Several of the 27 scientists who signed the letter Daszak circulated did so using other professional affiliations and omitted reporting their ties to EcoHealth Alliance.

For Richard Ebright, professor of molecular biology at Rutgers University in New Jersey and a biosafety expert, scientific journals were complicit in helping to shout down any mention of a lab leak. “That means Nature, Science, and the Lancet,” he says. In recent months he and dozens of academics have signed several open letters rejecting conspiracy theory accusations and calling for an open investigation of the pandemic’s origins.

Efforts to characterise the lab leak scenario as unworthy of serious consideration were far reaching, sometimes affecting reporting that had first appeared well before the covid-19 pandemic. For example, in March 2020 Nature Medicine added an editor’s note (“Scientists believe that an animal is the most likely source of the coronavirus”) to a 2015 paper on the creation of a hybrid version of a SARS virus, co-written by Shi.

Here are the facts as we knew them back then… as anyone could know…

  • There was an outbreak caused by a bat sarbecovirus, in the one city in the world that had been collecting hundreds of bat sarbecoviruses and experimenting on them.
  • It happened one year after that lab proposed inserting the one feature that distinguishes SARS‑CoV‑2 from all other viruses.
  • The lab in question refuses to this day to release the database of the viruses it had been working on.
  • Virus leaks have been common.

It was always sensible to ask whether SARS-CoV-2 came from the Wuhan lab. Yet this was openly censored. As is often the case, instead of reflecting on this failure, many people rewrite history. “We never denied it could have come from a lab”, they say. “We never denied that it could have been human-made,” they say. But they very explicitly and strongly did so. They specifically and repeatedly said that this virus could not have been made in a laboratory:  yet a funding application to do exactly that, a few years before, had been submitted to the US government by Daszak, the very man who insisted that the lab origin was a conspiracy theory.

Of course, knowledgeable scientists knew that the lab origin was a possibility. They did not dare to speak up. Would you speak up when it could mean the end of your career?

This was not at all an isolated incident. Dr. Scott Atlas was censored by Stanford for questioning the covid dogma. The Stanford Review writes:

This censure, now a black mark on the University, was unquestionably motivated by political animosity. Atlas, a health policy expert who worked as a professor at Stanford Medical School for fourteen years, chose to serve his country by taking an advisory role in the Trump Administration’s White House Coronavirus Task Force. As an advisor, Atlas suggested reopening schools and businesses and pushed back against draconian lockdown policies.

You might answer… « Attacking people for getting closer to the truth isn’t new » But science seeks to address this very point. In fact, it is the very essence of the epistemology of science: the recognition that truth is not arrived by social consensus or by following the powerful. There are many ways to describe science, but to a first approximation…  Science is the process whereas anyone can post ideas and results for others to replicate, and everyone get to fail in public, and, hopefully correct themselves. Science the opposite of a gatekeeping process, it is, by its very nature, a progressive and open process.

It does not mean you should not use peer review publication. But you need to recognize that it is not the reference in science. Evidence is evidence. Consensus is not evidence.

Remember: ‘The reasonable man adapts himself to the world; the unreasonable man persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.’

How fast can you construct a small list of strings in C for Python?

Python is probably the most popular programming language in the world right now. Python is easy to extend using C code.

You may want to return from Python a small data structure. When crossing from C to Python, there is an overhead. Thus, if performance is a concern, you do not want to return lots of small values such as hundreds of individual small strings.

However, you may want to return a list of strings such as

['zero elephant', 'one elephant is having fun', 'two elephants are having fun',
'three elephants are meeting with the president', 'four elephants',
'five elephants in an hexagon', 'six elephants are playing the saxophone',
'seven elephants are visiting the school', 'eight elephants are at Church',
'nine elephants are having a party']

I am going to assume that these strings are in an array called ‘numbers’ in my C code. Of course, if the strings are known at compile time, I could just precompute the array and return it. However, I want to generate the content dynamically.

A reasonable C function which will return an list is as follows:

PyObject* all_strings(PyObject* self) {
  size_t len = sizeof(numbers) / sizeof(numbers[0]);
  PyObject* pyList = PyList_New(len);
  for (size_t i = 0; i < len; ++i) {
    PyObject* pyString = PyUnicode_FromString(numbers[i]);
    PyList_SET_ITEM(pyList, i, pyString);
  return pyList;

It looks a bit complicated, but it is not. It is just an application of the Python C functions. The key element is the PyUnicode_FromString function which automatically takes a pointer to an UTF-8 string, and converts it into a Python string. In our case, all our strings are ASCII, so the function does more work than needed. In this instance, I know ahead of time the size of the list (hence PyList_New(len)), but I could also have constructed an empty list (PyList_New(0)) and appended strings to it as needed.

Can we do better? We might. Python introduced lower-level string construction functions. You start by constructing a string that cannot be resized, and you must tell it whether the string content fits in one byte (ASCII or Latin1), or whether it requires more storage per character (UTF-16 without surrogate pairs) or whether you need the full Unicode range. In my case, it is easy: my strings are pure ASCII. In the general case, you would need to ascertain the string type (a functionality we plan to add to the simdutf library).

The code becomes slightly more complicated, but barely so…

PyObject* all_strings_new(PyObject* self) {
  size_t len = sizeof(numbers) / sizeof(numbers[0]);
  PyObject* pyList = PyList_New(len);
  for (size_t i = 0; i < len; ++i) {
    const char * str = numbers[i];
    size_t len = strlen(str);
    PyObject* py_string = PyUnicode_New(len, 127);
    Py_UCS1* data = PyUnicode_1BYTE_DATA(py_string);
    memcpy(data, str, len);
    PyList_SET_ITEM(pyList, i, py_string);
  return pyList;

I wrote a small benchmark. You need Python and a development environment to build it. I ran the benchmark on my Apple M2 laptop, using LLVM 15 and Python 3.12. Your results will vary.

In my case, I get a very significant boost (2x) from the lower-level approach.

basic 270 ns/structure
lower-level 140 ns/structure

My data structure contains 295 bytes of strings, so the lower-level approach generates strings at over 2 GB/s, whereas the conventional approach has half the speed.

Should Node.js be built with ClangCL under Windows?

Under Windows, when using Visual Studio to build C++ code, there are two possible compiler strategies. The Visual Studio compiler (often referred to as MSVC) is the default compiler provided by Microsoft for Windows development. In Debug mode, the regular Visual Studio compiler produces faster compilation times and more performant code compared to ClangCL. ClangCL is part of the Clang/LLVM project, which is an open-source compiler toolchain. ClangCL is compatible with the Visual Studio runtime and links with the Microsoft implementation of the Standard Library. It’s available as an optional component in Visual Studio 2019 and later versions.

In Debug mode, I find that the regular Visual Studio compiler builds faster. However, in release mode, I found empirically that ClangCL approach may provide more performant code. On some micro-benchmarks, the difference can be large (e.g., 40%) although I expect more modest gains on complex systems.

As of Chrome 64, Google Chrome for Windows is compiled with ClangCL. Thus Clang is now used to build Chrome for all platforms it runs on, including macOS, iOS, Linux, Chrome OS, Android, and Windows. Firefox switched to ClangCL in 2018. And at least some game developers have adopted ClangCL.

Node.js is an open-source, cross-platform JavaScript runtime environment. It allows developers to execute JavaScript code outside of a web browser. Unlike traditional JavaScript, which primarily runs in browsers, Node.js enables server-side execution of JavaScript. Node.js is part of popular web development stacks Node.js relies on the Google Chrome V8 JavaScript Engine: Node.js is built on the V8 JavaScript engine, the same engine used by Google Chrome.

Node.js is built under Windows using the regular Visual Studio compiler. Thanks in large part to Michaël Zasso, it is possible to build the Node.js under Windows with ClangCL. Could it improve the performance?

To start answering this question, I ran the standard V8 benchmarks from Node.js. These benchmarks mostly focus on V8 performance and are not affected by changes in other components. For my tests, I use Visual Studio 2022 on Microsoft Surface Laptop Studio.

All results point at improvements. That is, on average, the speed is greater with ClangCL than using the standard Visual Studio compiler. However, there is much noise in my numbers. Using the V8 benchmarks, only one test was statistically strong (serialize.js len=256).

function improvement
v8\get-stats getHeapSpaceStatistics 3% +/- 11%
v8\get-stats getHeapStatistics 10% +/- 11%
v8\serialize.js len=256 6% +/- 2%
v8\serialize.js len=16384 2% +/- 2%
v8\serialize.js len=524288 19% +/- 50%

I should stress that compilers have strengths and weaknesses. The regular Visual Studio compiler is perfectly capable and you should expect it to do better in some instances. And Microsoft have some of the best compiler engineers in the world: each new version might bring in improvements.

Furthermore, some tasks and benchmarks are not necessarily affected by the choice of a compiler: e.g., a network access, a disk processing routine, or a memory allocation stress test.

Yet it appears that there might be benefits to a transition to ClangCL for Node.js.

Careful with Pair-of-Registers instructions on Apple Silicon

Egor Bogatov is an engineer working on C# compiler technology at Microsoft. He had an intriguing remark about a performance regression on Apple hardware following what appears to be an optimization. The .NET 9.0 runtime introduced the optimization where two loads (ldr) could be combined into a single load (ldp). It is a typical peephole optimization. Yet it made things much slower in some cases.

Under ARM, the ldr instruction is used to load a single value from memory into a register. It operates on a single register at a time. Its assembly syntax is straightforward ldr Rd, [Rn, #offset]. The ldp instruction (Load Pair of Registers) loads two consecutive values from memory into two registers simultaneously. Its assembly syntax is similar but there are two destination registers: ldp Rd1, Rd2, [Rn, #offset]. The ldp instruction loads two 32-bit words or two 64-bit words from memory, and writes them to two registers.

Given a choice, it seems that you should prefer the ldp instruction. After all, it is a single instruction. But there is a catch on Apple silicon: if you are loading data from a memory that was just written to, there might be a significant penalty to ldp.

To illustrate, let us consider the case where we write and load two values repeatedly using two loads and two stores:

for (int i = 0; i < 1000000000; i++) {
  int tmp1, tmp2;
  __asm__ volatile("ldr %w0, [%2]\n"
                   "ldr %w1, [%2, #4]\n"
                   "str %w0, [%2]\n"
                   "str %w1, [%2, #4]\n"
    : "=&r"(tmp1), "=&r"(tmp2) : "r"(ptr):);

Next, let us consider an optimized approach where we combine the two loads into a single one:

for (int i = 0; i < 1000000000; i++) {
  int tmp1, tmp2;
  __asm__ volatile("ldp %w0, %w1, [%2]\n"
                   "str %w0, [%2]\n"
                   "str %w1, [%2, #4]\n"
    : "=&r"(tmp1), "=&r"(tmp2) : "r"(ptr) :);

It would be surprising if this new version was slower, but it can be. The code for the benchmark is available. I benchmarked both on AWS using Amazon’s graviton 3 processors, and on Apple M2. Your results will vary.

function graviton 3 Apple M2
2 loads, 2 stores 2.2 ms/loop 0.68 ms/loop
1 load, 2 stores 1.6 ms/loop 1.6 ms/loop

I have no particular insight as to why it might be, but my guess is that Apple Silicon has a Store-to-Load forwarding optimization that does not work with Pair-Of-Registers loads and stores.

There is an Apple Silicon CPU Optimization Guide which might provide better insight.

Large language models (e.g., ChatGPT) as research assistants

Software can beat human beings at most games… from Chess to Go, and even poker. Large language models like GPT-4 offered through services such as ChatGPT allow us to solve a new breed of problems. GPT-4 can beat 90% of human beings at the bar exam. Artificial intelligence can match math Olympians.

The primary skills of academics are language-related: synthesis, analogy, extrapolation, etc. Academics analyze the literature, identify gaps, and formulate research questions. They review and synthesize existing research. They write research papers, grant proposals, and reports. Being able to produce well-structured and grammatically correct prose is a vital skill for academics.

Unsurprisingly, software and artificial intelligence can help academics, and maybe replace them in some cases. Liang et al. found that an increasing number of research papers are written with tools like GPT-4 (up to 18% in some fields). It is quite certain that in the near future, a majority of all research papers will be written with the help of artificial intelligence. I suspect that they will be reviewed with artificial intelligence as well. We might soon face a closed loop where software writes papers while other software reviews it.

I encourage scholars to apply artificial intelligence immediately for tasks such as…

  1. Querying a document. A tool like BingChat from Microsoft allows you to open a PDF document and query it. You may ask “what are the main findings of this study?” or “are there any practical applications for this work?”.
  2. Improve text. Many academics, like myself, use English as a second language. Of course, large language models can translate, but they can also improve your wording. It is more than a mere grammar checker: it can rewrite part of your text, correcting bad usages as it goes.
  3. Idea generation. I used to spend a lot of time chatting with colleagues about a vague idea I had. “How could we check whether X is true?” A tool like ChatGPT can help you get started. If you ask how to design an experiment to check a given hypothesis, it can often do a surprisingly good job.
  4. Grant applications. You can use tools like ChatGTP to help you with grant applications. Ask it to make up short-term and long-term objectives, sketch a methodology and discuss the impact of your work… it will come up with something credible right away. It is likely that thousands of grant applications have been written with artificial intelligence.
  5. Writing code. You are not much of a programmer, but you want an R script that will load data from your Excel spreadsheet and do some statistical analysis? ChatGPT will do it for you.
  6. Find reviewers and journals. Sometimes you have done some work and you would like help picking the right journal, a tool like ChatGPT can help. If a student of yours finished their thesis, ChatGPT can help you identify prospective referees.

I suspect that much academic work will soon greatly benefit from artificial intelligence to the point where a few academics will be able to do the work that required an entire research institute in the past.

And this new technology should make mediocre academics even less useful, relatively speaking. If artificial intelligence can write credible papers and grant applications, what is the worth of someone who can barely do these things?

You would think that these technological advances should accelerate progress. But, as argued by Patrick Collison and Michael Nielsen, science productivity has been falling despite all our technological progress. Physics is not advancing faster today than it did in the first half of the XXth century. It may even be stagnant in relative terms. I do not think that we should hastily conclude that ChatGPT will somehow accelerate the rate of progress in Physics. As Clusmann et al. point out:  it may simply ease scientific misconduct. We could soon be drowning in a sea of automatically generated documents. Messeri and Crockett put it elegantly:

AI tools in science risks introducing a phase of scientific enquiry in which we produce more but understand less

Yet there are reasons to be optimistic. By allowing a small group of researchers to be highly productive, by freeing them to explore further with less funding, we could be on the verge of entering into a new era of scientific progress. However, it may not be directly measurable using our conventional tools. It may not appear as more highly cited papers or through large grants. A good illustration is Hugging Face, a site where thousands of engineers from all over the world explore new artificial-intelligence models. This type of work is undeniably scientific research: we have metrics, hypotheses, testing, reproducibility, etc. However, it does not look like ‘academic work’.

In any case, conventional academics will be increasingly challenged. Ironically, plumbers and electricians won’t be so easily replaced, a fact sometimes attributed to the Moravec paradox. Steven Pinker wrote in 1994 that cooks and gardeners are secured in their jobs for decades to come, unlike stock market analysis and engineers. But I suspect that the principle even extends within the academy: some work, like conducting actual experiments, is harder to automate than producing and running models. The theoretical work is likely more impacted by intelligence artificial than more applied, concrete work.

Note: This blog post was not written with artificial intelligence. Expect typos and grammatical mistakes.

How do you recognize an expert?

Go back to the roots: experience. An expert is someone who has repeatedly solved the concrete problem you are encountering. If your toilet leaks, an experienced plumber is an expert. An expert has a track record and has had to face the consequences of their work. Failing is part of what makes an expert: any expert should have stories about how things went wrong.

I associate the word expert with ‘the problem’ because we know that expertise does not transfer well: a plumber does not necessarily make a good electrician. And within plumbing, there are problems that only some plumbers should solve. Furthermore, you cannot abstract a problem: you can study fluid mechanics all you want, but it won’t turn you into an expert plumber.

That’s one reason why employers ask for relevant experience: they seek expertise they can rely on. It is sometimes difficult to acquire expertise in an academic or bureaucratic setting because the problems are distant or abstract. Your experience may not translate well into practice. Sadly we live in a society where we often lose track of and undervalue genuine expertise… thus you may take software programming classes from people who never built software or civil engineering classes from people who never worked on infrastructure projects.

So… how do you become an expert? Work on real problems. Do not fall for reverse causation: if all experts dress in white, dressing in white won’t turn you into an expert. Listening to the expert is not going to turn you into an expert. Lectures and videos can be inspiring but they don’t build your expertise. Getting a job with a company that has real problems, or running your own business… that’s how you acquire experience and expertise.

Why would you want to when you can make a good living otherwise, without the hard work of solving real problems? Actual expertise is capital that can survive a market crash or a political crisis. After Germany’s defeat in 1945… many of the aerospace experts went to work for the American government. Relevant expertise is robust capital.

Why won’t everyone seek genuine expertise? Because there is a strong countervailing force: showing a total lack of practical skill is a status signal. Wearing a tie shows that you don’t need to work with your hands.

But again: don’t fall for reverse causality… broadcasting that you don’t have useful skills might be fun if you are already of high status… but if not, it may not grant you a higher status.

And status games without a solid foundation might lead to anxiety. If you can get stuff done, if you can fix problems, you don’t need to worry so much about what people say about you. You may not like the color of the shoes of your plumber, but you won’t snob him over it.

So get expertise and maintain it. You are likely to become more confident and happier.

How quickly can you break a long string into lines?

Suppose that you receive a long string and you need to break it down into lines. Consider the simplified problems where you need to break the string into segments of (say) 72 characters. It is a relevant problem if your string is a base64 string or a Fortran formatted statement.

The problem could be a bit complicated because you might need consider the syntax. So the speed of breaking into a new line every 72 characters irrespective of the content provides an upper bound on the performance of breaking content into lines.

The most obvious algorithm could be to copy the content, line by line:

void break_lines(char *out, const char *in, size_t length,
  size_t line_length) {
  size_t j = 0;
  size_t i = 0;
  for (; i + line_length <= length; i += line_length) {
    memcpy(out + j, in + i, line_length);
    out[j+line_length] = '\n';
    j += line_length + 1;
  if (i < length) {
    memcpy(out + j, in + i, length - i);

Copying data in blocks in usually quite fast unless you are unlucky and you trigger aliasing. However, allocating a whole new buffer could be wasteful, especially if you only need to extend the current buffer by a few bytes.

A better option could thus be to do the work in-place. The difficulty is that if you load the data from the current array, and then write it a bit further away, you might be overwriting the data you need to load next. A solution is to proceed in reverse: start from the end… move what would be the last line off by a few bytes, then move the second last line and so forth. Your code might look like the following C function:

void break_lines_inplace(char *in, size_t length, size_t line_length) {
  size_t left = length % line_length;
  size_t i = length - left;
  size_t j = length + length / line_length - left;
  memmove(in + j, in + i, left);
  while (i >= line_length) {
    i -= line_length;
    j -= line_length + 1;
    memmove(in + j, in + i, line_length);
    in[j+line_length] = '\n';

I wrote a benchmark. I report the results only for a 64KB input. Importantly, my numbers do not include memory allocation which is separate.

A potentially important factor is whether we allow function inlining: without inlining, the compiler does not know the line length at compile-time and cannot optimize accordingly.

Your results will vary, but here are my own results:

method Intel Ice Lake, GCC 12 Apple M2, LLVM 14
memcpy 43 GB/s 70 GB/s
copy 25 GB/s 40 GB/s
copy (no inline) 25 GB/s 37 GB/s
in-place 25 GB/s 38 GB/s
in-place (no inline) 25 GB/s 38 GB/s

In my case, it does not matter whether we do the computation in-place or not. The in-place approach generates more instructions but we are not limited by the number of instructions.

At least in my results, I do not see a large effect from inlining. In fact, for the in-place routine, there appears to be no effect whatsoever.

Roughly speaking, I achieve a bit more than half the speed as that of a memory copy. We might be limited by the number of loads and stores. There might be a clever way to close the gap.