Expected performance of a Bloom filter

A hash function is a function that maps a value (such as a string) to an integer value. Typically, we want random-looking values.

A Bloom filter is a standard data structure in computer science to approximate a set. Basically, you start with a large array of bits, all initialized at zero. Each time you want to add an element to the set, you compute k different hash values and you set the bits at the k corresponding locations to one.

A fun application of a Bloom filter is a compromised password application: given a large set of passwords, you want a concise data structure that indicates whether a password belongs to the set.

To check whether an element is in the set, you compute, again, k different hash values and you check the bits at the corresponding k locations. If the value is in the set, then all k bits have to have value 1. If not, then you know for sure that the value is not in the set.

The way we naturally implement a Bloom filter is through a loop, where we check for each bit value in the sequence, stopping as soon as a 0 bit is found.

Unfortunately, it is possible for the Bloom filter to be wrong: all the k bits could be 1s out of luck. If that happens, you have a false positive. The false positives are often not a problem in practice because you can prune them out with a secondary application. For example, if you want to make sure that the password has been compromised, you may simply look it up in a conventional database. The Bloom filter will still handle most cases.

By spending about 12 bits per value (12 bits time the size of the set) and using 8 hash functions, the probability of a false positive is about 1/256 (or 0.3%) which is reasonable.

If you know a bit about software performance, the 8 bits could be concerning. Looking up 8 values at random location in memory is not cheap. And, indeed, when the element is in the set, you must check all 12 locations. It is expensive.

What if the value is not in the set, however? If your Bloom filter is configured optimally, for the lowest false-positive rate given a storage budget per element, about 50% of all bits in the array of bits are 1s.

How many random bits must we access before we find a 0 in such a case? We stop after one bit 1/2 the time, then a quarter of the time after two bits, and so forth. So the expected value is 1/2 + 1/4*2 + 1/8*3 + 1/16 *4 +… which tends towards 2. The answer does not depend much on the number of hash functions k.

This means that if you are using a Bloom filter in a scenario where the values are likely not in the set, you can get away with very few memory accesses.

Of course, the processor is not going to naively load each bit one by one. It will do speculative loads. Thankfully, we can use performance counters to measure the actual number of loads.

Another concern for programmers who are interested in performance is the number of mispredicted branches. Interestingly, the relationship is reversed: if your elements are not in the set, then you face unpredictable branches (so a bad performance) but if they are in the set, then the branches are perfectly predictable.

Let us consider the performance metrics of the Bloom filter when we have a false positive (miss) and an actual positive (hit). I use an Intel Ice Lake server with a filter that exceeds the CPU cache, and I record CPU performance counters.

number of hash functions bits per values cache misses per miss cache misses per hit mispredict per miss mispredict per hit
8 12 3.5 7.5 0.95 0.0
11 16 3.8 10.5 0.95 0.0

For the misprediction tests, I use either the regime where all values are in the sets, or no value is in the set.

What is the performance of both hits and misses? We find that the performance of a miss does not depend too much on the number of hash functions, as you would expect. The cost of a hit grows roughly in proportion of the number of hash functions, as you would expect.

number of hash functions bits per values cycles/miss cycles/hit
8 12 135 170
11 16 140 230

You can reproduce my results with the fastfilter_cpp.

Let us now consider, for the same problem size, binary fuse filters (Go implementation, C single-header implementation, Rust implementation, zig implementation, Julia implementation). They use a slightly different approach (see paper): we can either use a flat 3 accesses per value or a flat 4 accesses per value.

cache misses per value mispredict per value
3-wise binary fuse 2.8 0.0
4-wise binary fuse 3.7 0.0

So the 4-wise filters has about as many cache misses as the Bloom filter. Yet they are significantly faster than Bloom filters.

cycles per access
3-wise binary fuse 85
4-wise binary fuse 100

The absurd cost of finalizers in Go

The Go programming language makes it easy to call C code. Suppose you have the following C functions:

char* allocate() {
  return (char*)malloc(100);
}
void free_allocated(char *c) {
  free(c);
}

Then you can call them from Go as follows:

c := C.allocate()
C.free_allocated(c)

It works well.

You might argue that my functions are useless, but I designed them to be trivial on purpose. In practice, you will call C code to do something actually useful. Importantly, there is an allocation followed by a necessary deallocation: this is typical in realistic code.

Reasonably, Go programmers are likely to insist that the memory allocated from Go be released automatically. Thankfully, Go has a mechanism for this purpose. You put the C data in a Go structure which is subject to automatic garbage collection. Then you tie the instances of this structure to a “finalizer” function which will be called before the Go structures is garbage collected. The code might look as follows:

type Cstr struct {
  cpointer *C.char
}

func AllocateAuto() *Cstr {
  answer := &Cstr{C.allocate()}
  runtime.SetFinalizer(answer, func(c *Cstr) {  C.free_allocated(c.cpointer); runtime.KeepAlive(c) })
  return answer
}

 

So far so good. Go is doing very well up until now.

But what is the performance impact? We are comparing these two routines. First, the inconvenient version where you manually have to free the allocated memory…

p := Allocate()
Free(p)

and then the version which relies on Go’s memory management…

AllocateAuto()

Let us benchmark it. My benchmarking code is available, your results will differ from mine but I care only about the big picture.

In my case, the automated version is nearly ten times slower.

AllocateAuto 650 ns
Allocate-Free 75 ns

The 650 ns result is silly: it is thousands of CPU cycles.

Maybe it is the overhead due to garbage collection ? Thankfully, Go allows us to disable garbage collection with GOGC=off:

AllocateAuto (no GC) 580 ns
Allocate-Free (no GC) 75 ns

So the numbers are slightly better, but barely so.

We can try to see where the problem lies with profiling:

go test -cpuprofile gah -bench BenchmarkAllocateAuto -run -
go tool pprof gah
> top

We get that most of the processing time is spent in runtime.cgocall:

     1.67s 70.17% 70.17%      1.67s 70.17%  runtime.cgocall
     0.23s  9.66% 79.83%      0.23s  9.66%  runtime.usleep
     0.12s  5.04% 84.87%      0.12s  5.04%  runtime.pthread_cond_signal

What if I try a dummy finalizer?

func AllocateDummy() *Cstr {
 answer := &Cstr{C.allocate()}
 runtime.SetFinalizer(answer, func(c *Cstr) {})
 return answer
}

I get the same poor performance, suggesting that it is really the finalizer that is expensive.

This is seemingly consistent with Java, which also has finalizers:

Oh, and there’s one more thing: there is a severe performance penalty for using finalizers. On my machine, the time to create and destroy a simple object is about 5.6ns. Adding a finalizer increases the time to 2,400ns. In other words, it is about 430 times slower to create and destroy objects with finalizers. (Effective Java 2nd Edition: Item 7: Avoid finalizers)

Maybe there is a way to do better, I hope there is, but I suspect not.

Further reading. Some notes on the cost of Go finalizers (in Go 1.20) by Chris Siebenmann.

Computing the UTF-8 size of a Latin 1 string quickly (ARM NEON edition)

While most of our software relies on Unicode strings, we often still encounter legacy encodings such as Latin 1. Before we convert Latin 1 strings to Unicode (e.g., UTF-8), we must compute the size of the UTF-8 string. It is fairly easy: all ASCII characters map 1 byte to 1 byte, while other characters (with code point values from 128 to 256) map 1 Latin byte to 2 Unicode bytes (in UTF-8).

Computers represent strings using bytes. Most often, we use the Unicode standard to represent characters in bytes. The universal format to exchange strings online is called UTF-8. It can represent over a million characters while retaining compatibility with the ancient ASCII format.

Though most of our software stack has moved to Unicode strings, there are still older standards like Latin 1 used for European strings. Thus we often need to convert Latin 1 strings to UTF-8. It is useful to first compute the size (in bytes) of the eventual UTF-8 strings. You can code a simple C function to compute the UTF-8 size from the Latin 1 input as follow:

size_t scalar_utf8_length(const char * c, size_t len) {
  size_t answer = 0;
  for(size_t i = 0; i<len; i++) {
    if((c[i]>>7)) { answer++;}
  }
  return answer + len;
}

In Computing the UTF-8 size of a Latin 1 string quickly (AVX edition), I reviewed faster techniques to solve this problem on x64 processors.

What about ARM processors (as in your recent MacBook)?

Keyhan Vakil came up with a nice solution with relies on the availability for “narrowing instructions” in ARM processors. Basically you can take a 16-byte vector registers and create a 8-byte register (virtually) by truncating or rounding the results. Conveniently, you can also combine bit shifting with narrowing.

Consider pairs of successive 8-bit words as a 16-bit word. E.g., if the 16 bits start out as aaaaaaaabbbbbbbb then a shift-by-four-and-narrow creates the byte value aaaabbbb. Indeed, if you shift a 16-bit word by 4 bits and keep only the least significant 8 bits of the result, then

  1. the most significant 4 bits from the second 8-bit word become the least significant 4 bits in the result
  2. and the least significant 4 bits from the first 8-bit word become the most significant 4 bits.

This is convenient because vectorized comparison functions often generate filled bytes when the comparison is true (all 1s). The final algorithm in C looks as follows:

uint64_t utf8_length_kvakil(const uint8_t *data, uint32_t length) {
  uint64_t result = 0;
  const int lanes = sizeof(uint8x16_t);
  uint8_t rem = length % lanes;
  const uint8_t *simd_end = data + (length / lanes) * lanes;
  const uint8x16_t threshold = vdupq_n_u8(0x80);
  for (; data < simd_end; data += lanes) {
    // load 16 bytes
    uint8x16_t input = vld1q_u8(data);
    // compare to threshold (0x80)
    uint8x16_t withhighbit = vcgeq_u8(input, threshold);
    // shift and narrow
    uint8x8_t highbits = vshrn_n_u16(vreinterpretq_u16_u8(withhighbit), 4);
    // we have 0, 4 or 8 bits per byte
    uint8x8_t bitsperbyte = vcnt_u8(highbits);
    // sum the bytes vertically to uint16_t
   result += vaddlv_u8(bitsperbyte);
  }
  result /= 4; // we overcounted by a factor of 4
  // scalar tail
  for (uint8_t j = 0; j < rem; j++) {
    result += (simd_end[j] >> 7);
  }
  return result + length;
}

Can you beat Vakil? You can surely reduce the instruction count but once you reach speeds like 20 GB/s, it becomes difficult to go much faster without hitting memory and cache speed limits.

Pete Cawley proposed a simpler algorithm which avoids the narrowing shifts, and does a vertical addition instead:

uint64_t utf8_length_simpler(const uint8_t *data, uint32_t length) {
  uint64_t result = 0;
  const int lanes = sizeof(uint8x16_t);
  uint8_t rem = length % lanes;
  const uint8_t *simd_end = data + (length / lanes) * lanes;
  const uint8x16_t threshold = vdupq_n_u8(0x80);
  for (; data < simd_end; data += lanes) {
    // load 16 bytes
    uint8x16_t input = vld1q_u8(data);
    // compare to threshold (0x80)
    uint8x16_t withhighbit = vcgeq_u8(input, threshold);
    // vertical addition
    result -= vaddvq_s8(withhighbit);
  }
  // scalar tail
  for (uint8_t j = 0; j < rem; j++) {
    result += (simd_end[j] >> 7);
  }
  return result + length;
}

Are the hand-tuned NEON functions fast?

On my Apple M2, they are three times as fast as what the compiler produces from the scalar code on large enough inputs. Observe that the compiler already relies on vector instructions even when compiling scalar code.

scalar code ~6 GB/s
NEON code (both functions) ~20 GB/s

On my Apple laptop, both NEON functions are equally fast. Using Graviton 3 processors on AWS (with GCC 11), I can tell them apart:

scalar code ~7 GB/s
NEON code (Vakil) ~27 GB/s
NEON code (Cawley) ~30 GB/s

The Cawley function is slightly better. My source code is available. Your results will vary.

Update: if you check out my source code, you will find two new versions that are quite fast. It seems that there is a lot of room for optimization on this problem.

ARM instructions do “less work”?

Modern processors can execute several instructions per cycle. Because processors cannot easily run faster (in terms of clock speed), vendors try to get their processors to do more work per cycle.

Apple processors are wide in the sense that they can retire many more instructions per cycle than comparable Intel or AMD processors. However, some people argue that it is unfair because ARM instructions are less powerful and do less work than x64 (Intel/AMD) instructions so that we have performance parity.

Let us verify.

I have a number parsing benchmark that records the number of cycles, instructions and nanosecond spent parsing numbers on average. The number of instructions and cycles is measured using performance counters, as reported by Linux. I parse a standard dataset of numbers (canada.txt), I keep the fast_float numbers (ASCII mode).

system instructions per float cycles per float instructions per cycle
Intel Ice Lake, GCC 11 302 64 4.7
Apple M1, LLVM 14 299 45 6.6

Of course, that’s a single task, but number parsing is fairly generic as a computing task.

Looking the assembly output often does not reveal a massive benefit for x64. Consider the following simple routine:

// parses an integer of length 'l'
// into an int starting with value
// x.
for(int i = 0; i < l; i++) {
  x = 10 * x + (c[i]-'0');
}
return x;

LLVM 16 compiles this to the following optimized ARM assembly:

start:
 ldrb w10, [x1], #1
 subs x8, x8, #1
 madd w10, w0, w9, w10
 sub w0, w10, #48
 b.ne start

Or the following x64 assembly…

start:
  lea eax, [rax + 4*rax]
  movsx edi, byte ptr [rsi + rdx]
  lea eax, [rdi + 2*rax]
  add eax, -48
  inc rdx
  cmp rcx, rdx
  jne start

Though your mileage will vary, I find that for the tasks that I benchmark, I often see as many ARM instructions being retired than x64 instructions. There are differences, but they are small.

For example, in a URL parsing benchmark, I find that ARM requires 2444 instructions to parse a URL on average, against 2162 instructions for x64: a 13% benefit for x64. That’s not zero but it is not a massive benefit that overrides other concerns.

However, Apple processors definitively retire more instructions per cycle than Intel processors.

Science and Technology links (May 6 2023)

  1. Artificial intelligence (ChatGPT) can provide better answers to patients than physicians.
  2. Eating chocolate might affect your brain and cognitive functions. It may not make you smarter, but it might brighten your mood.
  3. Obesity may shorten your disease-free lifespan.
  4. A supplement called Netrin-1 appears to rejuvenate the bone marrow.
  5. Prior research indicated that high status individuals tended to be unethical. It appears that these results are not robust: they fail to replicate.
  6. Young Americans are increasingly depressed.
  7. Garlic consumption might protect against cancer by affecting the bacteria in your body.

Graviton 3, Apple M2 and Qualcomm 8cx 3rd gen: a URL parsing benchmark

Whenever you enter a URL into a system, it must be parsed and validated. It is a surprisingly challenging task: it may require hundreds of nanoseconds and possibly over a thousand cycles to parse a typical URL.

We can use URL parsing as a reasonable benchmark of a system performance. Of course, no single measure is sufficient… but URL parsing is interesting because it is a fairly generic task involving strings, and substrings, and characters searches and so forth.

I am going to compare the following ARM-based systems:

  • c7g.large: Amazon Graviton 3 running Ubuntu 22.04 (GCC 11)
  • macBook Air 2022: Apple M2 LLVM 14
  • Windows Dev Kit 2023: Qualcomm 8cx 3rd gen running Ubuntu 22.04 (GCC 11) inside WSL (Windows 11)

The Windows Dev Kit is a little plastic box designed to allow Windows developers to get their applications ready for Windows for 64-bit ARM. It is a tiny low-power device that I leave on my desk. The Amazon Graviton 3 nodes from Amazon are their best ARM-based servers. The macBook Air contains one of the best laptop processors on the market.

The benchmark we run loads 100,000 URLs found on the top 100 most visited web sites. It is single-threaded and requires no disk or network access: it is a pure CPU test.

I run the following routine:

  • git clone https://github.com/ada-url/ada
  • cd ada
  • cmake -B build -D ADA_BENCHMARKS=ON
  • cmake --build build --target benchdata
  • ./build/benchmarks/benchdata --benchmark_filter=BasicBench_AdaURL_aggregator_href
Graviton 3 285 ns/url
Apple M2 190 ns/url
Qualcomm 8cx 3rd gen 245 ns/url

We can also plot these average timings.

On this particular benchmark, the Qualcomm processor is 30% slower than the Apple M2 processor. That is to be expected: Apple Silicon is generally superior.

However, in this particular test, the Qualcomm system beats the Graviton 3 node from Amazon. On a related benchmark, I showed that the Graviton 3 had competitive performance and could beat state-of-the-art Intel Ice Lake nodes. Amazon themselves claim that Graviton 3 instances might be superior for machine learning tasks.

We can try to correct for frequency differences. The Graviton runs at 2.6 GHz, the Apple M2 runs at 3.5 GHz and the Qualcomm processor at 3.0 GHz. Let us correct the numbers:

Graviton 3 (model) 245 ns/url (corrected for 3 GHz)
Apple M2 (model) 220 ns/url (corrected for 3 GHz)
Qualcomm 8cx 3rd gen 245 ns/url

Note that you cannot blindly correct for frequency in this manner because it is not physically possible to just change the frequency as I did: it is a model to help us think.

Overall, these numbers suggest that the Qualcomm processor is competitive. It is not likely to establish speed records, but I would not shy away from a Qualcomm-based system if it is meant for low power usage.

How likely is it that my results are misleading? They seem to match roughly the results that Alex Ellis got running a more complete benchmark:

So I believe that my result is roughly correct: Qualcomm is inferior to Apple Silicon, but not by a very wide margin.

A separate issue is the Windows performance itself. Much of Windows is still x86 specific and though Windows can run x86 applications in emulation under 64-bit ARM, there is a penalty which could be substantial. Nevertheless, my own experience has been quite good. Of course, I do not play games on these machines nor do I do video editing. Your mileage will vary.

Further reading: Linux on Microsoft Dev Kit 2023

Under Linux, libSegFault and addr2line are underrated

Many modern programming languages like Java produce useful error messages when they fail. Some indicate where in the program (e.g., down to the line of source code) the failure occurred. Slowly, the C++ language is moving in this direction.

A particularly nasty problem in C and C++ are segmentation faults: they occur when you are trying to access a memory region that is not allocated. For example, this will occur if you are dereferencing a null pointer. You can mostly avoid segmentation faults in C++ by avoiding pointers, but they are instances when pointers are useful or necessary. And even if you are not causing them yourself, the software that you are using might be triggering them.

Thankfully, you can often get a useful message when you have a segmentation by linking with the libSegFault library under Linux.

Let me illustrate. The following program has a segmentation fault on line six:

#include <stdio.h>
#include <stdlib.h>

int go() {
  int x = 0;
  x += *((int *)NULL);
  return x;
}
int main() { return go(); }

I am calling this program hello. Let us compile the program with the line

cc -g -o hello hello.c -lSegFault

I am linking against SegFault, but not otherwise modifying my code. Importantly, I do not need to modify anything else. I do not need to run my program within the scope of another program (e.g., within a debugger like gdb or ldb). As far as I know, SegFault has been available pretty much all the time with Linux, you do not need to install anything. In the future, it will be removed from the standard C library but, at least under Ubuntu, it is being moved to the glibc-tools package. You can also build glibc-tools from its source code.

I run the program and I get the following…

Backtrace:
./hello[0x401116]
./hello[0x40112c]
/lib64/libc.so.6(+0x3feb0)[0x7faa8cc6eeb0]
/lib64/libc.so.6(__libc_start_main+0x80)[0x7faa8cc6ef60]
./hello[0x401045]

It does not tell me where exactly my program failed, instead I get a mysterious address (0x401116). There is a handy program under Linux called addr2line which will tell me where, in the source code, this address is. The error starts at address 0x401116, so let me type:

$ addr2line -e hello -a 0x401116
hello.c:6

And voilà! I get the exact location of the error. Again, this is lightweight and non-invasive, I only need my original program and the address.

The addr2line utility requires you to compile your code with debug symbols. If you try turning on optimization too much, you may not be able to get back to the line of code. However, you may use some optimization (-O1). The following works in this test:

cc -g -O1 -o hello hello.c -lSegFault

In fact, I recommend you compile your code with the -O1 flag generally when debugging.

My source code is available.

Appendix. Some of you might be wondering, why not wait for the application to dumb cores and then load the core into a debugger like gdb or ldb? That is not always possible. Some systems do not dump cores, or the cores might required privileged access. Furthermore, it may take more steps and more time if you have to redo the steps often.

Science and Technology links (April 29 2023)

  1. Ovaries age quickly in women. By the age of 40, most ovaries are poorly functional. However, there is an ongoing clinical trial to check whether the drug rapamycin might slow down this aging. It is one out of several initiatives to post-pospone and maybe reverse procreative aging in women.
  2. Glycine, a cheap supplement, might prolong life by activating autophagy (the process by which your body eats up old cells) and by mimicking methionine restriction (effectively mimicking caloric restriction)
  3. Pregnant women who take antidepressants might be more at risk of having a child with autism. About one in ten women takes antidepressants in the USA.
  4. Scenarios set out under the UN Climate Panel (IPCC) show human welfare will likely increase to 450% of today’s welfare over the 21st century. Climate damages will reduce this welfare an increase of merely 434%.
  5. Overall environmental knowledge and climate-specific knowledge are negatively related to climate change anxiety. In other words, people who suffer from climate-change anxiety are likely poorly informed.
  6. A lithium-air battery could have an energy density comparable to that of gasoline, and thus allow an electric car to go a long distance without recharging. There are promising laboratory results.
  7. Autonomous robots are burning unwanted weeds which might lead to higher farming productivity and lower usage of pesticides. These results are not new and the feasibility has been established. A few years ago, one of my graduate students worked on detecting weeds in a field of blueberries using deep learning.

Hotspot performance engineering fails

Developers often believe that software performance follows a Pareto distribution: 80% of the running time is spent in 20% of the code. Using this model, you can write most of your code without any care for performance and focus on the narrow pieces of code that are performance sensitive. Engineers like Casey Muratori have rightly criticized this model. You can read Muratori excellent piece on his blog.

It is definitively true that not all of your code requires attention. For example, it is possible that 99% of the time, your code processes correct data. The code that handles errors could be quite slow and it would not impact most of your users.

But the hotspot predicts something more precise: you should be able to just keep all of your code, identify the specific bottlenecks, optimize these bottlenecks, and then get great performance all around. Muratori relies on empirical evidence to falsify the model: many companies embarked in large rewrites of their codebase to optimize it. Why bother with such expenses if they could simply identify the bottlenecks and fix those?

We have tools today called profilers that can tell you roughly where your software spends its time. And if you apply such a tool to your software, you may indeed find massive bottlenecks. It may sometimes work wonders. For example, there was a video game (GTA Online) that was loading a JSON file. Simply optimizing this one bottleneck solved a massive performance issue. It did not make the game more performant, but it made it start much faster. So bottlenecks do exist. We should hunt them down and optimize them. But that’s not what the hotspot model predicts: it predicts that it is all you need to do to get performance. Hit a few bottlenecks and you are done.

Sadly, it does not work.

Let us run down through a few reasons:

  • Overall architecture trumps everything. If you first build a bus, you cannot then turn it into a sports car with a few changes. A few years ago, a company came to me and offered me a ton of money if I could make their database engine faster. They had money, but their software was much too slow. At first I was excited by the project, but I started reading their code and doing benchmarks, and then I realized that the entire architecture was wrong. They insisted that they knew where the hotspots were and that they just needed the expertise to optimize these few components. They told me that their code was spending 80% of its time in maybe 100 lines. And that is what profilers said. It is true, formally speaking, that if you could have made these 100 lines of code twice as fast, the code would have run twice as fast… but these lines of code were pulling data from memory and software cannot beat Physics. There are elementary operations that are not time compressible: you cannot move data faster than what is allowed by the hardware. The key point is that if your software does not have a good overall architecture, if it is not well organized for performance, you may have no choice but to rewrite it from the ground up, to re-architecture it.
  • As you optimize, the hotspots multiply. Going back to the example of GTA Online, it is easy to find that the program spends 10 seconds loading a 10 MB JSON file. However, the next steps are going to be more difficult. You will find that that finding the bottlenecks become difficult: we are subject to a Heisenberg principle: measuring big effects is easy, measuring small ones becomes impossible because the action of measuring interacts with the software execution. But even if you can find the bottlenecks, they become more numerous with each iteration. Eventually, much of your code needs to be considered.

The effort needed to optimize code grows exponentially. In other words, to multiply the performance by N, you need to 2N optimizations. The more you optimize, the more code you must consider. It is relatively easy to double the performance of an unoptimized piece of code, but much harder to multiply it by 10. You quickly hit walls that can be unsurmountable: the effort needed to double the performance again would just be too much. In effect, we have a long tail effect: though there are clearly easy wins, much of the work lies in the ‘tail’.

And that explain why companies do full rewrites of their code for performance: the effort needed to squeeze more performance from the existing code becomes too much and a complete rewrite is cheaper.

It also means that you should be acutely concerned about performance when you design your software if you want to avoid a rewrite. Knuth told us that premature optimization is the root of all evil, but he meant that before you worry about replacing your switch/case routine with gotos, you should work on the algorithmic design and code architecture. In effect, Knuth was preemptively rejecting hotspot performance engineering, and you should too.

Further content: Federico Lois — Patterns for high-performance C#.

Vectorized trimming of line comments

A French graduate student reached out by email yesterday with the following problem. Consider a format such as TOML which has line comments: when a ‘#’ character is encountered, the rest of the line is omitted. Let us look at an example:

[build-system] # use this to indicate the name of the project
  requires = ["setuptools>=61.0"]
  build-backend = "setuptools.build_meta"
[tool.setuptools]
  package-dir = {"" = "pylib"}
  packages = ["gyp", "gyp.generator"] # looks good
# end of the file

The lines with ‘#’ characters have comments.

We want to process such a scenario in a purely vectorized manner, as we do within simdjson, the fast JSON parser. That is, we want to avoid branching as much as possible.

We can construct bitmaps which indicate where the end of lines are (checking for the character ‘\n’) and where the ‘#’ is. Effectively, we turn all blocks of, say, 64 characters into two 64-bit words. Each bit in the 64-bit words correspond to one character. The bit is set to 1 in the first 64-bit word if and only if the matching character is ‘\n’ and similarly for the second 64-bit word. These pairs of words are ‘bitsets’. They can be computed efficiently using SIMD instructions or other means.

From these bitsets, we can then compute a mask of bitsets (a stream of 64-bit words corresponding each to blocks of 64 characters) where the corresponding bits are set to 0 if and only the character is commented out.

Effectively, considering both streams of bitsets (for end-of-lines and for hashes) as big integers, we just need to subtract the hash “big integer” from the end-of-line “big integer”. And then we need to make sure to remove all hashes, and put back the line endings. There is no clean way to subtract the big integers, but compilers like GCC and LLVM/clang have intrinsics for this purpose (__builtin_usubll_overflow). It is quite ugly and requires an overflow variable, but it compiles to highly efficient code. You can achieve much the same effect within Visual Studio with Microsoft intrinsics (left as an exercise for the reader). In Rust, you might call u64::overflowing_sub.

bool overflow = 1;
for (size_t i = 0; i < b.line_end.size(); i++) {
  // We subtract b.line_end from b.hash, with overflow handling.
  overflow = __builtin_usubll_overflow(b.hash[i], 
            b.line_end[i] + overflow,
            &b.comment[i]);
  b.comment[i] &=~b.hash[i]; // when there is more than one #, 
                             //we want to remove it.
  b.comment[i] |= b.line_end[i]; // we want to keep the line start bits.
}

From there, I was able to write a function that can prune (remove) the comments from the input file, using vectorized (branchless) routines. My C++ code has only a fast path for ARM NEON currently (otherwise, there is a fallback), but it works. I assume that have Linux or macOS.

Check the full code out if you would like. It is only a prototype is it not going to be especially fast even on ARM NEON. To make it run fast, I should avoid the unnecessary materialization of the whole bitset indexes: there is no need to map out the whole content. We can process block-by-block and avoid memory allocations.

Furthermore, I did not handle ‘#’ characters found inside strings. I suppose that they should be omitted. I have also simplified slightly the code above, in practice you need to be concerned with the case where you have a long streams of line ending characters ‘\n’, ending with a line containing a comment. Please see my full code.

I think that is a convincing proof-of-principle demonstration that you can identify and possibly remove line comments from a piece of code. I invite contributions and efforts toward turning this into something useful!

Science and Technology links (April 22 2023)

  1. There are many theories regarding what biological aging. Animals can differ by up to six orders of magnitude (100000x) in longevity. Some animal species like the lobster or the naked model rate do not exhibit measurable biological aging (they have negligible senescence) whereas trees and other plants often have negative aging (they get more fit over time). Pamplona et al. (2023) argue that there is animal aging is the result of a genetic program. They report that long-lived species do not have greater defense against damage nor greater repair abilities, except maybe for the skin (since long-lived species have greater abilities to repair their skins). In fact, long-lived species tend to have weaker antioxidant systems. Similar animal species have large longevity differences, and it is difficult to manipulate them to change this longevity by large factors, which is more consistent with a programmed model of aging than with a damage model. How might have programmed aging evolved? It may have evolved through group selection: species with more rapid aging can evolve and adapt more quickly, with lesser chances that they will go extinct.
  2. Obesity seems to increase inflammation and may contribute to osteoarthritis. Obesity is directly associated with the clinical and functional consequences of knee osteoarthritis.
  3. The European Union passed a privacy-protection legal framework (GDPR) a few years ago. Those of us visiting European websites are exposed to it through the cookie notifiers and other warnings. Johnson finds that the economic consequences of GDPR are mostly bad for Europe:

    The GDPR reduced investment for EU technology ventures. The GDPR disproportionately hurts smaller firms, to the benefit of larger firms (often large international corporations). Many European firms report to be limited in their ability to innovate due to the GDPR.

    As a related fact, there is no European company on the lists of most succesful technology companies provided by Wikipedia. Europe has some large technology companies such as Dassault Systèmes (France), Spotify and SAP (Germany), as well as several video-game companies.

  4. Female authors as well as people at top institutions are given preferential treatment during peer review among economists.
  5. Animals may have evolved to use sugar (fructose) as a protection against mass extinction: when you eat a lot of sugar, your body may tend to store more energy and water for use at a later date, it may reduce its oxygen usage, it may drive it to store salt, and raise blood pressure.

Will emerging artificial intelligence boost research productivity?

Bryan Caplan, an economist, raised an interesting question on Twitter: why aren’t people celebrating the fact that tools like GPT might soon allow us to produce many more research papers at the same cost, without sacrificing quality?

The American government is funding research at the tune of 138 billion dollars a year. It includes 45 billion dollars solely for the National Institutes of Health. There are governmental laboratories, special projects, the National Science Foundation, and so forth. And, of course, you have military research. For comparison, a large corporation like Google, with hundreds of thousands of employees, makes about 20 billion dollars in profit each year. We pour a significant effort into scientific research.

The number of research articles published per year follows an exponential curve. We publish millions of research articles. We never published so many research papers, and we will publish many more next year. Thousands of researchers publish at least a research paper a week.

At a glance, the objective of much of this research is to produce research articles. It seems certain that if we could produce more of them for the same cost, the national research productivity would go up.

The problem, as pointed out by Caplan, is that nobody cares about these research articles. In fact, if you asked engineers to pay the authors to access their research articles, they almost certainly would not be willing to pay.

To be clear, Caplan does not mean ‘all of it’. Some of the work is widely read and influential. He estimates that the worthwhile publication make up 2%. Nevertheless, most published articles are not contributing anything at all. So our ability to write more of them will not help us in any way.

As people like Stonebraker have remarked, we have lost our way: people publish to get jobs and promotions, and no longer to advance science. the focus on peer-reviewed publications has more to do with a status competition between researchers and academics, than a genuine desire to advance science. The metrics have been gamed. Researchers, generally, no longer have customers: nobody uses the result of the work, except maybe other researchers. Peer-reviewed papers are getting increasingly boring

It does not mean that actual scientific progress does not happen. Nor does it mean that we should stop writing papers and books. The research done at OpenAI is changing the world, and they are writing research papers. However, if you go on OpenAI’s web site, under Research, you find a list of links to arXiv, a freely accessible repositories of papers that are often not formally peer reviewed. In other words, the researchers at OpenAI seem more concerned with advancing the science than by playing a “publication game”.

Effectively, we have a severe misalignment problem. What gets you a job, a promotion or a research grant, no longer has much to do with whether you are advancing science.

Sometimes, when I sit on a recruiting committee for new professors, I ask the candidate to tell us about their most important scientific contribution. I then ask them why it is an important contribution. Almost invariably, some people will say “it was published in this important conference”. A large fraction of academia is thoroughly confused about what constitute progress.

What will tools like GPT do to the business of science? It is hard to tell. However, it is possible that the net outcome could be greatly positive. If “publishing a paper” is no longer a significant achievement thanks to artificial intelligence, it is possible that it could stop being a status-seeking game. This would leave the field to people who seek to contribute real advances.

I think that it is conceivable that this new breakthrough in artificial intelligence could bring us to a new golden age of scientific research. Of course, people will still seek to game prestige signals, but a fast changing technological landscape could reset the metrics and re-align the incentives.

Further reading:

Defining interfaces in C++: concepts versus inheritance

In a previous blog post, I showed how you could define ‘an interface’ in C++ using concepts. For example, I can specify that a type should have the methods has_next, next and reset:

template <typename T>
concept is_iterable = requires(T v) {
                        { v.has_next() } -> std::convertible_to<bool>;
                        { v.next() } -> std::same_as<uint32_t>;
                        { v.reset() };
                      };

I can then define a function template, taking a concept instance as a parameter:

template <is_iterable T> size_t count(T &t) {
  t.reset();
  size_t count = 0;
  while (t.has_next()) {
    t.next();
    count++;
  }
  return count;
}

In that blog post, I stated that I did not take into account inheritance as a strategy. Let us do so. We can define a generic base class and a corresponding generic function:

class iter_base {
public:
  virtual bool has_next() = 0;
  virtual uint32_t next() = 0;
  virtual void reset() = 0;
  virtual ~iter_base() = default;
};

size_t count_inheritance(iter_base &t) {
  t.reset();
  size_t count = 0;
  while (t.has_next()) {
    t.next();
    count++;
  }
  return count;
}

I can define a class that is suitable for both functions, as it satisfies the inheritance condition, as well as the concept:

struct iterable_array : iter_base {
  std::vector<uint32_t> array{};
  size_t index = 0;
  void reset() { index = 0; }
  bool has_next() { return index < array.size(); }
  uint32_t next() {
    index++;
    return array[index - 1];
  }
};

So far so good. But what is the difference between these two expressions given that a is an instance of iterable_array?

  • count(a),
  • count_inheritance(a).

Given an optimizing compiler, the first function (count(a)) is likely to just immediately return the size of the backing vector. The function is nearly free.

The second function (count_inheritance(a)) does not know anything about the iterable_array type so it will iterate through the content naively, and might be hundreds of times more expensive.

Defining interfaces in C++ with ‘concepts’ (C++20)

In an earlier blog post, I showed that the Go programming language allows you to write generic functions once you have defined an interface. Java has a very similar concept under the same name (interface). I gave the following example:

type IntIterable interface {
    HasNext() bool
    Next() uint32
    Reset()
}

func Count(i IntIterable) (count int) {
    count = 0
    i.Reset()
    for i.HasNext() {
        i.Next()
        count++
    }
    return
}

From this code, all you have to do is provide a type that supports the interface (having the methods HasNext, Next and Reset with the right signature) and you can use the function Count.

What about C++? Assume that I do not want to use C++ inheritance. In conventional C++, you could just write a Count template like so:

template <class T> size_t count(T &t) {
  t.reset();
  size_t count = 0;
  while (t.has_next()) {
    t.next();
    count++;
  }
  return count;
}

That is fine when used in moderation, but if I am a programmer and I need to use a template function I am unfamiliar with, I might have to read the code to find out what my type needs to implement to be compatible with the function template. Of course, it also limits the tools that I use to program: they cannot much about the type I am going to have in practice within the count function.

Thankfully, C++ now has the equivalent to a Go or Java interface, and it is called a concept (it requires a recent compiler with support for C++20). You would implement it as so…

template <typename T>
concept is_iterable = requires(T v) {
                        { v.has_next() } -> std::convertible_to<bool>;
                        { v.next() } -> std::same_as<uint32_t>;
                        { v.reset() };
                      };

template <is_iterable T> size_t count(T &t) {
  t.reset();
  size_t count = 0;
  while (t.has_next()) {
    t.next();
    count++;
  }
  return count;
}

It is even better than the Go equivalent because, as my example demonstrate, I do not have to require strictly that the has_next function return a Boolean, I can just require that it returns something that can be converted to a Boolean. In this particular example, I require that the next method returns a specific type (uint32_t), but I could have required, instead, to have an integer (std::is_integral<T>::value) or a number (std::is_arithmetic<T>::value).

In Go, I found that using an interface was not free: it can make the code slower. In C++, if you implement the following type and call count on it, you find that optimizing compilers are able to just figure out that they need to return the size of the inner vector. In other words: the use of a concept/template has no runtime cost. However, you pay for it up-front with greater compile-time.

struct iterable_array {
    std::vector<uint32_t> array{};
    size_t index = 0;
    void reset() { index = 0; }
    bool has_next() { return index < array.size(); }
    uint32_t next() { index++; return array[index - 1]; }
};

size_t f(iterable_array & a) {
    return count(a);
}

In C++, you can even make the runtime cost absolutely nil by forcing compile-time computation. The trick is to make sure that your type can be instantiated as a compile-time constant, and then you just pass it to the count function. It works with a recent GCC right now, but should be eventually broadly supported. In the following code, the function just returns the integer 10.

template <is_iterable T>
constexpr size_t count(T&& t) {
    return count(t);
}

struct iterable_array {
    constexpr iterable_array(size_t s) : array(s) {}
    std::vector<uint32_t> array{};
    size_t index = 0;
    constexpr void reset() { index = 0; }
    constexpr bool has_next() { return index < array.size(); }
    constexpr uint32_t next() { index++; return array[index - 1]; }
};


consteval size_t f() {
    return count(iterable_array(10));
}


You can examine my source code if you would like.

So what are concepts good for? I think it is mostly about documenting your code. For example, in the simdjson library, we have template methods of the type get<T>() where T is meant to be one of a few select types (int64_t, double, std::string_view, etc.). Some users invariably hope for some magic and just do get<mytypefromanotherlibrary>() hoping that the simdjson will somehow have an adequate overload. They then get a nasty error message. By using concepts, we might limit these programming errors. In fact, IDEs and C++ editor might catch it right away.

In the next blog post, I explain why concepts might be preferable than standard inheritance.

Science and Technology links (April 15 2023)

  1. Some university professors include ‘trigger warnings’ in their course material, to warn students that potentially disturbing content may be encountered. According to Birdgland et al., the research on the effectiveness of trigger warnings suggests that not beneficial, the triggers themselves cause anxiety.
  2. Switching 12% of the global livestock from pigs and chicken to ruminants (e.g., cows) could reduce nitrogen emissions by 2% and greenhouse gas emissions by 5% (Cheng et al., 2022).
  3. Lonely individuals are more likely to have original views.
  4. In a clinical trial consisting mostly of older women, the supplement urolithin A improved muscular endurance. You can buy urolithin A on Amazon in health stores.
  5. Future mothers who are exposed to higher levels of lithium might be more likely to have autistic children.

Interfaces are not free in Go

We are all familiar with the concept even if we are not aware of it: when you learn about arithmetic in school, you use the same mathematical symbols whether you are dealing with integers, fractions or real numbers. In software programming, this concept is called polymorphism. To be precise, in software programming, polymorphism means that can access objects of different types through the same interface.

The Go programming language has “polymorphism” through the notion of ‘interface’. It is somewhat similar to interfaces in Java, if you are a Java programmer.

Let us illustrate. Sometimes we like to ‘iterate’ through a set of values… we often call the software implementation of this concept an iterator. You can specify an iterator as an interface in Go… and once you have that, you can define a function to count the number of elements:

type IntIterable interface {
    HasNext() bool
    Next() uint32
    Reset()
}

func Count(i IntIterable) (count int) {
    count = 0
    i.Reset()
    for i.HasNext() {
        i.Next()
        count++
    }
    return
}

So far, so good. This function Count will count the number of elements in any Go instance that satisfies the interface (hasNext, Next, Reset): any structure in Go that implements these functions can be processed by the function.

Next you can provide Go with an actual implementation of the ‘IntIterable’ interface:

type IntArray struct {
    array []uint32
    index int
}

func (a *IntArray) HasNext() bool {
    return a.index < len(a.array)
}

func (a *IntArray) Next() uint32 {
    a.index++
    return a.array[a.index-1]
}

func (a *IntArray) Reset() {
    a.index = 0
}

And magically, you can call Count(&array) on an instance of IntArray and it will count the number of elements. This seems a lot better than specialized code such as…

func CountArray(a *IntArray) (count int) {
    count = 0
    a.Reset()
    for a.HasNext() {
        a.Next()
        count++
    }
    return
}

Unfortunately, it is not entirely better because the specialized function may be significantly faster than the function taking an interface. For sizeable arrays, the specialized function is about twice as fast in some of my tests:

Count (interface) 2 ns/element
CountArray 1 ns/element

Your results will vary, but the concept remains: Go does not ensure that interfaces are free computationally. If it is a performance bottleneck, it is your responsibility to optimize the code accordingly.

Sadly, both of these functions are too slow: the computation of the number of elements should be effectively free (0 ns/element for sizeable arrays) since it is just the length of the array, a constant throughout the benchmarks. In fact, in my benchmarks, the size of the array is even known at compile time.

My code is available.

In the next blog post, I compare with what is possible in C++.

19 random digits is not enough to uniquely identify all human beings

Suppose that you assigned everyone a 19-digit number. What is the probability that two human beings would have the same number? It is an instance of the Birthday’s paradox.

Assuming that there are 8 billion people, the probability that at least two of them end up with the same number is given by the following table:

digits probability
18 99.9%
19 96%
20 27%
21 3%
22 0.3%
23 0.03%
24 0.003%
25 0.0003%

If you want the probability to be effectively zero, you should use 30 digits or so.

Consider using constexpr static function variables for performance in C++

When programming, we often need constant variables that are used within a single function. For example, you may want to look up characters from a table. The following function is efficient:

char table(int idx) {
  const char array[] = {'z', 'b', 'k', 'd'};
  return array[idx];
}

It gets trickier if you have constants that require initialization. For example, the following is terrible code:

std::string table(int idx) {
  const std::string array[] = {"a", "l", "a", "z"};
  return array[idx];
}

It is terrible because it is possible that the compiler will create all string instances each time you enter the function, and then throw them away immediately. To fix this problem, you may declare the array to be ‘static’. It tells the compiler that you want the string instances to be initialized just exactly once in C++11. There is a one-to-one map between the string instances and the function instances.

std::string table(int idx) {
  const static std::string array[] = {"a", "l", "a", "z"};
  return array[idx];
}

But how does the compiler ensures that the initialization occurs just once? It may do so by using a guard variable, and loading this guard variable each time the function is called. If the variable indicates that the strings may not have been instantiated yet, a thread-safe routine is called and such a routine proceeds with the initialization if needed, setting the guard variable to indicate that no initialization is required in the future.

This initialization is inexpensive, and the latter checks are inexpensive as well. But they are not free and they generate a fair amount of binary code. A better approach is to tell the compiler that you want the initialization to occur at compile time. In this manner, there is no overhead whatsoever when you call the function. There is no guard variable. You get direct access to your constants. Unfortunately, it is not generally possible to have C++ string instances be instantiated at compile time, but it is possible with the C++17 counterpart ‘string_view’. We can declare the constant variables with the attributes constexpr static. The attribute constexpr tells the compiler to do the work at compile time. The resulting code is most efficient:

std::string_view table(int idx) {
  constexpr static std::string_view array[] = {"a", "l", "a", "z"};
  return array[idx];
}

I wrote a little benchmark to illustrate the effect. Your results will vary depending on your system. I use an AMD Rome processor with Linux Ubuntu 22 (GCC 11). My source code is available.

constexpr static string_view 2.4 ns/call
static string_view 2.6 ns/call
static string 7 ns/call
string 22 ns/call

The difference between using only static or constexpr static is not large as far as the runtime is concerned, and it may ever be too small to measure. However, the variant with constexpr static should generate less code (less bloat) in general.

In this instance, other compilers like LLVM may make the constexpr qualifier unnecessary… but the principle is that if you want compile-time initialization, you should be explicit.

Science and Technology links (April 11 2023)

  1. Robert Metcalfe won the Turing Award (the ‘Computer Science Nobel Prize’) for his work on early networking. According to DBLP, Metcalfe published 11 journal articles and 7 peer-review conference papers. Metcalfe was denied his PhD at first. Among other things, he is known for the following law:

    Metcalfe’s law states that the financial value or impact of a telecommunications network is proportional to the square of the number of connected users of the system.

  2. Loconte et al. propose using standadized psychological tests to evaluate tools like ChatGPT.
  3. Altay and Acerbi report that people fear misinformation because they assume that other people are gullible.
  4. Short videos in sequence may impair your cognition.
  5. You can train yourself for competitive high intensity exercises while on a high fat diet.
  6. 4% of boys and 1% of girls have an autism spectrum disorder (ASD) at age 8 according to the CDC.
  7. An amateur mathematician (David Smith) discovered for the first time how to create a tile that can cover a plane without ever repeating the pattern.
  8. According to Verger, as the rise of the West was already operating at great speeds prior to European colonialism, the often-suggested idea that the West’s ascent can be attributed to Europe’s colonialist history is untenable.
  9. Woody vegetation cover over sub-Saharan Africa increased by 8% between 1986 and 2016.
  10. Over the last century, the hardwood trees in Ohio have extended their growth season by an extra month, due to warmer temperatures.
  11. High-protein diets and intermittent fasting are effective at weight loss.

Programming-language popularity by GitHub pull requests

GitHub is probably the most popular software repository in the world. One important feature on GitHub is the ‘pull request’: we often contribute to a piece of software by proposing changes to a piece of code.

The number of pull requests is not, per se, an objective measure of how much one contributes to a piece of software. Pull requests can be large or small. It is also possible to commit directly to a software project without ever creating a pull request. Furthermore, some automated tools will generate pull requests (for security patches). This is especially common in JavaScript and TypeScript.

Nevertheless, in my view, the number of pull requests is an important indicator of how much people are willing and capable of contributing to your software in the open source domain.

The gist of the story goes as follows:

  1. The most popular languages are JavaScript/TypeScript and Python with roughly 20% of all pull requests each. In effect, if you put JavaScript/TypeScript and Python together, you get about 40% of all pull requests.
  2. Then you get the second tier languages: Java and Scala, C/C++, and Go. They all are in the 10% to 15% range.
  3. Finally, you have PHP, Ruby and C# that all manage to get about 5% of all pull requests.
  4. Other languages are typically far below 5%.

The popularity of JavaScript and derivative languages is strong. It matches my experience. I have published a few JavaScript/TypeScript librairies (FastPriorityQueue.js and FastBitSet.js) and they have received a continuous flow of contributions. It appears that TypeScript is progressively replacing part of JavaScript, but they are often used interchangeably.

Python is close behind: 15% to 20% of the pull requests. I suspect that being the default programming language of data science is helping sustain its well deserved popularity. I mostly use Python for quick scripts.

Java and Scala are still quite strong (10% to 15%): we do not observe a decline in the popularity of Java and it may even be gaining. It seems that the bet on faster release cycles in Java is beneficial. The language and the JVM are fast improving. Our RoaringBitmap library is receiving a constant flow of high-quality pull requests. I am a little bit surprised at the sustained popularity of Java, but it is undeniable. I find that building and publishing Java artefacts is unnecessarily challenging, compared to JavaScript and Python.

C/C++ is on the rise (above 10%). C++ has roughly doubled its relative popularity in terms of pull requests in the last ten years. I suspect that the great work that the C++ standard committee is doing, modernizing the language with every new standard, is helping. The tooling in C++ is also fast improving: it is easier than ever to write good C++ code. I have probably never spent as much time programming in C++ (simdjson, simdutf, ada, fast_float). I find it easy to find really smart collaborators.

Go is holding at nearly 10%: it underwent a fast rise but seems to have plateaued starting in 2018. I imagine that the imminent release of Go 2.0 could help. Our Bloom and bitset libraries in Go (see roaring too) receive many pull requests. Go is still one of my favourite programming languages. You can teach Go in a week-end, it comes with all the necessary tools (benchmarking, formatting, testing), its runtime library is accessible and complete, it is trivial to deploy a Go binary on a server, it builds quickly.

PHP and Ruby are falling (both at 5%) to C# level (around 5%). Ten years ago, both PHP and Ruby were default programming languages on the web. I am not exactly sure why PHP is falling in popularity among open source developers, but I  suspect that it might be suffering at the hands of JavaScript/TypeScript. C# is holding steady but I consider that it is an underrated programming language.

Source: A small place to discover languages in GitHub.