Passing recursive C++ lambdas as function pointers

In modern C++, as in many popular languages, you can create ‘lambdas’. Effectively, they are potentially anonymous function instances that you can create on the fly as you are programming, possibly inside another function. The following is a simple example.

auto return1 = [](int n) -> int { return 1; };

What about recursive functions? At first I thought you could do…

auto fact = [](int n) -> int {
 if (n == 0) {
   return 1;
 } else {
  return n * fact(n - 1);
 }
};

Sadly it fails. What seems to be happening is that while it recognizes the variable ‘fact’ within the definition of ‘fact’, it cannot use it without knowing its type. So you should specify the type of the ‘fact’ right away. The following will work:

std::function<int(int)> fact = [](int n) -> int {
  if (n == 0) {
    return 1;
  } else {
    return n * fact(n - 1);
  }
};

But using std::function templates may add complexity. For example, what if you have a function that takes a function as a parameter without using std::function, such as…

void print(int(*f)(int)) {
  for(int k = 1; k < 10; ++k) {
   std::cout << "Factorial of " << k << " is " << f(k) << std::endl;
  }
}

Then you would want to call print(fact), but it will not work directly. It may complain like so:

No known conversion from 'std::function' to 'int (*)(int)

So let us avoid the std::function as much as possible:

int (*factorial)(int) = [](int n) -> int {
  if (n == 0) {
    return 1;
  } else {
    return n * factorial(n - 1);
  }
};

And then everything works out fine:

    print(factorial); // OK

Let me finish with a word of caution: functional programming is sophisticated, but it has downsides. One potential downside is performance. Let us consider this conventional code:

int factorialc(int n) {
  if (n == 0) {
    return 1;
  } else {
    return n * factorialc(n - 1);
  }
}
int functionc() {
  return factorialc(10);
}

Most compilers should produce highly optimized code in such a scenario. In fact, it is likely that the returned value of ‘functionc’ gets computed a compile time. The alternative using lambdas might look as follows:

int (*lfactorial)(int) = [](int n) -> int {
  if (n == 0) {
    return 1;
  } else {
    return n * lfactorial(n - 1);
  }
};

int functionl() {
  return lfactorial(10);
}

Though the results will depend on your system, I would expect far less efficient code in general.

Thus when programming in C++,  if you use lambdas in performance critical code, run benchmarks or disassemble your function to make sure that you have, indeed, zero-cost abstraction.

My source code is available.

Credit: Thanks to Ca Yi, Yagiz Nizipli and many X users for informing this post.

Further reading: Recursive lambdas from C++14 to C++23

Measuring your system’s performance using software (Go edition)

When programming software, we are working over an abstraction over a system. The computer hardware may not know about your functions, your variables, and your data. It may only see bits and instructions. Yet to write efficient software, the programmer needs to be aware of the characteristics of the underlying system. Thankfully, we can also use the software itself to observe the behavior of the system through experiments.

Between the software and the hardware, there are several layers such as the compilers, the operating system, and the hardware. A good programmer should take into account these layers when needed. A good programmer must also understand the behavior of their software in terms of these layers.

Benchmarks in Go

To measure the performance, we often measure the time required to execute some function. Because most functions are fast, it can be difficult to precisely measure the time that takes a function if we run it just once. Instead, we can run the function many times, and record the total time. We can then divide the total time by the number of executions. It can be difficult to decide how many times we should execute the function: it depends in part on how fast a function is. If a function takes 6 seconds to run, we may not want or need to run it too often. An easier strategy is to specify a minimum duration and repeatedly call a function until we reach or exceed the minimum duration.

When the function has a short execution time, we often call the benchmark a microbenchmark. We use microbenchmarks to compare different implementations of the same functionality or to better understand the system or the problem. We should always keep in mind that a microbenchmark alone cannot be used to justify a software optimization. Real-world performance depends on multiple factors that are difficult to represent in a microbenchmark.

Importantly, all benchmarks are affected by measurement errors, and by interference from the system. To make matters worse, the distribution of timings may not follow a normal distribution.

All programming languages provide the ability to run benchmarks. In Go, the tools make it easy to write benchmarks. You can import the testing package and create a function with the prefix Benchmark and a parameter of pointer type `testing.B. For example, the following program benchmarks the time required to compute the factorial of 10 as an integer:

package main

import (
    "fmt"
    "testing"
)

var fact int

func BenchmarkFactorial(b *testing.B) {
    for n := 0; n < b.N; n++ {
        fact = 1
        for i := 1; i <= 10; i++ {
            fact *= i
        }
    }
}

func main() {
    res := testing.Benchmark(BenchmarkFactorial)
    fmt.Println("BenchmarkFactorial", res)
}

If you put functions with such a signature (BenchmarkSomething(b *testing.B)) as part of your tests in a project, you can run them with the command go test -bench .. To run just one of them, you can specify a pattern such as go test -bench Factorial which would only run benchmark functions containing the word Factorial.

The b.N field indicates how many times the benchmark function runs. The testing package adjusts this value by increasing it until the benchmark runs for at least one second.

Measuring memory allocations

In Go, each function has its own ‘stack memory’. As the name suggests, stack memory is allocated and deallocated in a last-in, first-out (LIFO) order. This memory is typically only usable within the function, and it is often limited in size. The other type of memory that a Go program may use is heap memory. Heap memory is allocated and deallocated in a random order. There is only one heap shared by all functions.

With the stack memory, there is no risk that the memory may get lost or misused since it belongs to a specific function and can be reclaimed at the end of the function. Heap memory is more of a problem: it is sometimes unclear when the memory should be reclaimed. Programming languages like Go rely on a garbage collector to solve this problem. For example, when we create a new slice with the make function, we do not need to worry about reclaiming the memory. Go automatically reclaims it. However, it may still be bad for performance to constantly allocate and deallocate memory. In many real-world systems, memory management becomes a performance bottleneck.

Thus it is sometimes interesting to include the memory usage as part of the benchmark. The Go testing package allows you to measure the number of heap allocation made. Typically, in Go, it roughly corresponds to the number of calls to make and to the number of objects that the garbage collector must handle. The following extended program computers the factorial by storing its computation in dynamically allocated slices:

package main

import (
    "fmt"
    "testing"
)

var fact int

func BenchmarkFactorial(b *testing.B) {
    for n := 0; n < b.N; n++ {
        fact = 1
        for i := 1; i <= 10; i++ {
            fact *= i
        }
    }
}
func BenchmarkFactorialBuffer(b *testing.B) {
    for n := 0; n < b.N; n++ {
        buffer := make([]int, 11)
        buffer[0] = 1
        for i := 1; i <= 10; i++ {
            buffer[i] = i * buffer[i-1]
        }
    }
    b.ReportAllocs()
}

func BenchmarkFactorialBufferLarge(b *testing.B) {
    for n := 0; n < b.N; n++ {
        buffer := make([]int, 100001)
        buffer[0] = 1
        for i := 1; i <= 100000; i++ {
            buffer[i] = i * buffer[i-1]
        }
    }
    b.ReportAllocs()
}

func main() {
    res := testing.Benchmark(BenchmarkFactorial)
    fmt.Println("BenchmarkFactorial", res)
    resmem := testing.Benchmark(BenchmarkFactorialBuffer)
    fmt.Println("BenchmarkFactorialBuffer", resmem, resmem.MemString())
    resmem = testing.Benchmark(BenchmarkFactorialBufferLarge)
    fmt.Println("BenchmarkFactorialBufferLarge", resmem, resmem.MemString())
}

If you run such a Go program, you might get the following result:

BenchmarkFactorial 90887572             14.10 ns/op
BenchmarkFactorialBuffer 88609930               11.96 ns/op        0 B/op              0 allocs/op
BenchmarkFactorialBufferLarge     4408      249263 ns/op   802816 B/op         1 allocs/op

The last function allocates 802816 bytes per operation, unlike the first two. In this instance, if Go determines that data is not referenced after the function returns (a process called ‘escape analysis’), and if the amount of memory used is sufficiently small, it will avoid allocating the memory to the heap, preferring instead stack memory. In the case of the last function, the memory usage is too high, hence there is allocation on the heap rather than the stack.

Measuring memory usage

Your operating system provides memory to a running process in units of pages. The operating system cannot provide memory in smaller units than a page. Thus if you allocate memory in a program, it may either cost no additional memory if there are enough pages already; or it may force the operating system to provide more pages.

The size of a page depends on the operating system and its configuration. It can often vary between 4 kilobytes and 16 kilobytes although much larger pages are also possible (e.g., 1 gigabyte).

A page is a contiguous array of virtual memory addresses. A page may also represent actual physical memory. However, operating systems tend to only map used pages to physical memory. An operating system may provide a nearly endless supply of pages to a process, without ever mapping it to physical memory. Thus it is not simple to ask how much memory a program uses. A program may appear to use a lot of (virtual) memory, while not using much physical memory, and inversely.

The page size impacts both the performance and the memory usage. Allocating pages to a process is not free, it takes some effort. Among other things, the operating system cannot just reuse a memory page from another process as is. Doing so would be a security threat because you could have indirect access to the data stored in memory by another process. This other process could have held in memory your passwords or other sensitive information. Typically an operating system has to initialize (e.g., set to zero) a newly assigned page. Furthermore, mapping the pages to their actual physical memory also takes time. To accelerate the mapping process, modern systems often use a translation lookaside buffer to keep the map in the cache. When the translation lookaside buffer runs out of space a manual computation of the page mapping may be required, an expensive process. Thus large pages may improve the performance of some programs. However, large pages force the operating system to provide memory in larger chunks to a process, potentially wasting precious memory. You can write a Go program which prints out the page size of your system:

import (
    "fmt"
    "os"
)

func main() {
    pageSize := os.Getpagesize()
    fmt.Printf("Page size: %d bytes (%d KB)\n", pageSize, pageSize/1024)
}

Go makes it relatively easy to measure the number of pages allocated to a program by the operating system. Nevertheless, some care is needed. Because Go uses a garbage collector to free allocated memory, there might be a delay between the moment you no longer need some memory, and the actual freeing of the memory. You may force Go to call immediately the garbage collector with the function call runtime.GC(). You should rarely deliberately invoke the garbage collector in practice, but for our purposes (measuring memory usage), it is useful.

There are several memory metrics. In Go, some of the most useful are HeapSys and HeapAlloc. The first indicates how much memory (in bytes) has been given to the program by the operating system. The second value, which is typically lower indicates how much of that memory is actively in used by the program.

The following program allocates ever larger slices, and then ever smaller slices. In theory, the memory usage should first go up, and then go down:

package main

import (
    "fmt"
    "os"
    "runtime"
)

func main() {
    pageSize := os.Getpagesize()
    var m runtime.MemStats
    runtime.GC()
    runtime.ReadMemStats(&m)
    fmt.Printf(
        "HeapSys = %.3f MiB, HeapAlloc =  %.3f MiB,  %.3f pages\n",
        float64(m.HeapSys)/1024.0/1024.0,
        float64(m.HeapAlloc)/1024.0/1024.0,
        float64(m.HeapSys)/float64(pageSize),
    )
    i := 100
    for ; i < 1000000000; i *= 10 {
        runtime.GC()
        s := make([]byte, i)
        runtime.ReadMemStats(&m)
        fmt.Printf(
            "%.3f MiB, HeapSys = %.3f MiB, HeapAlloc =  %.3f MiB,  %.3f pages\n",
            float64(len(s))/1024.0/1024.0,
            float64(m.HeapSys)/1024.0/1024.0,
            float64(m.HeapAlloc)/1024.0/1024.0,
            float64(m.HeapSys)/float64(pageSize),
        )
    }
    for ; i >= 100; i /= 10 {
        runtime.GC()
        s := make([]byte, i)
        runtime.ReadMemStats(&m)
        fmt.Printf(
            "%.3f MiB, HeapSys = %.3f MiB, HeapAlloc =  %.3f MiB,  %.3f pages\n",
            float64(len(s))/1024.0/1024.0,
            float64(m.HeapSys)/1024.0/1024.0,
            float64(m.HeapAlloc)/1024.0/1024.0,
            float64(m.HeapSys)/float64(pageSize),
        )
    }
    runtime.GC()
    runtime.ReadMemStats(&m)
    fmt.Printf(
        "HeapSys = %.3f MiB, HeapAlloc =  %.3f MiB,  %.3f pages\n",
        float64(m.HeapSys)/1024.0/1024.0,
        float64(m.HeapAlloc)/1024.0/1024.0,
        float64(m.HeapSys)/float64(pageSize),
    )
}

The program calls os.Getpagesize() to get the underlying system’s memory page size in bytes as an integer, and assigns it to a variable pageSize. It declares a variable m of type runtime.MemStats, which is a struct that holds various statistics about the memory allocator and the garbage collector. The program repeatedly calls runtime.GC() to trigger a garbage collection cycle manually, which may free some memory and make it available for release. It calls runtime.ReadMemStats(&m) to populate the m variable with the current memory statistics. We can reuse the same variable m from call to call. The purpose of this program is to demonstrate how the memory usage of a Go program changes depending on the size and frequency of memory allocations and deallocations, and how the garbage collector and the runtime affect the memory release. The program prints the memory usage before and after each allocation, and shows how the m.HeapSys, m.HeapAlloc, and m.HeapSys / pageSize values grow and shrink accordingly.

If you run this program, you may observe that a program tends to hold on to the memory you have allocated and later released. It is partly a matter of optimization: acquiring memory takes time and we wish to avoid giving back pages only to soon request them again. It illustrates that it can be challenging to determine how much memory a program uses.

The program may print something like the following:

$ go run mem.go
HeapSys = 3.719 MiB, HeapAlloc =  0.367 MiB,  238.000 pages
0.000 MiB, HeapSys = 3.719 MiB, HeapAlloc =  0.367 MiB,  238.000 pages
0.001 MiB, HeapSys = 3.719 MiB, HeapAlloc =  0.383 MiB,  238.000 pages
0.010 MiB, HeapSys = 3.688 MiB, HeapAlloc =  0.414 MiB,  236.000 pages
0.095 MiB, HeapSys = 3.688 MiB, HeapAlloc =  0.477 MiB,  236.000 pages
0.954 MiB, HeapSys = 3.688 MiB, HeapAlloc =  1.336 MiB,  236.000 pages
9.537 MiB, HeapSys = 15.688 MiB, HeapAlloc =  9.914 MiB,  1004.000 pages
95.367 MiB, HeapSys = 111.688 MiB, HeapAlloc =  95.750 MiB,  7148.000 pages
953.674 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  954.055 MiB,  68332.000 pages
95.367 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  95.750 MiB,  68332.000 pages
9.537 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  9.914 MiB,  68332.000 pages
0.954 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  1.336 MiB,  68332.000 pages
0.095 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  0.477 MiB,  68332.000 pages
0.010 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  0.414 MiB,  68332.000 pages
0.001 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  0.383 MiB,  68332.000 pages
0.000 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  0.375 MiB,  68332.000 pages
HeapSys = 1067.688 MiB, HeapAlloc =  0.375 MiB,  68332.000 pages

Observe how, at the very beginning and at the very end, over a third of a megabyte of memory (238 pages) is repeated as being in used. Furthermore, over 68,000 pages remain allocated to the program at the very, even though no data structure remains in scope within the main function.

Inlining

One of the most powerful optimization technique that a compile may do is function inlining: the compiler brings some of the called functions directly into the calling functions.

Go makes it easy to tell which functions are inlined. We can also easily request that the compiles does not inline by adding the line //go:noinline right before a function.

Let us consider this program which contains two benchmarks were we sum all odd integers in a range.

package main

import (
    "fmt"
    "testing"
)

func IsOdd(i int) bool {
    return i%2 == 1
}

//go:noinline
func IsOddNoInline(i int) bool {
    return i%2 == 1
}

func BenchmarkCountOddInline(b *testing.B) {
    for n := 0; n < b.N; n++ {
        sum := 0
        for i := 1; i < 1000; i++ {
            if IsOdd(i) {
                sum += i
            }
        }
    }
}

func BenchmarkCountOddNoinline(b *testing.B) {
    for n := 0; n < b.N; n++ {
        sum := 0
        for i := 1; i < 1000; i++ {
            if IsOddNoInline(i) {
                sum += i
            }
        }
    }
}

func main() {
    res1 := testing.Benchmark(BenchmarkCountOddInline)
    fmt.Println("BenchmarkCountOddInline", res1)
    res2 := testing.Benchmark(BenchmarkCountOddNoinline)
    fmt.Println("BenchmarkCountOddNoinline", res2)
}

In Go, the flag -gcflags=-m tells the compiler to report the main optimizations it does. If you call this program simpleinline.go and compile it with the command go build -gcflags=-m simpleinline.go, you may see the following:

$ go build -gcflags=-m simpleinline.go
./simpleinline.go:8:6: can inline IsOdd
./simpleinline.go:21:12: inlining call to IsOdd
...

If you run the benchmark, you should see that the inlined version is much faster:

$ go run simpleinline.go
BenchmarkCountOddInline  3716786           294.6 ns/op
BenchmarkCountOddNoinline  1388792         864.8 ns/op

Inlining is not always beneficial: in some instances, it can generate large binaries and it may even slow down the software. However, when it is applicable, it can have a large beneficial effect.

Go tries as hard as possible to inline functions, but it has limitations. For example, compilers often find it difficult to inline recursive functions. Let benchmark two factorial functions, one that is recursive, and one that is not.

package main

import (
    "fmt"
    "testing"
)

var array = make([]int, 1000)

func Factorial(n int) int {
    if n < 0 {
        return 0
    }
    if n == 0 {
        return 1
    }
    return n * Factorial(n-1)
}

func FactorialLoop(n int) int {
    result := 1
    for i := 1; i <= n; i++ {
        result *= i
    }
    return result
}

func BenchmarkFillNoinline(b *testing.B) {
    for n := 0; n < b.N; n++ {
        for i := 1; i < 1000; i++ {
            array[i] = Factorial(i)
        }
    }
}

func BenchmarkFillInline(b *testing.B) {
    for n := 0; n < b.N; n++ {
        for i := 1; i < 1000; i++ {
            array[i] = FactorialLoop(i)
        }
    }
}

func main() {
    res1 := testing.Benchmark(BenchmarkFillNoinline)
    fmt.Println("BenchmarkFillNoinline", res1)
    res2 := testing.Benchmark(BenchmarkFillInline)
    fmt.Println("BenchmarkFillInline", res2)
    fmt.Println(float64(res1.NsPerOp()) / float64(res2.NsPerOp()))
}

Though both FactorialLoop and Factorial are equivalent, running this program, you should find that the non-recursive function (FactorialLoop) is much faster.

Cache line

Our computers read and write memory using small blocks of memory called “cache lines”. The cache line size is usually fixed and small (e.g.,  64 or 128 bytes). To attempt to measure the cache-line size, we may use a strided copy. From a large array, we copy every N bytes to another large array. We repeat this process N times. Thus if the original array contains a 1000 bytes, we always copy 1024 bytes, whether r N = 1, N = 2, N = 4, or N = 8.

When N is sufficiently large (say N = 16), the problem should be essentially memory bound: the performance is not limited by the number of instructions, but by the system’s ability to load and store cache lines. If N is larger than twice the cache line, then I can effectively skip one cache line out of two. If N is smaller than the cache line, then every cache line must be accessed. You expect a large stride to be significant faster.

One limitation to this approach is that processors may fetch more cache lines than needed so we may overestimate the size of the cache line. However, unless memory bandwidth is overly abundant, we should expect processors to try to limit the number of cache lines fetched.

Let us run an experiment. For each stride size, we repeat 10 times and record the maximum, the minimum and the average. Consider the following program.

package main

import (
    "fmt"
    "time"
)

const size = 33554432 // 32 MB
func Cpy(arr1 []uint8, arr2 []uint8, slice int) {
    for i := 0; i < len(arr1); i += slice {
        arr2[i] = arr1[i]
    }
}

func AverageMinMax(f func() float64) (float64, float64, float64) {
    var sum float64
    var minimum float64
    var maximum float64

    for i := 0; i < 10; i++ {
        arr1 = make([]uint8, size)
        arr2 = make([]uint8, size)

        v := f()
        sum += v
        if i == 0 || v < minimum {
            minimum = v
        }
        if i == 0 || v > maximum {
            maximum = v
        }
    }
    return sum / 10, minimum, maximum
}

var arr1 []uint8
var arr2 []uint8

func run(size int, slice int) float64 {
    start := time.Now()
    times := 10
    for i := 0; i < times*slice; i++ {
        Cpy(arr1, arr2, slice)
    }
    end := time.Now()
    dur := float64(end.Sub(start)) / float64(times*slice)
    return dur
}

func main() {
    for slice := 16; slice <= 4096; slice *= 2 {
        a, m, M := AverageMinMax(func() float64 { return run(size, slice-1) })
        fmt.Printf("%10d: %10.1f GB/s [%4.1f - %4.1f]\n", slice-1, float64(size)/a, float64(size)/M, float64(size)/m)
    }
}

We may get the following result:

$ go run cacheline.go                                  1
        15:       23.6 GB/s [21.3 - 24.4]
        31:       24.3 GB/s [23.8 - 24.5]
        63:       24.2 GB/s [23.6 - 24.6]
       127:       26.9 GB/s [23.8 - 27.9]
       255:       40.8 GB/s [37.8 - 43.6]
       511:      162.0 GB/s [130.4 - 203.4]
      1023:      710.0 GB/s [652.0 - 744.4]
      2047:      976.1 GB/s [967.1 - 983.8]
      4095:     1247.4 GB/s [1147.7 - 1267.0]

We see that the performance increases substantially when the stride goes from 127 to 255. It suggests that the cache line has 128 bytes. If you run this same benchmark on your own system, you may get a different result.

The results need to be interpreted with care: we are not measuring a copy speed of 1247.4 GB/s. Rather, we can copy large arrays at such a speed if we only copy one byte out of every 4095 bytes.

CPU Cache

When programming, we often do not think directly about memory. When we do consider that our data uses memory, we often think of it as homogeneous: memory is like a large uniform canvas upon which the computer writes and reads it data. However, your main memory (RAM) is typically buffered using a small amount of memory that resides close to the processor core (CPU cache). We often have several layers of cache memory (e.g., L1, L2, L3): L1 is is typically small but very fast whereas, for example, L3 is larger but slower.

You can empirically measure the effect of the cache. If you take a small array and shuffle it randomly, will be moving data primarily in the CPU cache, which is fast. If you take a larger array, you will move data in memory without much help from the cache, a process that is much slower. Thus shuffling ever larger arrays is a way to determine the size of your cache. It may prove difficult to tell exactly how many layers of cache you have and how large each layer is. However, you can usually tell when your array is significantly larger than the CPU cache.

We are going to write a random shuffle function: Shuffle(arr []uint32). It uses an algorithm called Fisher-Yates shuffle, which involves going through the array in reverse and swapping each element with another randomly chosen from those preceding it. The function uses a seed variable to generate random numbers from a mathematical formula. For our purposes, we use a simplistic number generator: we multiply the seed by the index. The function bits.Mul64 calculates the product of two 64-bit numbers and returns the result as two 32-bit numbers: the most significant (hi) and the least significant. The most significant value is necessarily between 0 and i (inclusively). We use this most significant value as the random index. The function then exchanges the elements using multiple assignment. We call this shuffle function several times, on inputs of different sizes. We report the time normalized by the size of the input.

package main

import (
    "fmt"
    "math/bits"
    "time"
)

func Shuffle(arr []uint32) {
    seed := uint64(1234)
    for i := len(arr) - 1; i > 0; i-- {
        seed += 0x9E3779B97F4A7C15
        hi, _ := bits.Mul64(seed, uint64(i+1))
        j := int(hi)
        arr[i], arr[j] = arr[j], arr[i]
    }
}

func AverageMinMax(f func() float64) (float64, float64, float64) {
    var sum float64
    var minimum float64
    var maximum float64

    for i := 0; i < 10; i++ {
        v := f()
        sum += v
        if i == 0 || v < minimum {
            minimum = v
        }
        if i == 0 || v > maximum {
            maximum = v
        }
    }
    return sum / 10, minimum, maximum
}

func run(size int) float64 {
    arr := make([]uint32, size)

    for i := range arr {
        arr[i] = uint32(i + 1)
    }
    start := time.Now()
    end := time.Now()
    times := 0
    for ; end.Sub(start) < 100_000_000; times++ {
        Shuffle(arr)
        end = time.Now()
    }
    dur := float64(end.Sub(start)) / float64(times)
    return dur / float64(size)
}

func main() {
    for size := 4096; size <= 33554432; size *= 2 {
        fmt.Printf("%20d KB ", size/1024*4)
        a, m, M := AverageMinMax(func() float64 { return run(size) })
        fmt.Printf(" %.2f [%.2f, %.2f]\n", a, m, M)
    }
}

A possible output of running this program might be:

⚡  go run cache.go 
                  16 KB  0.70 [0.66, 0.93]
                  32 KB  0.65 [0.64, 0.66]
                  64 KB  0.64 [0.64, 0.66]
                 128 KB  0.64 [0.64, 0.67]
                 256 KB  0.65 [0.64, 0.66]
                 512 KB  0.70 [0.70, 0.71]
                1024 KB  0.77 [0.76, 0.79]
                2048 KB  0.83 [0.82, 0.84]
                4096 KB  0.87 [0.86, 0.90]
                8192 KB  0.92 [0.91, 0.95]
               16384 KB  1.10 [1.06, 1.24]
               32768 KB  2.34 [2.28, 2.52]
               65536 KB  3.90 [3.70, 4.25]
              131072 KB  5.66 [4.80, 9.78]

We see between 16 KB and 16384 KB, the time per element shuffle does not increase much even though we repeatedly double the input size. However, between 16384 KB and 32768 KB, the time per element doubles. And then it consistently doubles each time the size of the array doubles. It suggests that the size of the CPU cache is about 16384 KB in this instance.

Memory bandwidth

You can only read and write memory up to a maximal speed. It can be difficult to measure such limits. In particular, you may need several cores in a multi-core system to achieve the best possible memory. For simplicity, let us consider maximal read memory.

Many large systems do not have a single bandwidth number. For example, many large systems rely on NUMA: NUMA stands for Non-Uniform Memory Access. In a NUMA system, each processor has its own local memory, which it can access faster than the memory of other processors.

The bandwidth also depends to some extend on the amount of memory requested. If the memory fits in CPU cache, only the first access may be expensive. A very large memory region may not fit in RAM and may require disk storage. Even if it fits in RAM, an overly large memory region might require many memory pages, and accessing all of them may cause page walking due to the limits of the translation lookaside buffer.

If the memory is accessed at random locations, it might be difficult for the system to sustain a maximal bandwidth because the system cannot predict easily where the next memory load occurs. To get the best bandwidth, you may want to access the memory linearly or according to some predictable pattern.

Let us consider the following code:

package main

import (
    "fmt"
    "time"
)

func run() float64 {
    bestbandwidth := 0.0
    arr := make([]uint8, 2*1024*1024*1024) // 4 GB
    for i := 0; i < len(arr); i++ {
        arr[i] = 1
    }
    for t := 0; t < 20; t++ {
        start := time.Now()
        acc := 0
        for i := 0; i < len(arr); i += 64 {
            acc += int(arr[i])
        }
        end := time.Now()
        if acc != len(arr)/64 {
            panic("!!!")
        }
        bandwidth := float64(len(arr)) / end.Sub(start).Seconds() / 1024 / 1024 / 1024
        if bandwidth > bestbandwidth {
            bestbandwidth = bandwidth
        }
    }
    return bestbandwidth
}

func main() {
    for i := 0; i < 10; i++ {
        fmt.Printf(" %.2f GB/s\n", run())
    }
}

The code defines two functions: run and main. The main function is the entry point for the program, and it calls the run function 10 times, printing the result each time. The run function is a custom function that measures the memory bandwidth of the system. It does this by performing the following steps:

It declares a variable called bestbandwidth and initializes it to 0.0. This variable stores the highest bandwidth value obtained during the execution of the function. It creates a slice of bytes (uint8) called arr, with a length equivalent to 4 GB. The slice is initialized with 1s. The loop will only access every 64th element of the slice, skipping the rest. Given that most systems have a cache-line size of 64 bytes or more, it is enough to touch each cache line. It calculates the bandwidth by dividing the size of the slice (in bytes) by the difference between the end and start times (in seconds), and then dividing by 1024 three times to convert the result to gigabytes per second (GB/s). The code repeats the measurement 20 times and returns the best result, to account for possible variations in the system performance. The code prints the result 10 times, to show the consistency of the measurement.

Memory latency and parallelism

Latency is often described as the time delay between the beginning of a request and the moment when you are served. Thus if you go to a restaurant, the latency you might be interested in is the time it will take before you can start eating. The latency is distinct from the throughput: a restaurant might be able to serve hundreds of customers at once, but still have high latency (long delays for each customer). If you put a lot of data on a very large disk, you can put this disk in a truck and drive the truck between two cites. It could represent a large bandwidth (much data is moved per unit of time), but the latency could be quite poor (hours). Similarly, you could shine a laser at your partner when supper is ready: the information could arrive without much delay even if you are very far away, but you are communicating little information (low throughput). One way to express this trade-off between latency and throughput is with Little’s Law: L = λW where L is the average number of elements in the system, λ is the throughput (long-term average arrival rate of new elements), and W is the latency, or the average amount of time that elements spend waiting. Thus if you want to have L customers at all times in your restaurant, and fewer customers arrive, you should serve the customers with greater delays. And so forth. Little’s law work with our memory subsystems as well: computers can sustain a maximum number of memory requests, each memory request has a latency, and there is an overall bandwidth. If latency does not improve, we can still improve bandwidth or throughput by increasing the number of requests that can be sustained concurrently. Unfortunately, system designers are often forced to make this choice, and so it is not common to see stagnant or worsening memory latencies despite fast improving memory bandwidths. A common illustration of the concept of memory latency is the traversal of a linked list. In computer science, a linked list is a data structure made of nodes, and each node is linked (by a pointer) to the next node. The nodes may not be laid out in memory consecutively, but even if they are, accessing each successive node requires a at least a small delay. On current processors, it can often take at least 3 cycles to load data from memory, even if the memory is in cache. Thus determining the length of the list by traversing the whole linked list can take time, and most of this time is just made of the successive delays. The following code benchmarks the time required to traverse a linked list made of a million nodes. Though the time varies depending on your system, it may represent a sizeable fraction of a millisecond.

package main

import (
    "fmt"
    "testing"
)

type Node struct {
    data int
    next *Node
}

func build(volume int) *Node {
    var head *Node
    for i := 0; i < volume; i++ {
        head = &Node{i, head}
    }
    return head
}

var list *Node
var N int

func BenchmarkLen(b *testing.B) {
    for n := 0; n < b.N; n++ {
        len := 0
        for p := list; p != nil; p = p.next {
            len++
        }
        if len != N {
            b.Fatalf("invalid length: %d", len)
        }
    }
}

func main() {
    N = 1000000
    list = build(N)
    res := testing.Benchmark(BenchmarkLen)
    fmt.Println("milliseconds: ", float64(res.NsPerOp())/1e6)

    fmt.Println("nanoseconds per el.", float64(res.NsPerOp())/float64(N))
}

In this code, a Node struct is defined with two fields: data is an integer representing the value stored in the node, `next is a pointer to the next node in the linked list. We could also add a pointer to the previous node, but that is not necessary in our case. The build function creates a singly linked list of nodes from an integer volume as an argument. It initializes an empty linked list (head is initially nil). It iterates from 0 to volume-1, creating a new node with value i and pointing its next to the current head. The new node becomes the new head. The function returns the final head of the linked list. The main function initializes two global variables (list and N) storing respectively the head of the list and the expected length. These values h are used by the BenchmarkLen function. This code demonstrates how to create a linked list, calculate its length, and benchmark the performance of the length calculation. Our length computation is almost entirely bounded (limited) by the memory latency, the time it takes to access the memory. The computations that we are doing (comparisons, increments) are unimportant to the performance. To illustrate our observation, we can try traversing two linked lists simultaneously, as in this example:

package main

import (
    "fmt"
    "testing"
)

type Node struct {
    data int
    next *Node
}

func build(volume int) *Node {
    var head *Node
    for i := 0; i < volume; i++ {
        head = &Node{i, head}
    }
    return head
}

var list1 *Node
var list2 *Node

var N int

func BenchmarkLen(b *testing.B) {
    for n := 0; n < b.N; n++ {
        len := 0
        for p1, p2 := list1, list2; p1 != nil && p2 != nil; p1, p2 = p1.next, p2.next {
            len++
        }
        if len != N {
            b.Fatalf("invalid length: %d", len)
        }
    }
}

func main() {
    N = 1000000
    list1 = build(N)
    list2 = build(N)

    res := testing.Benchmark(BenchmarkLen)
    fmt.Println("milliseconds: ", float64(res.NsPerOp())/1e6)

    fmt.Println("nanoseconds per el.", float64(res.NsPerOp())/float64(N))
}

If you run this new code, you might find that the benchmark results are close to the single-list ones. It is not surprising: the processor is mostly just waiting for the next node, and waiting for two nodes is not much more expensive. For this reason, when programming, you should limit memory accesses as much as possible. Use simple arrays when you can instead of linked lists or node-based tree structures. We would would like to work with arbitrarily large data structures, so that we can stress the memory access outside of the cache. Sattolo’s algorithm is a variant of the well-known random shuffle that generates a random cyclic permutation of an array or list. Sattolo’s algorithm ensures that the data is permuted using a single cycle. That is, starting with one element in a list of size n, we find that this element is moved to another position, which is itself moved to another position, and so forth, until after n moves, we end up back at our initial position. To apply Sattolo’s algorithm, given an array or list of elements, we start with an index i from 0 to n-1, where n is the length of the array. For each index i, we choose a random index j such that i < j < n. We swap the elements at indices i and j. E.g., suppose we have an array [0, 1, 2, 3, 4]. The algorithm might produce a cyclic permutation like [2, 0, 3, 1, 4]. With this algorithm, we can visit all values in an array exactly once in random order. From an array contain indexes 0 to n-1 permuted with Sattolo’s algorithm, we first load the first element, read its value, move to the corresponding index, and so forth. After `n operation, we should come back at the initial position. Because each operation involves a memory load, it is limited by memory latency. We can try to go faster with memory-level parallelism: we can pick k positions spread out in the cycle and move from these k initial positions n/k times through the cycle. Because computers can load many values in parallel, this algorithm should be faster for larger values of k. However, as k increases, we may see fewer and fewer gains because systems have limited memory-level parallelism and bandwidth.The following program implements this idea.

package main

import (
    "fmt"
    "math/rand"
    "time"
)

// makeCycle creates a cycle of a specified length starting at element 0
func makeCycle(length int) ([]uint64, []uint64) {
    array := make([]uint64, length)
    index := make([]uint64, length)
    // Create a cycle of maximum length within the big array
    for i := 0; i < length; i++ {
        array[i] = uint64(i)
    }

    // Sattolo shuffle
    for i := 0; i+1 < length; i++ {
        swapIdx := rand.Intn(length-i-1) + i + 1
        array[i], array[swapIdx] = array[swapIdx], array[i]
    }

    total := 0
    cur := uint64(0)
    for cur != 0 {
        index[total] = cur
        total++
        cur = array[cur]
    }
    return array, index
}

// setupPointers sets up pointers based on the given index
func setupPointers(index []uint64, length, mlp int) []uint64 {
    sp := make([]uint64, mlp)
    sp[0] = 0

    totalInc := 0
    for m := 1; m < mlp; m++ {
        totalInc += length / mlp
        sp[m] = index[totalInc]
    }
    return sp
}

func runBench(array []uint64, index []uint64, mlp int) time.Duration {
    length := len(array)
    sp := setupPointers(index, length, mlp)
    hits := length / mlp
    before := time.Now()
    for i := 0; i < hits; i++ {
        for m := 0; m < mlp; m++ {
            sp[m] = array[sp[m]]
        }
    }
    after := time.Now()
    return after.Sub(before)
}

func main() {
    const length = 100000000
    array, index := makeCycle(length)
    fmt.Println("Length:", length*8/1024/1024, "MB")
    base := runBench(array, index, 1)
    fmt.Println("Lanes:", 1, "Time:", base)

    for mlp := 2; mlp <= 40; mlp++ {
        t := runBench(array, index, mlp)
        fmt.Println("Lanes:", mlp, "Speedup:", fmt.Sprintf("%.1f", float64(base)/float64(t)))
    }
}

The function makeCycle creates a cycle of a specified length starting at element 0. It initializes two slices: array and index, both of type []uint64. The array slice represents the elements in the cycle. The index slice stores the indices of the elements in the cycle, so that we can more easily access a position in the cycle. The function performs the following steps. It initializes array with values from 0 to length-1. It applies Sattolo’s shuffle algorithm to the array to create a random permutation. The function returns both array and index. The function setupPointers: the function calculates the increment value (totalInc) based on the length and the number of lanes (mlp). It assigns the indices from index to sp based on the calculated increments. The function runBench benchmarks the execution time for a given number of lanes (mlp). It initializes a slice sp using setupPointers. The function iterates through the pointers in sp and updates them by following the indices in array`. It measures the execution time and returns it as a time.Duration instance. The main function first computes the running time for 1 lane, and then it reports the gains when using multiple lanes. Overall, this code generates a cycle of specified length, sets up pointers, and benchmarks the execution time for different numbers of lanes. The primary purpose seems to be exploring parallelization using multiple lanes. The runBench function simulates parallel execution by updating pointers concurrently. The speedup is calculated by comparing the execution time for different numbers of lanes. The larger the speedup, the more efficient the memory-level parallel execution. The general principle is that you can often improve the performance of a system that faces high latencies by breaking the data dependencies. Instead of putting all your data in a long chain, try to break to have no chain at all or, if you must have chains, use several smaller chains.

Superscalarity and data dependency

Most current processors are superscalar (as opposed to ‘scalar’), meaning that they can execute and retire several instructions per CPU cycles. That is, even if you have a single CPU core, there is much parallelism involved. Some processors can retire 8 instructions per cycle or more. Not all code routines benefit equally from superscalar execution. Several factors can limit your processors to few instructions per cycle. Having to wait on memory accesses is one such factor. Another common factor is data dependency: when the next instruction depends on the result of a preceding instruction… it may have to wait before it starts executing. To illustrate consider functions that compute successive differences between elements of an array (e.g., given 5,7,6, you might get the initial value 5 followed by 2 and -1), and the reverse operation which sums up all the differences to recover the original value. You may implement these functions as such:

func successiveDifferences(arr []int) {
    base := arr[0]
    for i := 1; i < len(arr); i++ {
        base, arr[i] = arr[i], arr[i]-base
    }
}

func prefixSum(arr []int) {
    for i := 1; i < len(arr); i++ {
        arr[i] = arr[i] + arr[i-1]
    }
}

Assuming that the compiler does not optimize these functions in a non-trivial manner (e.g., using SIMD instructions), we can reason relatively simply about the performance. For the successive differences, we need approximately one subtraction per element in the array. For the prefix sum, you need approximately one addition per element in the array. It looks quite similar at a glance. However, the data dependency is different. To compute the difference between any two values in the array, you do not need to have computed the preceding differences. However, the prefix sum, as we implemented it, requires us to have computed all preceding sums before the next can be computed. Let us write a small benchmarking program to test the performance difference:

package main

import (
    "fmt"
    "math/rand"
    "testing"
)

func successiveDifferences(arr []int) {
    base := arr[0]
    for i := 1; i < len(arr); i++ {
        base, arr[i] = arr[i], arr[i]-base
    }
}

func prefixSum(arr []int) {
    for i := 1; i < len(arr); i++ {
        arr[i] = arr[i] + arr[i-1]
    }
}

var array []int

func BenchmarkPrefixSum(b *testing.B) {
    for n := 0; n < b.N; n++ {
        prefixSum(array)
    }
}

func BenchmarkSuccessiveDifferences(b *testing.B) {
    for n := 0; n < b.N; n++ {
        successiveDifferences(array)
    }
}

func main() {
    array = make([]int, 100)
    for i := range array {
        array[i] = rand.Int()
    }
    res2 := testing.Benchmark(BenchmarkSuccessiveDifferences)
    fmt.Println("BenchmarkSuccessiveDifferences", res2)
    res1 := testing.Benchmark(BenchmarkPrefixSum)
    fmt.Println("BenchmarkPrefixSum", res1)

}

Your result will vary depending on your system. However, you should not be surprised if the prefix sum takes more time. On an Apple system, we go the following results:

BenchmarkSuccessiveDifferences 39742334         30.04 ns/op
BenchmarkPrefixSum  8307944            142.8 ns/op

The prefix sum can be several times slower, even though it appears at a glance that it should use a comparable number of instructions. In general, you cannot trust a hasty analysis. Just because two functions appear to do a similar amount of work, does not mean that they have the same performance. Several factors must be taken into account, including data dependencies.

Branch prediction

In part because the processors are multiscalar, they have been designed to execute speculatively: when facing a branch, the processor tries to guess the direction that will be taken, and it begins the computation optimistically. When the processor makes the correct prediction, it usually improves the performance, sometimes by a large amount. However, when the processor is unable to predict accurately the branch, branch prediction may become a net negative. Indeed, when the branch is mispredicted, the processor may have to restart the computation from the point where it made the wrong prediction, an expensive process that can waste several CPU cycles. To illustrate, let us first consider a function that copies the content of an slice into another slice of the same size:

func Copy(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        dest[i] = v
    }
}

A more sophisticated function may copy only the odd elements:

unc CopyOdd(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        if v&1 == 1 {
            dest[i] = v
        }
    }
}

We may try to copy an array that contains random integers (both odd and even), only odd integers, or only even integers. The following program illustrates:

package main

import (
    "fmt"
    "math/rand"
    "testing"
)

func Copy(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        dest[i] = v
    }
}

func CopyOdd(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        if v&1 == 1 {
            dest[i] = v
        }
    }
}

var array []uint
var dest []uint

func BenchmarkCopyOdd(b *testing.B) {
    for n := 0; n < b.N; n++ {
        CopyOdd(dest, array)
    }
}

func BenchmarkCopy(b *testing.B) {
    for n := 0; n < b.N; n++ {
        Copy(dest, array)
    }
}

func main() {
    array = make([]uint, 10000)
    dest = make([]uint, len(array))

    for i := range array {
        array[i] = uint(rand.Uint32())
    }
    res0 := testing.Benchmark(BenchmarkCopy)
    fmt.Println("BenchmarkCopy (random)", res0)
    res1 := testing.Benchmark(BenchmarkCopyOdd)
    fmt.Println("BenchmarkCopyOdd (random)", res1)
    for i := range array {
        array[i] = uint(rand.Uint32()) | 1
    }
    res2 := testing.Benchmark(BenchmarkCopyOdd)
    fmt.Println("BenchmarkCopyOdd (odd data)", res2)
    for i := range array {
        array[i] = uint(rand.Uint32()) &^ 1
    }
    res3 := testing.Benchmark(BenchmarkCopyOdd)
    fmt.Println("BenchmarkCopyOdd (even data)", res3)
}

On an Apple system, we got the following results:

BenchmarkCopy (random)   414158       2936 ns/op
BenchmarkCopyOdd (random)    55408           19518 ns/op
BenchmarkCopyOdd (odd data)   402670          2975 ns/op
BenchmarkCopyOdd (even data)   402738         2896 ns/op

The last three timings involve the same function, only the input data differs. We find that all timings are similar in this case, except for benchmark that copies random data: it is several times slower in our tests. The much longer running time is due to the presence of an unpredictable branch in our inner loop. Observe that the same function, subject to the same volume of data, can have vastly different performance characteristics, even though the computational complexity of the function does not change: in all instances, we have linear time complexity. If we expect our data to lead to poor branch prediction, we may reduce the number of branches in the code. The resulting code might be nearly branch free or branchless. For example, we can use an arithmetic and logical expression to replace a condition copy:

func CopyOddBranchless(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        dest[i] ^= uint(-(v & 1)) & (v ^ dest[i])
    }
}

Let us review the complicated expression:

  • v & 1: This operation checks if the least significant bit of v is set (i.e., if v is odd).
  • -(v & 1): This negates the result of the previous operation. If v is odd, this becomes -1; otherwise, it becomes 0. However, -1 as an unsigned integer is becomes the maximal value, the one with all of the bits set to 1.
  • v ^ dest[i]: This XORs the value of v with the corresponding element in the dest slice.
  • uint(-(v & 1)) & (v ^ dest[i]): If v is odd, it returns the XOR of v with dest[i]; otherwise, it returns 0.
  • Finally, dest[i] ^= uint(-(v & 1)) & (v ^ dest[i]) leaves dest[i] unchanged if v is even, otherwise it replaces with v using the fact that dest[i] ^ (v ^ dest[i]) == v.

We can put this function to good use in a benchmark:

package main

import (
    "fmt"
    "math/rand"
    "testing"
)

func CopyOdd(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        if v&1 == 1 {
            dest[i] = v
        }
    }
}

func CopyOddBranchless(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        dest[i] ^= uint(-(v & 1)) & (v ^ dest[i])
    }
}

var array []uint
var dest []uint

func BenchmarkCopyOdd(b *testing.B) {
    for n := 0; n < b.N; n++ {
        CopyOdd(dest, array)
    }
}

func BenchmarkCopyOddBranchless(b *testing.B) {
    for n := 0; n < b.N; n++ {
        CopyOddBranchless(dest, array)
    }
}
func main() {
    array = make([]uint, 10000)
    dest = make([]uint, len(array))
    for i := range array {
        array[i] = uint(rand.Uint32())
    }
    res1 := testing.Benchmark(BenchmarkCopyOdd)
    fmt.Println("BenchmarkCopyOdd (random)", res1)
    res2 := testing.Benchmark(BenchmarkCopyOddBranchless)
    fmt.Println("BenchmarkCopyOddBranchless (random)", res2)
}

On an Apple system, we got:

BenchmarkCopyOdd (random)    60782           19254 ns/op
BenchmarkCopyOddBranchless (random)   166863          7124 ns/op

In this test, the branchless approach is much faster. We should stress that it is not always the case that branchless code is faster. In fact, we observe that in our overall test results, the branchless function is significantly slower than the original when the results are predictable (e.g., 2896 ns/op vs 7124 ns/op). In actual software, you should try to recognize where you have poorly predicted branches and act in these cases to see if a branchless approach might be faster. Thankfully, most branches are well predicted in practice in most projects.

How to read files quickly in JavaScript

Suppose you need to read several files on a server using JavaScript. There are many ways to read files in JavaScript with a runtime like Node.js. Which one is best? Let us consider the various approaches.

Using fs.promises

const fs = require('fs/promises');
const readFile = fs.readFile;
readFile("lipsum.txt", { encoding: 'utf-8' })
.then((data) => {...})
.catch((err) => {...})

Using fs.readFile and util.promisify

const fs = require('fs');
const util = require('util');
const readFile = util.promisify(fs.readFile);
readFile("lipsum.txt", { encoding: 'utf-8' })
.then((data) => {...})
.catch((err) => {...})

Using fs.readFileSync

const fs = require('fs');
const readFileSync = fs.readFileSync;
var data = readFileSync("lipsum.txt", { encoding: 'utf-8' })

Using await fs.readFileSync

const fs = require('fs');
const readFileSync = fs.readFileSync;
async function f(name, options) {
  return readFileSync(name, options);
}

Using fs.readFile

const fs = require('fs');
const readFile = fs.readFile;
fs.readFile('lipsum.txt', function read(err, data) {...});

Benchmark

I wrote a small benchmark where I repeated read a file from disk. It is a simple loop where the same file is accessed each time. I report the number of milliseconds needed to read the file 50,000 times.  The file is relatively small (slightly over a kilobyte). I use a large server with dozens of Ice Lake Intel cores and plenty of memory. I use Node.js 20.1 and Bun 1.0.14. Bun is a competing JavaScript runtime.

I ran the benchmarks multiple times, and I report the best results in all cases. Your results will differ.

time (Node.js) time (Bun)
fs.promises 2400 ms 110 ms
fs.readFile and util.promisify 1500 ms 180 ms
fs.readFileSync 140 ms 140 ms
await fs.readFileSync 220 ms 180 ms
fs.readFile 760 ms 90 ms

At least on my system, in this test, the fs.promises is significantly more expensive than anything else when using Node.js. The Bun runtime is much faster than Node.js in this test.

The results are worse than they appear for fs.promises in the following sense. I find that readFileSync uses 300 ms of CPU time whereas fs.promises uses 7 s of CPU time. That is because fs.promises triggers work on several cores during the benchmark.

Increasingly the file size to, say, 32kB, does not change the conclusion. If you go to significantly larger files, many of the Node.js cases fail with “heap limit Allocation failed”. Bun keeps going even with large files. The test results do not change the conclusion with Bun: fs.readFile is consistently faster in my tests, even for large files.

Credit. My benchmark is inspired by a test case provided by Evgenii Stulnikov.

How many political parties rule Canada? Fun with statistics

Canada has several political parties with elected member of parliament: the Liberals, the Conservatives, the Bloc Québecois, de NDP and the Green. But do they behave as distinct political parties when voting, or are they somehow aligned?

Voting data for the member of parliament in Canada is easily accessible as JSON or XML. Thus I wrote a little Python script to compute, for each vote, what percentage of each party voted yea. I use the latest 394 votes. It turns out that, overwhelming, the percentage is either 0% or 100%. So individual members of parliament are not relevant, only caucuses matter.

We can first examine Pearson’s correlation between how the different parties vote:

Conserv. Liberal NDP Bloc Québécois Green Party
Conserv. 1 -0.5 -0.5 -0.1 -0.2
Liberal 1 0.8 0.4 0.5
NDP 1 0.4 0.6
Bloc Québécois 1 0.5
Green Party 1

We observe that there is excellent correlation between the ruling party (Liberal) and the NDP, and to a lesser extend to the Bloc Québécois (0.4) and Green (0.5). The Conservatives are anti-correlated with everyone else, although they less anti-correlated with the Bloc Québécois and the Green than with other parties (Liberal and NDP).

Though there are dozens of votes, you can capture 85% of the variance by using only two dimensions with a principal component analysis. In effect, you create two fictional voting events (that are weighted combinations of the votes) that represent most accurately the stances of the various parties.

The result demonstrates that four of the Canadian political parties are clustered, meaning that they vote similarly, while one party (the Conservatives) is clearly distinct in its voting patterns.

My source code is available. It is made of two simple Python files that you can run yourself. I encourage you to run your own analysis. My work can be extended to include more data.

Book review: Theft of Fire by Devon Eriksen

When I was young, science fiction was the genre of choice for many engineers and scientists. But the genre declined significantly in recent years. Part of the problem is the rise dystopian fiction. In the imagined future, we are no longer conquering space or developing new fantastic technologies, but rather, increasingly, battling the consequences of climate change or of some evil corporation. In some respect, science fiction is always about the present and present has become unimaginative, fearful and generally anxious.

To illustrate, let me quote a recent piece in the Atlantic:

The United States is experiencing an extreme teenage mental-health crisis. From 2009 to 2021, the share of American high-school students who say they feel “persistent feelings of sadness or hopelessness” rose from 26 percent to 44 percent, according to a new CDC study. This is the highest level of teenage sadness ever recorded.

Civilizations are not eternal, they have a life cycle. When young people grow increasingly depressed, when we are more eager to take down than to build up, we are facing a decline. On a per capita basis, most countries in the West are stagnating. Economists like Cowen and Gordon blame in part the relative lack of benefits due to the Internet and technological advancement in computers. But I am sure Roman intellectuals could have blamed the decline of their empire on the fact that late-stage innovations like the onager were not sufficiently impactful.

The covid era is a perfect illustration. Entire societies turned inward, locked up, shut down their industries, their entertainment. We did get an echo of science-fiction, but the dystopian kind. In communist China, we had the so-called big whites, a special police force wearing hazmat suits, enforcing lockdowns. Worldwide, politicians took the habit of wearing black masks. So the Empire from Star Wars was a tangible reality, but without much in the way of rebels. No Luke, no Lea came to be celebrated.

But culture is not a mere reflection of an era. In some sense, culture defines the era, it is a beast that is larger than any one of us. As much as you would like to think that politicians are in charge, they are strictly limited by their culture. Václav Havel made this point in The Power of the Powerless (1985): the ruler of a communist country has no more than a grocer to oppose the prevailing culture, and he may even have less power.

So how do we get back to a future-oriented culture? One where the challenges are met with courage. One where we fight our way through instead of turning inward and regressing?

Culture is all of us. Culture is what we do. It is what we talk about. And it is what we read.

Few books have the power to inspire readers quite like Devon Eriksen’s Theft of Fire. This work of science fiction takes us on a journey through the frozen edge of the solar system, where a hidden treasure lies waiting to be discovered. It is a tale of adventure, intrigue, and the indomitable spirit of humanity.

It is an imperfect, and even somewhat sad future. There are poor people, there are oppressed people. But it is world where you can hope. It is very much the path in our future where Elon Musk’s futuristic and optimistic vision came true. Yes, Elon Musk himself shaped this universe. Eriksen does not tell us that if ‘the man’ (Elon Musk) has his way, we will live forever in harmony. Quite the opposite. If you are intrigued by ChatGPT and Google Gemini, Eriksen has you covered, as well.

Here is a quote to illustrate Eriksen’s style:

In space, everyone can hear you scream. You vacsuit monitors your heartbeat, your blood pressure. It knows when you are injured. And it knows when to cry for help on the 100 MHz emergency band. And everyone is listening. Every suit radio. Every spacecraft. Every robot probe.

Theft of Fire offers a plausible and realistic universe. The characters, Marcus Warnoc and Miranda Foxgrove, are not mere archetypes; they are complex, flawed, and deeply human. Their struggle to trust one another and overcome their own demons is a powerful allegory for the modern human condition. There are class wars, between the very rich and the very poor, but it is not a Marxist story. Marcus might be poor and struggling, but he is never a victim. The rich people can be bad, but so can the poor people.
The world of Theft of Fire is one of contrasts: the cold, unforgiving vacuum of space and the warmth of human connection. It is a testament to the power of storytelling and the enduring appeal of the science fiction genre.

It may maybe telling to look at who published Eriksen’s book: it is a self-published book. As far as I can tell, one of Eriksen’s wives is in charge of marketing. She is the one who reached out to me and suggested this review. Maybe that’s how we change our destiny: from the bottom up. Just like in the novel.

More information: Eriksen’s home page.

Measuring energy usage: regular code vs. SIMD code

Modern processor have fancy instructions that can do many operations at one using wide registers: SIMD instructions. Intel and AMD have 512-bit registers and associated instructions under AVX-512.

You expect these instructions to use more power, more energy. However, they get the job done faster. Do you save energy overall? You should expect so.

Let us consider an example. I can just sum all values in a large array.

float sum(float *data, size_t N) {
  double counter = 0;
  for (size_t i = 0; i < N; i++) {
    counter += data[i];
  }
  return counter;
}

If I leave it as is, the compiler might be tempted to optimize too much, but I can instruct it to avoid ‘autovectorization’: it will not doing anything fancy.

I can write the equivalent function using AVX-512 intrinsic functions. The details do not matter too much, just trust me that it is expected to be faster for sufficiently long inputs.

float sum(float *data, size_t N) {
  __m512d counter = _mm512_setzero_pd();
  for (size_t i = 0; i < N; i += 16) {
    __m512 v = _mm512_loadu_ps((__m512 *)&data[i]);
    __m512d part1 = _mm512_cvtps_pd(_mm512_extractf32x8_ps(v, 0));
    __m512d part2 = _mm512_cvtps_pd(_mm512_extractf32x8_ps(v, 1));
    counter = _mm512_add_pd(counter, part1);
    counter = _mm512_add_pd(counter, part2);
  }
  double sum = _mm512_reduce_add_pd(counter);
  for (size_t i = N / 16 * 16; i < N; i++) {
    sum += data[i];
  }
  return sum;
}

Under Linux, we can ask the kernel about power usage. Your software can query the power usage of different components, but I query the overall power usage (per socket). It works well with Intel processors as long as you have privileged access on the system. I wrote a little benchmark that runs both functions.

On a 32-core Ice Lake processors, my results are as follows with a float array that spans about 500 megabytes.

code routine Power (muJ/s) Energy per value (muJ/value)
naive code 0.055 muJ/s 0.11 muJ/value
AVX-512 0.061 muJ/s 0.032 muJ/value

So the AVX-512 uses 3.5 times less energy overall, despite consuming 10% more energy per unit of time.

My benchmark also reports the memory usage of the memory subsystem. In this particular case, the memory usage caused by memory (DRAM) is low and even negligible, according to the kernel.

My benchmark is naive and should only serve as an illustration. The general principle holds, however: if your tasks complete much faster, you are likely to use less power, even if you are using more energy per unit of time.

JSON Parsing: Intel Sapphire Rapids versus AMD Zen 4

Intel has release a new generation of server processors (Sapphire Rapids) while the latest AMD technology (Zen 4) is now broadly available. There are extensive comparisons available. Of particular interest is the open benchmark results which assess various aspects of processor speeds, including JSON parsing performance. In these benchmarks, AMD systems appear to dominate.

I decided to run my own benchmarks using JSON parsing as a reference and commonly available Amazon big nodes. For these tests, I use Amazon Linux 2023 with GCC 11. I use two instances that cost about 5 dollars per hour. Amazon charges me about the same amount for both the AMD and Intel systems.

The AMD instance is of type c7a.24xlarge with an AMD EPYC 9R14 processor (Zen 4 microarchitecture). The Intel instance is of type c7i.metal-24xl with an Intel XeonPlatinum 8488C (Sapphire Rapids microarchitecture). I use systems with multiple cores but my benchmark is entirely single threaded. I could have optimized either system by going with systems that have fewer cores running hotter. In my case, both processors run in practice at a comparable frequency, with a slight advantage for AMD (3.5 GHz vs 3.4 GHz).

The gist of the result is that neither system dominates the other one. In some benchmarks, Intel wins, in others AMD wins. It is very closely matched.

Intel results:

simdjson On-Demand simdjson DOM yyjson rapidjson nlohmann/json Boost JSON
json2msgpack 3.68 GB/s 2.67 GB/s 1.72 GB/s 0.71 GB/s 0.03 GB/s 0.42 GB/s
partial_tweets 6.83 GB/s 4.77 GB/s 2.41 GB/s 0.77 GB/s 0.13 GB/s 0.50 GB/s
distinct_user_id 6.99 GB/s 4.90 GB/s 2.52 GB/s 0.67 GB/s 0.14 GB/s 0.49 GB/s
kostya 2.92 GB/s 2.03 GB/s 0.83 GB/s 0.80 GB/s 0.12 GB/s 0.47 GB/s

AMD results:

simdjson On-Demand simdjson DOM yyjson rapidjson nlohmann/json Boost JSON
json2msgpack 3.09 GB/s 2.45 GB/s 1.93 GB/s 0.68 GB/s 0.03 GB/s 0.38 GB/s
partial_tweets 6.84 GB/s 4.22 GB/s 2.64 GB/s 0.77 GB/s 0.12 GB/s 0.46 GB/s
distinct_user_id 6.94 GB/s 4.26 GB/s 2.58 GB/s 0.77 GB/s 0.13 GB/s 0.47 GB/s
kostya 4.03 GB/s 2.71 GB/s 1.00 GB/s 0.78 GB/s 0.12 GB/s 0.52 GB/s

You can reproduce my results by grabbing simdjson and running bench_ondemand.

I do not pretend that this single data point is sufficient to make purchasing decisions or to assess the Intel and AMD technology. Take it as a data point.

Further reading. On-demand JSON: A better way to parse documents?, Software: Practice and Experience (to appear)

How fast is rolling Karp-Rabin hashing?

A hash function maps values (e.g., strings) into a fixed number of strings, typically smaller than the original. It is useful to compare quickly two long strings, for example. Instead of comparing the strings, you may compare the hash values.

A simple hash function consists in repeatedly multiplying the hash value by some constant B (e.g., you may pick a number such as 31):

uint32_t hash = 0;
for (size_t i = 0; i < len; i++) {
  hash = hash * B + data[i];
}
return hash;

I credit Karp-Rabin for this type of hash functions. It is an example of recursive hashing: the hash function is computed by taking the hash value and updating it with the next character.

Given a long strings, you may want to hash all sequences of N characters. A naive approach might be as follows:

for(size_t i = 0; i < len-N; i++) {
  uint32_t hash = 0;
  for(size_t j = 0; j < N; j++) {
    hash = hash * B + data[i+j];
  }
  //...
}

You are visiting each character value N times. If N is large, it is inefficient.

You can do better using a rolling hash function: instead of recomputing the hash function anew each time, you just update it. It is possible to only access each character twice:

uint32_t BtoN = 1;
for(size_t i = 0; i < N; i++) { BtoN *= B; }

uint32_t hash = 0;
for(size_t i = 0; i < N; i++) {
  hash = hash * B + data[i];
}
// ...
for(size_t i = N; i < len; i++) {
  hash = hash * B + data[i] - BtoN * data[i-N];
  // ...
}

The most expensive component of this routine is the line with two character accesses and two multiplications.

I wrote a simple benchmark in C++ to see how fast a straight-forward implementation could be. I use a set window of size 8 for the purpose of this analysis, but the larger the window, the less competitive the naive implementation becomes.

rolling hashing 0.75 GB/s 13 instructions/byte
naive hashing 0.18 GB/s 61 instructions/byte

Karp-Rabin is not the only type hash functions. Tabulated hashing is another approach that would be much more compute efficient.

However, I suspect we could greatly improve my naive implementation of the Karp-Rabin rolling hash function.

Further reading:

Possibly relevant software:

C23: a slightly better C

One of the established and most popular programming languages is the C programming language. It is relatively easy to learn, and highly practical.

Maybe surprisingly, the C programming language keeps evolving, slowly and carefully. If you have GCC 13 or LLVM (Clang) 16, you already have a compiler with some of the features from the latest standard (C23).

// Only include stdio.h if it exists
#if __has_include (<stdio.h>)
  #include <stdio.h>
#endif

#include <stdlib.h>

[[deprecated]]
void f() {}

[[nodiscard]]
int g(int x) {
  return x + 1;
}

int main() {
  f(); // compile-time warning: 'f' is deprecated
  g(1); // compile-time warning
  auto x = 0b1111;
  typeof(x) y = 1'000'000; // type of y is the same as x
  printf("%d\n", x); // prints 15
  printf("%d\n", y); // prints 1000000
  constexpr int N = 10;
  // compile-time asserts using static_assert
  static_assert (N == 10, "N must be 10");
  bool a[N]; // array of N booleans
  for (int i = 0; i < N; ++i) {
    a[i] = true;
  }
  printf("%d\n", a[0]); // prints 1
}

  1. The first part of the code contains some preprocessor directives, which are instructions for the compiler to process the source code before compiling it. The #if directive checks a condition at compile time and includes the following code only if the condition is true. The __has_include macro is a feature of C++17 adopted by C23 that checks if a header file exists and can be included. In this instance, it is not useful because we know that stdio.h is present, but in other instances, this can prove useful to determine what headers are available.
  2. The next part of the code defines two functions with attributes, which are annotations that provide additional information to the compiler about the behavior or usage of a function, variable, type, etc.
    • The [[deprecated]] attribute is a feature of C++14 adopted by C23 that marks a function as obsolete and discourages its use. The compiler will issue a warning if the function is called or referenced.
    • The [[nodiscard]] attribute is a feature of C++17 adopted by C23 that indicates that the return value of a function should not be ignored or discarded. The compiler will issue a warning if the function is called from a discarded-value expression.

    In this case, the function f is deprecated and does nothing, while the function g returns the input value plus one and should not be ignored. The first two lines of the main function call the functions f and g and trigger the warnings.

  3. The third line of the main function declares a variable x with the auto keyword, which is a feature of C++11 that lets the compiler deduce the type of the variable from its initializer. In this case, the initializer is a binary literal, which is a feature of C++14 and adopted by C23 that allows writing integer constants in binary notation using the prefix 0b. The value of x is 0b1111, which is equivalent to 15 in decimal.
  4. The fourth line declares another variable y with the typeof operator that returns the type of an expression. In this case, the expression is x, so the type of y is the same as the type of x. The initializer of y is a digit separator, which is a feature of C++14 adopted by C23 that allows inserting single quotes between digits in numeric literals to improve readability. The value of y is 1’000’000, which is equivalent to 1000000 in decimal.
  5. The seventh line declares a constant variable N with the constexpr keyword, which is a feature of C++11 adopted by C23 that indicates that the value of the variable can be evaluated at compile time. The value of N is 10. Previously, one would often use a macro to define a compile-time constant (e.g., #define N 10).
  6. The eighth line uses the static_assert keyword, which is a syntax of C++11 adopted by C23 that performs a compile-time assertion check. The keyword takes a boolean expression and an optional string message as arguments. If the expression is false, the compiler will emit an error and stop the compilation, displaying the message. If the expression is true, the compiler will do nothing. In this case, the expression is N == 10, which is true, so the compilation continues.
  7. The ninth line declares an array of N booleans named a. An array is a collection of elements of the same type that are stored in contiguous memory locations. The size of the array must be a constant expression for standard C arrays (otherwise it becomes a variable-length array which may be less efficient), which is why N is declared with constexpr. We also use the keywords true and false which become standard keywords in C23.

There are many more features in C23, but it will take some time for compilers and system librairies to catch up.

My thoughts so far:

  • The introduction of constexpr in C will probably help reduce the dependency on macros, which is a good idea generally. Macros work well in C, but when a bug is introduced, it can be difficult get meaningful error messages. It does not happen too often, but in large code bases, it can be a problem.
  • I personally rarely use auto and typeof in other languages, so I don’t expect to use them very much in C. In some specific cases, it can greatly simply one’s code, however. It is likely going to help reduce the reliance on macros.
  • The idea behind static_assert is great. You run a check that has no impact on the performance of the software, and may even help it. It is cheap and it can catch nasty bugs. It is not new to C, but adopting the C++ syntax is a good idea.
  • The __has_include feature can simplify supporting diverse standard libraries and test for available libraries. For example, it becomes easy to check whether the standard library supports AVX-512. If a header is missing, you can fail the compilation with instructions (e.g., you need to install library X). It is generally a good idea for people who need to write portable code that others can rely upon.
  • I did not include the introduction of `char8_t` in the language. I worked extensively with Unicode in C++ and I have not found good use cases for the `char8_t` type so far: `char` is always sufficient in my experience.

How much memory bandwidth do large Amazon instances offer?

In my previous post, I described how you can write a C++ program to estimate your read memory bandwidth. It is not very difficult: you allocate a large memory region and you read it as fast as you can. To see how much bandwidth you may have if you use multithreaded applications, you can use multiple threads, where each thread reads a section of the large memory region.

The server I used for the blog post, a two-CPU Intel Ice Lake server has a maximal bandwidth of about 130 GB/s. You can double this amount of bandwidth with NUMA-aware code, but it will require further engineering.

But you do not have access to my server. What about a big Amazon server? So I spun out an r6i.metal instance from Amazon. These servers can support 128 physical threads, they have 1 terabyte of RAM (1024 GB) and 6.25 GB/s of network bandwidth.

Running my benchmark program on this Amazon server revealed that they have about 115 GB/s of read memory bandwidth. That is not counting NUMA and other sophisticated tricks. Plotting the bandwidth versus the number of threads used reveals that, once again, you need about 20 threads to maximized memory bandwidth although you get most of it with only 15 threads.

My source code is available.

Estimating your memory bandwidth

One of the limitations of a compute is the memory bandwidth. For the scope of this article, I define “memory bandwidth” as the maximal number of bytes you can bring from memory to the CPU per unit of time. E.g., if your system has 5 GB/s of bandwidth, you can read up to 5 GB from memory in one second.

To measure this memory bandwidth, I propose to read data sequentially. E.g., you may use a function where we sum the byte values in a large array. It is not necessary to sum every byte value, you can skip some because the processor operates in units of cache lines. I do not know of a system that uses cache lines smaller than 64 bytes, so reading one value every 64 bytes ought to be enough.

uint64_t sum(const uint8_t *data,
    size_t start, size_t len, size_t skip) {
  uint64_t sum = 0;
  for (size_t i = start; i < len; i+= skip) {
    sum += data[i];
  }
  return sum;
}

It may not be good enough to maximize the bandwidth usage: your system has surely several cores. Thus we should use multiple threads. The following C++ code divides the input into consecutive segments, and assigns one thread to each segment, dividing up the task as fairly as possible:

size_t segment_length = data_volume / threads_count;
size_t cache_line = 64;
for (size_t i = 0; i < threads_count; i++) {
  threads.emplace_back(sum, data, segment_length*i,
       segment_length*(i+1), cache_line);
}
for (std::thread &t : threads) {
  t.join();
}

I ran this code on a server with two Intel Ice Lake  processors. I get that the more threads I use, the more bandwidth I am able to get up to around 15 threads. I start out at 15 GB/s and I go up to over 130 GB/s. Once I reach about 20 threads, it is no longer possible to get more bandwidth out of the system. The system has a total of 64 cores, over two CPUs. My program does not do any fiddling with locking threads to cores, it is not optimized for NUMA. I have transparent huge pages enabled by default on this Linux system.

My benchmark ought to be make it easy for the processor to maximize bandwidth usage, so I would not expect more complicated software to hit a bandwidth limit with as few as 20 threads.

My source code is available.

This machine has two NUMA nodes. You can double the bandwidth by running the same benchmark using two NUMA nodes. E.g., under Linux you might call:

numactl --cpunodebind=1 --membind=1 ./bandwidth  & numactl --cpunodebind=0 --membind=0 ./bandwidth

Be aware that NUMA has some downsides. For example, the communication between NUMA nodes is relatively expensive.

Further reading: Many tools to measure bandwidth. I also wrote a second blog post on this theme: How much memory bandwidth do large Amazon instances offer?

Implementing the missing sign instruction in AVX-512

Intel and AMD have expanded the x64 instruction sets over time. In particular, the SIMD (Single instruction, multiple data) instructions have become progressively wider and more general: from 64 bits to 128 bits (SSE2), to 256 bits (AVX/AVX2) to 512 bits (AVX-512). Interestingly, many instructions defined on 256 bits registers through AVX/AVX2 are not available on 512 bits registers.

With SSSE3, Intel introduced sign instructions, with the corresponding intrinsic functions (e.g., _mm_sign_epi8). There are 8-bit, 16-bit and 32-bit versions.  It was extended to 256-bit registers in AVX2.

What these instructions do is to apply the sign of one parameter to the other parameter. It is most easily explained as pseucode code:

function sign(a, b): # a and b are integers
   if b == 0 : return 0
   if b < 0 : return -a 
   if b > 0 : return a

The SIMD equivalent does the same operation but with many values at once. Thus, with SSSE3 and psignb, you can generate sixteen signed 8-bit integers at once.

You can view it as a generalization of the absolute function: abs(a) = sign(a,a). The sign instructions are very fast. They are used in numerical analysis and machine learning: e.g., it is used in llama.cpp, the open source LLM project.

When Intel designed AVX-512 they decided to omit the sign instructions. So while we have the intrinsic function  _mm256_sign_epi8, we don’t have _mm512_sign_epi8. The same instructions are missing for 16 bits and 32 bits integers (e.g., no _m512_sign_epi16 is found).

You may implement it for AVX-512 with a several instructions. I found this one approach:

#include <x86intrin.h>

__m512i _mm512_sign_epi8(__m512i a, __m512i b) {
  __m512i zero = _mm512_setzero_si512();
  __mmask64 blt0 = _mm512_movepi8_mask(b);
  __mmask64 ble0 = _mm512_cmple_epi8_mask(b, zero);
  __m512i a_blt0 = _mm512_mask_mov_epi8(zero, blt0, a);
  return _mm512_mask_sub_epi8(a, ble0, zero, a_blt0);;
}

It is disappointingly expensive. It might compile to four or five instructions:

vpmovb2m k2, zmm1
vpxor xmm2, xmm2, xmm2
vpcmpb k1, zmm1, zmm2, 2
vpblendmb zmm1{k2}, zmm2, zmm0
vpsubb zmm0{k1}, zmm2, zmm1

In practice, you may not need to pay such a high price. The reason the problem is difficult is that we have three cases to handle (three signs b=0, b>0, b<0).  If you do not care about the case ‘b = 0’, then you can do it in two instruction, not counting the zero (one xor):

#include <x86intrin.h>

__m512i _mm512_sign_epi8_cheated(__m512i a, __m512i b) {
   __m512i zero = _mm512_setzero_si512();
  __mmask64 blt0 = _mm512_movepi8_mask(b);
  return _mm512_mask_sub_epi8(a, blt0, zero, a);;
}

E.g., we implemented…

function sign_cheated(a, b): # a and b are integers
   if b < 0 : return -a 
   if b ≥ 0 : return a

Science and Technology links (December 30th 2023)

  1. Parenting does not appear to be able to determine the personality traits of a child.
  2. When the last ice age ended, 12,000 years ago, the Sahara was green and full of life. It turned into a desert about 5,500 years ago.
  3. Fadnes et al. claim that the UK population could live 10 years older if it changed its eating habits.
  4. By studying 175 different populations, You et al. find that meat intake predicts longevity: people who eat more meat live longer.
  5. According to an editorial in the journal Nature, scientists who work in industry are more satisfied and better paid than are colleagues in academia. Industry scientists report less bullying and discrimination.
  6. The Asch experiment examined the extent to which individuals would conform to the majority view, even when that view was clearly incorrect. The experiment involved a group of participants, one of whom was the actual subject of the experiment, and the rest were people who knew the true purpose of the experiment and acted according to a script. The group was shown a series of images with lines of different lengths and asked to identify which two lines were the same length. The results showed that a significant number of participants conformed to the majority view, even when it was clearly wrong. The Asch experiment is important because it highlights the influence of social factors on individual beliefs. Most people just adopt the prevaling beliefs, even when they are clearly incorrect. In other words, very few people can think for themselves. They just reproduce what they are shown or what they see others doing, like mere monkeys. Unfortunately, the original experiment is robust with respect to replication. We also find that even financial incentive fail to make people more critical.
  7. Weather prediction is one of the first application of powerful computers. To this day, we rely on predictions made by specialized services: we don’t generally compute our own weather predictions. Google Deepmind claims to be able to predict the weather accurately on a normal computer using artificial intelligence.

Measuring the size of the cache line empirically

Our computers do not read or write memory in units of bits or even bytes. Rather memory is accessed in small blocks of memory called “cache lines”. For a given system, the cache line size is usually fixed and small (e.g.,  16 to 256 bytes). All Intel/AMD x64 systems I have used relied on a 64-byte cache line. My current Apple laptop with its ARM-based M2 processor relies on a 128-byte cache line.

How can you measure the size of the cache line? I asked on Twitter/X and a few people (e.g., Peter Cawley, Will Bickford) pointed to a phenomenon called false sharing. I will come back to false sharing in a future blog post. I do not want to discuss or cover false sharing because it depends on parallelism. Furthermore, it may be trickier than it seems. I want a simpler solution.

Many people (Robert Clausecker, Sam Westrick, Tomasz Kowalczewski, Vinoth Deivasigamani, Sergey Slotin and many others) proposed using ‘strided access’ benchmark.

I finally decided to test a strided copy: from a large array, I copy every N bytes to another large array. It is a problem that should be largely “memory access bound” as long as N is not too small. I start N at 16. Importantly, I never read my own writes, so I avoid concerns with 4K aliasing on Intel processors.

If N is larger than twice the cache line, then I can effectively skip one cache line out of two. If N is smaller than the cache line, then every cache line must be accessed. Having a stride value just above the cache line should not be sufficiently to see large gains: but you expect the speed to almost double once you reach twice the size of the cache line if the only thing that matters are cache lines.

Sadly, several other factors come into play on a modern system with such a benchmark. There is more than just the cache-line size as a variable! So we need to verify the model experimentally.

I wrote the benchmark in C, but the actual C compiler is unlikely to be relevant. The original code was in Go and I got the same results, but I switched to C to make sure I avoided Go-specific issues. Interestingly, ChatGPT converted the Go code to C code automatically for  me, with just a couple of small mistakes. Mind you, it is deliberately simple code.

We are measuring the time it takes to copy one byte out of every N bytes, using a total buffer size of 32 MB. We report the numbers in GB/s, effectively dividing the total time by the buffer size. I stress that it is a strided copy, not an actually copy, so as stride (N) grows large, the speed should increase. However, as long as the stride (N) is lower than the cache line, we expect a flat speed.

I run each experiment, for each stride size, 10 times and I record the maximum, the minimum and the average. I use an Apple M2 processor running on my laptop and an Intel-based server. I do not require a particular memory alignment when allocating memory. I do not make any other attempt to control the results.

The numbers on the server are quite nice, with hardly any difference between the average, the maximum and the minimum. If your stride is 129, you are 66% faster than when your stride is 64. This suggests that the cache-line size is 64 bytes. The gain is not 2x as I would have expected but the processing might be loading cache lines speculatively. Observe how a stride that is a multiple of 64 (e.g., 128 or 256) is slightly bad: we see the performance dip visibly. It might be due to a caching issue e.g., only half the cache-line addresses are used which makes it more difficult for the processor to make full use of its cache (due to address aliasing).

The result on my laptop are much less clean even though I kept the laptop unused during the benchmarking. In this instance, if your stride is 257, you are more than 2 times faster than when your stride is 128. It suggests that the cache-line size is 128 bytes. Just like the Intel system, a stride of 128 is unfortunate: there is a visible performance dip.

Note that we do not actually copy data at 300 GB/s, that would be impossible on its face on my hardware: but we can copy an array that fast if we just copy one byte per block of 512 bytes.

Thus you can empirically measure the size of the cache line with a strided copy of a large array… As soon you use a stride that is twice the cache line, you should be more than 50% faster.

Fast Buffer-to-String conversion in JavaScript with a Lookup Table

When programming in a JavaScript environment such as Node.js, you might recover raw data from the network and need to convert the bytes into strings. In a system such as Node.js, you may represent such raw bytes using a Buffer instance.

You can conveniently convert a Buffer instance into a JavaScript (mybuffer.toString()). But, maybe surprisingly, creating new strings can be a bottleneck. Thus a worthwhile optimization might be to try to recognize that your incoming bytes are one out of a list of known strings. This is not a problem unique to JavaScript.

One example of such a problem occurs when parsing HTTP headers. These headers contain common strings such as  ‘accept’, ‘accept-encoding’, ‘content-encoding’, ‘content-language’, ‘content-length’, ‘from’, ‘host’, etc. If we can recognize the bytes as one of these strings, we can just point at the existing strings. To make things more complicated, we might want to ignore the case, so that both inputs ‘Accept’ and ‘ACCEPT’ should be mapped to accept’.

This problem has been addressed recently in a library called undici. This library provides Node.js with an HTTP/1.1 client. GitHub user tsctx initially proposed solving this problem using a trie implemented with JavaScript objects. A trie is a type of data structure that is used to store and search for strings in an efficient way. In its simplest implementation (sometimes called a digital search tries), we first branch out on the first character, and each possible character becomes a new tree based on the second character and so forth. The last node of each string is marked as the end of the word, and may also store some additional information, such as a point to a desired output. The problem with such an approach is that it can use much memory. And, indeed, I estimated that tsctx‘s implementation might cost between 1 MB to 2 MB of memory. Whether that is a concern depends on your priorities. Following my comments, user tsctx opted for a new strategy that may be less time efficient, but that is significantly more economical memory-wise: a ternary. A ternary tree is similar to a binary tree, but each node can have up to three children, usually called left, mid, and right.

I think that tsctx‘s is excellent, but it is sometimes important to compare with a few competitors.

I decided to implement my own approach based the observation that it is fairly easy to quickly identify a candidate using solely the length of the input. For example, there is only one candidate string of length 2: ‘TE’. So it makes sense to write code like this:

function toLowerCase(s) {
 var len = s.length;
  switch (len) {
   case 2:
    // check that the buffer is equal to te and return it if so
    if ((s[0] == 116 || s[0] == 84) && (s[1] == 101 || s[1] == 69)) {
     return "te";
    }

This code is a function that takes a string as an input and returns a lowercased version of it. The function works as follows: It declares a variable called len and assigns it the value of the length of the input string s. It uses a switch statement to check the value of len and execute different code blocks depending on the case.
In this example, the function only has one case, which is 2. In the case of 2, the function checks that the input string is equal to “te” or “TE” or “Te” or “tE”. It does this by comparing the ASCII codes of the characters in the string. The ASCII code of t is 116, the ASCII code of T is 84, the ASCII code of e is 101, and the ASCII code of E is 69. The function uses the logical operators || (or) and && (and) to combine the conditions. If the input string matches any of these four combinations, the function will return “te”. Here is an example of how the function works: If the input string is “te”, the function will return “te”. If the input string is “TE”, the function will return “te”. If the input string is “Te”, the function will return “te”. If the input string is “tE”, the function will return “te”. If the input string is “ta”, the function will continue.

If the buffer has length 3, then I have four possible candidate strings (age, ect, rtt, via). I can differentiate them by looking only at the first character. The logic is much the same:

case 3:
  switch (s[0]) {
   case 97: case 65:
    // check that the buffer is equal to age and return it if so
    if ((s[1] == 103 || s[1] == 71) && (s[2] == 101 || s[2] == 69)) {
      return "age";
    }
    break;
   case 101:case 69:
    // check that the buffer is equal to ect and return it if so
    if ((s[1] == 99 || s[1] == 67) && (s[2] == 116 || s[2] == 84)) {
     return "ect";
    }
    break;
   case 114:case 82:
    // check that the buffer is equal to rtt and return it if so
    if ((s[1] == 116 || s[1] == 84) && (s[2] == 116 || s[2] == 84)) {
      return "rtt";
    }
    break;
   case 118:case 86:
    // check that the buffer is equal to via and return it if so
    if ((s[1] == 105 || s[1] == 73) && (s[2] == 97 || s[2] == 65)) {
     return "via";
    }
    break;
   default:
    break;
   }

It is easy enough to do it by hand, but it gets tedious, so I wrote a little Python script. It is not complicated… I just repeat the same logic in a loop.

Pay attention to the fact that the switch key is made of nearly continuous integers from 2 to 35… It means that a good compiler will almost surely use a jump table and not a series of comparisons.

First let us compare the memory usage of the four approaches: the original (simple) code used by undici, the naive switch-case approach, the ternary tree and the digital search trie. I use various recent versions of Node.js on a Linux server. I wrote scripts that simply include the function, and only the function, and I print the memory usage. I repeat five times and report the lowest figure. When using Node.js, I call the garbage collector and pause to try to minimize the memory usage.

Node.js 21 Node.js 20 Node.js 19 Bun 1.0
original 43.3 MB 42.4 MB 44.9 MB 20.2 MB
naive switch 43.3 MB 42.9 MB 42.9 MB 23.8 MB
ternary tree 43.5 MB 44.2 MB 45.2 MB 29.3 MB
digital search trie 45.1 MB 44.6 MB 47.0 MB 26.7 MB

Thus only the digital search trie appears to bring a substantial memory usage with Node.js 21. If you use a different version of Node.js or a different operating system, results will differ… but I verified that the conclusion is the same on my macBook.

What about performance? I use an Intel Ice Lake processor running at 3.2 GHz. I wrote a small benchmark that parses a few headers. I rely on a well-known JavaScript benchmark framework (mitata).

Node.js 21 Node.js 20 Node.js 19 Bun 1.0
original 15 µs 14 µs 15 µs 12 µs
naive switch 7.8 µs 7.9 µs 7.8 µs 8.2 µs
ternary tree 9.4 µs 9.4 µs 9.0 µs 8.8 µs
digital search trie 12 µs 12 µs 11 µs 10 µs

I am not quite certain why the digital search trie does poorly in this case. I also ran the same experiment on my 2022 macBook (Apple M2 processor). I am usually against benchmarking on laptops, but these macBooks tend to give very stable numbers.

Node.js 21 Node.js 20 Node.js 19 Bun 1.0
original 8.5 µs 9.1 µs 8.5 µs 8.2 µs
naive switch 5.0 µs 4.9 µs 4.7 µs 5.3 µs
ternary tree 5.8 µs 5.8 µs 5.6 µs 6.1 µs
digital search trie 5.3 µs 5.5 µs 5.4 µs 5.5 µs

Thus I would conclude that both the naive switch and the ternary tree are consistently faster than the original. The original implementation is about 1.8 times slower than the naive switch when using Node.js 21.

One approach I did not try is perfect hashing. I am concerned that it might be difficult to pull off because JavaScript might not compile the code efficiently. One benefit of perfect hashing is that it can be nearly branchless so it provides consistent performance. We could use the perfect hashing strategy with the switch case approach: we would have just one hash computation, and then we would end up straight to single buffer-to-target comparison. It would like a C/C++ implementation in spirit, although it would generate more code.

We rely on the fact that this function was identified as a bottleneck. We ran a microbenchmark, but it would be useful to see whether these functions make a difference in a realistic application.

My source code and benchmark is online. It can be improved, pull requests invited.

How fast can you validate UTF-8 strings in JavaScript?

When you recover textual content from the disk or from the network, you may expect it to be a Unicode string in UTF-8. It is the most common format. Unfortunately, not all sequences of bytes are valid UTF-8 and accepting invalid UTF-8 without validating it is a security risk.

How might you validate a UTF-8 string in a JavaScript runtime?

You might use the valid-8 module:

import valid8 from "valid-8";
if(!valid8(file_content)) { console.log("not UTF-8"); }

Another recommended approach is to use the fact that TextDecoder can throw an exception upon error:
new TextDecoder("utf8", { fatal: true }).decode(file_content)

Or you might use the isUtf8 function which is part of Node.js and Bun.
import { isUtf8 } from "node:buffer";
if(!isUtf8(file_content)) { console.log("not UTF-8"); }

How do they compare? Using Node.js 20 on a Linux server (Intel Ice Lake), I get the following speeds with three files representative of different languages. The Latin file is just ASCII. My benchmark is available.
Arabic Chinese Latin
valid-8 0.14 GB/s 0.17 GB/s 0.50 GB/s
TextDecoder 0.18 GB/s 0.19 GB/s 7 GB/s
node:buffer 17 GB/s 17 GB/s 44 GB/s

The current isUtf8 function in Node.js was implemented by Yagiz Nizipli. It uses the simdutf library underneath. John Keiser should be credited for the UTF-8 validation algorithm.

Parsing 8-bit integers quickly

Suppose that you want to parse quickly 8-bit integers (0, 1, 2, …, 254, 255) from an ASCII/UTF-8 string. The problem comes up in the simdzone project lead by Jeroen Koekkoek (NLnet Labs). You are given a string and its length: e.g., ’22’ and length is 2. The naive approach in C might be:

int parse_uint8_naive(const char *str, size_t len, uint8_t *num) {
  uint32_t n = 0;
  for (size_t i = 0, r = len & 0x3; i < r; i++) {
    uint8_t d = (uint8_t)(str[i] - '0');
    if (d > 9)
     return 0;
    n = n * 10 + d;
  }
  *num = (uint8_t)n;
  return n < 256 && len && len < 4;
}

This code is a C function that takes a string of characters, its length, and a pointer to an unsigned 8-bit integer as input arguments. The function returns a Boolean value indicating whether the input string can be parsed into an unsigned 8-bit integer or not. It restricts the input to at most three characters but it allows leading zeros (e.g. 002 is 2).

The function first initializes a 32-bit unsigned integer n to zero, we will store our answer there. The function then iterates over the input string, extracting each digit character from the string and converting it to an unsigned 8-bit integer. The extracted digit is then added to n after being multiplied by 10. This process continues until the end of the string or until the function has processed 4 bytes of the string. Finally, the function assigns the value of n to the unsigned 8-bit integer pointed to by num. It then returns a boolean value indicating whether the parsed integer is less than 256 and the length of the input string is between 1 and 3 characters.  If the input string contains any non-digit characters or if the length of the string is greater than 3 bytes, the function returns false.

If the length of the input is predictable, then this naive function should be fast. However, if the length varies (between 1 and 3), then the processor will tend to  mispredict branches, and expensive penalty.

In C++, you could call from_chars:

int parse_uint8_fromchars(const char *str, size_t len, uint8_t *num) {
  auto [p, ec] = std::from_chars(str, str + len, *num);
  return (ec == std::errc());
}

The std::from_chars function takes three arguments: a pointer to the beginning of the input character sequence, a pointer to the end of the input character sequence, and a reference to the output variable that will hold the parsed integer value. The function returns a std::from_chars_result object that contains two members: a pointer to the first character that was not parsed, and an error code that indicates whether the parsing was successful or not.

In this function, the std::from_chars function is called with the input string and its length as arguments, along with a pointer to the unsigned 8-bit integer that will hold the parsed value. The function then checks whether the error code returned by std::from_chars is equal to std::errc(), which indicates that the parsing was successful. If the parsing was successful, the function returns true. Otherwise, it returns false.

Can we do better than these functions? It is not obvious that we can. A function that reads between 1 and 3 bytes is not a function you would normally try to optimize. But still: can we do it? Can we go faster?

Suppose that you can always read 4 bytes, even if the string is shorter (i.e., there is a buffer). This is often a safe assumption. If you numbers are within a larger string, then you can often check efficiently whether you are within 4 bytes of the end of the string. Even if that is not the case, reading 4 bytes is always safe as long as you do not cross a page boundary, something you may check easily.

So you can load your input into a 32-bit word and process all bytes at once, in a single register. This often called SWAR: In computer science, SWAR means SIMD within a register, which is a technique for performing parallel operations on data contained in a processor register.

Jeroen Koekkoek first came up with a valid SWAR approach, but it was only about 40% faster than the naive approach in the case where we had unpredictable inputs, and potentially slower than the naive approach given predictable inputs. Let us examine an approach that might be competitive all around:

int parse_uint8_fastswar(const char *str, size_t len, 
    uint8_t *num) {
  if(len == 0 || len > 3) { return 0; }
  union { uint8_t as_str[4]; uint32_t as_int; } digits;
  memcpy(&digits.as_int, str, sizeof(digits));
  digits.as_int ^= 0x30303030lu;
  digits.as_int <<= ((4 - len) * 8);
  uint32_t all_digits = 
    ((digits.as_int | (0x06060606 + digits.as_int)) & 0xF0F0F0F0) 
       == 0;
  *num = (uint8_t)((0x640a01 * digits.as_int) >> 24);
  return all_digits 
   & ((__builtin_bswap32(digits.as_int) <= 0x020505));
}

Again, this code is a C function that takes a string of characters, its length, and a pointer to an unsigned 8-bit integer as input arguments. The function returns a boolean value indicating whether the input string can be parsed into an unsigned 8-bit integer or not. We check whether the length is in range ([1,3]), if not, we return false, terminating the function. After this initial check, the function first initializes a union digits that contains an array of 4 unsigned 8-bit integers and a 32-bit unsigned integer. The function then copies the input string into the 32-bit unsigned integer using the memcpy function. The memcpy function copies the input string into the 32-bit unsigned integer. In production code where you do not know the target platform, you would want to reverse the bytes when the target is a big-endian system. Big endian systems are vanishingly rare: mostly just mainframes. Modern systems compile a byte reversal to a single fast instructions. For code on my blog post, I assume that you do not have a big-endian system which is 99.99% certain.

The function then flips the bits of the 32-bit unsigned integer using the XOR operator and the constant value 0x30303030lu. This operation converts each digit character in the input string to its corresponding decimal value. Indeed, the ASCII characters from 0 to 9 have code point values 0x30 to 0x39 in ASCII. The function then shifts the 32-bit unsigned integer to the left by a certain number of bits, depending on the length of the input string. This operation removes any trailing bytes that were not part of the input string. Technically when the length is not within the allowed range ([1,3]), the shift might be undefined behaviour, but we return a false value in this case earlier, indicating that the result of the computation is invalid.

The next part is where I contributed to the routine. First we check all characters are digits using a concise expression. The function then multiplies the 32-bit unsigned integer by the constant value 0x640a01 using a 32-bit unsigned integer. It is a concise way to do two multiplications (by 100 and by 10) and two sums at once. Observe that 0x64 is 100 and 0x0a is 10. The result of this multiplication is then shifted to the right by 24 bits. This operation extracts the most significant byte of the 32-bit unsigned integer, which represents the parsed unsigned 8-bit integer. Finally, the function assigns the value of the parsed unsigned 8-bit integer to the unsigned 8-bit integer pointed to by num. It then returns a boolean value indicating whether the parsed integer is less than 256 and made entirely of digits.

The function might compile to 20 assembly instructions or so on x64 processors:

lea rcx, [rsi - 4]
xor eax, eax
cmp rcx, -3
jb .END
mov eax, 808464432
xor eax, dword ptr [rdi]
shl esi, 3
neg sil
mov ecx, esi
shl eax, cl
lea ecx, [rax + 101058054]
or ecx, eax
test ecx, -252645136
sete cl
imul esi, eax, 6556161
shr esi, 24
mov byte ptr [rdx], sil
bswap eax
cmp eax, 132358
setb al
and al, cl
movzx eax, al
.END: # %return
ret

To test these functions, I wrote a benchmark. The benchmark uses random inputs, or sequential inputs (0,1,…), and it ends up being very relevant. I use GCC 12 and an Ice Lake (Intel) linux server running at 3.2 GHz. I report the number of millions of numbers parsed by second.

random numbers sequential numbers
std::from_chars 145 M/s 255 M/s
naive 210 M/s 365 M/s
SWAR 425 M/s 425 M/s

So the SWAR approach is twice as fast as the naive approach when the inputs are unpredictable. Otherwise, for predictable inputs, we surpass the naive approach by about 15%. Whether it is helpful in you use case depends on your data so you should run your own benchmarks.

Importantly, the SWAR approach is entirely equivalent to the naive approach, except for the fact that it always reads 4 bytes irrespective of the length.

The from_chars results are disappointing all around. I am puzzled as to why the naive approach is so much faster than the standard library. It solves a slightly different problem but the difference is still quite large. It could be that there is room for optimization in the standard library (GCC 12).

Can you do better? The benchmark is available. As part of the benchmark, we check exhaustively that the validation is correct.

Credit: I am grateful to Jeroen Koekkoek from NLnet Labs for suggesting this problem. The approach described was improved based on proposals by Jean-Marc Bourguet.

Update: User “Perforated Bob”, proposed a version which can be faster under some compilers:

int parse_uint8_fastswar_bob(const char *str, size_t len, uint8_t *num) {
  union { uint8_t as_str[4]; uint32_t as_int; } digits;
  memcpy(&digits.as_int, str, sizeof(digits));
  digits.as_int ^= 0x303030lu;
  digits.as_int <<= (len ^ 3) * 8;
  *num = (uint8_t)((0x640a01 * digits.as_int) >> 16);
  return ((((digits.as_int + 0x767676) | digits.as_int) & 0x808080) == 0) 
   && ((len ^ 3) < 3) 
   && __builtin_bswap32(digits.as_int) <= 0x020505ff;
}

 

A simple WebSocket benchmark in Python

Modern web applications often use the http/https protocols. However, when the server and client needs to talk to each other in a symmetrical fashion, the WebSocket protocol might be preferable. For example, if you want to program a multiplayer video game, the WebSocket protocol is almost surely better than http. In A simple WebSocket benchmark in JavaScript, I showed that JavaScript (through Node.js) can produce efficient WebSocket servers, at least in the simplest cases.

My benchmark is what I call a ping-pong. Client 1 sends a message to the server.The server receives the message and broadcasts it to the second client. Client 2 receives the message from the server. Client 2 replies back to the server. Client 1 receives the message. And so forth.

I want to know how many roundtrips I can generate per second. For Node.js (JavaScript), the answer is about 20,000. If you use a faster JavaScript engine (bun), you might get twice as many.

What about Python? I wrote my client using standard code, without any tweaking:

import asyncio
import websockets
async def client1():
  async with websockets.connect('ws://localhost:8080') as websocket:
    message = 'client 1!'
    await websocket.send(message)
    while True:
      response = await websocket.recv()
      await websocket.send(message)

async def client2():
  async with websockets.connect('ws://localhost:8080') as websocket:
    message = 'client 2!'
    while True:
      response = await websocket.recv()
      await websocket.send(message)

async def main():
  task1 = asyncio.create_task(client1())
  task2 = asyncio.create_task(client2())
  await asyncio.gather(task1, task2)

asyncio.run(main())


Python has several different frameworks to build WebSocket servers. I picked three that looked popular and mature: sanic, blacksheep, and aiohttp. By default, a module like sanic should use good optimizations like the uvloop module.

My source code is available. I run the benchmark on a Linux machine with Python 3.9. The packets are local, they don’t go out to the Internet. There is no docker container involved.

Python client Node.js 20 client
sanic server 3700 5200
blacksheep server 3000 200
aiohttp server 3600 270
Node.js 20 server 6000 19,000

Mixing the blacksheep and aiohttp servers I wrote with the Node.js server gives terrible results: I have not investigated the cause, but it would be interesting to see if others can reproduce it, or diagnose it.

Otherwise, I get that the sanic server is nearly 4 times slower than the Node.js server. Writing the client in Python appears to cut the performance significantly (except for the blacksheep and aiohttp servers anomaly).

JavaScript shines in comparison with Python in these tests. My implementations are surely suboptimal, and there might be mistakes. Nevertheless, I believe that it is as expected: the standard Python interpreter is not very fast. Node.js has a great Just-in-Time compiler. It would be interesting to switch to faster implementation of Python such as pypy.

This being said, 3000 roundtrips per second might be quite sufficient for several applications. Yet, a real-world WebSocket server would be assuredly slower: it would go through the Internet, it would do non-trivial work, and so forth.

A simple WebSocket benchmark in JavaScript: Node.js versus Bun

Conventional web applications use the http protocol (or the https variant). The http protocol is essentially asymmetrical: a client application such as a browser issues requests and the server responds. It is not possible for the server to initiate communication with the client. Certain types of applications are therefore more difficult to design. For example, if we wanted to design a multiplayer video game using the http protocol, such as a chess game, we could have one server, and two browsers connected to the server. When one of the players moves a piece within its browser, the browser can inform the server via an http request. But how do you inform the second browser? One solution is to have the browsers make requests to the server at regular intervals. A better solution is to use another protocol, the WebSocket protocol.

WebSocket is a network protocol for creating bidirectional communication channels between browsers and web servers. Most browsers support WebSocket, although the standard is relatively recent (2011). It enables the client to be notified of a change in server status, without having to make a request.

You expect WebSocket to be relatively efficient. I wrote an elementary WebSocket benchmark in JavaScript. I use the standard module ws. In my benchmark, I have one server. The server takes whatever messages it receives, and it sends them to other clients. Meanwhile I create two clients. Both clients initiate a connection to the server, so we have two connections. The clients then engage in a continual exchange:

  1. Client 1 sends a message to the server.
  2. The server receives the message and broadcasts it to the second client.
  3. Client 2 receives the message from the server.
  4. Client 2 replies back to the server.
  5. Client 1 receives the message.

My code is as simple as possible. I do not do any trick to go faster. It is ‘textbook’ code.

Importantly, this benchmark has a strong data dependency: there is only just one connection active while the other one is stalling. So we are measuring the latency (how long the trips take) rather than how many requests we can support simultaneously.

How fast can it go? I run my benchmark locally on a Linux server with a server processor (Xeon Gold). The tests are local so that they do no go through the Internet, they do not use docker or a VM, etc. Obviously, if you run these benchmark on the Internet, you will get slower results due to the network overhead. Furthermore, my benchmark does not do any processing, it just sends simple messages. But I am interested in how low the latency can get.

I use Node.js as a runtime environment (version 20). There is an alternative JavaScript runtime environment called Bun which I also use for comparison (1.0.14). Because I have two JavaScript processes, I four possibilities: the two processes may run Node.js, or they may run bun, or a mixed of those.

Can we do better? Bun has its own WebSocket API. I wrote a script specifically for it. I am keeping the clients unchanged (i.e., I am not using a bun-specific client).

I measure the number of roundtrips per second in steady state.

Node.js 20 (client) Bun 1.0 (client)
Node.js 20 (server with ws) 19,000 23,000
Bun 1.0 (server with ws) 15,000 27,000
Bun 1.0 (bun-specific server) 44,000 50,000

It seems fair to compare the pure Node.js configuration (19,000) with the pure Bun configuration (27,000) when using the ws module. At least in my tests, I am getting that Node.js clients are faster when using a Node.js server. I am not sure why that is. Bun is 40% faster than Node.js in this one test. Once you switch to the bun-specific JavaScript code, then bun is twice as fast.

In a simple http benchmark, I got that Node.js could support about 45,000 http queries per second while bun while nearly twice as capable. However, to get these high numbers, we would have multiple requests in flight at all times. So while I am not making a direct comparison, it seems likely that WebSocket is more efficient than repeatedly polling the servers from both clients.

Importantly, all of my source code is available. The benchmark should be fully reproducible.

Science and Technology links (November 12 2023)

  1. Vitamin K2 supplements might reduce the risk of myocardial infarction (heart attacks) and of all-cause death (Hasific et al. 2022). You find vitamin K2 in some Gouda cheeses and in eggs.
  2. Most of the water on Earth is salinated (in the oceans) and cannot be consumed. Fresh water is often scarce. Yet Israel is desalinating water for less than a dollar per cubic meter.
  3. People living in South America engaged in warfare for 10,000 years before the arrival of the Europeans (Standen et al. 2023).
  4. The last glacial period ended about 12,000 years ago and lasted about 100,000 years. About 26,000 years ago, all of Canada was covered by a permanent ice sheet. Thus many of us were taught in school that human beings first colonized America about 12,000 years ago by the Bering land bridge, that existed back then between modern-day Russia and modern-day Alaska. The evidence accumulates that there were human beings in America much earlier than initially thougth. They would have been present 21,000 to 23,000 years ago in New Mexico. We even have their footprints.
  5. As recently as 20,000 years ago—not long in geological terms—Britain was not, in fact, an island. Instead, the terrain that became the British Isles was linked to mainland Europe by Doggerland, a tract of now-submerged territory where early Mesolithic hunter-gatherers lived, settled and traveled (McGreevy, 2020). Correspondly, there were human beings in Ireland 31,000 years ago.
  6. Gray et al. (2023) argue that the limited freedom that children enjoy in our modern societies is leading to a rise in mental disorders.
  7. Most people cannot understand the bat and ball problem, even after the solution is given. The problem can be stated as follows: “A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?” ChatGPT can solve it:
  8. When hiring, we find a slight bias in favour of females in male-dominated fields, and a strong bias in favour of females in female-dominated fields (Schaerer et al., 2023). Overall, people greatly overestimate gender biases in hiring.
  9. Retinol, a common cosmetic product, keeps one’s skin younger.
  10. Unwarranted financial optimism might be the result of low cognitive abilities.