Randomness in programming (with Go code)

Computer software is typically deterministic on paper: if you run twice the same program with the same inputs, you should get the same outputs. In practice, the complexity of modern computing makes it unlikely that you could ever run twice the same program and get exactly the same result, down to the exact same execution time. For example, modern operating systems randomize the memory addresses as a security precaution: a technique called Address space layout randomization. Thus if you run a program twice, you cannot be guaranteed that the memory is stored at the same memory addresses. In Go, you can print the address of a pointer with the %p directive. The following program will allocate a small array of integers, and print the corresponding address, using a pointer to the first value. If you run this program multiple times, you may get different addresses.

package main

import (
    "fmt"
)

func main() {
  x := make([]int, 3)
    fmt.Printf("Hello %p", &x[0])
}

Thus, in some sense, software programs are already randomized whether we like it or not. Randomization can make programming more challenging. For example, a bad program might behave correctly most of the time and only fail intermittently. Such unpredictable behavior is a challenge for a programmer.

Nevertheless, we can use randomization to produce better software: for example by testing our code with random inputs. Furthermore, randomness is a key element of security routines.

Though randomness is an intuitive notion, defining it requires more care. Randomness is usually tied to a lack of information. For example, it may be measured by our inability to predict an outcome. Maybe you are generating numbers, one every second, and after looking at the last few numbers you generated, I still cannot predict the next number you will generate. It does not imply that the approach you are using to generate numbers is magical. Maybe you are applying a perfectly predictable mathematical routine. Thus randomness is relative to the observer and their knowledge.

In software, we distinguish between pseudo-randomness and randomness. If I run a mathematical routine that generates random-looking numbers, but these numbers are perfectly determined, I will say that they are ‘pseudo-random’. What random looking means is subjective and the concept of pseudo-randomness is likewise subjective.

It is possible, on a computer, to produce numbers that cannot be predicted by the programmer. For example, you might use a temperature sensor in your processor to capture physical ‘noise’ that can serve as a random-looking input. You might use the time of day when a program was started as a random input. We often refer to such values are random (as opposed to pseudo-random). We consider them random in the sense that, even in principle, it is not possible for the software to predict them: they are produced by a process from outside of the software system. ### Hashing Hashing is the process by which we design a function that takes various inputs (for example variable-length strings) and outputs a convenient value, often an integer value. Because hashing involves a function, given the same input, we always get the same output. Typically, hash functions produce a fixed number of bits: e.g., 32 bits, 64 bits, and so forth.

One application of hashing has to do with security: given a file recovered from the network, you may compute a hash value from it. You may then compare it with the hash value that the server provides. If the two hash values match, then it is likely that the file you recovered is a match for the file on the server. Systems such as git rely on this strategy.

Hashing can also be used to construct useful data structures. For example, you can create a hash table: given a set of key values, you compute a hash value representing an index in an array. You can then store the key and a corresponding value at the given index, or nearby. When a key is provided, you can hash it, seek the address in the array, and find the matching value. If the hashing is random-looking, then you should be able to hash N objects into an array of M elements for M slightly larger than N so that a few objects are hashed to the same location in the array. It is difficult to ensure that no two objects are ever mapped to the same array element: it requires that M be much, much larger than N. For M much larger than N, the probability of a collision is about 1 - exp(-N*N/(2*M)). Though this probability falls to zero as M grows large, it requires M to be much larger than N for it to be practically zero. Solving for p in 1 - exp(-N*N/(2*M)) = p, we get M = -1/2 N*N / ln(1-p). That is, to maintain a probability p, then M must grow quadratically (proportionally to N*N) concerning N. Thus we should expect that there will be collisions in a hash table even if the hash function appears random. We can handle collisions in various ways. For example, you may use chaining: each element in the array stores a reference to a dynamic container that may contain several keys. You may also use linear probing: if the array location is occupied, you find the next unoccupied value and store the key there. When searching for a key under linear probing, you first access the element indicated by the hash value, and if it is occupied by a different key, you move to the next element and so forth, until you have visited all of the array, or until you have found an unoccupied element. There are many variations on the same theme (linear probing).

An ideal hash function might take each possible input and assign it to a purely random value given by an Oracle. Unfortunately, such hash functions are often impractical. They require the storage of large tables of input values and matching random values. In practice, we aim to produce hash functions that behave as if they were purely random while still being easy to implement efficiently.

A reasonable example to hash non-zero integer values is the murmur function. The murmur function consists of two multiplications and and three shift/xor operations. The following Go program will display random-looking 64-bit integers, using the murmur function:

package main

import (
    "fmt"
    "math/bits"
)

func murmur64(h uint64) uint64 {
    h ^= h >> 33
    h *= 0xff51afd7ed558ccd
    h ^= h >> 33
    h *= 0xc4ceb9fe1a85ec53
    h ^= h >> 33
    return h
}

func main() {

    for i := 0; i < 10; i++ {
        fmt.Println(i, murmur64(uint64(i)))
    }
}

It is a reasonably fast function. One downside of the murmur64 function is that zero is mapped to zero, so some care is needed.

In practice, your values might not be integers. If you want to hash a string, you might use a recursive function. You process the string character by character. At each character, you combine the character value with the hash value computed so far, generating a new hash value. Once the function is completed, you may then apply murmur to the result:

package main

import (
    "fmt"
)

func murmur64(h uint64) uint64 {
    h ^= h >> 33
    h *= 0xff51afd7ed558ccd
    h ^= h >> 33
    h *= 0xc4ceb9fe1a85ec53
    h ^= h >> 33
    return h
}

func hash(s string) (v uint64) {
    v = uint64(0)
    for _, c := range s {
        v = uint64(c) + 31*v
    }
    return murmur64(v)
}

func main() {
    fmt.Print(hash("la vie"), hash("Daniel"))
}

There are better and faster hash functions, but the result from recursive hashing with a murmur finalizer is reasonable.

Importantly, it is reasonably easy to generate two strings that hash to the same values, i.e., to create a collision. For example, you can verify that the strings "Ace", "BDe", "AdF", "BEF" all have the same hash value:

fmt.Print(hash("Ace"), hash("BDe"), hash("AdF"), hash("BEF"))

When hashing arbitrarily long strings, collisions are always possible. However, we can use more sophisticated (and more computationally expensive) hash functions to reduce the probability that we encounter a problem.

An interesting characteristic of the provided murmur64 function is that it is invertible. If you consider the steps, you have two multiplication by odd integers. A multiplication by an odd integer is always invertible: the multiplicative inverse of 0xff51afd7ed558ccd is 0x4f74430c22a54005 and the multiplicative inverse of 0xc4ceb9fe1a85ec53 is 0x9cb4b2f8129337db, as 64-bit unsigned integers. It may be slightly less obvious that h ^= h >> 33 is invertible. But if h is a 64-bit integer, we have that h and h ^ (h >> 33) are identical in their most significant 33 bits, by inspection. Thus if we are given z = h ^ (h >> 33), we have that z >> (64-33) == h >> (64-33). That is, we have identified the most significant 33 bits of h from h ^ (h >> 33). Extending this reasoning, we have that g is the inverse of f in the following code, in the sense that g(f(i)) == i.

func f(h uint64) uint64 {
    return h ^ (h >> 33)
}

func g(z uint64) uint64 {
    h := z & 0xffffffff80000000
    h = (h >> 33) ^ z
    return h
}

We often need hash values to fit within an interval starting at zero. E.g., you might want to get a hash value in [0,max), you might use the following function:

func toIntervalBias(random uint64, max uint64) uint64 {
    hi,_ := bits.Mul64(random, max)
    return hi
}

This function outputs a value in [0,max) using a single multiplication. There are alternatives such as random % max, but an integer remainder operation may compile to a division instruction, and a division is typically more expensive than a multiplication. Whenever possible, you should avoid division instructions when performance is a factor.

Importantly, the toIntervalBias function introduces a slight bias: we start with 264 distinct values and we map them to N distinct values. This means that out of 264 original values, about 264/N values correspond to each output value. Let x be the smallest integer no smaller than x and x be the larger integer no larger than x. When 264/N is not an integer, then some output values match ⌈264/N original values, whereas others match ⌊264/N original values. When N is small, it may be negligible, but as N grows, the bias is relatively more important. In some sense, it is the smallest possible bias if we are starting from original values that are uniformly distributed over a set of 264 possible values.

Putting it all together, the following program will hash a string into a value in the interval [0,10).

package main

import (
    "fmt"
    "math/bits"
)

func murmur64(h uint64) uint64 {
    h ^= h >> 33
    h *= 0xff51afd7ed558ccd
    h ^= h >> 33
    h *= 0xc4ceb9fe1a85ec53
    h ^= h >> 33
    return h
}

func hash(s string) (v uint64) {
    v = uint64(0)
    for _, c := range s {
        v = uint64(c) + 31*v
    }
    return murmur64(v)
}

func toIntervalBias(random uint64, max uint64) uint64 {
    hi,_ := bits.Mul64(random, max)
    return hi
}

func main() {
    fmt.Print(toIntervalBias(hash("la vie"),10))
}

Though the toIntervalBias function is generally efficient, it is unnecessarily expensive when the range is a power of two. If max is a power of two (e.g., 32), then random % max == random & (max-1). A bitwise AND with the decremented maximum is faster than even just a multiplication, typically. Thus the following function is preferable.

func toIntervalPowerOfTwo(random uint64, max uint64) uint64 {
    return random & (max-1)
}

Estimating cardinality

One use case for hashing is to estimate the cardinality of the values in an array or stream of values. Suppose that your software receives billions of identifiers, how many distinct identifiers are there? You could build a database of all identifiers, but it could use a lot of memory and be relatively expensive. Sometimes, you only want a crude approximation, but you want to compute it quickly.

There are many techniques to estimate cardinalities using hashing: Probabilistic Counting, LOGLOG Probabilistic Counting, and so forth. We can explain the core idea and even produce a useful function without any advanced mathematics. If you hash all values (e.g., identifiers), and if the hash function is of good quality, you would expect that all distinct values will be hash to a random-looking value within the set of all values.

The space between distinct hash values should be about 264/(N+1) where N is the number of distinct values. If we find the small hash value m, then we should have approximately m = 264/(N+1) or N = 264/m − 1. When N is much larger than 1 but much smaller than 264, this is approximatively N = (264−1)/m. The following function applies this formula to estimate the cardinality:

// estimateCardinality estimates the number of distinct values
func estimateCardinality(values []uint64) int {
    if len(values) < 2 {
        return len(values)
    }
    mi1 := murmur64(values[0])

    for i := 1; i < len(values); i++ {
        t := murmur64(values[i])
        if t < mi1 {
            mi1 = t
        }
    }
    return int(math.MaxUint64 / mi1)
}

We can apply this function in the following program. The approximation is rather crude, but it can be good enough in some practical cases.

package main

import (
    "fmt"
    "math"
)

func mu(h uint64, step uint64) uint64 {
    return h * step
}

func murmur64(h uint64) uint64 {
    h ^= h >> 33
    h *= 0xff51afd7ed558ccd
    h ^= h >> 33
    h *= 0xc4ceb9fe1a85ec53
    h ^= h >> 33
    return h
}

// fillArray fills the array with up to howmany distinct values
func fillArray(arr []uint64, howmany int) {
    for i := 0; i < len(arr); i++ {
        // careful not to include zero because murmur64(0) == 0
        arr[i] = 1 + uint64(i%howmany)
    }
}

// estimateCardinality estimates the number of distinct values
func estimateCardinality(values []uint64) int {
    if len(values) < 2 {
        return len(values)
    }
    m := murmur64(values[0])

    for i := 1; i < len(values); i++ {
        t := murmur64(values[i])
        if t < mi1 {
            m = t
        }
    }
    return int(math.MaxUint64 / m)
}

func main() {
    values := make([]uint64, 5000000) // 50 M
    distinct := 2200000               // 1.2 M
    fillArray(values, distinct)
    fmt.Println(estimateCardinality(values), distinct)

}

Integers

There are many ways to generate random integers, but a particularly simple approach is to rely on hashing. For example, we could start from an integer (e.g., 10) and return the random integer murmur64(10) and then increment the integer (e.g, to 11) and next return the integer murmur64(10).

Steele et al. (2014) propose a similar strategy which they call SplitMix: it is part of the Java standard library. It works much like what we just described but instead of incrementing the counter by one, they increment it by a large odd integer. They also use a slightly different version of the murmur64 version. The following function prints 10 different random values, following the SplitMix formula:

package main

import "fmt"

func splitmix64(seed *uint64) uint64 {
    *seed += 0x9E3779B97F4A7C15
    z := *seed
    z = (z ^ (z >> 30))
    z *= (0xBF58476D1CE4E5B9)
    z = (z ^ (z >> 27))
    z *= (0x94D049BB133111EB)
    return z ^ (z >> 31)
}

func main() {
    seed := uint64(1234)
    for z := 0; z < 10; z++ {
        r := splitmix64(&seed)
        fmt.Println(r)
    }
}

Each time the splitmix64 function is called, the hidden seed variable is advanced by a constant (0x9E3779B97F4A7C15). If you start from the same seed, you always get the same random values.

The function then performs a series of bitwise operations on z. First, it performs an XOR operation between z and z shifted right by 30 bits. It then multiplies the result by the constant value 0xBF58476D1CE4E5B9. Next, it performs another XOR operation between the result and the result shifted right by 27 bits. Finally, it multiplies the result by the constant value 0x94D049BB133111EB and returns the result XORed with the result shifted right by 31 bits.

It produces integers using the full 64-bit range. If one needs a random integer in an interval (e.g., [0,N)), then more work is needed. If the size of the interval is a power of two (e.g., [0,32)), then we may simply use the same technique as for hashing:

// randomInPowerOfTwo -> [0,max)
func randomInPowerOfTwo(seed *uint64, max uint64) uint64 {
    r := splitmix64(seed)
    return r & (max-1)
}

However, when the bound is arbitrary (not a power of two) and we want to avoid biases, a slightly more complicated algorithm is needed. Indeed, if we assume that the 64-bit integers are truly random, then all values are equally likely. However, if we are not careful, we can introduce a bias when converting the 64-bit integers to values in [0,N). It is not a concern when N is a power of two, but it becomes a concern when N is arbitrary. A fast routine was described by Lemire (2019) to solve this problem:

func toIntervalUnbiased(seed *uint64, max uint64) uint64 {
    x := splitmix64(seed)
    hi, lo := bits.Mul64(x, max)
    if lo < max {
        t := (-max) % max // division!!!
        for lo < t {
            x := splitmix64(seed)
            hi, lo = bits.Mul64(x, max)
        }
    }
    return hi
}

The toIntervalUnbiased function takes two arguments: a pointer to a 64-bit unsigned integer seed and a 64-bit unsigned integer max. It returns a 64-bit unsigned integer. The function first calls the splitmix64 function with the seed pointer as an argument to generate a random 64-bit unsigned integer x. It then multiplies x with max using the bits.Mul64 function, which returns the product of two 64-bit unsigned integers as two 64-bit unsigned integers. The higher 64 bits of the product are stored in the variable hi, and the lower 64 bits are stored in the variable lo. If lo is less than max, the function enters a loop that generates new random numbers using splitmix64 and recalculates the product of x and max until lo is greater than or equal to -max % max. This is done to ensure that the distribution of random numbers is unbiased.

The general strategy used by this function is called the rejection method: we repeatedly try to generate a random integer until we can produce an unbias result. However, when the interval is much smaller than 264 (a common case), then we are very unlikely to use the rejection method or to even have to compute an integer remainder. Most of the time, the function never enters in the rejection loop.

Testing that a random generator appears random is challenging. We can use many testing strategies, and each testing strategy can be more or less extensive. Thankfully, it is not difficult to think of some tests we can apply. For example, we want the distribution of values to be uniform: the probability that any one value is generated should be 1 over the number of possible values. When generating 2 to the 64 possible values, it is technically challenging to test for uniformity. However, we can conveniently restrict the size of the output with a function such as toInterval.

The following program computes the relative standard deviation of a frequency histogram based on 100 million values. The relative standard deviation is far smaller than 1% (0.05655%) which suggests that the distribution is uniform.

package main

import (
    "fmt"
    "math"
    "math/bits"
)

func splitmix64(seed *uint64) uint64 {
    *seed += 0x9E3779B97F4A7C15
    z := *seed
    z = (z ^ (z >> 30))
    z *= (0xBF58476D1CE4E5B9)
    z = (z ^ (z >> 27))
    z *= (0x94D049BB133111EB)
    return z ^ (z >> 31)
}

func toIntervalUnbiased(seed *uint64, max uint64) uint64 {
    x := splitmix64(seed)
    hi, lo := bits.Mul64(x, max)
    if lo < max {
        t := (-max) % max // division!!!
        for lo < t {
            x := splitmix64(seed)
            hi, lo = bits.Mul64(x, max)
        }
    }
    return hi
}

func meanAndStdDev(arr []int) (float64, float64) {
    var sum, sumSq float64
    for _, val := range arr {
        sum += float64(val)
        sumSq += math.Pow(float64(val), 2)
    }
    n := float64(len(arr))
    mean := sum / n
    stdDev := math.Sqrt((sumSq / n) - math.Pow(mean, 2))
    return mean, stdDev
}

func main() {
    seed := uint64(1234)
    const window = 30
    var counter [window]int

    for z := 0; z < 100000000; z++ {
        counter[toIntervalUnbiased(&seed, window)] += 1
    }
    moyenne, ecart := meanAndStdDev(counter[:])
    fmt.Println("relative std ", ecart/moyenne*100, "%")
}

Random shuffle

Sometimes, you are given an array that you want to randomly shuffle. An elegant algorithm described by Knuth is the standard approach. The algorithm works by iterating over the array from the last element to the first element. At each iteration, it selects a random index between 0 and the current index (inclusive) and swaps the element at the current index with the element at the randomly generated index.

The following program shuffles randomly an array based on a seed. Changing the seed would change the order of the array. For large arrays, the number of possible permutations is likely to exceed the number of possible seeds: it implies that not all possible permutations are possible with such an algorithm using a simple fixed-length seed.

package main

import (
    "fmt"
    "math/bits"
)

func splitmix64(seed *uint64) uint64 {
    *seed += 0x9E3779B97F4A7C15
    z := *seed
    z = (z ^ (z >> 30))
    z *= (0xBF58476D1CE4E5B9)
    z = (z ^ (z >> 27))
    z *= (0x94D049BB133111EB)
    return z ^ (z >> 31)
}

func toIntervalUnbiased(seed *uint64, max uint64) uint64 {
    x := splitmix64(seed)
    hi, lo := bits.Mul64(x, max)
    if lo < max {
        t := (-max) % max // division!!!
        for lo < t {
            x := splitmix64(seed)
            hi, lo = bits.Mul64(x, max)
        }
    }
    return hi
}

func shuffle(seed *uint64, arr []int) {
    for i := len(arr)-1; i >= 1; i-- {
        j := toIntervalUnbiased(seed, uint64(i+1))
        arr[i], arr[j] = arr[j], arr[i]
    }
}

func main() {
    seed := uint64(1234)
    numbers := []int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
    shuffle(&seed, numbers)
    fmt.Println(numbers)
}

Floats

It is often necessary to generate random floating-point numbers. Software systems typically use IEEE 754 floating-point numbers.

To generate 32-bit floating-point numbers in the interval [0,1), it may seem that we could generate a 32-bit integer (in [0, 232)) and divide it by 232 to get a random floating-point value in [0, 1). That’s certainly “approximately true”, but we are making an error when doing so. How much of an error?

Floating-point (normal* numbers are represented as a sign bit, a mantissa, and an exponent as follows:

  • There is a single sign bit. Because we only care about positive numbers, this bit is fixed and can be ignored.
  • The mantissa of a 32-bit floating point number is 23 bits. It is implicitly preceded by the number 1.
  • There are eight bits dedicated to the exponent. For normal numbers, the exponent ranges from -126 to 127. To represent zero, you need an exponent value of -127 and zero mantissa.

So how many normal non-zero numbers are there between 0 and 1? The negative exponents range from -1 to -126. In each case, we have 223 distinct floating-point numbers because the mantissa is made of 23 bits. So we have 126 x 223 normal floating-point numbers in [0,1). If you don’t have a calculator handy, that’s 1,056,964,608. If we want to add the numbers 0 and 1, that’s 126 × 223 + 2 slightly over a billion distinct values. There are 232 32-bit words or slightly over 4 billion, so about a quarter of them are in the interval [0,1]. Of all the float-pointing point numbers your computer can represent, a quarter of them lie in [0,1]. By extension, half of the floating-point numbers are in the interval [-1,1].

The number 232 is not divisible by 126 × 223 + 2, so we can’t take a 32-bit non-negative integer, divide it by 232 and hope that this will generate a number in [0,1] or [0,1) in an unbiased way.

We can use the fact that the mantissa uses 23 bits. This means in particular that you pick any integer in [0, 224), and divide it by 224, then you can recover your original integer by multiplying the result again by 224. This works with 224 but not with 225 or any other larger number. For 64-bit floating-point numbers, you have greater accuracy as you can replace 24 with 53.

So you can pick a random integer in [0, 224), divide it by 224 and you will get a random number in [0,1) without bias, meaning that for every integer in [0,2^{24}), there is one and only one number in [0,1). Moreover, the distribution is uniform in the sense that the possible floating-point numbers are evenly spaced (the distance between them is a flat 2−24).

So even though single-precision floating-point numbers use 32-bit words, and even though your computer can represent about 230 distinct and normal floating-point numbers in [0, 1), chances are good that your random generator only produces 224 distinct 32-bit floating-point numbers in the interval [0, 1), and only 253 distinct 64-bit floating-point numbers.

A common way to generate random integers in an interval [0,N) is to first generate a random floating-point number [0,1) and then multiply the result by N. Should N exceeds 224 (or 253), then you are unable to generate all integers in the interval [0,N). Similarly, to generate numbers in [a,b), you would generate a random floating-point number [0,1) and then multiply the result by b-a and add a. The result may not be ideal in general.

The following program generates random floating-point numbers:

package main

import (
    "fmt"
)

func splitmix64(seed *uint64) uint64 {
    *seed += 0x9E3779B97F4A7C15
    z := *seed
    z = (z ^ (z >> 30))
    z *= (0xBF58476D1CE4E5B9)
    z = (z ^ (z >> 27))
    z *= (0x94D049BB133111EB)
    return z ^ (z >> 31)
}

// toFloat32 -> [0,1)
func toFloat32(seed *uint64) float32 {
    x := splitmix64(seed)
    x &= 0xffffff // %2**24
    return float32(x)/float32(0xffffff)
}


// toFloat64 -> [0,1)
func toFloat64(seed *uint64) float64 {
    x := splitmix64(seed)
    x &= 0x1fffffffffffff // %2**53
    return float64(x)/float64(0x1fffffffffffff)
}

func main() {
    seed := uint64(1231114)
    fmt.Println(toFloat32(&seed))
    fmt.Println(toFloat64(&seed))
}

An amusing application of floating-point is to estimate the value of pi. If we generate two floating-point numbers x, y in [0, 1), [0, 1), then out of an area of 1 (the unit square), then the area was x*x+y*y <= 1 should be pi/4. The following program prints an estimate of the value of pi.

package main

import (
    "fmt"
)

func splitmix64(seed *uint64) uint64 {
    *seed += 0x9E3779B97F4A7C15
    z := *seed
    z = (z ^ (z >> 30))
    z *= (0xBF58476D1CE4E5B9)
    z = (z ^ (z >> 27))
    z *= (0x94D049BB133111EB)
    return z ^ (z >> 31)
}

// toFloat64 -> [0,1)
func toFloat64(seed *uint64) float64 {
    x := splitmix64(seed)
    x &= 0x1fffffffffffff // %2**53
    return float64(x) / float64(0x1fffffffffffff)
}

func main() {
    seed := uint64(1231114)
    N := 100000000
    circle := 0
    for i := 0; i < N; i++ {
        x := toFloat64(&seed)
        y := toFloat64(&seed)
        if x*x+y*y <= 1 {
            circle += 1
        }

    }
    fmt.Println(4 * float64(circle)/float64(N))
}

Of course, practical algorithms might require other distributions such as the normal distribution. We can generate high quality normally distributed floating-point values at high speed using the The Ziggurat Method Marsaglia & Tsang, 2000. The implementation is not difficult, but it is technical. In particular, it requires a precomputed table. Typically, we generate normally distributed values with a mean of zero and a standard deviation of one: we often multiply the result by the square root of the desired standard deviation, and we add the desired mean.

Discrete distributions

Sometimes we are given a collection of possible values and each value has a corresponding probability. For example, we might pick at random one of three colors (red, blue, green) with corresponding probabilities (20%, 40%, 40%). If there are few such values (for example three), a standard approach is a roulette wheel selection. We divide the interval from 0 to 1 into three distinct components, one for each colour: from 0 to 0.2, we pick red, from 0.2 to 0.6, we pick blue, from 0.6 to 1.0, we pick green.

The following program illustrates this algorithm:

package main

import (
    "fmt"
    "math/rand"
    "time"
)

func splitmix64(seed *uint64) uint64 {
    *seed += 0x9E3779B97F4A7C15
    z := *seed
    z = (z ^ (z >> 30))
    z *= (0xBF58476D1CE4E5B9)
    z = (z ^ (z >> 27))
    z *= (0x94D049BB133111EB)
    return z ^ (z >> 31)
}

func toFloat64(seed *uint64) float64 {
    x := splitmix64(seed)
    x &= 0x1fffffffffffff // %2**53
    return float64(x) / float64(0x1fffffffffffff)
}

func rouletteWheelSelection(seed *uint64, colors []string, probabilities []float64) string {
    rand.Seed(time.Now().UnixNano())

    // Create a slice of cumulative probabilities
    cumulativeProbabilities := make([]float64, len(probabilities))
    cumulativeProbabilities[0] = probabilities[0]
    for i := 1; i < len(probabilities); i++ {
        cumulativeProbabilities[i] = cumulativeProbabilities[i-1] + probabilities[i]
    }

    // Generate a random number between 0 and 1
    randomNumber := toFloat64(seed)

    // Select the color based on the random number and cumulative probabilities
    if randomNumber < cumulativeProbabilities[0] {
        return colors[0]
    }
    for i := 1; i < len(cumulativeProbabilities); i++ {
        if randomNumber >= cumulativeProbabilities[i-1] && randomNumber < cumulativeProbabilities[i] {
            return colors[i]
        }
    }

    return colors[len(colors)-1]
}

func main() {
    seed := uint64(1231114)

    colors := []string{"red", "blue", "green"}
    probabilities := []float64{0.2, 0.4, 0.4}

    fmt.Println(rouletteWheelSelection(&seed, colors, probabilities))
}

If you have to pick a value out of a large set, a roulette-wheel selection approach can become inefficient. In such cases, we may use the alias method.

Cryptographic hashing and random number generation

We do not typically reimplement cryptographic functions. It is preferable to use well-tested implementations. They are typically reserved for cases where security is a concern because they often use more resources.

Cryptographic hashing of strings is designed so that it is difficult to find two strings that collide (have the same hash value). Thus if you receive a message, and you were given its hash value ahead of time, and you check that the hash value and the message correspond, there are good chances that the message has not been corrupted. It is difficult (but not impossible) for an attacker to produce a message that matches the hash value you were given. To hash a string in Go cryptographically, you may use the following code:

package main

import (
    "crypto/sha256"
    "fmt"
)

func main() {
    message := "Hello, world!"
    hash := sha256.Sum256([]byte(message))
    fmt.Printf("Message: %s\nHash: %x\n", message, hash)
}

Similarly, you may want to generate random numbers in a cryptographical manner: in such cases, the produced random numbers are difficult to predict. Even if I were to give you the ten last numbers, it would be difficult to predict the next one. If you were to implement software for an online casino, you should probably use cryptographic random numbers.

package main

import (
    "crypto/rand"
    "fmt"
    "math/big"
)

func main() {
    nBig, err := rand.Int(rand.Reader, big.NewInt(100))
    if err != nil {
        panic(err)
    }
    n := nBig.Int64()
    fmt.Printf("Here is a random %T between 0 and 99: %d\n", n, n)
}

Web server ‘hello world’ benchmark : Go vs Node.js vs Nim vs Bun

The Web is a convenient interface to your software. Many times, if you have an existing application, you may want to allow Web access to it using HTTP. Or you may want to build a small specialized Web application. In such instances, you do not want to use an actual Web server (e.g., Apache or IIS).

There are many popular frameworks for writing little web applications. Go and JavaScript (Node.js) are among the most popular choices. Reportedly, Netflix runs on Node.js; Uber moved from Node.js to Go for better performance. There are also less popular options such as Nim.

An in-depth review of their performance characteristics would be challenging.  But I just write a little toy web application, will I see the difference? A minimalist application gives you a reference since more complex applications are likely to run slower.

Let us try it out. I want the equivalent of ‘hello world’ for web servers. I also do not want to do any fiddling: let us keep things simple.

A minimalist Go server might look as follows:

package main
import (
  "io"
  "fmt"
  "log"
  "net/http"
)
func main() {
  http.HandleFunc("/simple", func(w http.ResponseWriter, r *http.Request){
    io.WriteString(w, "Hello!")
  })
  fmt.Printf("Starting server at port 3000\n")
  if err := http.ListenAndServe(":3000", nil); err != nil {
    log.Fatal(err)
  }
}

A basic JavaScript (Node.js) server might look like this:

const f = require('fastify')()
f.get('/simple', async (request) => {
  return "hello"
})
f.listen({ port: 3000})
  .then(() => console.log('listening on port 3000'))
  .catch(err => console.error(err))

It will work as-is in an alternative runtime such as Bun, but to get the most of the Bun runtime, you may need to write Bun-specific code:

const server = Bun.serve({
  port: 3000,
  fetch(req) {
   let url = new URL(req.url);
   let pname = url.pathname;
   if(pname == '/simple'){
     return Response('Hello');
   }
   return new Response("Not Found.");
  }
});

Nim offers a nice way to achieve the same result:

import options, asyncdispatch
import httpbeast
proc onRequest(req: Request): Future[void] =
  if req.httpMethod == some(HttpGet):
    case req.path.get()
    of "/simple":
      req.send("Hello World")
    else:
      req.send(Http404)
run(onRequest, initSettings(port=Port(3000)))

An interesting alternative is to use uWebSockets.js with Node:

const uWS = require('uWebSockets.js')
const port = 3000;
const app = uWS.App({
}).get('/simple', (res, req) => {
  res.end('Hello!');
}).listen(port, (token) => {
  if (token) {
    console.log('Listening to port ' + port);
  } else {
    console.log('Failed to listen to port ' + port);
  }
});

We can also use C++ with the lithium library:

#include <lithium_http_server.hh>
int main() {
  li::http_api my_api;
  my_api.get("/simple") =
    [&](li::http_request& request, li::http_response& response) {
      response.write("hello world.");
    };
  li::http_serve(my_api, 3000);
}

I wrote a benchmark, my source code is available. I ran it on a powerful IceLake-based server with 64 cores. As is typical, such big servers have relatively low clock speeds (base frequency of 2 GHz, up to 3.2 GHz). I use a simple bombardier command as part of the benchmark:

bombardier -c 10 http://localhost:3000/simple

You can increase the number of concurrent connections to 1000 (-c 1000). My initial tests used autocannon which is a poor choice for this task.

My result indicates that Nim is doing quite well on this toy example.

system requests/second (10 connections) requests/second (1000 connections)
Nim 2.0 and httpbeast 315,000 +/- 18,000 350,000 +/- 60,000
GCC12 (C++) + lithium 190,000 +/- 60,000 385,000 +/- 80,000
Go 1.19 95,000 +/- 30,000 250,000 +/- 45,000
Node.js 20 and uWebSockets.js 100,000 +/- 25,000 100,000 +/- 35,000
Bun 1.04 80,000 +/- 15,000 65,000 +/- 20,000
Node.js 20 (JavaScript) 45,000 +/- 7,000 41,000 +/- 10,000
Bun + fastify 40,000 +/- 6,000 35,000 +/- 9,000

*Jarred Sumner, the author of Bun, said on X that fastify is not fast in bun right now but that Bun.serve() is more than twice faster than node:http in bun.

My web server does very little work, so it is an edge case. I have also not done any configuration: it is ‘out of the box’ performance. Furthermore, the server is probably more powerful than anything web developers will use in practice.

There is considerable noise in this results, and you should not trust my numbers entirely. I recommend you try running the benchmark for yourself.

I reviewed some blog posts, all concluding that Go is faster :

It would be interesting to add C, Rust and Zig to this benchmark.

Regarding the C++ solution, I initially encountered many difficulties. Using Lithium turned out to be simple: the most difficult part is to ensure that you have installed OpenSSL and Boost on your system. My solution is just as simple as the alternatives. The author of Lithium has a nice twist where he explains how to run a Lithium server using a docker container with a script. Doing it in this manner means that you do not have to worry about installing libraries on your system. Running a server in a docker container is perfectly reasonable but there is a performance overhead, so I did not use this solution in my benchmark.

While preparing this blog post, I had the pleasure of compiling software written in the Nim language for the first time. I must say that it left a good impression. The authors state that Nim was inspired by Python. It does feel quite like Python. I will revisit nim later.

Parsing integers quickly with AVX-512

If I give a programmer a string such as "9223372036854775808" and I ask them to convert it to an integer, they might do the following in C++:

std::string s = ....
uint64_t val;
auto [ptr, ec] =
std::from_chars(s.data(), s.data() + s.size(), val);
if (ec != std::errc()) {} // I have an error !
// val holds the value

It is very fast: you can parse a sequence of random 32-bit integers at about 40 cycles per integer, using about 128 instructions.

Can you do better?

The recent Intel processors have new instructions, AVX-512, that can process multiple bytes at once and do it with masking, so that you can select just a range of data.

I am going to assume that you know the beginning and the end of sequence of digits. The following code with AVX-512 intrinsic does the following:

  1. Computes the span in bytes (digit_count),
  2. If we have more than 20 bytes, we know that the integer is too large to fit in a 64-bit integer,
  3. We compute a “mask”: a 32-bit value that only the most significant digit_count bits set to 1,
  4. We load an ASCII or UTF-8 string in a 256-bit register,
  5. We subtract character value ‘0’ to get values between 0 and 9 (digit values),
  6. We check whether some value exceeds 9, in which case we had a non-digit character.
size_t digit_count = size_t(end - start);
// if (digit_count > 20) { error ....}
const simd8x32 ASCII_ZERO = _mm256_set1_epi8('0');
const simd8x32 NINE = _mm256_set1_epi8(9);
uint32_t mask = uint32_t(0xFFFFFFFF) << (start - end + 32);
auto in = _mm256_maskz_loadu_epi8(mask, end - 32);
auto base10_8bit = _mm256_maskz_sub_epi8(mask, in, ASCII_ZERO);
auto nondigits = _mm256_mask_cmpgt_epu8_mask(mask, base10_8bit, NINE);
if (nondigits) {
// there is a non-digit
}

This is the key step that uses the functionality of AVX-512. Afterward, we can use ‘old-school’ processing, for folks familiar with advanced Intel intrinsics on conventional x64 processors… Mostly, we just multiply by 10, by 100, by 100000 to create four 32-bit values: the first one corresponds to the least significant 8 ASCII bytes, the second one to the next most significant 8 ASCII bytes, and the their one to up to 4 most significant bytes. When the numbers has 8 digits or less, only one of these words is relevant, and when there are 16 or less, on the first two are significant. We always waste one 32-bit value that is made of zeroes. The code might look as follows:

 auto DIGIT_VALUE_BASE10_8BIT =
_mm256_set_epi8(1, 10, 1, 10, 1, 10, 1, 10,
1, 10, 1, 10, 1, 10, 1, 10,
1, 10, 1, 10, 1, 10, 1, 10,
1, 10, 1, 10, 1, 10, 1, 10);
auto DIGIT_VALUE_BASE10E2_8BIT = _mm_set_epi8(
1, 100, 1, 100, 1, 100, 1, 100, 1, 100, 1, 100, 1, 100, 1, 100);
auto DIGIT_VALUE_BASE10E4_16BIT =
_mm_set_epi16(1, 10000, 1, 10000, 1, 10000, 1, 10000);
auto base10e2_16bit =
_mm256_maddubs_epi16(base10_8bit, DIGIT_VALUE_BASE10_8BIT);
auto base10e2_8bit = _mm256_cvtepi16_epi8(base10e2_16bit);
auto base10e4_16bit =
_mm_maddubs_epi16(base10e2_8bit, DIGIT_VALUE_BASE10E2_8BIT);
auto base10e8_32bit =
_mm_madd_epi16(base10e4_16bit, DIGIT_VALUE_BASE10E4_16BIT);

I have implemented this function in C++, and compiled it using GCC12. I run the benchmark on an Ice Lake server. I use random 32-bit integers for testing. The AVX-512 is over twice as fast as the standard approach.

AVX-512 1.8 GB/s 57 instructions/number 17 cycles/number
std::from_chars 0.8 GB/s 128 instructions/number 39 cycles/number

The comparison is not currently entirely fair because the AVX-512 function assumes that it knows the start and the end of the sequence of digits.

You can boost the performance by using an inline function, which brings it up to 2.3 GB/s, so a 30% performance boost. However, that assumes that you are parsing the numbers in sequence within a loop.

My original code would return fancy std::optional values, but GCC was negatively affected, so I changed my function signatures to be more conventional. Even, LLVM/clang is slightly faster in my tests, compared to GCC.

Credit: The original code and the problem were suggested to me by John Keiser. My code is largely derivative of his code.

Transcoding Unicode strings at crazy speeds with AVX-512

In software, we store strings of text as arrays of bytes in memory using one of the Unicode Transformation Formats (UTF), the most popular being UTF-8 and UTF-16. Windows, Java, C# and other systems common languages and systems default on UTF-16, whereas other systems and most of the web relies on UTF-8. There are benefits to both formats and we are not going to adopt one over the other any time soon. It means that we must constantly convert our strings from UTF-8 to UTF-16 and back.

It is so important that IBM mainframes based on z/Architecture provide special-purposes instructions named “CONVERT UTF-8 TO UTF-16” and “CONVERT UTF-16 TO UTF-8” for translation between the two encodings. By virtue of being implemented in hardware, these exceed 10 GiB processing speed for typical inputs. The rest of us do not have access to powerful mainframes, but our processors have single-instruction-multiple-data (SIMD) instructions. These SIMD instructions operate on larger registers (128 bits, 256 bits, 512 bits) representing vectors of numbers. Starting with the AMD Zen 4 family of processors, and with recent server Intel processors (i.e., Ice Lake or better), we have powerful SIMD instructions that can operate over 512 bits of data (AVX-512). The width (512 bits) is not the most interesting thing about these instructions: they are also distinctly more powerful than prior SIMD instruction sets.

We have a software library for this purpose called simdutf. The library automatically detects your processor and selects a ‘kernel’ of functions that is most appropriate. It supports a wide range of instruction sets (ARM NEON, SSE2, AVX2), but until last year, it did not support AVX-512.

So, how quickly can you convert (or ‘transcode’) strings using AVX-512? It is the subject of a new paper: Transcoding unicode characters with AVX-512 instructions (Software: Practice and Experience, to appear). A reference point is the popular ICU library that is used almost universally for this task. It is mature and well optimized.

It is fairly easy to optimize the performance of transcoding for simple cases (e.g., English), but much more challenging for languages such as Arabic… and the most difficult case is probably streams of Emojis. The following graphic indicates our performance in gigabytes of input per second, comparing against our fast AVX2 kernel and ICU for difficult cases: plot A  is for UTF-8 to UTF-16 transcoding and plot B is for UTF-16 to UTF-8 transcoding. In these experiments, we use an Ice Lake processor and the Clang 14 C++ compiler from the LLVM.

 

Using AVX-512 on recent Intel and AMD processors, we exceed 4 GB/s on Chinese and Emoji inputs and almost reach 8 GB/s on Arabic text when transcoding from UTF-8. When transcoding from UTF-16, we exceed 5 GB/s on Chinese and Emoji text and break the 20 GB/s barrier on Arabic text.

I also like to report the speed in ‘gigacharacters per second’: billions of characters processed per second.  It is less impressive that gigabytes per second because characters can span multiple bytes, but with gigacharacters per second, we can directly compare the UTF-8 to UTF-16 transcoding to the UTF-16 to UTF-8 transcoding. We find it much easier to transcoding from UTF-16, since it is a simpler format, than to transcode from UTF-8. The benefits compared to ICU depend on the data source, but our simdutf library can be several times faster as the following tables show.

UTF-8 to UTF-16:

ICU AVX-512
Arabic 0.80 4.3
Chinese 0.50 1.8
Emoji 0.22 1.0
Hebrew 0.80 4.3
Hindi 0.43 1.7
Japanese 0.51 1.7
Korean 0.62 1.8
Latin 1.5 20.
Russian 0.46 4.2

UTF-16 to UTF-8:

ICU AVX-512
Arabic 0.67 11.
Chinese 0.36 3.9
Emoji 0.27 1.6
Hebrew 0.68 11.
Hindi 0.21 3.8
Japanese 0.37 3.8
Korean 0.37 3.8
Latin 0.91 20.
Russian 0.23 11.

Older Intel processor that supported AVX-512 raised serious concerns due to various forms of ‘frequency throttling’: whenever you’d use these instructions, they would trigger a lowering of the frequency. For this reason, we do not use AVX-512 instructions on these older processors, preferring to fall back on AVX instructions. There is no frequency throttling on AMD Zen 4 or recent Intel processors due AVX-512 instructions.

Our functions have already been used in production systems for nearly a year as part of the simdutf library. The library includes an extensive collection of tests and benchmarks. It is part of several important projects such as Node.js, Bun, Oracle graaljs, and it is used by Couchbase and others. The simdutf library is a joint community effort, with multiple important contributors (e.g., Wojciech Muła, Yagiz Nizipli and so forth).

We are currently in the process of adding Latin 1 support to the library. In the future, I hope that we might add SVE/SVE2 support for the new generation of ARM processors.

Credit: The algorithmic design of the new AVX-512 functions should be credited to the always brilliant Robert Clausecker.

Locating ‘identifiers’ quickly (ARM NEON edition)

A common problem in parsing is that you want to find all identifiers (e.g., variable names, function names) in a document quickly. There are typically some fixed rules. For example, it is common to allow ASCII letters and digits as well as characters like ‘_’ in the identifier, but to forbid some characters at the beginning of the identifier (such as digits). E.g., ab123 is an identifier but 123ab might not be.

An efficient way to proceed is to use a 256-element table where allowed leading characters have a set value (say 255), non-identifier characters have the value 0, and all other characters have a non-zero, non-255 value.

In C, you might be about to count identifiers using the following routine:

while (source < end) {
  uint8_t c = identifier_map[*source++];
  if(c) {
    count += (c == 255);
    while (source < end && identifier_map[*source]) {
      source++;
    }
  }
}

Can you do better?

Suppose that you have an ARM-based machine (such as a recent macBook). Then you have access to Single instruction, multiple data (SIMD) instructions which can process up to 16 bytes at a time. ARM has several types of SIMD instructions but ARM NEON are the most well known.

We can apply vectorized classification to the problem (see Parsing Gigabytes of JSON per Second, The VLDB Journal, 28(6), 2019). I am not going to review the technique in details but the gist of it is that we replace the table lookup with a vectorized (or SIMD-based) table lookup. The relevant component of the code, using ARM instrinsic functions, looks as follows:

uint8x16_t low_nibble_mask = (uint8x16_t){
42, 62, 62, 62, 62, 62, 62, 62, 62, 62, 60, 20, 20, 20, 20, 21};
uint8x16_t high_nibble_mask =
(uint8x16_t){0, 0, 0, 2, 16, 33, 4, 8, 0, 0, 0, 0, 0, 0, 0, 0};
uint8x16_t chars = vld1q_u8((const uint8_t *)source);
uint8x16_t low =
vqtbl1q_u8(low_nibble_mask, vandq_u8(chars, vdupq_n_u8(0xf)));
uint8x16_t high = vqtbl1q_u8(high_nibble_mask, vshrq_n_u8(chars, 4));
uint8x16_t v = vtstq_u8(low, high);
uint8x16_t v_no_number = vtstq_u8(low, vandq_u8(high, vdupq_n_u8(61)));

The register v contains a non-zero byte value where there is an identifier character, and the v_no_number contains a non-zero byte value where there is a non-digit identifier character. We can map these SIMD registers into bitmaps from which we can identify the location of the identifier using regular C code.

We need to do a few logical operations and bit shifting. Counting the number of identifiers is a simple as the following:

while (source <= end16) {
  auto m = identifier_neon(source);
  uint16_t mask =
   m.leading_identifier_mask & ~(m.identifier_mask << 1 | lastbit);
  count += popcount(mask);
  lastbit = m.identifier_mask >> 15;
  source += 16;
}

ARM NEON is not particularly powerful, not compared with AVX-512 on Intel Ice Lake or AMD Zen 4 processors, but it is still quite good by historical standards, and the Apple implementation is fine. You can extend this approach to go to 64 bytes.

So how does the SIMD approach compares to the conventional approach? I am using a script that generate a large document. We want to count the number of identifiers as quickly as possible. The file is 10 MB and it contains half a million identifiers.

SIMD (simdjson-like, 16 bytes) 10 GB/s 2 ns/identifier
SIMD (simdjson-like, 64 bytes) 17 GB/s 1.1 ns/identifier
conventional (table-based) 0.5 GB/s 25 ns/identifier

So the SIMD-based approach over 10 times faster in this instance. My Apple-based M2 processor runs at 3.6 GHz. Using the SIMD-based code, it retires over 5 instructions per cycle. The conventional table-based approach retires far few instructions, and it is bound by load latencies.

For context, my macBook has a disk that can read at well over 2 GB/s. So the conventional routine is not fast.

On AVX-512, it should be possible to reach even higher speeds.

My source code is available.

Credit: This problem was suggested to me by Kevin Newton (Shopify). He provided the test file. The 64-byte implementation was provided by reader Perforated Blob.

Science and Technology links (September 2 2023)

  1. Physicists have a published a paper with 5154 authors. The list of authors takes 24 pages out of the 33 pages. The lesson is that if someone tell you that they have published an important paper, you should ask how many authors there were and what their exact role was.
  2. Vegatarians are at higher risk for hip fracture.
  3. Virgin Galatic brings private customers to space.
  4. Quercetin, a commonly available supplement, appears to reduce blood pressure and improve the lipid profile in human beings. (Quercetin is relatively expensive in Canada.)
  5. Scientists increasingly talk with certitude, they increasingly avoid words such as might or probably. It seems that it might be part of an effort to promote the research. In effect, science writing might be adapting strategies from marketing.
  6. I have long accepted the fact that if you encourage students to think that they can grow their abilities by hard work, they will better results. Macnamara and Burgoyne find that this result may not hold:

    We conclude that apparent effects of growth mindset interventions on academic achievement are likely attributable to inadequate study design, reporting flaws, and bias.

  7. Heterochronic parabiosis is the process by which you connect the blood vessels of old and young animals (typically mice). Researchers ran a heterochronic parabiosis experiment over a long period of time (3 months) and they found that the old mice that have benefited from heterochronic parabiosis lived longer than control old mice. Furthermore, the treated old mice appeared rejuvenated, and the rejuvenation was still visible two months after the treatment. It seems that heterochronic parabiosis made the old mice younger at the level of gene expression. The approach is unlikely to be used on human beings as is. You would not want to connect the blood vessels of young human beings and old human beings for extended periods of times. However, it is a proof of principle. It seems that you can rejuvenate old animals by acting on the composition of their blood.
  8. Researchers used young blood plasma from pigs to seemingly rejuvenate old rats.
  9. A specific compound, AOH1996, appears to selectively kill cancer cells without side effect:

    We report a small molecule inhibitor, AOH1996, of a cancer-associated isoform of PCNA (caPCNA), which notably, almost completely inhibits the growth of xenograft tumors without causing any discernible toxicity to experimental animals. (Gu et al., 2023)

  10. Old people tend to become frail, a process called sarcopenia. Though the most obvious effect is a loss of muscle mass, the muscle quality is also decreased. It is possible that the loss of muscle might a consequence of a loss of nerves in the muscles. Tezze et al. (2023) find that metformin and galantamine treat sarcopenia in old mice. Galantamine has been used to treat some of the side effects of Alzheimer’s whereas metformin is routinely used to treat diabetes.
  11. It seems that a significant fraction (e.g., 40%) of the measured warming in the recent decades could be explained by the urban effect: thermometers that were once located in the country become surrounded by a warm city.
  12. Naked mole rates are nearly ageless, they don’t age the way we do. Researchers have transfered a gene from naked mole rates to mice and found that they were able to increase the longevity of mice.
  13. The placebo effect is the phenomenon where people who receive a bogus treatment get better, despite the lack of actual drug. Two doctors (Hróbjartsson and  Gøtzsche) suggest that there is not much of placebo effect outside small effect in the context of pain management.
  14. The number of natural disasters is not increasing.
  15. Used coffee grounds make strong concrete.
  16. China restricts how long young people can play video games. Their policies appear to be ineffectual: young people in China still play a lot of video games.
  17. Aging is conserved across mammalian species at select regions of the DNA. Thus if we rejuvenate some animals at the gene expression level, it is credible that it could extend to other animals.
  18. Psychiatrists choosing a treatment for themselves predominantly selected a different treatment than the one recommended to patients by participants asked in the regular recommendation role. Psychiatrists preferred the lless invasive options for themselves (i.e. watchful waiting) whereas they recommended the more invasive ones for the patient (i.e. antidepressants and depot injections).

Transcoding Latin 1 strings to UTF-8 strings at 18 GB/s using AVX-512

Though most strings online today follow the Unicode standard (e.g., using UTF-8), the Latin 1 standard is still in widespread inside some systems (such as browsers) as JavaScript strings are often stored as either Latin 1, UTF-8 or UTF-16 internally. Latin 1 captures the first 256 characters from the Unicode standard and represents them as a single byte per character.

In a previous blog post, we examined how we could convert UTF-8 strings to Latin 1. I want to examine the reverse problem, the transcoding of Latin 1 strings to UTF-8. We receive a byte array, and each byte is one character. There is no need for any validation. However, some characters (the ASCII characters) are mapped to one byte whereas all others should be mapped to two bytes.

You can code it in C using code such as this routine:

 unsigned char byte = data[pos];
if ((byte & 0x80) == 0) { // if ASCII
  // will generate one UTF-8 byte
  *utf8_output++ = (char)(byte);
  pos++;
} else {
  // will generate two UTF-8 bytes
  *utf8_output++ = (char)((byte >> 6) | 0b11000000);
  *utf8_output++ = (char)((byte & 0b111111) | 0b10000000);
  pos++;
}

Can we do better using the new AVX-512 instructions available on recent Intel (e.g. Ice Lake) and AMD (e.g., Zen4) processors?

We can try to do it as follows:

  1. Load 32 bytes into a register.
  2. Identify all the non-ASCII bytes (they have the most significant bit set to 1).
  3. Identify the bytes that have their two most significant bits set by comparing with 192.
  4. Cast all bytes to 16-bit words into a 64-byte register.
  5. Add 0xc200 to all 16-bit words, as the most significant byte in a 16-bit word must be either 0xc2 or 0xc3.
  6. Add 0x00c0 to the 16-bit words where the corresponding byte had its two most significant bit set, this will move up a bit value so that the most signficant byte may become 0xc3.
  7. Flip the order of the bytes within each 16-bit word since UTF-8 is big endian.
  8. Remove one byte (‘compress’) where we had an ASCII byte.
  9. Store the result (can be between 32 bytes and 64 bytes).

Using Intel intrinsics, the result might look as follows:

__m256i input = _mm256_loadu_si256((__m256i *)(buf + pos));
__mmask32 nonascii = _mm256_movepi8_mask(input);
int output_size = 32 + _popcnt32(nonascii);
uint64_t compmask = ~_pdep_u64(~nonascii, 0x5555555555555555);
__mmask32 sixth =
_mm256_cmpge_epu8_mask(input, _mm256_set1_epi8(192));
__m512i input16 = _mm512_cvtepu8_epi16(input);
input16 =_mm512_add_epi16(input16, _mm512_set1_epi16(0xc200));
__m512i output16 =
_mm512_mask_add_epi16(input16, sixth, input16,
_mm512_set1_epi16(0x00c0));
output16 = _mm512_shuffle_epi8(output16, byteflip);
__m512i output = _mm512_maskz_compress_epi8(compmask, output16);

Our AVX-512 registers can load and process 64 bytes at a time, but this approach  consumes only 32 bytes (instead of 64 bytes). An anonymous reader has contributed a better approach:

  • Load 64 bytes into a register.
  • Identify all the non-ASCII bytes (they have the most significant bit set to 1).
  • Identify the bytes that have their two most significant bits set by comparing with 192.
  • Create a ‘compression’ mask where ASCII characters correspond to 01 whereas non-ASCII characters correspond to 11: we need to distinct 64-bit masks.
  • We use the vpshldw instruction (instead of the vpmovzxbw instruction) to upscale the bytes to 16-bit value, adding the 0b11000000 leading byte in the process. We adjust for the bytes that have their two most significant bits. This takes care of the first 32 bytes, assuming we interleaved the bytes.
  • The second part (next 32 bytes) are handled with bitwise logical operations.
  • We save two blocks of 32 bytes.

 

The code using AVX-512 intrinsics might look at follows:

__mmask32 nonascii = _mm256_movepi8_mask(input);
__mmask64 sixth =
_mm512_cmpge_epu8_mask(input, _mm512_set1_epi8(-64));
const uint64_t alternate_bits = UINT64_C(0x5555555555555555);
uint64_t ascii = ~nonascii;
uint64_t maskA = ~_pdep_u64(ascii, alternate_bits);
uint64_t maskB = ~_pdep_u64(ascii>>32, alternate_bits);
// interleave bytes from top and bottom halves (abcd...ABCD -> aAbBcCdD)
__m512i input_interleaved = _mm512_permutexvar_epi8(_mm512_set_epi32(
0x3f1f3e1e, 0x3d1d3c1c, 0x3b1b3a1a, 0x39193818,
0x37173616, 0x35153414, 0x33133212, 0x31113010,
0x2f0f2e0e, 0x2d0d2c0c, 0x2b0b2a0a, 0x29092808,
0x27072606, 0x25052404, 0x23032202, 0x21012000
), input);
// double size of each byte, and insert the leading byte
__m512i outputA = _mm512_shldi_epi16(input_interleaved, _mm512_set1_epi8(-62), 8);
outputA = _mm512_mask_add_epi16(outputA, (__mmask32)sixth, outputA, _mm512_set1_epi16(1 - 0x4000));
__m512i leadingB = _mm512_mask_blend_epi16((__mmask32)(sixth>>32), _mm512_set1_epi16(0x00c2), _mm512_set1_epi16(0x40c3));
__m512i outputB = _mm512_ternarylogic_epi32(input_interleaved, leadingB, _mm512_set1_epi16((short)0xff00), (240 & 170) ^ 204); // (input_interleaved & 0xff00) ^ leadingB
// prune redundant bytes
outputA = _mm512_maskz_compress_epi8(maskA, outputA);
outputB = _mm512_maskz_compress_epi8(maskB, outputB);

It is possible to do even better if you expect that the input is often ASCII or contains few non-ASCII bytes. You can branch on the case where a sequence of 64-bit is all ASCII and use a fast path in this case: trivially, we just store the 64 bytes we just read, as is.

I use GCC 11 on an Ice Lake server. My source code is available. For benchmarking, I use a French version of the Mars wikipedia entry.

technique CPU cycles/byte instructions/byte
conventional 1.7 5.0
AVX-512 (32 bytes) 0.26 0.77
AVX-512 (64 bytes) 0.21 0.54
AVX-512 (64 bytes+branch) 0.17 0.45

On a 3.2 GHz processor, the 32-byte AVX-512 routine reaches 12 GB/s. It is nearly seventimes faster than the conventional routine, and it uses six times fewer instructions. The approach consuming 64 bytes is faster than 14 GB/s, and the version with an ASCII branch breaks the 18 GB/s limit.

This server has memory bandwidth of at least 15 GB/s, but the CPU cache bandwidth is several times higher. There are disks with bandwidths of over 10 GB/s.

Remark. Whenever I mention Latin 1, some people are prone to remark that browsers treat HTML pages declared as Latin 1 and ASCII as windows-1252. That is because modern web browsers do not support Latin 1 and ASCII in HTML. Even so, you should not use Latin 1, ASCII or even windows-1252 for your web pages. I recommend using Unicode (UTF-8). However, if you code in Python, Go or Node.js, and you declare a string as Latin 1, it should be Latin 1, not windows-1252. It is bug to confuse Latin 1, ASCII and windows-1252. They are different formats.

How accurate is the birthday’s paradox formula?

Given a set of r random values from a large set (of size N), I have been using the formula 1-exp(-r**2/(2N)) to approximate the probability of a collision. It assumes that r is much smaller than N. The formula suggests that if you have hundreds of millions of random 64-bit numbers, you will start getting collisions with non-trivial probabilities, meaning that at least two values will be equal. At the one billion range, the probability of a collision is about 3% according to this formula.

It is somewhat unintuitive because if I give you two random 64-bit values, the probability of a collision is so low that it might as well be zero.

Though it is a textbook formula, we should still test it out to make sure that it is reasonable. Let us generate 32-bit random values for speed. I use a simple frequentist approximation: I generate many sets of 32-bit random values, I count the number of sets with a collision, and I divide this number by the total number of sets.

My results are as follows. The formula agrees with my results: I get a maximal error of 23%. The exact measured output depends on the random number generation and will vary depending on how you set it up. Nevertheless, it looks good! As you can see, if you even only 51,200 32-bit random values, the probability of a collision reaches 25%. My code is available.

number theory measured relative error
100 0.000001 0.000001 error: 23%
200 0.000005 0.000005 error: 13%
400 0.000019 0.000014 error: 23%
800 0.000075 0.000073 error: 2%
1600 0.000298 0.000254 error: 15%
3200 0.001191 0.001079 error: 9%
6400 0.004757 0.004700 error: 1%
12800 0.018893 0.017570 error: 7%
25600 0.073456 0.071261 error: 3%
51200 0.263006 0.240400 error: 9%

Transcoding UTF-8 strings to Latin 1 strings at 18 GB/s using AVX-512

Most strings online are Unicode strings in the UTF-8 format. Other systems (e.g., Java, Microsoft) might prefer UTF-16. However, Latin 1 is still a common encoding (e.g., within JavaScript runtimes). Its relationship with Unicode is simple: Latin 1 includes the first 256 Unicode characters. It is rich enough to convert most of the standard European languages. If something is stored in Latin 1, it can be encoded using Unicode. The reverse is obviously false. Nevertheless, let us assume that you have a Unicode string in UTF-8 that you want to quickly transcode to Latin 1.

The transcoding can be done with a short routine in C:

uint8_t leading_byte = data[pos]; // leading byte
if (leading_byte < 0b10000000) {
  *latin_output++ = leading_byte;
  pos++;
} else if ((leading_byte & 0b11100000) == 0b11000000) {
  *latin_output++ = (leading_byte & 1) << 6 | (data[pos + 1]);
  pos += 2;
}

It processes the data one byte at a time. There are two cases: ASCII bytes (one byte, one character) and two-byte characters (one leading byte, one continuation byte).

Can we do better?

We can use Single instruction, multiple data (SIMD) and specifically the advanced  SIMD instructions available on recent AMD Zen 4 and Intel Ice Lake processors: AVX-512 with VBMI2.

We can then process the data 64 bytes at a time. Using AVX-512, we lead 64 bytes. We identify the location of the ASCII bytes, the leading and the continuation bytes. These are identified using masks. We then modify the leading bytes, keeping one bit (just the least significant one) and shift it up by six positions. We then do two compression: one where we omit the continuation bytes, and one where we omit the newly transformed leading bytes. We then simply using a byte-wise logical OR, and we are done. Using Intel intrinsics, the code might look as follows:

 __m512i input = _mm512_loadu_si512((__m512i *)(buf + pos));
__mmask64 ascii = _mm512_cmplt_epu8_mask(input, mask_80808080);
__mmask64 continuation = _mm512_cmplt_epi8_mask(input,
    _mm512_set1_epi8(-64));
__mmask64 leading = _kxnor_mask64(ascii, continuation);
__m512i highbits = _mm512_maskz_add_epi8(leading, input,
    _mm512_set1_epi8(62));
highbits = _mm512_slli_epi16(highbits, 6); // shift in position
input = _mm512_mask_blend_epi8(leading, input, highbits);
__m512i ascii_continuation = _mm512_maskz_compress_epi8(ascii | 
    continuation, input);
__m512i ascii_leading = _mm512_maskz_compress_epi8(ascii | leading,
    input);
__m512i output = _mm512_or_si512(ascii_continuation, ascii_leading);
_mm512_storeu_epi8((__m512i*)latin_output, output);

Of course, we must also validate the input, and it adds some complexity, but not too much.

An anonymous reader points out to a significantly faster and simpler approach:

  1. Load 64 bytes
  2. Identify the leading bytes (non-continuation non-ASCII bytes) with a single comparison.
  3. Test whether the leading bytes have their less significant bit set, and construct a mask with it.
  4. Shift the mask by one position and set the second last bit to 1 in the contibuation bytes that are preceded by a leading byte. We can do it by subtracting 0b11000000 because we have that 0b10000000 – 0b11000000 is 0b11000000.
  5. Prune all the leading bytes (non-continuation non-ASCII bytes) and write it out.

Using Intel intrinsics, the core implementation might be like so:

__m512i input = _mm512_loadu_si512((__m512i *)(buf + pos));
__mmask64 leading = _mm512_cmpge_epu8_mask(input, _mm512_set1_epi8(-64));
__mmask64 bit6 = _mm512_mask_test_epi8_mask(leading, input, _mm512_set1_epi8(1));
input = _mm512_mask_sub_epi8(input, (bit6<<1) | next_bit6, input, _mm512_set1_epi8(-64));
next_bit6 = bit6 >> 63;
__mmask64 retain = ~leading;
__m512i output = _mm512_maskz_compress_epi8(retain, input);
int64_t written_out = _popcnt64(retain);
__mmask64 store_mask = (1ULL << written_out) - 1;
_mm512_mask_storeu_epi8((__m512i *)latin_output, store_mask, output);

I use GCC 11 on an Ice Lake server. My source code is available. For benchmarking, I use a French version of the Mars wikipedia entry that has been modified to fit in latin 1.

technique CPU cycles/byte instructions/byte
conventional 1.9 5.4
AVX-512 0.3 0.75
Faster AVX-512 0.2 0.43

On a 3.2 GHz processor, the AVX-512 routine reaches 12 GB/s. It is about 6 times faster than the conventional routine, and it uses 7 times fewer instructions. The faster AVX-512 routine 10 times faster than the conventional routine while using 11 times fewer instructions. It reaches 18 GB/s. It is likely faster than the RAM bandwidth which  I estimate to be at least 15 GB/s, but the CPU cache bandwidth is several times higher. Keep in mind that there are disks with a 14 GB/s bandwidth.

This problem illustrates that AVX-512 can really do well on non-trivial string transformations without excessive cleverness.

Remark. Whenever I mention Latin 1, some people are prone to remark that browsers treat HTML pages declared as Latin 1 and ASCII as windows-1252. That is because modern web browsers do not support Latin 1 and ASCII in HTML. Even so, you should not use Latin 1, ASCII or even windows-1252 for your web pages. I recommend using Unicode (UTF-8). However, if you code in Python, Go or Node.js, and you declare a string as Latin 1, it should be Latin 1, not windows-1252. It is bug to confuse Latin 1, ASCII and windows-1252. They are different formats.

Coding of domain names to wire format at gigabytes per second

When you enter in your browser the domain name lemire.me, it eventually gets encoded into a so-called wire format. The name lemire.me contains two labels, one of length 6 (lemire) and one of length two (me). The wire format starts with 6lemire2me: that is, imagining that the name starts with an imaginary dot, all dots are replaced by the length (in bytes) of the upcoming label. The numbers are stored as byte values, not ASCII digits. There is also a final zero byte. RFC 1035 specifies the format:

Each label is represented as a one octet length field followed by that number of octets. Since every domain name ends with the null label of the root, a domain name is terminated by a length byte of zero. The high order two bits of every length octet must be zero, and the remaining six bits of the length field limit the label to 63 octets or less. To simplify implementations, the total length of a domain name (i.e., label octets and label length octets) is restricted to 255 octets or less.

I find the problem interesting because the wire format is an “indexed” format: you can easily iterate over the labels, jumping in constant time over them, knowing the exact length, and so forth. Thus, coding strings into the wire format is a form of indexing.

How quickly can we do it?

You can use conventional code to convert the name to the wire format. You copy non-dot characters and when you encounter a dot, you write the distance to the previous dot (imagining that there is a virtual first dot). In C, it might look as follows (omitting the final zero byte):

 char *src = "lemire.me";
 uint8_t *dst = ...; // buffer
 uint8_t *counter = dst++;
 do {
   while (is_name_character(*src)) {
    *dst++ = (uint8_t)*src; src++;
  }
  *counter = (uint8_t)(dst - counter - 1);
  counter = dst++;
  if (*src != '.') { break; }
  src++;
 } while (true);

Can you do better?

You can use Single instruction, multiple data (SIMD) instructions. You load 16, 32 or 64 characters in wide register. You locate the dots. You construct a new register where the dots are replaced by position indexes and everything else is zeroed. E.g., starting with

.a.bcde.

You mark the location of the dots:

0 0 0xff 0 0 0 0 0xff

You compute the byte-wise AND with the following constant:

0 1 2 3 4 5 6 7

and you get an array where the dots have been replaced by indexes…

0 0 2 0 0 0 0 7

You then do a prefix computation, propagating the values:

0 2 2 7 7 7 7 7

This can be done using a logarithmic algorithm: shifting the values off by 1 byte, comparing, shifting the values off by 2 bytes, comparing, and so forth.

Finally, we can shift by one and subtract the result. Masking off the result so that we are only keeping where the dots are, we get the desired counts:

2 - 5 - - - - 0

You can then blend the results:

2 a 5 b c d e 0

If you use Intel intrinsics, the code might look like this… It is a bit technical so I am not going to explain it further, but the idea is as I explained…

__m128i dot_to_counters(__m128i input) {
  __m128i dots = _mm_cmpeq_epi8(input, _mm_set1_epi8('.'));
  __m128i sequential =
_mm_setr_epi8(-128, -127, -126, -125, -124, -123, -122, -121, -120, -119,
-118, -117, -116, -115, -114, -113);
  __m128i dotscounts = _mm_and_si128(dots, sequential);
  __m128i origdotscounts = dotscounts;
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 1),
    dotscounts);
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 2),
    dotscounts);
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 4),
    dotscounts);
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 8),
    dotscounts);
  __m128i next = _mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 1);
  dotscounts = _mm_subs_epu8(next, origdotscounts);
  dotscounts = _mm_subs_epu8(dotscounts, _mm_set1_epi8(1));
  return _mm_blendv_epi8(input, dotscounts, dots);
}

I call this strategy  “Prefix-Minimum”. It is essentially data independent. It does not matter where the dots are, the code is always the same. Processors like that a lot!

Prefix-Minimum will be fast, especially if you use wide registers (e.g., 32 bytes with AVX2 instructions). However, it does not offer a direct path to validating the inputs: are the counters in the proper range? Do you have overly long labels?

If you need to reason about where the dots are, the best way is probably to build an index in the spirit of the simdjson library (our original paper is available as PDF). For example, you can load 64 bytes of data, and construct a 64-bit word where each bit correspond to a byte value, and the byte is set to 1 if the corresponding byte is a dot (‘.’), you can then iterate through the bits, as a bitset. The construction of the 64-bit word might look as follows using Intel instructions and AVX2:

__m256i input1 = _mm256_loadu_si256((const __m256i *)src);
__m256i input2 = _mm256_loadu_si256((const __m256i *)(src + 32));
uint32_t dots1 = (uint32_t)_mm256_movemask_epi8(
_mm256_cmpeq_epi8(input1, _mm256_set1_epi8('.')));
uint32_t dots2 = (uint32_t)_mm256_movemask_epi8(
_mm256_cmpeq_epi8(input2, _mm256_set1_epi8('.')));
uint64_t DOTS = (((uint64_t)dots2) << 32) | dots1;

I implemented different strategies using AVX2 intrinsics. We can use a database of popular domain names for testing. Results vary depending on the system. I use GCC 11 and an Ice Lake server.

technique CPU cycles/string instructions/string
conventional 150 240
Prefix-Minimum (256 bits) 44 77
simdjson-like (256 bits) 56 77

My source code is available.

The Prefix-Minimum approach is fastest. Observe how the Prefix-Minimum approach is faster than the simdjson-like approach, despite executing the same number of instructions. That is because it is able to retire more instructions per cycle. The downside of the Prefix-Minimum approach is that it does not validate the label length.

The simdjson-like approach does full validation and is still quite fast: over three times faster than the conventional approach. It is sufficient to break the gigabyte per second barrier (1.7 GB/s in these experiments).

Being able to validate the counter values with the Prefix-Minimum without adding too much overhead would be fantastic: it might be possible to get the best speed without compromising on the validation. It remains an open problem whether it is possible.

I suspect that my implementations can be improved, by at least 20%.

Future work should consider AVX-512. I have the sketch of an implementation using AVX-512 but it is not faster. Porting to ARM NEON would be interesting.

Credit: The work was motivated by the simdzone project by NLnetLabs.

Science and Technology links (August 6 2023)

  1. In an extensive study, You et al. (2022) found that meat consumption was correlated with higher life expectancies:

    Meat intake is positively correlated with life expectancies. This relationship remained significant when influences of caloric intake, urbanization, obesity, education and carbohydrate crops were statistically controlled. Stepwise linear regression selected meat intake, not carbohydrate crops, as one of the significant predictors of life expectancy. In contrast, carbohydrate crops showed weak and negative correlation with life expectancy.

  2. Tupy and Deutsch write:

    The world’s population has increased eightfold since 1800, and standards of living have never been higher. Despite increases in consumption, and contrary to the prophecies of generations of Malthusians, the world hasn’t run out of a single metal or mineral. In fact, resources have generally grown cheaper relative to income over the past two centuries. Even on the largest cosmic scale, resources may well be limitless.

  3. Though the United States has one of the fastest economic growths, its growth might be limited since the 1970 by uneqal rate of innovations among industries. That is, economic progress might be held back by bottlenecks.
  4. In the past, people built extensive cities underground, where tens of thousands of people lived, protected from invaders.
  5. In some cases, workers are less productive when working from home. The productivity of workers randomly assigned to working from home is 18% lower than those in the office. Workers who prefer home work are substantially less productive at home than at the office (27% less compared to 13% less for workers who prefer the office).
  6. Law students are optimistic: 95% believed they would end up in the top half of the class.
  7. HMB, a common muscle-building supplement, might also protect your brain.
  8. Antibiotics are associated with the inflammatory bowel disease.
  9. Statins, a commonly prescribed class of drugs, might increase insuline resistance and cause diabetes.
  10. Artificial intelligence (deep reinforcement learning) has improved the speed of some data sorting algorithms, in practice. Meanwhile a simple algorithm based on zip compression can beat fancy artificial intelligence techniques at text classification.
  11. A cocktail based on metformin (anti-diabetes drug) and leucine (a common supplement) can protect against muscle atrophy.

Decoding base16 sequences quickly

We sometimes represent binary data using the hexadecimal notation. We use a base-16 representation where the first 10 digits are 0, 1, 2, 3, 5, 6, 7, 8, 9 and where the following digits are A, B, C, D, E, F (or a, b, c, d, e, f). Thus each character represents 4 bits. A pair of characters can represent any byte value (8 bits).

In Parsing short hexadecimal strings efficiently, I examined the problem of decoding short sequences (e.g., 4 characters). A conventional parser can be based on table lookups, where unexpected characters are mapped to a special value (e.g., -1) but where the characters 0 to 9 are mapped to 0 to 9, and A to F are mapped to 10 to 15. A conventional decoder can rely on a simple routine such as…

int8_t n1 = digittoval[src[0]];
int8_t n2 = digittoval[src[1]];
src += 2;
*dst = (uint8_t)((n1 << 4) | n2);
dst += 1;

Can we go faster? In Parsing hex numbers with validation, Langdale and Muła considered the question. They use SIMD instructions. They get good results, but let us revisit the question.

I am considering the problem of parsing sequences of 56 characters (representing 28 bytes). I want to compare the best approach described by Langdale and Muła with a straight adaptation of our base32 decoder. Generally speaking, the main difference between our approach and the Langdale-Muła is in the validation. The Langdale-Muła approach does straight-forward arithmetic and comparisons, whereas we favour vectorized classification (e.g., we use vectorized table lookups).

Our main routine to decode 16 characters goes as follows:

  1. Load 16 bytes into a vector register.
  2. Subtract 1 from all character values.
  3. Using a 4-bit shift and a mask, select the most significant 4 bits of each byte value.
  4. Do a vectorized table lookup using the most significant 4 bits, and add the result to the value. This inexpensive computation can detect any non-hexadecimal character: any such character would map to a byte value with the most significant bit set. We later branch out based on a movemask which detects these bytes.
  5. We do a similar computation, a vectorized table lookup using the most significant 4 bits, and add the result to the value, to map the character to a 4-bit value (between 0 and 16).
  6. We use a multiply-add and a pack instruction to pack the 4-bit values, going from 16 bytes to 8 bytes.
  7. We write the 8 bytes to the output.

Using Intel instructions, the hot loop looks as follows:

  __m128i v = _mm_loadu_si128((__m128i *)src);
  __m128i vm1 = _mm_add_epi8(v, _mm_set1_epi8(-1));
  __m128i hash_key =
  _mm_and_si128(_mm_srli_epi32(vm1, 4), _mm_set1_epi8(0x0F));
 __m128i check = _mm_add_epi8(_mm_shuffle_epi8(delta_check, hash_key), vm1);
  v = _mm_add_epi8(vm1, _mm_shuffle_epi8(delta_rebase, hash_key));
  unsigned int m = (unsigned)_mm_movemask_epi8(check);
  if (m) {
   // error
  }
__m128i t3 = _mm_maddubs_epi16(v, _mm_set1_epi16(0x0110));
__m128i t5 = _mm_packus_epi16(t3, t3);
_mm_storeu_si128((__m128i *)dst, t5);

You can do the equivalent with 256-bit (and even 512-bit) registers if you want. I refer the interested reader to our paper Faster Base64 Encoding and Decoding using AVX2 Instructions.

Your results will vary depending on the system. I use GCC 11 and an Ice Lake server. The inputs are rather small and there is overhead due to the function calls so it is not an ideal case for fancy algorithms.

technique CPU cycles/string instructions/string
conventional 72 360
Langdale-Muła (128 bits) 24 110
ours (128 bits) 21 88
ours (256 bits) 16 61

My source code is available.

Compared to a conventional decoder, our fast techniques are 3 to 5 times faster.

At least in this code, our approach is slightly faster than the Langdale-Muła approach. However, the difference is small and if you consider different hardware or different compilers, the results might flip. The 256-bit approach is noticeably faster, but if your inputs are small (shorter than 32 characters), then that would not hold.

If you have access to AVX-512, you can assuredly go much faster, but I did not consider this case.

In the real-world, you may need to handle spaces in the encoded sequence. My benchmark does not handle such cases and some careful engineering based on real-world data would be needed to deal efficiently with such inputs at high speed. I nevertheless include a fast decoder that can ignore spaces in the input for demonstration purposes (see source code). It would be easier to prune spaces using AVX-512.

Porting this code to ARM NEON would be easy.

Conclusion: You can decode base16 inputs using about one instruction per input character which is about six times better than a conventional decoder. It seems that you can beat the fast Langdale-Muła decoder, at least on my test system. Your main tool to go faster is to use wider SIMD registers. Future work should consider AVX-512: it might be possible to double the performance.

Credit: The work was motivated by the simdzone project by NLnetLabs. GitHub user @aqrit lent his expertise.

Science and Technology links (July 23 2023)

  1. People increasingly consume ultra processed foods. They include
    energy drinks, mass-produced packaged breads, margarines, cereal, energy bars, fruit yogurts, fruit drinks, vegan meat and cheese, infant formulas, pizza, chicken nuggets, and so forth. Ultra processed foods are correlated with poorer health.
  2. Homo sapiens was not the first human species to venture out of Africa, it was the last.
  3. Researchers claim to be able to reprogram the cells in our eyes to repair retinal degenerence.
  4. Older people are often prescribed statins to keep their risk of cardiovascular disease low. However, statins reduce short-term cognitive performance.
  5. Killifish are animals that have amazing regeneration abilities, being even able to regenerate part of their brain. As they age, this regenerative ability is lost. Killifish, like other animals, tend to accumulate dysfunctional cells with age, called senescent cells. Though having a few senescent cells is not a problem, and may even be necessary, having too many is believed to reduce fitness. Thankfully, we have drugs called senolytics that can eliminate some of the senescent cells. By using these drugs in old killifish, they were able to restore some of the amazing regenerative abilities.
  6. There may be a mushroom that is able to increase nerve growth and improve memory, in human beings.
  7. Resistance training might be a viable anti-skin-aging strategy for women (and possibly men).

Fast decoding of base32 strings

We often need to encode binary data into ASCII strings (e.g., email). The standards to do so include base16, base32 and base64.

There are some research papers on fast base64 encoding and decoding: Base64 encoding and decoding at almost the speed of a memory copy and Faster Base64 Encoding and Decoding using AVX2 Instructions.

For the most parts, these base64 techniques are applicable to base32. Base32 works in the following manner: you use 8 ASCII characters to encode 5 bytes. Each ASCII characters carries 5 bits of information: it can be one of 32 characters. For reference, base64 uses 64 different ASCII characters so each character carries more information. However, base64 requires using both upper case and lower case letters and other special characters, so it is less portable. Base32 can be case invariant.

There are different variations, but we can consider Base 32 Encoding with Extended Hex Alphabet which uses the letters 0 to 9 for the values 0 to 9 and the letters A to V for the numbers from 10 to 31. So each character represents a value between 0 to 31. If required, you can pad the coding with the ‘=’ character so that it is divisible by 8 characters. However, that is not always required. Instead, you may simply stop decoding as soon as an out-of-range character is found.

‘0’ 0
‘1’ 1
‘2’ 2
‘4’ 4
‘9’ 9
‘A’ 10
‘V’ 31

A conventional decoder might use branchy code:

if (ch >= '0' && ch <= '9')
  d = ch - '0';
else if (ch >= 'A' && ch <= 'V')
  d = ch - 'A' + 10;
else if (ch >= 'a' && ch <= 'v')
  d = ch - 'a' + 10;
else
  return -1;

Though that’s not going to be fast, it can be convenient. You can do better with tables. For example, you may populate a 256-long table (one for each character byte value) with the values from the table above, and use a special error code when the value is out of range (e.g., 32). That can produce efficient code:

uint64_t r = 0;
for (size_t i = 0; i < 8; i++) {
  uint8_t x = table[*src++];
  if (x > 31) {
    r <<= (5 * (8 - i));
    break;
  }
  r <<= 5;
  r |= x;
}

You can also program a version using SIMD instructions. I am not going to present the code, it is similar to the code described in a base64 paper.

I wrote a benchmark focused on short inputs (32 bytes). My benchmark makes function inlining difficult, thus the function call overhead and the population of the constant is a non-negligible cost.

I run it on an Ice Lake server, and the code is compiled with GCC11 (targeting an old processor, Haswell). We can use either 128-bit SIMD registers or 256-bit SIMD registers.

technique CPU cycles/string instructions/string
branchy 500 1400
table 43 230
SIMD (128 bits) 15 70
SIMD (256 bits) 13 61

In this instance, the table-based approach is reasonable and is only about three times slower than the SIMD-based approaches. Because I am using small inputs, the 256-bit SIMD code has only marginal benefits, but I expect it would do better over longer strings. The branchy code is terrible performance-wise, but it is more flexible and can easily deal with skipping white space characters and so forth.

My source code is available.

Credit: The work was motivated by the simdzone project by NLnetLabs. The initial base32 decoder implementation was provided by GitHub user @aqrit.

Science and Technology links (July 16 2023)

  1. Most people think that they are more intelligent than average.
  2. Lack of vitamin C may damage the arteries. Make sure you have enough!
  3. A difficult problem in software is caching. Caching is the idea that you keep some values in fast memory. But how do you choose which values to keep? A standard technique is to evict from cache the least recently used value. Intuitively, you just get rid of the value you have not needed in a long time. However, that’s relatively expensive. Yang et al. find that a simpler technique (first-in first-out) can be just as good. You just enter values in queue and the last value to enter cache is the first one to be considered for evication. For best performance, they explain that you should reinsert values that have been evicted when it has been used at least once, and that you should have a probation period for the newly inserted values.
  4. Soybean oil induces obesity, diabetes, insulin resistance, and fatty liver in mice. Unlike other oils (e.g., coconut oil), soybean oils affect gene expression.
  5. Receiving a Nobel Prize might lower your intellectual productivity.
  6. We are currently in the holocene, which started at the end of the last glacial period (about 12,000 years ago). During the first half of the holocene, the Sahara was green, filled with grassland and tropical trees.
  7. Conspiracy theorizing is most prevalent in freer societies (including Australia, Canada, Germany, Japan, the Netherlands and the United Kingdom).
  8. Cancer thrives on sugar. Thus tumors do not grow as much on a low-sugar diet. However, it does not follow that a ketogenic diet is helpful in fighting cancer because the lack of sugar may accelerate frailty and thus hasten death.
  9. Offsprings of parents with severe mental illnesses are more at risk for diabetes.
  10. Vitamin D supplements might reduce your risk of myocardial infarction.
  11. If you take away much of the genetic material of a cell, leaving it less fit, it can quickly gain back the fitness by natural selection.
  12. You can tell apart men and women by how they smell.

Recognizing string prefixes with SIMD instructions

Suppose that I give you a long list of string tokens (e.g., “A”, “A6”, “AAAA”, “AFSDB”, “APL”, “CAA”, “CDS”, “CDNSKEY”, “CERT”, “CH”, “CNAME”, “CS”, “CSYNC”, “DHC”, etc.). I give you a pointer inside a much larger string and I ask you whether you are pointing at one of these tokens, and if so, which one. To make things slightly more complicated, you want the token to be followed by a valid separator (e.g., a space, a semi-colon, etc.) and you want to ignore the case (so that “aaaa” matches “AAAA”).

How might you solve this efficiently?

Comparing against each token might do well if you have few of them, but it is clearly a bad idea if you have many (say 70).

A reasonable approach is to do a binary search through the sorted list of tokens. The C function bsearch is well suited. You need a comparison function as part of the implementation. You may use the C function strncasecmp to compare the strings while ignoring the case, and you add a check to make sure that you only return a match (value 0) when the input has a terminating character at the right position.

Then the linearithmic (O(n log n)) implementation looks like this…

std::string *lookup_symbol(const char *input) {
  return bsearch(input, strings.data(), strings.size(),
  sizeof(std::string), compare);
}

Simple enough.

Another approach is to use a trie. You implement a tree where the first level checks for the first character, the second level for the second character and so forth.

It gets a little bit lengthy, but you can use a script or a library to generate the code. You can use a series of switch/case like so…

switch (s[0]) {
  case 'A': case 'a':
  switch (s[1]) {
    case '6': return is_separator(s[2]) ? 1 : -1;
  case 'A': case 'a':
    switch (s[2]) {
     case 'A': case 'a':
       switch (s[3]) { 
         case 'A': case 'a':
          return is_separator(s[4]) ? 2 : -1;
       default:
         return -1;
...

The running time complexity depends on the data, but is at most the length of the longest string in your set. The trie is a tempting solution but it is branchy: if the processor can predict the upcoming content, it should do well, but if the input is random, you might be unlikely and get poor performance.

We can also use a finite-state machine which requires a relative large table, but has really simple execution:

int s = 71;
while (*str && s >= 0) {
  uint8_t tok = char2token[uint8_t(*str)];
  str++;
  s = statetable[32 * s + tok];
}
*token = (uint8_t)s;
return s != 0;

With SIMD instructions, you can write a tight implementation that is effectively branchless: its execution does not depend on the input data.

The code works in this manner:

  1. We load unconditionally 16 bytes in a register.
  2. We first find the location of the first separator, if any. We can do this with vectorized classification. It is a significant cost.
  3. We set to zero all bytes in the register starting from this first separator. We also switch all characters in A-Z to a lower case.
  4. We use a hash function to map the processed bytes to a table containing our tokens in 16-byte blocks. The hash function is designed in a such a way that if the input matches one of our tokens, then it should be mapped to an identical value. We can derive the hash function empirically (by trial and error). Computing the hash function is a significant cost so we have the be careful. In this instance, I use a simple function:
    (((val >> 32) ^ val)&0xffffffff) * (uint64_t)3523216699) >> 32.
  5. We compare the processed input with the loaded value from the hash function.

The C function written using Intel intrinsic functions is as follows:

bool sse_type(const char *type_string, uint8_t *type) {
  __m128i input = _mm_loadu_si128((__m128i *)type_string);
  __m128i delimiters =
    _mm_setr_epi8(0x00, 0x00, 0x22, 0x00, 0x00, 0x00, 
                  0x00, 0x00, 0x28, 0x09, 0x0a, 0x3b, 
                  0x00, 0x0d, 0x00, 0x00);
  __m128i mask = _mm_setr_epi8(-33, -1, -1, -1, -1, 
                  -1, -1, -1, -1, -33, -1, -1,
                  -1, -1, -1, -1);
  __m128i pattern = _mm_shuffle_epi8(delimiters, input);
  __m128i inputc = _mm_and_si128(input, 
      _mm_shuffle_epi8(mask, input));
  int bitmask = _mm_movemask_epi8(
      _mm_cmpeq_epi8(inputc, pattern));
  uint16_t length = __builtin_ctz(bitmask);
  __m128i zero_mask = _mm_loadu_si128(
       (__m128i *)(zero_masks + 16 - length));
  __m128i inputlc = _mm_or_si128(input, _mm_set1_epi8(0x20));
  input = _mm_andnot_si128(zero_mask, inputlc);
  uint8_t idx = hash((uint64_t)_mm_cvtsi128_si64(input));
  *type = idx;
  __m128i compar = _mm_loadu_si128((__m128i *)buffers[idx]);
  __m128i xorthem = _mm_xor_si128(compar, input);
  return _mm_test_all_zeros(xorthem, xorthem);
}

We expect it to compile to branchfree code, as follows:

sse_type(char const*, unsigned char*):
  movdqu xmm1, XMMWORD PTR [rdi]
  mov edx, 16
  movdqa xmm2, XMMWORD PTR .LC0[rip]
  movdqa xmm0, XMMWORD PTR .LC1[rip]
  pshufb xmm2, xmm1
  pshufb xmm0, xmm1
  pand xmm0, xmm1
  pcmpeqb xmm0, xmm2
  por xmm1, XMMWORD PTR .LC2[rip]
  pmovmskb eax, xmm0
  bsf eax, eax
  cdqe
  sub rdx, rax
  movdqu xmm0, XMMWORD PTR zero_masks[rdx]
  pandn xmm0, xmm1
  movq rax, xmm0
  movq rdx, xmm0
  shr rax, 32
  xor eax, edx
  mov edx, 3523216699
  imul rax, rdx
  shr rax, 32
  mov BYTE PTR [rsi], al
  movzx eax, al
  sal rax, 4
  pxor xmm0, XMMWORD PTR buffers[rax]
  ptest xmm0, xmm0
  sete al
  ret

I wrote a benchmark where we repeatedly try to check for matching tokens, using many thousands random tokens (enough to prevent the processor from having trivial branch prediction). I run it on an Ice Lake server, and the code is compiled with GCC11 (targeting an old processor, Westmere).

technique CPU cycles/string instructions/string
binary search 236 335
trie 71 39
finite state 42 61
SIMD 15 39

In this particular test, the SIMD-based approach is four times faster than the trie despite the fact that it retires as many instructions. The trie struggles with branch mispredictions. The SIMD-based approach has a relatively high number of instructions retired per cycle (2.5). The binary search is disastrous in this case, being more than 10 times slower. The finite-state approach is interesting as it is only three times slower than the SIMD-based approach and significantly faster than the other non-SIMD approaches. It uses near twice as many instructions as  the trie, but it is nearly twice as fast. However, the finite-state approach requires a relatively large table, larger than the alternatives.

The trie can match the SIMD-based approach when the input is predictable. I can simulate this scenario by repeatedly trying to match a small number of tokens (say 100) always in the same order. I get that the trie can then be just as fast in this easy case.

My code is available. It could be easily ported to ARM NEON or to AVX-512.

Credit: The problem was suggested to me by Jeroen Koekkoek from NLnet Labs. He sketched part of the solution. I want to thank GitHub user @aqrit for their comments.

Note: The problem considered in this blog post is not the recognition of strings from a set. I have a blog post on that other topic: Quickly checking that a string belongs to a small set.

Further reading: Is Prefix Of String In Table? by Nelson and Modern perfect hashing for strings by Muła.

Stealth, not secrecy

The strategy for winning is simple: do good work and tell the world about it. In that order! This implies some level of stealth as you are doing the good work.

If you plan to lose weight, don’t announce it… lose the weight and then do the reveal.

Early feedback frames the problem and might derail you. You can end up feeling stuck in your path and it may increase your risk of failure.

If you want to feel free to let go of bad ideas, do not commit to them publicly.

Or you may feel like you already succeeded and take the “do good work” for granted: it is called hubris and smart people are susceptible to it.

The timing of the first reveal matters because people are lazy. If you reveal something that is flawed, it will be hard to correct this impression later. You are better off waiting a bit later and making sure you present good work. Furthermore, announcing your next product early can drain interest for your current products. Steve Jobs once said: “It’s really simple, if we tell people what our next product is, they stop buying our current products.”

But the reverse does not apply: the most secretive people are not those who have genuine good work upcoming. For every Apple, you have hundreds of companies that have nothing worth stealing. And they will make sure you never find out.1, 2, 3

“Don’t worry about people stealing your ideas. If your ideas are any good, you’ll have to ram them down people’s throats.” Howard Aiken.

1. Theranos had an insane culture of secrecy. The company lasted 15 years. It was once valued at $10 billion. The company only failed after John Ioannidis famously questioned the wild claims of the company while the rest of the medical community just went along. It turns out that there was no innovation, no technology worth having. The company was soon worth $0.

2. Apple is successful and secretive. But Apple is secretive about features that their competitors already have.

3. In Surprisingly Small: The Effect of Trade Secret Breaches on Firm Performance, Searl and Vivian (2021) find that security breaches have little effect on a firm’s performance. They write:

For policy makers, the implication is the narrative of trade secret theft as a fundamental threat to a country’s economy and innovation (in our case, the US), may simply be rhetoric when contrasted to the market’s understanding of such theft. Therefore, calls for the expansion of trade secrecy protections, such as the expansion of its definition or further criminalisation of trade secret theft, may do less to protect innovation overall and instead expand negative externalities such as increased litigation and constraints on labour mobility.
For managers, the implication is that the benefits of the protection of trade secrets may be overstated. Counterintuitively, the findings suggest managers should not prioritise trade secret protections and cybersecurity if the main goal is protecting shareholders.

 

Packing a string of digits into an integer quickly

Suppose that I give you a short string of digits, containing possibly spaces or other characters (e.g., "20141103 012910"). We would like to pack the digits into an integer (e.g., 0x20141103012910) so that the lexicographical order over the string matches the ordering of the integers.

We can use the fact that in ASCII, the digits have byte values 0x30, 0x31 and so forth. Thus as a sequence of bytes, the string "20141103 012910" is actually 0x32, 0x30, 0x31… so we must select the least significant 4 bits of each byte, discarding the rest. Intel and AMD processors have a convenient instruction which allows us to select any bits from a word to construct a new word (pext).

A problem remains: Intel and AMD processors are little endian, which means that if I load the string in memory, the first byte becomes the least significant, not the most significant. Thankfully, Intel and AMD can handle this byte order during the load process.

In C, the desired function using Intel intrinsics might look like this:

#include <x86intrin.h> // Windows: <intrin.h>
#include <string.h>

// From "20141103 012910", we want to get
// 0x20141103012910
uint64_t extract_nibbles(const char* c) {
  uint64_t part1, part2;
  memcpy(&part1, c, sizeof(uint64_t));
  memcpy(&part2 , c + 7, sizeof(uint64_t));
  part1 = _bswap64(part1);
  part2 = _bswap64(part2);
  part1 = _pext_u64(part1, 0x0f0f0f0f0f0f0f0f);
  part2 = _pext_u64(part2, 0x0f000f0f0f0f0f0f);
  return (part1<<24) | (part2);
}

It compiles to relatively few instructions: only 4 non-load instructions. The memcpy calls in my code translate into 64-bit load instructions. The register loading instructions (movabs) are nearly free in practice.

movbe rax, QWORD PTR [rdi]
movbe rdx, QWORD PTR [rdi+7]
movabs rcx, 1085102592571150095
pext rax, rax, rcx
movabs rcx, 1080880467920490255
sal rax, 24
pext rdx, rdx, rcx
or rax, rdx

Prior to the AMD Zen 3 processors, pext had terrible performance on AMD processors. Recent AMD processors have performance on par with Intel, meaning that pext has a latency of about 3 cycles, and can run once per cycle. So it is about as expensive as a multiplication. Not counting the loads, the above function could nearly be complete in about 5 cycles, which is quite fast.

For ARM processors, you can do it with ARM NEON like so: mask the high nibbles, shuffle the bytes, then shift/or, then narrow (16->8), extract to general register.

#include <arm_neon.h>
// From "20141103 012910", we want to get
// 0x20141103012910
uint64_t extract_nibbles(const char *c) {
  const uint8_t *ascii = (const uint8_t *)(c);
  uint8x16_t in = vld1q_u8(ascii);
  // masking the high nibbles,
  in = vandq_u8(in, vmovq_n_u8(0x0f));
  // shuffle the bytes
  const uint8x16_t shuf = {14, 13, 12, 11, 10, 9, 7, 6,
    5, 4, 3, 2, 1, 0, 255, 255};
  in = vqtbl1q_u8(in, shuf);
  // then shift/or
  uint16x8_t ins =
    vsraq_n_u16(vreinterpretq_u16_u8(in),
    vreinterpretq_u16_u8(in), 4);
  // then narrow (16->8),
  int8x8_t packed = vmovn_u16(ins);
  // extract to general register.
  return vget_lane_u64(vreinterpret_u64_u16(packed), 0);
}

It might compile to something like this:

adrp x8, .LCPI0_0
ldr q1, [x0]
movi v0.16b, #15
ldr q2, [x8, :lo12:.LCPI0_0]
and v0.16b, v1.16b, v0.16b
tbl v0.16b, { v0.16b }, v2.16b
usra v0.8h, v0.8h, #4
xtn v0.8b, v0.8h
fmov x0, d0

My source code is available.

Having fun with string literal suffixes in C++

The C++14 standard introduced string suffixes. E.g., the expression "abcde\0\0fgh"s is an std::string with 10 characters. You also have user-defined string suffixes: you can create your own.

C++11 also added regular expressions to the C++ language as a standard feature. I wanted to have fun and see whether we could combine these features.

Regular expressions are useful to check whether a given string matches a pattern. For example, the expression \d+ checks that the string is made of one or more digits. Unfortunately, the backlash character needs to be escaped in C++, so the string \d+ may need to be written as "\\d+" or you may use a raw string: a raw string literal starts with R"( and ends in )" so you can write R"(\d+)". For complicated expressions, a raw string might be better.

A user-defined string literal is a way to specialize a string literal according to your own needs. It is effectively a convenient way to design your own “string types”. You can code it up as:

myclass operator"" _mysuffix(const char *str, size_t len) {
  return myclass(str, len);
}

And once it is defined, instead of writing myclass("mystring", 8), you can write "mystring"_mysuffix.

In any case, we would like to have a syntax such as this:

bool is_digit = "\\d+"_re("123");

I can start with a user-defined string suffix:

convenience_matcher operator "" _re(const char *str, size_t) {
return convenience_matcher(str);
}

I want my convenience_matcher to construct a regular expression instance, and to call the matching function whenever a parameter is passed in parenthesis. The following class might work:

#include <regex>
struct convenience_matcher {
  convenience_matcher(const char *str) : re(str) {}
  bool match(const std::string &s) {
    std::smatch base_match;
    return std::regex_match(s, base_match, re);
  }
  bool operator()(const std::string &s) { return match(s); }
  std::regex re;
};

And that is all. The following expressions will then return a Boolean value indicating whether we have the required pattern:

 "\\d+"_re("123") // true
 "\\d+"_re("a23") // false
 R"(\d+)"_re("123") // true
 R"(\d+)"_re("a23") // false

I have posted a complete example. It is just for illustration and I do not recommend using this code for anything serious. I am sure that you can do better!

Parsing time stamps faster with SIMD instructions

In software, it is common to represent time as a time-stamp string. It is usually specified by a time format string. Some standards use the format %Y%m%d%H%M%S meaning that we print the year, the month, the day, the hours, the minutes and the seconds. The current time as I write this blog post would be 20230701205436 as a time stamp in this format. It is convenient because it is short, easy to read and if you sort the strings lexicographically, you also sort them chronologically.

You can generate time stamps using any programming language. In C, the following program will print the current time (universal, not local time):

#include <time.h>
#include <stdio.h>
int main() {
  char buffer[15];
  struct tm timeinfo;
  time_t rawtime;
  time(&rawtime);
  gmtime_r(&rawtime, &timeinfo);
  size_t len = strftime(buffer, 15, "%Y%m%d%H%M%S", &timeinfo);
  buffer[14] = '\0';
  puts(buffer);
}

We are interested in the problem of parsing these strings. In practice, this means that we want to convert them to an integer presenting the number of seconds since the Unix epoch. The Unix epoch is January 1st 1970. For my purposes, I will consider the time to be an unsigned 32-bit integer so we can represent time between 1970 and 2106. It is not difficult to switch over to a 64-bit integer or to signed integers.

The way you typically solve this problem is to use something like the C function strptime. Can we do better?

Modern processors have fast instructions that operate on several words at once, called SIMD instructions. We have a block of 14 characters. Let us assume that we can read 16 characters safely, ignoring the content of the leftover characters.

We load the block of digits in a SIMD register. We subtract 0x30 (the code point value of the character ‘0’), and all bytes values should be between 0 and 9, inclusively. We know that some character must be smaller than 9, for example, we generally cannot have more than 59 seconds and never 60 seconds, in the time stamp string. In Unix time, we never allow 60 seconds. So one character must be between 0 and 5. Similarly, we start the hours at 00 and end at 23, so one character must be between 0 and 2. We do a saturating subtraction of the maximum: the result of such a subtraction should be zero if the value is no larger. We then use a special instruction to multiply one byte by 10, and sum it up with the next byte, getting a 16-bit value. We then repeat the same approach as before, checking that the result is not too large.

The code might look as follow using Intel intrinsic functions:

 __m128i v = _mm_loadu_si128((const __m128i *)date_string);
v = _mm_sub_epi8(v, _mm_set1_epi8(0x30));
__m128i limit =
_mm_setr_epi8(9, 9, 9, 9, 1, 9, 3, 9, 2, 9, 5, 9, 5, 9, -1, -1);
__m128i abide_by_limits = _mm_subs_epu8(v, limit); // must be all zero
const __m128i weights = _mm_setr_epi8(
10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 0, 0);
v = _mm_maddubs_epi16(v, weights);
__m128i limit16 =
_mm_setr_epi16(99,99, 12, 31, 23, 59, 59, -1);
__m128i abide_by_limits16 = _mm_subs_epu16(v, limit16);
__m128i limits = _mm_or_si128(abide_by_limits16,abide_by_limits);
if (!_mm_test_all_zeros(limits, limits)) {
  return false;
}

It does not get all the parsing done, but at this point, you have the months, days, hours, minutes and seconds as valid binary integer values. The year is parsed in two components (the first two digits, and the next two digits).

We can just use standard C code for the result.

Is it fast? I wrote a benchmark that I compile using GCC 12 on an Intel Ice Lake Linux server.

instructions per stamp time per stamp
standard C with strptime 700 46
SIMD approach 65 7.9

We use about 10 times fewer instructions, and we go 6 times faster. That is not bad, but I suspect it is not nearly optimal.

Importantly, we do full error checking, and abide by the standard.

The source code is available.

Credit: Thanks to Jeroen Koekkoek from NLnetLabs for initial work and for proposing the problem, and to @aqrit for sketching the current code.