Bit Hacking (with Go code)

At a fundamental level, a programmer needs to manipulate bits. Modern processors operate over data by loading in ‘registers’ and not individual bits. Thus a programmer must know how to manipulate the bits within a register. Generally, we can do so while programming with 8-bit, 16-bit, 32-bit and 64-bit integers. For example, suppose that I want to set an individual bit to value 1. Let us pick the bit an index 12 in a 64-bit words. The word with just the bit at index 12 set is 1<<12 : the number 1 shifted to the left 12 times, or 4096. In Go, we format numbers using the fmt.Printf function: we use a string with formatting instructions followed by the values we want to print. We begin a formatting sequence with the letter % which has a special meaning (if one wants to print %, one most use the string %%). It can be followed by the letter b which stands for binary, the letter d (for decimal) or x (for hexadecimal). Sometimes we want to specify the minimal length (in characters) of the output, and we do so by a leading number: e.g, fmt.Printf("%100d", 4096) prints a 100-character string that ends with 4096 and begins with spaces. We can specify zero as a padding character rather than the space by adding it as a prefix (e.g., "%0100d"). In Go, we may print thus the individual bits in a word as in the following example:

package main

import "fmt"

func main() {
    var x uint64 = 1 << 12
    fmt.Printf("%064b", x)
}

Running this program we get a binary string representing 1<<12:

0000000000000000000000000000000000000000000000000001000000000000

The general convention when printing numbers is that the most significant digits are printed first followed by the least significant digits: e.g., we write 1234 when we mean 1000 + 200 + 30 + 4. Similarly, Go prints the most significant bits first, and so the number 1<<12 has 64-12=52 leading zeros followed by a 1 with 11 trailing zeros.

We might find it interesting to revisit how Go represents negative integers. Let us take the 64-bit integer -2. Using two’s complement notation, the number should be represented as the unsigned number (1<<64)-2 which should be a word made entirely one ones, except for the second last bit. We can use the fact that a cast operation in Go (e.g., uint64(x)) preserves the binary representation:

package main

import "fmt"

func main() {
    var x int64 = -2
    fmt.Printf("%064b", uint64(x))
}

This program will print 1111111111111111111111111111111111111111111111111111111111111110 as expected.

Go has some relevant binary operators that we often use to manipulate bits:

&    bitwise AND
|    bitwise OR
^    bitwise XOR
&^   bitwise AND NOT

Furthermore, the symbol ^ is also used to flip all bits a word when used as an unary operation: a ^ b computes the bitwise XOR of a and b whereas ^a flips all bits of a. We can verify that we have a|b == (a^b) | (a&b) == (a^b) + (a&b).

We have other useful identities. For example, given two integers a and b, we have that a+b = (a^b) + 2*(a&b). In the identity 2*(a&b) represents the carries whereas a^b represents the addition without the carries. Consider for example 0b1001 + 0b01001. We have that 0b1 + 0b1 == 0b10 and this is the 2*(a&b) component, whereas 0b1000 + 0b01000 == 0b11000 is captured by a^b. We have that 2*(a|b) = 2*(a&b) + 2*(a^b), thus a+b = (a^b) + 2*(a&b) becomes a+b = 2*(a|b) - (a^b). These relationships are valid whether we consider unsigned or signed integers, since the operations (bitwise logical, addition and subtraction) are identical at the bits level.

Setting, clearing and flipping bits

We know how to create a 64-bit word with just one bit set to 1 (e.g., 1<<12). Conversely, we can also create a word that is made of 1s except for a 0 at bit index 12 by flipping all bits: ^uint64(1<<12). Before flipping all bits of an expression, it is sometimes useful to specify its type (taking uint64 or uint32) so that the result is unambiguous.

We can then use these words to affect an existing word:

  1. If we want to set the 12th bit of word w to one: w |= 1<<12.
  2. If we want to clear (set to zero) the 12th bit of word w: w &^= 1<<12 (which is equivalent to w = w & ^uint64(1<<12)).
  3. If we just want to flip (send zeros to ones and ones to zeros) the 12th bit: w ^= 1<<12.

We may also affect a range of bits. For example, we know that the word (1<<12)-1 has all but the 11 least significant bits set to zeros, and the 11 least significant bits set to ones.

  1. If we want to set the 11 least significant bits of the word w to ones: w |= (1<<12)-1.
  2. If we want to clear (set to zero) the 11 least signficant bits of word w: w &^= (1<<12)-1.
  3. If we want to flip the 11 least signficant bits: w ^= (1<<12)-1.
    The expression (1<<12)-1 is general in the sense that if we want to select the 60 least significant bits, we might do (1<<60)-1. It even works with 64 bits: (1<<64)-1 has all bits set to 1.

We can also generate a word that has an arbitrary range of bits set: the word ((1<<13)-1) ^ ((1<<2)-1) has the bits from index 2 to index 12 (inclusively) set to 1, other bits are set to 0. With such a construction, we can set, clear or flip an arbitrary range of bits within a word, efficiently.

We can set any bit we like in a word. But what about querying the bit sets ? We can check the 12th bit is set in the word u by checking whether w & (1<<12) is non-zero. Indeed, the expression w & (1<<12) has value 1<<12 if the 12th bit is set in w and, otherwise, it has value zero. We can extend such a check: we can verify whether any of the bits from index 2 to index 12 (inclusively) set to 1 by computing w & ((1<<13)-1) ^ ((1<<2)-1). The result is zero if and only if no bit in the specified range is set to one.

Efficient and safe operations over integers

By thinking about values in terms of their bit representation, we can write more efficient code or, equivalent, have a better appreciation for what optimized binary code might look like. Consider the problem of checking if two numbers have the same sign: we want to know whether they are both smaller than zero, or both greater than or equal to zero. A naive implementation might look as follows:

func SlowSameSign(x, y int64) bool {
return ((x < 0) && (y < 0)) || ((x >= 0) && (y >= 0))
}

However, let us think about what distinguishes negative integers from other integers: they have their last bit set. That is, their most significant bit as an unsigned value is one. If we take the exclusive or (xor) of two integers, then the result will have its last bit set to zero if their sign is the same. That is, the result is positive (or zero) if and only if the signs agree. We may therefore prefer the following function to determine if two integers have the same sign:

func SameSign(x, y int64) bool {
    return (x ^ y) >= 0
}

Suppose that we want to check whether x and y differ by at most 1. Maybe x is smaller than y, but it could be larger.

Let us consider the problem of computing the average of two integers. We have the following correct function:

func Average(x, y uint16) uint16 {
    if y > x {
        return (y-x)/2 + x
    } else {
        return (x-y)/2 + y
    }
}

With a better knowledge of the integer representation, we can do better.

We have another relevant identity x == 2*(x>>1) + (x&1). It means that x/2 is within [(x>>1), (x>>1)+1). That is, x>>1 is the greatest integer no larger than x/2. Conversely, we have that (x+(x&1))>>1 is the smallest integer no smaller than x/2.

We have that x+y = (x^y) + 2*(x&y). Hence we have that (x+y)>>1 == ((x^y)>>1) + (x&y) (ignoring overflows in x+y). Hence, ((x^y)>>1) + (x&y) is the greatest integer no larger than (x+y)/2. We also have that x+y = 2*(x|y) - (x^y) or x+y + (x^y)&1= 2*(x|y) - (x^y) + (x^y)&1 and so (x+y+(x^y)&1)>>1 == (x|y) - ((x^y)>>1) (ignoring overflows in x+y+(x^y)&1). It follows that (x|y) - ((x^y)>>1) is the smallest integer no smaller than (x+y)/2. The difference between (x|y) - ((x^y)>>1) and ((x^y)>>1) + (x&y) is (x^y)&1. Hence, we have the following two fast functions:

func FastAverage1(x, y uint16) uint16 {
    return (x|y) - ((x^y)>>1)
}
func FastAverage2(x, y uint16) uint16 {
    return ((x^y)>>1) + (x&y)
}

Though we use the type uint16, it works irrespective of the integer size (uint8, uint16, uint32, uint64) and it also applies to signed integers (int8, int16, int32, int64).

Efficient Unicode processing

In UTF-16, we may have surrogate pairs: the first word in the pair is in the range 0xd800 to 0xdbff whereas the second word is in the range from 0xdc00 to 0xdfff. How may we detect efficiency surrogate pairs? If the values are stored using an uint16 type, then it would seem that we could detect a value part of a surrogate pair with two comparisons: (x>=0xd800) && (x<=0xdfff). However, it may prove more efficient to use the fact that subtractions naturally wrap-around: 0-0xd800==0x2800. Thus x-0xd800 will range between 0 and 0xdfff-0xd800 inclusively whenever we have a value that is part of a surrogate pair. However, any other value will be larger than 0xdfff-0xd800=0x7fff. Thus, a single comparison is needed : (x-0xd800)<=0x7ff.
Once we have determined that we have a value that might correspond to a surrogate pair, we may check that the first value x1 is valid (in the range 0xd800 to 0xdbff) with the condition (x-0xd800)<=0x3ff, and similarly for the second value x2: (x-0xdc00)<=0x3ff. We may then reconstruct the code point as ((x-0xd800)<<10) + x-0xdc00. In practice, you may not need to concern yourself with such an optimization since your compiler might do it for you. Nevertheless, it is important to keep in mind that what might seem like multiple comparisons could actually be implemented as a single one.

Basic SWAR

Modern processors have specialized instructions capable of operating over multiple units of data with a single instruction (called SIMD for Single Instruction Multiple Data). We can do several operations using a single instruction (or few) instructions with a technique called SWAR (SIMD within a register). Typically, we are given a 64-bit word w (uint64) and we want to treat it as a vector of eight 8-bit words (uint8).

Given a byte value (uint8) I can have it replicated over all bytes of a word with a single multiplication: x * uint64(0x0101010101010101). For example, we have 0x12 * uint64(0x0101010101010101) == 0x1212121212121212. This approach can be generalized in various ways. For example, we have that 0x7 * uint64(0x1101011101110101) == 0x7707077707770707.

For convenience, let us define b80 to be the uint64 equal to 0x8080808080808080 and b01 be the uint64 equal to 0x0101010101010101. We can check whether all bytes are smaller than 128. We first replicate the byte value with all but the most significant bit set to zero (0x80 * b01) or b80) and then we compute the bitwise AND with our 64-bit word and check whether the result is zero: (w & b80)) == 0. It might compile to a two or three instructions on a processor.

We can check whether any byte is zero, assuming that we have checked that they are smaller than 128, with an expression such as ((w - b01) & b80) == 0. If we are not sure that they are smaller than zero, we can simply add an operation: (((w - b01)|w) & b80) == 0. Checking that a byte is zero allows us to check whether two words, w1 and w2, have a matching byte value since, when this happens, w1^w2 has a zero byte value.

We can also design more complicated operations if we assume that all byte values are no larger than 128. For example, we may check that all byte values are no larger than a 7-bit value (t) by the following routine: ((w + (0x80 - t) * b01) & b80) == 0. If the value t is a constant, then the multiplication would be evaluated at compile time and it should be barely more expensive than checking whether all bytes are smaller than 128. In Go, we check that no byte value is larger than 77, assuming that all byte values are smaller than 128 by verifying thaat b80 & (w+(128-77) * b01) is zero. Similarly, we can check that all byte values are at least as large a 7-bit t, assuming that they are also all smaller than 128: ((b80 - w) + t * b01) & b80) == 0. We can generalize further. Suppose we want to check that all bytes are at least as large at the 7-bit value a and no larger than the 7-bit value b. It suffices to check that ((w + b80 - a * b01) ^ (w + b80 - b * b01)) & b80 == 0.

Rotating and Reversing Bits

Given a word, we say that we rotate the bits if we shift left or right the bits, while moving back the leftover bits at the beginning. To illustrate the concept, suppose that we are given the 8-bit integer 0b1111000 and we want to rotate it left by 3 bits. The Go language provides a function for this purpose (bits.RotateLeft8 from the math/bits package): we get 0b10000111. In Go, there is no rotate right operation. However, rotating left by 3 bits is the same as rotating right by 5 bits when processing 8-bit integers. Go provide rotation functions for 8-bit, 16-bit, 32-bit and 64-bit integers.

Suppose that you would like to know if two 64-bit words (w1 and w2) have matching byte values, irrespective of the ordering. We know how to check that they have matching ordered byte values efficiently (e.g., (((w1^w2 - b01)|(w1^w2)) & b80) == 0. To compare all bytes with all other bytes, we can repeat the same operation as many times as they are bytes in a word (eight times for 64-bit integers): each time, we rotate one of the words by 8 bits:

(((w1^w2 - b01)|(w1^w2)) & b80) == 0
w1 = bits.RotateLeft64(w1,8)
(((w1^w2 - b01)|(w1^w2)) & b80) == 0
w1 = bits.RotateLeft64(w1,8)
...

We recall that words can be interpreted as little-endian or big-endian depending on whether the first bytes are the least significant or the most significant. Go allows you to reverse the order of the bytes in a 64-bit word with the function bits.ReverseBytes64 from the math/bits package. There are similar functions for 16-bit and 32-bit words. We have that bits.ReverseBytes16(0xcc00) == 0x00cc. Reversing the bytes in a 16-bit word, and rotating by 8 bits, are equivalent operations.

We can also reverse bits. We have that bits.Reverse16(0b1111001101010101) == 0b1010101011001111. Go has functions to reverse bits for 8-bit, 16-bit, 32-bit and 64-bit words. Many processors have fast instructions to reverse the bit orders, and it can be a fast operation.

Fast Bit Counting

It can be useful to count the number of bits set to 1 in a word. Go has fast functions for this purpose in the math/bits package for words having 8 bits, 16 bits, 32 bits and 64 bits. Thus we have that bits.OnesCount16(0b1111001101010101) == 10.

Similarly, we sometimes want to count the number of trailing or leading zeros. The number of trailing zeros is the number of consecutive zero bits appearing in the least significant positions.For example, the word 0b1 has no trailing zero, whereas the word 0b100 has two trailing zeros. When the input is a power of two, the number of trailing zeros is the logarithm in base two. We can use the Go functions bits.TrailingZeros8, bits.TrailingZeros16
and so forth to compute the number of trailing zeros. The number of leading zeros is similar, but we start from the most significant positions. Thus the 8-bit integer 0b10000000 has zero leading zeros,
while the integer 0b00100000 has two leading zeros. We can use the functions bits.LeadingZeros8, bits.LeadingZeros16
and so forth.

While the number of trailing zeros gives directly the logarithm of powers of two, we can use the number of leading zeros to compute the logarithm of any integer, rounded up to the nearest integer. For 32-bit integers, the following function provides the correct result:

func Log2Up(x uint32) int {
    return 31 - bits.LeadingZeros32(x|1)
}

We can also compute other logarithms. Intuitively, this ought to be possible because if log_b is the logarithm in base b, then log_b (x) = \log_2(x)/\log_2(b). In other words, all logarithms are within a constant factor (e.g., 1/log_2(b)).

For example, we might be interested in the number of decimal digits necessary to represent an integer (e.g., the integer 100 requires three digits). The general formula is ceil(log(x+1)) where the logarithm should be taken in base 10. We can show that the following function (designed by an engineer called Kendall Willets) computes the desired number of digits for 32-bit integers:

func DigitCount(x uint32) uint32 {
    var table = []uint64{
        4294967296, 8589934582, 8589934582,
        8589934582, 12884901788, 12884901788,
        12884901788, 17179868184, 17179868184,
        17179868184, 21474826480, 21474826480,
        21474826480, 21474826480, 25769703776,
        25769703776, 25769703776, 30063771072,
        30063771072, 30063771072, 34349738368,
        34349738368, 34349738368, 34349738368,
        38554705664, 38554705664, 38554705664,
        41949672960, 41949672960, 41949672960,
        42949672960, 42949672960}
    return uint32((uint64(x) + table[Log2Up(x)]) >> 32)
}

Though the function is a bit mysterious, its computation mostly involves computing the number of trailing zeros, and using the result to lookup a value in a table. It translates in only a few CPU instructions and is efficient.

Indexing Bits

Given a word, it is sometimes useful to compute the position of the set bits (bits set to 1). For example, given the word 0b11000111, we would like to have the indexes 0, 1, 2, 6, 7 corresponding to the 5 bits with value 1. We can determine efficiently how many indexes we need to produce thanks to the bits.OnesCount functions. The bits.TrailingZeros functions can serve to identify the position of a bit. We may also use the fact that x & (x-1) set to zero the least significant 1-bit of x. The following Go function generates an array of indexes:

func Indexes(x uint64) []int {
    var ind = make([]int, bits.OnesCount64(x))
    pos := 0
    for x != 0 {
        ind[pos] = bits.TrailingZeros64(x)
        x &= x - 1
        pos += 1
    }
    return ind
}

Given 0b11000111, it produces the array 0, 1, 2, 6, 7:

var x = uint64(0b11000111)
for _, v := range Indexes(x) {
    fmt.Println(v)
}

If we want to compute the bits in reverse order (7, 6, 2, 1, 0), we can do so with a bit-reversal function, like so:

for _, v := range Indexes(bits.Reverse64(x)) {
    fmt.Println(63 - v)
}

Conclusion

As a programmer, you may access, set, copy, or move individual bit values efficiently. With some care, you can avoid arithmetic overflows without much of a performance penalty. With SWAR, you can use a single word as if it was made of several subwords. Though most of these operations are only rarely needed, it is important to know that they are available.

Serializing IPs quickly in C++

On the Internet, we often use 32-bit addresses which we serialize as strings such as 192.128.0.1. The string corresponds to the Integer address 0xc0800001 (3229614081 in decimal).

How might you serialize, go from the integer to the string, efficiently in C++?

The simplest code in modern C++ might look as follows:

std::string output = std::to_string(address >> 24);
for (int i = 2; i >= 0; i--) {
  output.append(std::to_string((address >> (i * 8)) % 256) + ".");
}

At least symbolically, it will repeatedly create small strings that are appended to an initial string.

Can we do better? We have new functions in C++ (std::to_chars) which are dedicated to writing quickly to a string buffer. So we might try to allocate a single buffer and write to it using buffers. The result is not pretty:

std::string output(4 * 3 + 3, '\0'); // allocate just one big string
char *point = output.data();
char *point_end = output.data() + output.size();
point = std::to_chars(point, point_end, uint8_t(address >> 24)).ptr;
for (int i = 2; i >= 0; i--) {
 *point++ = '.';
 point = std::to_chars(point, point_end, uint8_t(address >> (i * 8))).ptr;
}
output.resize(point - output.data());

The uglier code might be faster.

I wrote a benchmark where I repeatedly create strings and add them to containers.

On an AMD Rome (Zen2) processor using GCC 11, I get the following numbers.

pretty code 80 ns/string
to_chars 24 ns/string

So about three times faster?

We can go at least twice as fast with a bulkier function but it uses significantly more storage and memory:

  // uses 1025 bytes
  constexpr static const char *lookup =
".0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .10 .11 .12 .13 .14 .15 .16 .17 "
".18 .19 .20 .21 .22 .23 .24 .25 .26 .27 .28 .29 .30 .31 .32 .33 .34 .35 "
".36 .37 .38 .39 .40 .41 .42 .43 .44 .45 .46 .47 .48 .49 .50 .51 .52 .53 "
".54 .55 .56 .57 .58 .59 .60 .61 .62 .63 .64 .65 .66 .67 .68 .69 .70 .71 "
".72 .73 .74 .75 .76 .77 .78 .79 .80 .81 .82 .83 .84 .85 .86 .87 .88 .89 "
".90 .91 .92 .93 .94 .95 .96 .97 .98 .99 "
".100.101.102.103.104.105.106.107.108.109.110.111.112.113.114.115.116."
"117.118.119.120.121.122.123.124.125.126.127.128.129.130.131.132.133.134."
"135.136.137.138.139.140.141.142.143.144.145.146.147.148.149.150.151.152."
"153.154.155.156.157.158.159.160.161.162.163.164.165.166.167.168.169.170."
"171.172.173.174.175.176.177.178.179.180.181.182.183.184.185.186.187.188."
"189.190.191.192.193.194.195.196.197.198.199.200.201.202.203.204.205.206."
"207.208.209.210.211.212.213.214.215.216.217.218.219.220.221.222.223.224."
"225.226.227.228.229.230.231.232.233.234.235.236.237.238.239.240.241.242."
"243.244.245.246.247.248.249.250.251.252.253.254.255";
  std::string output(4 * 3 + 3, '\0');
  char *point = output.data();
  uint8_t by;
  by = address >> 24;
  std::memcpy(point, lookup + by * 4 + 1, 4);
  point += (by < 10) ? 1 : (by < 100 ? 2 : 3);
  by = address >> 16;
  std::memcpy(point, lookup + by * 4, 4);
  point += (by < 10) ? 2 : (by < 100 ? 3 : 4);
  by = address >> 8;
  std::memcpy(point, lookup + by * 4, 4);
  point += (by < 10) ? 2 : (by < 100 ? 3 : 4);
  by = address >> 0;
  std::memcpy(point, lookup + by * 4, 4);
  point += (by < 10) ? 2 : (by < 100 ? 3 : 4);
  output.resize(point - output.data());


In my benchmark, I include a version by Marcin Zukowski that uses even more memory (about 4 times) and is slightly faster.

As always, your results will vary depending on your system and your compiler. However, I recommend against creating small strings and aggregating them together in performance critical code.

Credit: Peter Dimov and Marcin Zukowski have contributed fast versions (see benchmarks). The version with a large table is derived from their work. Ivan-Assen Ivanov contributed ideas on Twitter.

Move or copy your strings? Possible performance impacts

You sometimes want to add a string to an existing data structure. For example, the C++17 template ‘std::optional’ may be used to represent a possible string value. You may copy it there, as this code would often do…

std::string mystring;
std::optional<std::string> myoption;
myoption = mystring;

Or you can move it:

std::string mystring;
std::optional<std::string> myoption;
myoption = std::move(mystring);

In C++, when ‘moving’ a value, the compiler does not need to create a whole new copy of the string. So it is often cheaper.

I wrote a little benchmark to assess the performance difference. It is a single test, but it should illustrate.

Firstly, for relatively long strings (a phrase or a sentence), the move is 5 times to 20 times faster.

copy move
Apple LLVM 14, M2 processor 24 ns/string 1.2 ns/string
GCC 11, Intel Ice Lake 19 ns/string 4 ns/string

Secondly, for short strings (a single word), the move is 1.5 times to 3 times faster but the absolute difference is small (as small as a fraction of a nanosecond). Your main concern should be with long strings.

copy move
Apple LLVM 14, M2 processor 2.0 ns/string 1.2 ns/string
GCC 11, Intel Ice Lake 7 ns/string 2.6 ns/string

My results illustrate that moving your sizeable data structure instead of copying them is beneficial.

But that’s not the fastest approach: the fastest approach is to just hold a pointer. Copying an address is unbeatably fast. A slightly less optimal approach is to use a lightweight object like an std::string_view: copying or creating an std::string_view is cheaper than doing the same with a C++ string.

International domain names: where does https://meßagefactory.ca lead you?

Originally, the domain part of a web address was all ASCII (so no accents, no emojis, no Chinese characters). This was extended a long time ago thanks to something called internationalized domain name (IDN).

Today, in theory, you can use any Unicode character you like as part of a domain name, including emojis. Whether that is wise is something else.

What does the standard says? Given a domain name, we should identify its labels. They are normally separated by dots (.) into labels: www.microsoft.com has three labels. But you may also use other Unicode characters as separators ( ., ., 。, 。). Each label is further processed. If it is all ASCII, then it is left as is. Otherwise, we must convert it to an ASCII code called “punycode” after doing the following according to RFC 3454:

  • Map characters (section 3 of RFC 3454),
  • Normalize (section 4 of RFC 3454),
  • Reject forbidden characters,
  • Optionally reject based on unassigned code points (section 7).

And then you get to the punycode algorithm. There are further conditions to be satisfied, such as the domain name in ASCII cannot exceed 255 bytes.

That’s quite a lot of work. The goal is to transcribe each Unicode domain name into an ASCII domain name. You would hope that it would be a well-defined algorithm: given a Unicode domain name, there should be a unique output.

Let us choose a common non-ASCII character, the letter ß, called Eszett. Let me create a link with this character:

What happens if you click on this link? The result depends on your browser. If you are using Microsoft Edge, Google Chrome or the Brave browser, you may end up at https://messagefactory.ca/. If you are using Safari or Firefox  you may end up at https://xn--meagefactory-m9a.ca. Of course, your results may vary depending on your exact system. Under ios (iPhone), I expect that the Safari behaviour will prevail irrespective of your browser.

Not what I expected.

Year 2022: Scientific progress

The year 2022 is over. As with every year that passes, we have made some scientific progress. I found the following achievements interesting:

Science and technology links (January 15 2023)

    1. For under $600, one can buy a 20-terabyte disk on Amazon. Unless you work professionally in multimedia, it is more storage than you need. However, having much storage it, by itself, of little use if you cannot access it. Thankfully, you can buy a 1-terabyte “disk” for $200 that provides over 6 GB/s of bandwidth. I have a similar disk in my game console. Is this as good as it gets? Researchers show that we can transmit data over a distance at more than a petabit per second. According to some estimates, that is more than the total data size of the books in the library of congress, per second.
    2. Transplanting rejuvenated blood stem cells extends lifespan of aged immunocompromised mice.
    3. Amazon is using drones for deliveries in California and Texas.
    4. People who think themselves as less attractive are more likely willing to wear surgical masks.
    5. Conversations rarely end when they should.
    6. Using legal restrictions, manufacturers are able to prevent their customers from repairing their own products. There may be hope. Farmers in the US will be allowed to repair John Deere tractors.
    7. For most of my life, nuclear power has been viewed with suspicion, and little effort has been done to exploit it further. The tide might be finally turning. The UK government plans to authorize small modular nuclear reactors, and other nuclear-power innovations.
    8. Cancer rates are falling rapidly in both men and women.
    9. There is evidence that vitamin D supplements might help your immune system if your levels are too low. However, you may also require magnesium supplements to benefit from the vitamin D.
    10. In the post-pandemic era, people work fewer hours.
    11. There is evidence that viruses dormant in our genomes can wake up when we are older and harm us.
    12. Research papers and patents are becoming less disruptive over time. However, we are producing many more research papers and patents.
    13. Genetically modified immune cells are used to fight cancer in a patient.
    14. Increasing house insulation may not lead to lower energy usage on the long run. It is not, per se, an argument against better insulation. Yet it suggests that we should plan for increase power production.
    15. In a paper published by Nature, Kleinherenbrink et al. find that global mean sea levels are likely rising according to a linear curve, as opposed to an accelerating curve. The current rate is estimated to be between 3 and 4 millimeters per year. Meanwhile, the most low-lying island nations on the planet are growing.
    16. Antartica might be getting cooler.
    17. People prefer to stay away from promiscuous men.
    18. Cold temperatures harm the antiviral immunity of your nose. Thus cold weather may make you more prone to catching a virus.
    19. Replacing grades with pass/fail scores in courses lead students to make less effort. In my view, it does not imply that we should not adopt pass/fail scores because there are diminish returns to more efforts. E.g., if the kids in a country spend much more time perfecting their knowledge of trigonometry, you may not end up with a more prosperous country. In some sense, intense competition may be a net loss.
    20. Using solar power generation in countries such as Switzerland results in a net energy loss: though the solar panels produce energy, they never recoup the energy investment needed to make them and deploy them.

Care is needed to use C++ std::optional with non-trivial objects

We often have to represent in software a value that might be missing. Different programming languages have abstraction for this purpose.

A recent version of C++ (C++17) introduces std::optional templates. It is kind of neat. You can write code that prints a string, or a warning if no string is available as follows:

void f(std::optional<std::string> s) {
  std::cout << s.value_or("no string") << std::endl;
}

I expect std::optional to be relatively efficient when working with trivial objects (an integer, a std::string_view, etc.). However if you want to avoid copying your objects, you should use std::optional with care.

Let us consider this example:

A f(std::optional<A> z) {
    return z.value_or(A());
}

A g() { 
    A a("message"); 
    auto z = std::optional<A>(a); 
    return f(z); 
}

How many instances of the string class are constructed when call the function g()? It is up to five instances:

  1. At the start of the function we construct one instance of A.
  2. We then copy-construct this instance when passing it to std::optional<A>.
  3. Passing std::optional<A> to the function f involves another copy.
  4. In the value_or construction, we have a default construction of A (which is wasteful work in this instance).
  5. Finally, we copy construct an instance of A when the call to value_or terminates.

It is possible for an optimizing compiler to elide some or all of the superfluous constructions, especially if you can inline the functions, so that the compiler can see the useless work and prune it. However, in general, you may encounter inefficient code.

I do not recommend against using std::optional. There are effective strategies to avoid the copies. Just apply some care if performance is a concern.

Transcoding Unicode with AVX-512: AMD Zen 4 vs. Intel Ice Lake

Most systems today rely on Unicode strings. However, we have two popular Unicode formats: UTF-8 and UTF-16. We often need to convert from one format to the other. For example, you might have a database formatted with UTF-16, but you need to produce JSON documents using UTF-8. This conversion is often called ‘transcoding’.

In the last few years, we wrote a specialized library that process Unicode strings, with a focus on performance: the simdutf library. The library is used by JavaScript runtimes (Node JS and bun).

The simdutf library is able to benefit from the latest and most powerful instructions on your processors. In particular, it does well with recent processors with AVX-512 instructions (Intel Ice Lake, Rocket Lake, as well as AMD Zen 4).

I do not yet have a Zen 4 processor, but Velu Erwan was kind of enough to benchmark it for me. A reasonable task is to transcode an Arabic file from UTF-8 to UTF-16: it is typically a non-trivial task because Arabic UTF-8 is a mix of one-byte and two-byte characters that we must convert to two-byte UTF-16 characters (with validation). The steps required (under Linux) are as follows:

git clone https://github.com/simdutf/simdutf && 
cd simdutf && 
cmake -B build && cmake --build build && 
wget –content-disposition https://cutt.ly/d2cIxRx && 
./build/benchmarks/benchmark -F Arabic-Lipsum.utf8.txt -P convert_utf8 

(Ideally, run the last command with privileged access to the performance counters.)

Like Intel, AMD has its own compiler. I did not have access to the Intel compiler for my tests, but Velu has the AMD compiler.

A sensible reference point is the iconv, as it is provided by the runtime library. The AMD processor is running much faster than the Intel core (5.4 GHz vs. 3.4 GHz). We use GCC 12.

transcoder Intel Ice Lake (GCC) AMD Zen 4 (GCC) AMD Zen 4 (AMD compiler)
iconv 0.70 GB/s 0.97 GB/s 0.98 GB/s
simdutf 7.8 GB/s 11 GB/s 12 GB/s

At a glance, the Zen 4 processor is slightly less efficient on a per-cycle basis when running the simdutf AVX-512 code (2.8 instructions/cycle for AMD versus 3.1 instructions/cycle for Intel) but keep in mind that we did not have access to a Zen 4 processor when tuning our code. The efficiency difference is small enough that we can consider the processors roughly on par pending further investigations.

The big difference that the AMD Zen 4 runs at a much higher frequency. If I rely on wikipedia, I do not think that there is an Ice Lake processor that can match the 5.4 GHz. However, some Rocket Lake processors come close.

In our benchmarks, we track the CPU frequency and we get the same measured frequency when running an AVX-512 as when running conventional code (iconv).  Thus AVX-512 can be really advantageous.

These results suggest that AMD Zen 4 is matching Intel Ice Lake in AVX-512 performance. Given that the Zen 4 microarchitecture is the first AMD attempt at supporting AVX-512 commercially, it is a remarkable feat.

Further reading: AMD Zen 4 performance while parsing JSON (phoronix.com).

Note: Raw AMD results are available: GCC 12 and AOCC.

Credit: Velu Erwan got the processor from AMD France. The exact specification is AMD 7950X, 2x16GB DDR5 4800MT reconfig as 5600MT. The UTF-8 to UTF-16 transcoder is largely based on the work of Robert Clausecker.

Emojis in domain names, punycode and performance

Most domain names are encoded using ASCII (e.g., yahoo.com). However, you can register domain names with almost any character in them. For example, there is a web site at 💩.la called poopla. Yet the underlying infrastructure is basically pure ASCII. To make it work, the text of your domain is first translated into ASCII using a special encoding called ‘punycode‘. The poopla web site is actually at https://xn--ls8h.la/.

Punycode is a tricky format. Thankfully, domain names are made of labels (e.g., in microsoft.com, microsoft is a label) and each label can use at most 63 bytes. In total, a domain name cannot exceed 255 bytes, and that is after encoding it to punycode if necessary.

Some time ago, Colm MacCárthaigh suggested that I look at the performance impact of punycode. To answer the question, I quickly implemented a function that does the job. It is a single function without much fanfare. It is roughly derived from the code in the standard, but it looks simpler to me. Importantly, I do not claim that my implementation is particularly fast.

As a dataset, I use all of the examples from the wikipedia entry of punycode.

GCC 11, Intel Ice Lake 75 ns/string
Apple M2, LLVM 14 27 ns/string

The punycode format is relatively complicated, it was not designed for high speed, so there are limits to how fast one can go. Nevertheless, it should be possible to go faster. Yet it is probably ‘fast enough’ and it is unlikely that punycode encoding would ever be a performance bottleneck if only because domain names are almost always pure ASCII and punycode is unneeded.

 

Quickly checking that a string belongs to a small set

Suppose that I give you a set of reference strings (“ftp”, “file”, “http”, “https”, “ws”, “wss”). Given a new string, you want to quickly tell whether it is part of this set.

You might use a regular expression but it is unlikely to be fast in general:

const std::regex txt_regex("https?|ftp|file|wss?");
// later... 
bool match = std::regex_match(v.begin(), v.end(), txt_regex);

A sensible solution might be to create a set and then to ask whether the string is in the set. In C++, a default set type is the unordered_set thus your code might look as follows:

static const std::unordered_set<std::string_view> special_set = {
    "ftp", "file", "http", "https", "ws", "wss"};

bool hash_is_special(std::string_view input) {
  return special_set.find(input) != special_set.end();
}

You can also use a tool like gperf which constructs a perfect-hash function for you.

You might also be more direct about it, and just do several comparisons:

bool direct_is_special(std::string_view input) {
  return (input == "https") || (input == "http") || (input == "ftp") ||
         (input == "file") || (input == "ws") || (input == "wss");
}

I call it a branchy version because of the ‘||’ operator which suggests to the compiler that you want to evaluate the comparisons one by one, exiting once one is true; if you replace the ‘||’ operator with the operator ‘|’ then you the result is ‘branchless’ in the sense that you entice the compiler to evaluate all comparisons.

If you look at how the code gets compiled, you may notice that the compiler is forced to do comparisons and jumps, because it is not allowed to read in the provided string beyond its reported size.

You might be able to do slightly better if you can tell your compiler that the string you receive is ‘padded’ so that you can read eight bytes safely from it. I could not find a very elegant way to do it, but the following code works:

static inline uint64_t string_to_uint64(std::string_view view) {
  uint64_t val;
  std::memcpy(&val, view.data(), sizeof(uint64_t));
  return val;
}

uint32_t string_to_uint32(const char *data) {
  uint32_t val;
  std::memcpy(&val, data, sizeof(uint32_t));
  return val;
}


bool fast_is_special(std::string_view input) {
  uint64_t inputu = string_to_uint64(input);
  if ((inputu & 0xffffffffff) == string_to_uint64("https\0\0\0")) {
    return input.size() == 5;
  }
  if ((inputu & 0xffffffff) == string_to_uint64("http\0\0\0\0")) {
    return input.size() == 4;
  }
  if (uint32_t(inputu) == string_to_uint32("file")) {
    return input.size() == 4;
  }
  if ((inputu & 0xffffff) == string_to_uint32("ftp\0")) {
    return input.size() == 3;
  }
  if ((inputu & 0xffffff) == string_to_uint32("wss\0")) {
    return input.size() == 3;
  }
  if ((inputu & 0xffff) == string_to_uint32("ws\0\0")) {
    return input.size() == 2;
  }
  return false;
}

Though I did not do it, you can extend the comparison so that it is case-insensitive (simply AND the input with the bytes 0xdf instead of the bytes 0xff).

You can use a faster approach if you can assume that the input string has been padded with zeros:

  uint64_t inputu = string_to_uint64(input);
  uint64_t https = string_to_uint64("https\0\0\0");
  uint64_t http = string_to_uint64("http\0\0\0\0");
  uint64_t file = string_to_uint64("file\0\0\0\0");
  uint64_t ftp = string_to_uint64("ftp\0\0\0\0\0");
  uint64_t wss = string_to_uint64("wss\0\0\0\0\0");
  uint64_t ws = string_to_uint64("ws\0\0\0\0\0\0");
  if((inputu == https) | (inputu == http)) {
    return true;
  }
  return ((inputu == file) | (inputu == ftp) 
          | (inputu == wss) | (inputu == ws));

Observe how I have selected what I believe are the two most common cases (among URL protocols).

Finally, we must use a hash function to solve the problem with a single comparison:

static const uint8_t shiftxor_table[128] = {
    'w', 's', 0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   'f', 'i', 'l', 'e', 0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   'f', 't', 'p', 0,   0,   0,   0,   0,
    'w', 's', 's', 0,   0,   0,   0,   0,   'h',
    't', 't', 'p', 0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   'h', 't', 't',
    'p', 's', 0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0,   0,   0,   0,   0,   0,   0,   0,
    0,   0};

bool shiftxor_is_special(std::string_view input) {
  uint64_t inputu = string_to_uint64(input);

  return string_to_uint64(
             shiftxor_table +
             (((inputu >> 28) ^ (inputu >> 14)) &
              0x78)) == inputu;
}

If you do not want to assume that the strings are padded (i.e., no cheating), then you can do it the “gperf way” (it is my own code, but it is based on gpref):

std::string_view table_hashnocheat_is_special[] = {"http", "", "https", 
  "ws", "ftp", "wss", "file", ""};

bool hashnocheat_is_special(std::string_view input) {
  if(input.empty()) { return false; }
  int hash_value = (2*input.size() + (unsigned)(input[0])) & 7;
  const std::string_view target = table_craftedhash_is_special[hash_value];
  return (target[0] == input[0]) && (target.substr(1) == input.substr(1));
}

I am sure that there are faster and more clever alternatives!

In any case, how fast are my alternatives? Using GCC 11 on an Intel Ice Lake server, I get the following results:

gperf7.1 ns/string

regex 360 ns/string
std::unordered_map 19 ns/string
direct 16 ns/string
direct (branchy) 13 ns/string
direct (branchless) 18 ns/string
hashnocheat_is_special 7.0 ns/string
fast 2.6 ns/string
faster 1.9 ns/string
hashing (shiftxor_is_special) 1.1 ns/string

On an Apple M2 with LLVM 12, I get similar (but better) results:

regex 450 ns/string
std::unordered_map 15 ns/string
direct (branchy) 7.5 ns/string
direct (branchless) 4.8 ns/string
gperf 8.0 ns/string
hashnocheat_is_special 7.2 ns/string
fast 1.1 ns/string
faster 0.8 ns/string
hashing (shiftxor_is_special) 0.4 ns/string

Care is needed when optimizing such small functions: whether and how the function gets inlined can be critical to the good performance. The results will depend also on the data source and on the compiler.

The hashing method (e.g., shiftxor_is_special) has the benefit of being essentially branch-free which makes its performance having little dependency on the distribution of the input. It is also fastest in these tests.

If you cannot safely pad your strings, I recommend the hashnocheat_is_special function. Having to deal with variable-length strings is significantly slower, but it is sometimes necessary.

My source code is available.

Further reading: I was later informed that my friend Wojciech Muła had looked at the problem earlier in 2022. He did not look at string padding, but for the version with no-cheating (variable-length strings), he comes up with the same conclusion that I do.

Science and Technology links (December 25 2022)

Fast base16 encoding

Given binary data, we often need to encode it as ASCII text. Email and much of the web effectively works in this manner.

A popular format for this purpose is base64. With Muła, we showed that we could achieve excellent speed using vector instructions on commodity processors (2018, 2020). However, base64 is a bit tricky.

A much simpler format is just base16. E.g., you just transcribe each byte into two bytes representing the value in hexadecimal notation. Thus the byte value 1 becomes the two bytes ’01’. The byte value 255 becomes ‘FF’, and so forth. In other words, you use one byte (or one character) per ‘nibble’: a byte is made of two nibbles: the most-significant 4 bits and the least-significant 4 bits.

How could encode base16 quickly? A reasonable approach might be to use a table. You grab one byte from the input and you directly lookup the 2 bytes from the output which you immediately write out:

void encode_scalar(const uint8_t *source, size_t len, char *target) {
  const uint16_t table[] = {
      0x3030, 0x3130, 0x3230, 0x3330, 0x3430, ...
      0x6366, 0x6466, 0x6566, 0x6666};
  for (size_t i = 0; i < len; i++) {
    uint16_t code = table[source[i]];
    ::memcpy(target, &code, 2);
    target += 2;
  }
}

It requires a 512-byte table but that is not concerning.

Could we do better?

Milosz Krajewski wrote some good-looking C# code using vector instructions. I wrote something that should be the equivalent using x64 C++. We have both routines for 128-bit and 256-bit vectors. My code is for demonstration purposes but it is essentially correct.

The core idea is not complicated. You must grab a vector of bytes. Then you must somehow expand it out: each nibble must go into a byte. And then the magic is this: we use the fast vectorized lookup (e.g., the pshufb instruction) to look up each nibble into a 16-byte table containing the letters ‘0’, ‘1’…’a’, …’f’.

Here is the 128-bit code using Intel intrinsics:

  __m128i shuf = _mm_set_epi8('f', 'e', 'd', 'c', 'b', 'a', '9', '8', '7', '6',
                              '5', '4', '3', '2', '1', '0');
  size_t i = 0;
  __m128i maskf = _mm_set1_epi8(0xf);
  for (; i + 16 <= len; i += 16) {
    __m128i input = _mm_loadu_si128((const __m128i *)(source + i));
    __m128i inputbase = _mm_and_si128(maskf, input);
    __m128i inputs4 =
        _mm_and_si128(maskf, _mm_srli_epi16(input, 4));
    __m128i firstpart = _mm_unpacklo_epi8(inputs4, inputbase);
    __m128i output1 = _mm_shuffle_epi8(shuf, firstpart);
    __m128i secondpart = _mm_unpackhi_epi8(inputs4, inputbase);
    __m128i output2 = _mm_shuffle_epi8(shuf, secondpart);
    _mm_storeu_si128((__m128i *)(target), output1);
    target += 16;
    _mm_storeu_si128((__m128i *)(target), output2);
    target += 16;
  }

The 256-bit code is roughly the same, with one extra instruction to shuffle the bytes to compensate for the fact that 256-bit instructions work ‘per lane’ (in units of 128-bit words). My source code is available.

We might therefore expect the 256-bit to be maybe twice as fast? My results on an icelake processor with GCC11:

table lookup 0.9 GB/s
128-bit vectors 6.4 GB/s
256-bit vectors 11 GB/s

We are not quite twice as fast, but close enough. I do not find these speeds very satisfying: I expect that less naive code could be faster.

Milosz gets much poorer results in C#: the 256-bit code is barely faster than the 128-bit code, but he does some relatively complicated computation in the 256-bit code instead of just calling the 256-bit shuffle instruction (vpshufb). (I suspect that he will soon fix this issue if he can.)

Our code would work on ARM as well after minor changes. For AVX-512 or SVE, we may need different approaches. We could add both encoding and decoding.

The size of things in bytes

storing 1 GiB/month on the cloud 0.02$US
web site of my twitter profile (@lemire), HTML alone 296 KiB
web site of my twitter profile (@lemire), all data 3.9 MiB
Google result for ‘Canada’, HTML alone 848 KiB
Google result for ‘Canada’, all data 3.7 MiB
Node JS runtime 164 MiB
Size of the Java (19) runtime 330 MiB
LLVM/clang compiler+runtime 5.5 GiB
Boost (C++) library (source) 609 MiB
Go runtime 471 MiB

Implementing ‘strlen’ using SVE

In C, the length of a string in marked by a 0 byte at the end of the string. Thus to determine the length of the string, one must scan it, looking for the 0 byte. Recent ARM processors have a powerful instruction set (SVE) that is well suited for such problems. It allows you to load large registers at once and to do wide comparisons (comparing many bytes at once).

Yet we do not want to read too much data. If you read beyond the string, you could hit another memory page and trigger a segmentation fault. This could crash your program.

Thankfully, SVE comes with a load instruction that would only fault on the ‘first active element’: as long as the first element you are loading is valid, then there is no fault.

With this in mind, a simple algorithm to compute the length of a C string is as follows:

  1. Load a register.
  2. Compare each byte in it to 0.
  3. If any comparison matches, then locate the match and return the corresponding length.
  4. If not, increment by the register size (given by svcntb()), and repeat.

Using intrinsics, the code looks as follows…

size_t sve_strlen(const char *s) {
  size_t len = 0;
  while (true) {
    svuint8_t input = svldff1_u8(svptrue_b8(), (const uint8_t *)s + len);
    svbool_t matches = svcmpeq_n_u8(svptrue_b8(), input, 0);
    if (svptest_any(svptrue_b8(), matches)) {
      return len + svlastb_u8(svbrka_z(matches, matches), svindex_u8(0, 1));
    }
    len += svcntb();
  }
}

In assembly, the code looks as follows…

mainloop:
        ldff1b  { z0.b }, p0/z, [x10, x8]
        add     x8, x8, x9
        cmpeq   p1.b, p0/z, z0.b, #0
        b.eq    .mainloop

I do not discuss the details of my function, but it assumes implicitly that the underlying register size is no larger than 256 bytes which is guaranteed by the specification.

Benchmarking against your system’s strlen function is difficult methodologically. Nevertheless, we can measure the number of instructions retired for sizeable strings as an indication of the relative speed. Using GCC 11 on a graviton 3 system (Amazon), I get the following metrics:

system’s strlen 0.23 instructions per byte
SVE 0.15 instructions per byte

Though I do not advocate adopting SVE as a replacement for strlen at this time, the potential is interesting, considering that I threw together my implementation in minutes.

My source code is available.

Credit: Thanks to Denis Yaroshevskiy for inviting me to look at non-faulting loads.

Update: It turns out that strlen was one of the examples that ARM used in its slides presenting SVE. At a glance, their implementation looks like mine but with more sophistication.

Checking for the absence of a string, naive AVX-512 edition

Suppose you would like to check that a string is not present in a large document. In C, you might do the following using the standard function strstr:

bool is_present = strstr(mydocument, needle);

It is simple and likely very fast. The implementation is algorithmically sophisticated.

Can you do better?

Recent Intel and AMD processors have instructions that operate on 512-bit registers. So we can compare 64 bytes using a single instruction. The simplest algorithm to search for a string might look as follows…

  1. Load 64 bytes from our input document, compare them against 64 copies of the first character of the target string.
  2. If we find a match, load the second character of the target string, copy it 64 times within a register. Load 64 bytes from our input document, with an offset of one byte.
  3. Repeat as needed for the second, third, and so forth characters…
  4. Then advance in the input by 64 bytes and repeat.

Using Intel intrinsic functions, the algorithm looks as follows:

  
  for (size_t i = 0; ...; i += 64) {
    __m512i comparator = _mm512_set1_epi8(needle[0]);
    __m512i input = _mm512_loadu_si512(in + i);
    __mmask64 matches = _mm512_cmpeq_epi8_mask(comparator, input);
    for (size_t char_index = 1; matches && char_index < needle_len; 
         char_index++) {
      comparator = _mm512_set1_epi8(needle[char_index]);
      input = _mm512_loadu_si512(in + i + char_index);
      matches =
          _kand_mask64(matches, _mm512_cmpeq_epi8_mask(comparator, input));
    }
    if(matches) { return true; }
  }
  return false;

It is about the simplest algorithm I could think of.

We are now going to benchmark it against strstr. To make it interesting, I will pick a string that it designed to be repeatedly ‘almost’ in the input document, except for its last character. It is a worst-case scenario.

I use GCC 11 on an Intel Icelake processor. The source code of my benchmark is available.

number of characters in the string AVX-512 (naive) strstr
2 33 GB/s 13 GB/s
5 22 GB/s 9 GB/s
10 13 GB/s 8 GB/s
14 10 GB/s 7 GB/s

Unsuprisingly, my naive AVX-512 approach scales poorly in this benchmark with the string length. However, it is somewhat pessimistic, I would expect better results with a more realistic use case.

It should be possible to do much better with some more sophistication. However, for short strings, we are already twice as fast as strstr which is encouraging.

What is the memory usage of a small array in C++?

In an earlier blog post, I reported that the memory usage of a small byte array in Java (e.g., an array containing 4 bytes) was about 24 bytes. In other words: allocating small blocks of memory has substantial overhead.

What happens in C++?

To find out, I can try to allocate one million 4-byte arrays and look at the total memory usage of the process. Of course, the memory usage of the process will include some overhead unrelated to the 4-byte arrays, but we expect that such overhead will be relatively small.

From my benchmark, I get the following results…

system memory usage (in bytes)
GCC 8, Linux x86 32 bytes
LLVM 14, Apple aarch64 16 bytes

The results will vary depending on the configuration of your system, on your optimization level, and so forth.

But the lesson is that allocating four bytes (new char[4] or malloc(4)) does not use four bytes of memory… it will generally use much more.

Science and Technology links (December 11 2022)

  1. As we focus on some types of unfortunate discrimination (race, gender), we may become blind to other types of discrimination. For example, tend to discrimate against ugly people, short men, old people, and so on.
  2. Life may have emerged on Earth thanks to ‘aqueous microdroplets’.
  3. Naked mole rats are long-lived mammals. We believe that they experience negligible senescence, meaning that they do not lose fitness with age. One theory to explain they particular biology has to do with the fact that they live in a subterranean setting. Dammann et al. suggest that the fact that they are social creatures might also play a role.
  4. Low-carb diets may help obese individuals who face food addiction symptoms.
  5. Climate change may change how people name their children according to the journal Science.
  6. Offshore wind turbines endanger whales according to a Bloomberg report.
  7. Grip strength (who strongly you can hold on to something with your hands) is a good indication of your biological age.
  8. An mRNA technology might reprogram and rejuvenate your skin (according to a commercial press release).
  9. Many climate models predict that increased CO2 will impact clouds which would cause much of the predicted warming, since CO2 by itself is not very potent. New research published in Nature makes these models implausible. According to the authors, it makes it improbable that the climate is highly sensitive to CO2 levels.
  10. We have been told to avoid meat and dairy products (butter, cheese) because they contain saturated fats that might cause heart diseases. Teicholz (2022) reports that the rigorous evidence on saturated fats, which showed they do not cause heart disease, has long suppressed. The research seems clear: saturated fats have no effect on heart attacks, strokes or cardiovascular mortality, or total mortality.
  11. As you grow older, your cells tend to get larger.
  12. Goklany (2020) estimates that over 60% of our food production is dependent on fossil-fuel technologies. Currently, 12% of the land (excluding Antartica) is devoted to agriculture (cropland). Without fossil fuel, this percentage would need to move to 33%. The process of cropland expansion would harm biodiversity, increase food prices, and increase food insecurity among the poor.

Fast midpoint between two integers without overflow

Let us say that I ask you to find the number I am thinking about between -1000 and 1000, by repeatedly guessing a number. With each guess, I tell you whether your guess is correct, smaller or larger than my number. A binary search algorithm tries to find a value in an interval by repeating finding the midpoint, using smaller and smaller intervals. You might start with 0, then use either -500 or 500 and so forth.

Thus we sometimes need a fast algorithm to find the midpoint in an interval of integers.The following simple routine to find the midpoint is incorrect:

int f(int x, int y) {
  return (x + y)/2;
}

If the integers use a 64-bit two’s complement representation, we could pick 1 for x and 9223372036854775807 for y, and then the result of the function could be a large negative value.

Efficient solutions are provided by Warren in Hacker’s Delight (section 2.5):

int f(int x, int y) { 
  return (x|y) - ((x^y)>>1); 
}

int f(int x, int y) { 
  return ((x^y)>>1) + (x&y); 
}

They provide respectively the smallest value no smaller than (x+y)/2 and the largest value no larger than (x+y)/2. The difference between the two values is (x ^ y) & 1 (credit: Harold Aptroot).

They follow from the following identities: x+y=(x^y)+2*(x&y) and x+y=2*(x|y)-(x^y).

Update: Reader BartekF observes that C++20 added a dedicated function for this problem: std::midpoint.

Optimizing compilers reload vector constants needlessly

Modern processors have powerful vector instructions which allow you to load several values at once, and operate (in one instruction) on all these values. Similarly, they allow you to have vector constants. Thus if you wanted to add some integer (say 10001) to all integers in a large array, you might first load a constant with 8 times the value 10001, then you would load elements from your array, 8 elements by 8 elements, add the vector constant (thus do 8 additions at once), and then store the result. Everything else being equal, this might be 8 times faster.

An optimizing compiler might even do this optimization for you (a process called ‘auto-vectorization). However, for more complex code, you might need to do it manually using “intrinsic” functions (e.g., _mm256_loadu_si256, _mm256_add_epi32, etc.).

Let us consider the simple case I describe, but where we process two arrays at once… using the same constant:

#include <x86intrin.h>
#include <stdint.h>
void process_avx2(const uint32_t *in1, const uint32_t *in2, size_t len) {
  // define the constant, 8 x 10001
  __m256i c = _mm256_set1_epi32(10001);
  const uint32_t *finalin1 = in1 + len;
  const uint32_t *finalin2 = in2 + len;
  for (; in1 + 8 <= finalin1; in1 += 8) {
    // load 8 integers into a 32-byte register
    __m256i x = _mm256_loadu_si256((__m256i *)in1);
    // add the 8 integers just loaded to the 8 constant integers
    x = _mm256_add_epi32(c, x);
    // store the 8 modified integers
    _mm256_storeu_si256((__m256i *)in1, x);
  };
  for (; in2 + 8 <= finalin2; in2 += 8) {
    // load 8 integers into a 32-byte register
    __m256i x = _mm256_loadu_si256((__m256i *)in2);
    // add the 8 integers just loaded to the 8 constant integers
    x = _mm256_add_epi32(c, x);
    // store the 8 modified integers
    _mm256_storeu_si256((__m256i *)in2, x);
  }
}

My expectation, until recently, was that optimizing compilers would  keep the constant in a register, and never load it twice. Why would they?

Yet you can check that GCC loads the constant twice. You will recognize the assembly sequence:

mov          eax, 10001 // load 10001 in a general register
vpbroadcastd ymm1, eax  // broadcast 10001 to all elements

In  this instance, other compilers (like LLVM) do better. However, in other instances, both LLVM and GCC happily load constants more than once. Only the Intel compiler (ICC) seems to be able to avoid this issue with some consistency.

The processor has more than enough vector registers, so it is not a register allocation issue. Of course, there are instances where it is  best to avoid creating the constant, but you can check that even when the compiler ought to know that the constant is always needed, it may still create it twice. AVX-512 has introduced new mask types and they suffer from this effect as well.

Does it matter? In most cases, this effect should have little performance impact. It is almost surely only a few instructions of overhead per function.

It would be interesting to be able to instruct the compiler not to do reload the constants. You might think that the static keyword could help, but with LLVM, static vector variables may be protected by a lock, which probably makes your code even heavier.

 

How big are your SVE registers ? (AWS Graviton)

Amazon has some neat ARM-based systems based on Amazon’s own chips (Graviton). You can access them through Amazon’s web services (AWS). These processors have advanced vector instructions able to process many values at once. These instructions are part of an instruction sets called SVE for Scalable Vector Extension. SVE has a trick: it hides its internal register size from you. Thus, to the question “how many values can it process at once?”, the answer is ‘”it depends”.

Thankfully, you can still write a program to find out. The svcntb intrinsic tells you how many 8-bit integers fits in a full register. Thus the following C++ line should tell you the vector register size in bytes:

std::cout << "detected vector register size (SVE): " 
  << svcntb() << " bytes" << std::endl;

And here is what I get currently on an AWS server:

$ ./svesize
detected vector register size (SVE): 32 bytes

It is hard to find ARM processors with such wide registers (32 bytes) and it is unclear whether future iterations will still have 32 bytes registers.

My source code is available.