We often organize data using fixed blocks of memory. When these blocks are relatively small (e.g., 8 bits, 16 bits, 32 bits, 64 bits), we commonly call them ‘words’.

The notion of ‘word’ is important because processors do not operate over arbitrary data types. For practical reasons, processors expect data to fit in hardware registers having some fixed size (usually 64-bit registers). Most modern processors accommodate 8-bit, 16-bit, 32-bit and 64-bit words with fast instructions. It is typical to have the granularity of the memory accesses to be no smaller than the ‘byte’ (8 bits) so bytes are, in a sense, the smallest practical words.

Variable-length data structures like strings might be made of a variable number of words. Historically, strings have been made of lists of bytes, but other alternatives are common (e.g., 16-bit or 32-bit words).

The simplest type is probably the Boolean type. A Boolean value can take either the false or the true value. Though a single bit suffices to represent a Boolean value, it is common to use a whole byte (or more).

We can negate a Boolean value: the true value becomes the false value, and conversely. There are also binary operations:

- The result of the OR operation between two Boolean values is false if and only if both inputs are false. The OR operation is often noted
`|`

. E.g.,`1 | 0 == 1`

where we use the convention that the symbol`==`

states the equality between two values. - The result of the AND operation between two Boolean values is true if and only if both inputs are true. The AND operation is often noted
`&`

. E.g.,`1 & 1 == 1`

. - The result of the XOR operation is true if and only the two inputs differ in value. The XOR operation is often noted
`^`

. E.g.,`1 ^ 1 == 0`

. - The result of the AND NOT operation between two Boolean values is true if and only if the first Boolean value is true and the second one is false.

Integer data types are probably the most widely supported in software and hardware, after the Boolean types. We often represent integers using digits. E.g., the integer 1234 has 4 decimal digits. By extension, we use ‘binary’ digits, called bits, within computers. We often write an integer using the binary notation using the `0b`

prefix. E.g., the integer `0b10`

is two, the integer `0b10110`

is equal to `2^1+2^2+2^4`

or 22. After the prefix `0b`

, we enumerate the bit values, starting with the most significant bit. We may also use the hexadecimal (base 16) notation with the `0x`

prefix: in that case, we use 16 different digits in the list `0, 1, 2, 3,..., 9, A, B, C, D, E, F`

. These digits have values `0, 1, 2, 3,..., 9, 10, 11, 12, 13, 14, 15`

. For digits represented as letters, we may use either the lower or upper cases. Thus the number `0x5A`

is equal to `5 * 16 + 10`

or 90 in decimal. The hexadecimal notation is convenient when working with binary values: a single digit represents a 4-bit value, two digits represent an 8-bit value, and so forth.

We might count the number of digits of an integer using the formula `ceil(log(x+1))`

where the logarithm is the in the base you are interested in (e.g., base 2) and where `ceil`

is the *ceiling* function: `ceil(x)`

returns the smallest integer no smaller than `x`

. The product between an integer having `d1`

digits and an integer having `d2`

digits has either `d1+d2-1`

digits or `d1+d2`

digits. To illustrate, let us consider the product between two integers having three digits. In base 10, the smallest product is 100 times 100 is 10,000, so it requires 5 digits. The largest product is 999 times 999 or 998,001 so 6 digits.

For speed or convenience, we might use a fixed number of digits. Given that we work with binary computers, we are likely to use binary digits. We also need a way to represent the sign of a number (negative and positive).

Possibly the simplest number type is the unsigned integer where we use a fixed number of bits used to represent non-negative integers. Most processors support arithmetic operations over unsigned integers. The term ‘unsigned’ in this instance is equivalent to ‘non-negative’: the integers can be zero or positive.

We can operate on binary integers using bitwise logical operations. For example, the bitwise AND between `0b101`

and `0b1100`

is `0b100`

. The bitwise OR is `0b1111`

. The bitwise XOR (exclusive OR) is `0b1001`

.

The powers of two (1, 2, 4, 8,…) are the only numbers having a single 1-bit in their binary representation (`0b1`

, `0b10`

, `0b100`

, `0b1000`

, etc.). The numbers preceding powers of two (1,3,7,…) are the numbers made of consecutive 1-bit in the least significant positions (`0b1`

, `0b11`

, `0b111`

, `0b1111`

, etc.). A unique characteristic of powers of two is that their bitwise AND with the preceding integer is zero: e.g., 4 AND 3 is zero, 8 AND 7 is zero, and so forth.

In the Go programming language, for example, we have 8-bit, 16-bit, 32-bit and 64-bit unsigned integer types: `uint8`

, `uint16`

, `uint32`

, `uint64`

. They can represent all numbers from 0 up to (but not including) 2 to the power of 8, 16, 32 and 64. For example, an 8-bit unsigned integer can represent all integers from 0 up to 255 inclusively.

Because we choose to use a fixed number of bits, we therefore can only represent a range of integers. The result of an arithmetic operation may exceed the range (an *overflow*). For example, 255 plus 2 is 257: though both inputs (255 and 2) can be represented using 8-bit unsigned integers, the result exceeds the range.

Regarding multiplications, the product of two 8-bit unsigned integers is at most 65025 which can be represented by a 16-bit unsigned integer. It is always the case that the product of two `n`

-bit integers can be represented using `2n`

bits. The converse is untrue: a given `2n`

-bit integer is not the product of two `n`

-bit integers. As `n`

becomes large, only a small fraction of all `2n`

-bit integers can be written as the product of two `n`

-bit integers, a result first proved by Erdős.

Typically, arithmetic operations are “modulo” the power of two. That is, everything is as if we did the computation using infinite-precision integers and then we only kept the (positive) remainder of the division by the power of two.

Let us elaborate. Given two integers `a`

and `b`

(`b`

being non-zero), there are unique integers `d`

and `r`

where `r`

is in `[0,b)`

such that `a = d * b + r`

. The integer `r`

is the remainder and the integer `d`

is the quotient.

Euclid’s division lemma tells us that the quotient and the remainder exist and are unique. We can check uniqueness. Suppose that there is another such pair of integers (`d'`

and `r'`

), `a = d' * b + r'`

. We can check that if `d'`

is equal to `d`

, then we must have that `r'`

is equal to `r`

, and conversely, if `r'`

is equal to `r`

, then `d'`

is equal to `d`

. Suppose that `r'`

is greater than `r`

(if not, just reverse the argument). Then, by subtraction, we have that `0 = (d'-d)*b + (r'-r)`

. We must have that `r'-r`

is in `[0,b)`

. If `d'-d`

is negative, then we have that `(d-d')*b = (r'-r)`

, but that is impossible because `r'-r`

is in `[0,b)`

whereas `(d-d')*b`

is greater or equal than `b`

. A similar argument works when `d'-d`

is positive.

In our case, the divisor (`b`

) is a power of two. When the numerator (`a`

) is positive, then the remainder amounts to a selection of the least significant bits. For example, the remainder of the division of 65 (or `0b1000001`

) with 64 is 1.

When considering unsigned arithmetic, it often helps to think that we keep only the least significant bits (8, 16, 32 or 64) of the final result. Thus if we take 255 and we add 2, we get 257, but as an 8-bit unsigned integer, we get the number 1. Thus, using integers of the type `uint8`

, we have that `255 + 2`

is 1 (`255 + 2 == 1`

). The power of two itself is zero: 256 is equal to zero as an `uint8`

integer. If we subtract two numbers and the value would be negative, we effectively ‘wrap’ around: 10 – 20 in `uint8`

arithmetic is the positive remainder of (-10) divided by 256 which is 246. Another way to think of negative numbers is that we can add the power of two (say 256) as many times as needed (size its value is effectively zero) until we get a value that is between 0 and the power of two. Thus if we must evaluate `1-5*250`

as an 8-bit integer, we take the result (-1249) and we add 256 as many times as needed: we have that `-1249+5*256`

is 31, a number between 0 and 256. Thus `1-5*250`

is 31 as an unsigned 8-bit number.

We have that `0-1`

, as an 8-bit number, is 255 or `0b11111111`

. `0-2`

is 254, `0-3`

is 253 and so forth.￼ Consider the set of integers…

```
-1024, -1023,..., -513, -512, -511, ..., -1, 0, 1,...255, 256, 257,...
```

As 256-bit integers, they are mapped to

0, 255,... .1, 0, 255, ..., 255, 0, 1,...255, 0, 1,...

Multiplication by a power of two is equivalent to shifting the bits left, possibly losing the leftmost bits. For example, 17 is `0b10001`

. Multiplying it by 4, we get `0b1000100`

or 68. If we were to multiply 17 by 4, we would get `0b100010000`

or, as an 8-bit integer, `0b10000`

. That is, as 8-bit unsigned integers, we have that `17 * 16`

is 16. Thus we have that `17 * 16 == 1 * 16`

.

The product of two non-zero integers may be zero. For example, `16*16`

is zero as an 8-bit integer. It happens only when both integers are divisible by two. The product of two odd integers must always be odd.

We say that two numbers are ‘coprime’ if their largest common divisor is 1. Odd integers are coprime with powers of two. Even integers are never coprime with a power of two.

When multiplying a non-zero integer by an odd integer using finite-bit arithmetic, we never get zero. Thus, for example, `3 * x`

as an 8-bit integer is zero if and only if `x`

is zero when using fixed-bit unsigned integers. It means that `3 * x`

is equal to `3 * y`

if and only if `x`

and `y`

are equal. Thus we have that the following Go code will print out all values from 0 to 255, without repetition:

```
for i:=uint8(1); i != 0; i++ {
fmt.Println(3*i)
}
```

Multiplying integers by an odd integer permutes them.

If you consider powers of an odd integer, you similarly never get a zero result. However, you may eventually get the power to be one. For example, as an 8-bit unsigned integer, 3 to the power of 64 is 1. This number (64) is sometimes called the ‘order’ of 3. Since this is the smallest exponent so that the result is one, we have that all 63 preceding powers give distinct results. We can show this result as follows. Suppose that 3 raised to the `p`

is equal to 3 raised to the power `q`

, and assume without loss of generality that `p>q`

, then we have that `3`

to the power of `p-q`

must be 1, by inspection. And if both `p`

and `q`

are smaller than 64, then so must b `p-q`

, a contradiction. Further, we can check that the powers of an odd integer repeat after the order is reached: we have that 3 to the power 64 is 1, 3 to the power of 65 is 3, 3 to the power of 66 is 9, and so forth. It follows that the order of any odd integer must divide the power of two (e.g., 256).

How large can the order of an odd integer be? We can check that all powers of an odd integer must be odd integers and there are only 128 distinct 8-bit integers. Thus the order of an 8-bit odd integer can be at most 128. Conversely, Euler’s theorem tells us that any odd integer to the power of the number of odd integers (e.g., 3 to the power 128) must be one. Because the values of the power of an odd integer repeat cyclicly after the order is reached, we have that the order of any odd integer must divide 128 for 8-bit unsigned integers. Generally, irrespective of the width in bits of the words, the order of an odd integer must be a power of two.

Given two non-zero unsigned integers, `a`

and `b`

, we would expect that `a+b>max(a+b)`

but it is only true if there is no overflow. When and only when there is an overflow, we have that `a+b<min(a+b)`

using finite-bit unsigned arithmetic. We can check for an overflow with either conditions: `a+b<a`

and `a+b<b`

.

Typically, one of the most expensive operations a computer can do with two integers is to divide them. A division can require several times more cycles than a multiplication, and a multiplication is in turn often many times more expensive than a simple addition or subtraction. However, the division by a power of two and the multiplication by a power of two are inexpensive: we can compute the integer quotient of the division of an unsigned integer by shifting the bits *right*. For example, the integer 7 (0b111) divided by 2 is 0b011 or 3. We can further divide 7 (0b111) by 4 to get 0b001 or 1. The integer remainder is given by selecting the bits that would be shifted out: the remainder of 7 divided by 4 is 7 AND 0b11 or 0b11. The remainder of the division by two is just the least significant bit. Even integers are characterized by having zero as the least significant bit. Similarly, the multiplication by a power of two is just a left shift: the integer 7 (0b111) multiplied by two is 14 (0b1110). More generally, an optimizing compiler may produce efficient code for the computation of the remainder and quotient when the divisor is fixed. Typically, it involves at least a multiplication and a shift.

Given an integer `x`

, we say that `y`

is its multiplicative inverse if `x * y == 1`

. We have that every odd integer has a multiplicative inverse because multiplication by an integer creates a permutation of all integers. We can compute this multiplicative inverse using Newton’s method. That is, we start with a guess and from the guess, we get a better one, and so forth, until we naturally converge to the right value. So we need some formula `f(y)`

, so that we can repeatedly call `y = f(y)`

until `y`

converges. A useful recurrence formula is `f(y) = y * (2 - y * x)`

. You can verify that if `y`

is the multiplicative inverse of `x`

, then `f(y) = y`

. Suppose that `y`

is not quite the inverse, suppose that `x * y = 1 + z * p`

for some odd integer `z`

and some power of two `p`

. If the power of two is (say) 8, then it tells you that `y`

is the multiplicative inverse over the first three bits. We get `x * f(y) = x * y * (2 - y * x) = 2 + 2 * z * p - (1 - 2 * z * p + z * z * p * p) = 1 - z * z * p * p`

. We can see from this result that if `y`

is the multiplicative inverse over the first `n`

bits, then f(y) is the multiplicative inverse over `2n`

bits. That is, if `y`

is the inverse “for the first `n`

bits”, then `f(y)`

is the inverse “for the first `2n`

bits”. We double the precision each time we call the recurrence formula. It means that we can quickly converge on the inverse.

What should our initial guess for `y`

be? If we use 3-bit words, then every number is its inverse. So starting with `y = x`

would give us three bits of accuracy, but we can do better: `( 3 * x ) ^ 2`

provides 5 bits of accuracy. The following Go program verifies the claim:

```
package main
import "fmt"
func main() {
for x := 1; x < 32; x += 2 {
y := (3 * x) ^ 2
if (x*y)&0b11111 != 1 {
fmt.Println("error")
}
}
fmt.Println("Done")
}
```

Observe how we capture the 5 least significant bits using the expression `&0b11111`

: it is a bitwise logical AND operation.

Starting from 5 bits, the first call to the recurrence formula gives 10 bits, then 20 bits for the second call, then 40 bits, then 80 bits. So, we need to call our recurrence formula 2 times for 16-bit values, 3 times for 32-bit values and 4 times for 64-bit values. The function `FindInverse64`

computes the 64-bit multiplicative inverse of an odd integer:

```
func f64(x, y uint64) uint64 {
return y * (2 - y*x)
}
func FindInverse64(x uint64) uint64 {
y := (3 * x) ^ 2 // 5 bits
y = f64(x, y) // 10 bits
y = f64(x, y) // 20 bits
y = f64(x, y) // 40 bits
y = f64(x, y) // 80 bits
return y
}
```

We have that `FindInverse64(271) * 271 == 1`

. Importantly, it fails if the provided integer is even.

We can use multiplicative inverses to replace the division by an odd integer with a multiplication. That is, if you precompute `FindInverse64(3)`

, then you can compute the division by three for any multiple of three by computing the product: e.g., `FindInverse64(3) * 15 == 5`

.

When we store multi-byte values such as unsigned integers in arrays of bytes, we may use one of two conventions: little- and big-endian. The little- and big-endian variants only differ by the byte order: we either start with the least significant bytes (little endian) or by the most significant bytes (big endian). Let us consider the integer 12345. An an hexadecimal value, it is 0x3039. If we store it as two bytes, we may either store it as the byte value 0x30 followed by the byte value 0x39 (big endian), or by the reverse (0x39 followed by 0x30). Most modern systems default on the little-endian convention, and there are relatively few big-endian systems. In practice, we rarely have to be concerned with the endianness of our system.

Given unsigned integers, how do we add support for signed integers? At first glance, it is tempting to reserve a bit for the sign. Thus if we have 32 bits, we might use one bit to indicate whether the value is positive or negative, and then we can use 31 bits to store the absolute value of the integer.

Though this sign-bit approach is workable, it has downsides. The first obvious downside is that there are two possible zero values: `+0`

and `-0`

. The other downside is that it makes signed integers wholly distinct values as compared to unsigned integers: ideally, we would like hardware instructions that operate on unsigned integers to ‘just work’ on signed integers.

Thus modern computers use two’s complement notation to represent signed integers. To simplify the exposition, we consider 8-bit integers. We represent all positive integers up to half the range (127 for 8-bit words) in the same manner, whether using signed or unsigned integers. Only when the most significant bit is set, do we differ: for the signed integers, it is as if the unsigned value derived from all but the most significant bit is subtracted by half the range (128). For example, as an 8-bit signed value, 0b11111111 is -1. Indeed, ignoring the most significant bit, we have 0b1111111 or 127, and subtracting 128, we get -1.

Binary | unsigned | signed |
---|---|---|

0b00000000 | 0 | 0 |

0b00000001 | 1 | 1 |

0b00000010 | 2 | 2 |

0b01111111 | 127 | 127 |

0b10000000 | 128 | -128 |

0b10000001 | 129 | -127 |

0b11111110 | 254 | -2 |

0b11111111 | 255 | -1 |

In Go, you can ‘cast’ unsigned integers to signed integers, and vice versa: Go leaves the binary values unchanged, but it simply reinterprets the value as unsigned and signed integers. If we execute the following code, we have that `x==z`

:

```
x := uint16(52429)
y := int16(x)
z := uint16(y)
```

Conveniently, whether we compute the multiplication, the addition or the subtraction between two values, the result is the same (in binary) whether we interpret the bits as a signed or unsigned value. Thus we can use the same hardware circuits.

A downside of the two’s complement notation is that the smallest negative value (-128 in the 8-bit case) cannot be safely negated. Indeed, the number 128 cannot be represented using 8-bit signed integers. This asymmetry is unavoidable because we have three types of numbers: zero, negative values and positive values. Yet we have an even number of binary values.

Like with unsigned integers, we can shift (right and left) signed integers. The left shift works like for unsigned integers at the bit level. We have that

x := int8(1)

(x << 1) == 2

(x << 7) == -128

However, right shift works differently for signed and unsigned integers. For unsigned integers, we shift in zeroes from the left; for signed integers, we either shift in zeroes (if the integer is positive or zero) or ones (if the integer and negatives). We illustrate this behaviour with an example:

```
x := int8(-1)
(x >> 1) == -1
y := uint8(x)
y == 255
(y >> 1) == 127
```

When a signed integer is positive, then dividing by a power of two or shifting right has the same result (`10/4 == (10>>2)`

). However, when the integer is negative, it is only true when the negative integer is divisible by the power of two. When the negative integer is not divisible by the power of two, then the shift is smaller by one than the division, as illustrated by the following code:

```
x := int8(-10)
(x / 4) == -2
(x >> 2) == -3
```

On computers, real numbers are typically approximated by binary floating-point numbers: a fixed-width integer `m`

(the *significand*) multiplied by 2 raised to an integer exponent `p`

: `m * 2**p`

where `2**p`

represents the number two raised to the power `p`

. A signed bit is added so that both a positive and negative zero are available. Most systems today follow the IEEE 754 standard which means that you can get consistent results across programming languages and operating systems. Hence, it does not matter very much if you implement your software in C++ under Linux whereas someone else implements it in C# under Windows: if you both have recent systems, you can expect identical numerical outcomes when doing basic arithmetic and square-root operations.

A positive *normal* double-precision floating-point number is a binary floating-point number where the 53-bit integer `m`

is in the interval `[2**52,2**53)`

while being interpreted as a number in `[1,2)`

by virtually dividing it by `2**52`

, and where the 11-bit exponent `p`

ranges from `-1022`

to `+1023`

. Thus we can represent all values between `2**-1022`

and up to but not including `2**1024`

. Some values smaller than `2**-1022`

can be represented as *subnormal* values: they use a special exponent code which has the value `2**-1022`

and the significand is then interpreted as a value in the interval `[0,1)`

.

In Go, a `float64`

number can represent all decimal numbers made of a 15-digit significand from approximately `-1.8 * 10**308`

to `1.8 *10**308`

. The reverse is not true: it is not sufficient to have 15 digits of precision to distinguish any two floating-point numbers: we may need up to 17 digits.

The `float32`

type is similar. It can represent all numbers between `2**-126`

up to, but not including, `2**128`

; with special handling for some numbers smaller than `2**-126`

(subnormals). The `float32`

type can represent exactly all decimal numbers made of a 6-digit decimal significand but 9 digits are needed in general to identify uniquely a number.

Floating-point numbers also include the positive and negative infinity, as well as a special not-a-number value. They are identified by a reserved exponent value.

Numbers are typically serialized as decimal numbers in strings and then parsed back by the receiver. However, it is generally impossible to convert decimal numbers into binary floating-point numbers: the number `0.2`

has no exact representation as a binary floating-point number. However, you should expect the system to choose the best possible approximation: `7205759403792794 * 2**-55`

as a `float64`

number (or about `0.20000000000000001110`

). If the initial number was a `float64`

(for example), you should expect the exact value to be preserved: it will work as expected in Go.

One of the earliest string standards is ASCII: it was first specified in the early 1960s. The ASCII standard is still popular. Each character is a byte, with the most significant bit set to zero. There are therefore only 128 distinct ASCII characters. It is often sufficient for simple tasks like programming. Unfortunately, the ASCII standard could only ever represent up to 128 characters: far less than needed.

Many diverging standards emerged for representing characters in software. The existence of multiple incompatible formats made the production of interoperable localized software challenging.

Engineers developed Unicode in the late 1980s as an attempt to provide a universal standard. Initially, it was believed that using 16 bits per character would be sufficient, but this belief was wrong. The Unicode standard was extended to include up to 1,114,112 characters. Only a small fraction of all possible characters have been assigned, but more are assigned over time with each Unicode revision. The Unicode standard is an extension of the ASCII standard: the first 128 Unicode characters match the ASCII characters.

Due to the original expectation that Unicode would fit in 16-bit space, a format based on 16-bit words (UTF-16) format was published in 1996. It may use either 16-bit or 32-bit per character. The UTF-16 format was adopted by programming languages such as Java, and became a default under Windows. Unfortunately, UTF-16 is not backward compatible with ASCII at a byte level. An ASCII-compatible format was proposed and formalized in 2003: UTF-8. Over time, UTF-8 became widely used for text interchange formats such as JSON, HTML or XML. Programming languages such as Go, Rust and Swift use UTF-8 by default. Both formats (UTF-8 and UTF-16) require validation: not all arrays of bytes are valid. The UTF-8 format is more expensive to validate.

ASCII characters require one byte with UTF-8 and two bytes with UTF-16. The UTF-16 format can represent all characters, except for the supplemental characters such as emojis, using two bytes. The UTF-8 format uses two bytes for Latin, Hebrew and Arabic alphabets, three bytes for Asiatic characters and 4 bytes for the supplemental characters.

UTF-8 encodes values in sequences of one to four bytes. We refer to the first byte of a sequence as a leading byte; the most significant bits of the leading byte indicates the length of the sequence:

- If the most significant bit is zero, we have a sequence of one byte (ASCII).
- If the three most significant bits are 0b110, we have a two-byte sequence.
- If the four most significant bits are 0b1110, we have a three-byte sequence.
- Finally, if the five most significant bits are 0b11110, we have a four-byte sequence.

All bytes following the leading byte in a sequence are continuation bytes, and they must have 0b10 as their most significant bits. Except for the required most significant bits, the numerical value of the character (between 0 to 1,114,112) is stored by starting with the most significant bits (in the leading byte) followed by the less significant bits in the other continuation bytes.

In the UTF-16 format, characters in 0x0000-0xD7FF and 0xE000-0xFFFF are stored as single 16-bit words. Characters in the range 0x010000 to 0x10FFFF require two 16-bit words called a surrogate pair. The first word in the pair is in the range 0xd800 to 0xdbff whereas the second word is in the range from 0xdc00 to 0xdfff. The character value is made of the 10 least significant bits of the two words, using the second word as least significant, and adding 0x10000 to the result. There are two types of UTF-16 format. In the little-endian variant, each 16-bit word is stored using the least significant bits in the first byte. The reverse is true in the big-endian variant.

When using ASCII, it is relatively easy to access the characters in random order. For UTF-16, it is possible if we assume that there are no supplemental characters, but since some characters might require 4 bytes while other 2 bytes, it is not possible to go directly to a character by its index without accessing the previous content. The UTF-8 is similarly not randomly accessible in general.

Software often depends on the chosen locale: e.g., US English, French Canadian, and so forth. Sorting strings is locale-dependent. It is not generally possible to sort strings without knowing the locale. However, it is possible to sort strings lexicographically as byte sequences (UTF-8) or as 16-bit word sequences (UTF-16). When using UTF-8, the result is then a string sort based on the characters’ numerical value.

]]>Consider the following two lines of C code:

printf("Good day professor Jones");

printf("Good day professor Jane");

There is redundancy since the prefix “Good day professor” is the same in both cases. To my knowledge, no compiler is likely to trim this redundancy. However, you can get the desired trimming by breaking the strings:

printf("Good day professor "); printf("Jones"); printf("Good day professor "); printf("Jane");

Most compilers will recognize the constant string and store it once in the program. It works even if the constant string “Good day professor” appears in different functions.

Thus the following function may return true:

const char * str1 = "dear friend"; const char * str2 = "dear friend"; return str1 == str2;

That is, you do not need to manually create constant strings: the compiler recognizes the redundancy (typically).

The same trick fails with extended strings:

const char * str1 = "dear friend"; const char * str2 = "dear friend\0f"; return str1 == str2;

All compilers I tried return false. They create two C strings even if one is a prefix of the other in the following example…

char get1(int k) { const char * str = "dear friend"; return str[k]; } char get2(int k) { const char * str = "dear friend\0f"; return str[k]; }

Unsurprisingly, the “data compression” trick works with arrays. For example, the arrays in these two functions are likely to be compiled to just one array because the compiler recognizes that they are identical:

int f(int k) { int array[] = {1,2,3,4,5,34432,321323,321321,1, 2,3,4,5,34432,321323,321321}; return array[k]; } int g(int k) { int array[] = {1,2,3,4,5,34432,321323,321321,1, 2,3,4,5,34432,321323,321321}; return array[k+1]; }

It may still work if one array is an exact subarray of the other ones with GCC, as in this example:

int f(int k) { int array[] = {1,2,3,4,5,34432,321323,321321,1, 2,3,4,5,34432,321323,321321}; return array[k]; } int g(int k) { int array[] = {1,2,3,4,5,34432,321323,321321,1, 2,3,4,5,34432,321323,321321,1,4}; return array[k+1]; }

You can also pile up several arrays as in the following case where GCC creates just one array:

long long get1(int k) { long long str[] = {1,2,3}; return str[k]; } long long get2(int k) { long long str[] = {1,2,3,4}; return str[k+1]; } long long get3(int k) { long long str[] = {1,2,3,4,5,6}; return str[k+1]; } long long get4(int k) { long long str[] = {1,2,3,4,5,6,7,8}; return str[k+1]; }

It also works with arrays of pointers, as in the following case:

const char * get1(int k) { const char * str[] = {"dear friend", "dear sister", "dear brother"}; return str[k]; } const char * get2(int k) { const char * str[] = {"dear friend", "dear sister", "dear brother"}; return str[k+1]; }

Of course, if you want to make sure to keep your code thin and efficient, you should not blindly rely on the compiler. Nevertheless, it is warranted to be slightly optimistic.

]]>- Attractive female students get better grades. They lose this benefit when courses move online.
- A research paper is much more likely to be highly ranked if the author is famous.
- The USA has many more prisoners than police officers (three prisoners for every police officer), while every other developed country has the reverse ratio.
- Diluting the blood plasma of older human beings rejuvenate them.
- Saturated fat, as found in meat and dairy products, is not associated with bad cardiovascular health. In other words, eating butter does not harm your heart.
- An electric car has reportedly about half the carbon footprint as that of a conventional car.
- India has overtaken the United Kingdom by GDP. The five largest economies are the United States, China, Japan, Germany and India. They are followed by the United Kingdom (6th), France (7th), Canada (8th), Italy (9th) and Brazil (10th). The second richest person in the world is Gautam Adani from India, the richest person being Elon Musk.

my title is "La vie"

becomes

my title is \"La vie\"

A simple routine in C++ to escape a string might look as follows:

for (...) { if ((*in == '\\') || (*in == '"')) { *out++ = '\\'; } *out++ = *in; }

Such a character-by-character approach is unlikely to provide the best possible performance on modern hardware.

Recent Intel processors have fast instructions (AVX-512) that are well suited for such problems. I decided to sketch a solution using Intel intrinsic functions. The routine goes as follows:

- I use two constant registers containing 64 copies of the backslash character and 64 copies of the quote characters.
- I start a loop by loading 32 bytes from the input.
- I expands these 32 bytes into a 64 byte register, interleaving zero bytes.
- I compare these bytes with the quotes and backslash characters.
- From the resulting mask, I then construct (by shifting and blending) escaped characters.
- I ‘compress’ the result, removing the zero bytes that appear before the unescaped characters.
- I advance the output pointer by the number of written bytes and I continue the loop.

The C++ code roughly looks like this…

__m512i solidus = _mm512_set1_epi8('\\'); __m512i quote = _mm512_set1_epi8('"'); for (; in + 32 <= finalin; in += 32) { __m256i input = _mm256_loadu_si256(in); __m512i input1 = _mm512_cvtepu8_epi16(input); __mmask64 is_solidus = _mm512_cmpeq_epi8_mask(input1, solidus); __mmask64 is_quote = _mm512_cmpeq_epi8_mask(input1, quote); __mmask64 is_quote_or_solidus = _kor_mask64(is_solidus, is_quote); __mmask64 to_keep = _kor_mask64(is_quote_or_solidus, 0xaaaaaaaaaaaaaaaa); __m512i shifted_input1 = _mm512_bslli_epi128(input1, 1); __m512i escaped = _mm512_mask_blend_epi8(is_quote_or_solidus, shifted_input1, solidus); _mm512_mask_compressstoreu_epi8(out, to_keep, escaped); out += _mm_popcnt_u64(_cvtmask64_u64(to_keep)); }

This code can be greatly improved. Nevertheless, it is a good first step. What are the results an Intel icelake processor using GCC 11 (Linux) ? A simple benchmark indicates a 5x performance boost compared to a naive implementation:

regular code | 0.6 ns/character |

AVX-512 code | 0.1 ns/character |

It looks quite encouraging ! My source code is available. I require a recent x64 processor with AVX-512 VBMI2 support.

]]>Native Americans knew that the Sacramento Valley could become an inland sea when the rains came. Their storytellers described water filling the valley from the Coast Range to the Sierra.

Thus if you iterate over an array and access elements that are out of bounds, a memory sanitizer will immediately catch the error:

int array[8]; for(int k = 0;; k++) { array[k] = 0; }

The sanitizer reports the error, but what if you would like to catch the error and store it in some log? Thankfully, GCC and LLVM sanitizers call a function (`__asan_on_error()`) when when an error is encounter, allowing us to log it. Of course, you need to record the state of your program. The following is an example where the state is recorded in a string.

#include <iostream> #include <string> #include <stdlib.h> std::string message; extern "C" { void __asan_on_error() { std::cout << "You caused an error: " << message << std::endl; } } int main() { int array[8]; for(int k = 0;; k++) { message = std::string("access at ") + std::to_string(k); array[k] = 0; } return EXIT_SUCCESS; }

The `extern` expression makes sure that C++ does not mangle the function name.

Running this program with memory sanitizers will print the following:

You caused an error: access at 8

You could also write to a file, if you would like.

]]>#include <stdio.h> #include <stdlib.h> int main() { printf("hello world\n"); return EXIT_SUCCESS; }

You can write the equivalent in C++:

#include <iostream> #include <stdlib.h> int main() { std::cout << "hello world" << std::endl; return EXIT_SUCCESS; }

In the recently released C++20 standard, we could use `std::format` instead or wrap the stream in a `basic_osyncstream` for thread safety, but the above code is what you’d find in most textbooks today.

How fast do these programs run? You may not care about the performance of these ‘hello world’ programs per se, but many systems rely on small C/C++ programs running specific and small tasks. Sometimes you just want to run a small program to execute a computation, process a small file and so forth.

We can check the running time using a benchmarking tool such as hyperfine. Such tools handle various factors such as shell starting time and so forth.

I do not believe that printing ‘hello world’ itself should be slower or faster in C++ compared to C, at least not significantly. What we are testing by running these programs is the overhead due to the choice of programming language when launching the program. One might argue that in C++, you can use printf (the C function), and that’s correct. You can code in C++ as if you were in C all of the time. It is not unreasonable, but we are interested in the performance when relying on conventional/textbook C++ using the standard C++ library.

Under Linux when using the standard C++ library (libstdc++), we can ask that the standard C++ be linked with the executable. The result is a much larger binary executable, but it may provide faster starting time.

Hyperfine tells me that the C++ program relying on the dynamically loaded C++ library takes almost 1 ms more time than the C program.

C | 0.5 ms |

C++ (dynamic) | 1.4 ms |

C++ (static) | 0.8 ms |

My source code and Makefile are available. I get these numbers on Ubuntu 22.04 LTS using an AWS node (Graviton 3).

If these numbers are to be believed, there may a significant penalty due to textbook C++ code for tiny program executions, under Linux.

Half a millisecond or more of overhead, if it is indeed correct, is a huge penalty for a tiny program like ‘hello workd’. And it only happens when I use dynamic loading of the C++ library: the cost is much less when using a statically linked C++ library.

It seems that loading the C++ library dynamically is adding a significant cost of up to 1 ms. We might check for a few additional confounding factors proposed by my readers.

- The C compiler might not call the
`printf`function, and might call the simpler`puts`function instead: we can fool the compiler into calling`printf`with the syntax`printf("hello %s\n", "world")`: it makes no measurable difference in our tests. - If we compile the C function using a C++ compiler, the problem disappears, as you would hope, and we match the speed of the C program.
- Replacing
`"hello world" << std::endl;`with`"hello world\n";`does not seem to affect the performance in these experiments. The C++ program remains much slower. - Adding
`std::ios_base::sync_with_stdio(false);`before using`std::cout`also appears to make no difference. The C++ program remains much slower.

C (non trivial printf) | 0.5 ms |

C++ (using printf) | 0.5 ms |

C++ (std::cout replaced by \n) | 1.4 ms |

C++ (sync_with_stdio set to false) | 1.4 ms |

Thus we have every indication that dynamically loading the C++ standard library takes a lot time, certainly hundreds of extra microseconds. It may be a one-time cost but if your programs are small, it can dominate the running time. Statically linking the C++ library helps, but it also creates larger binaries. You may reduce somewhat the size overhead with appropriate link-time flags such as `--gc-sections`, but a significant overhead remains in my tests.

**Note:** This blog post has been edited to answer the multiple comments suggesting confounding factors, other than standard library loading, that the original blog post did not consider. I thank my readers for their proposals.

**Appendix 1** We can measure precisely the loading time by preceding the execution of the function by `LD_DEBUG=statistics` (thanks to Grégory Pakosz for the hint). The C++ code requires more cycles. If we use `LD_DEBUG=all` (e.g., `LD_DEBUG=all ./hellocpp`), then we observe that the C++ version does much more work (more versions checks, more relocations, many more initializations and finalizers). In the comments, Sam Mason blames dynamic linking: on his machine he gets the following result…

C code that dynamically links to libc takes ~240µs, which goes down to ~150µs when statically linked. A fully dynamic C++ build takes ~800µs, while a fully static C++ build is only ~190µs.

**Appendix 2** We can try to use sampling-based profiling to find out where the programs speeds its time. Calling perf record/perf report is not terribly useful on my system. Some readers report that their profiling points the finger at locale initialization in this manner. I get a much more useful profile with `valgrind --tool=callgrind command && callgrind_annotate`. The results are consistent with the theory that loading the C++ library dynamically is relatively expensive.

In the 30 (now 35) years that biomedical researchers have worked determinedly to find a cure for Alzheimer’s disease, their counterparts have developed drugs that helped cut deaths from cardiovascular disease by more than half, and cancer drugs able to eliminate tumors that had been incurable. But for Alzheimer’s, not only is there no cure, there is not even a disease-slowing treatment.

Across science, 98 of the 100 most-cited papers published in 2020 to 2021 were related to COVID-19. A large number of scientists received large numbers of citations to their COVID-19 work, often exceeding the citations they had received to all their work during their entire career.

char * string = "3.1416"; char * string_end = string; double x = strtod(string, &string_end); if(string_end == string) { //you have an error! }

… to something more modern in C++17…

std::string st = "3.1416"; double x; auto [p, ec] = std::from_chars(st.data(), st.data() + st.size(), x); if (p == st.data()) { //you have errors! }

Back when I first reported on this result, only Visual Studio had support for from_chars. The C++ library in GCC 12 now has full support for from_chars. Let us run the benchmark again:

strtod | 270 MB/s |

from_chars | 1 GB/s |

So it is almost four times faster! The benchmark reads random values in the [0,1] interval.

Internally, GCC 12 adopted the fast_float library.

**Further reading**: Number Parsing at a Gigabyte per Second, Software: Pratice and Experience 51 (8), 2021.

double angle = atan2(y, x); angle = (int(round(4 * angle / PI + 8)) % 8) * PI / 4; xout = cos(angle); yout = sin(angle);

If you assume that the unit direction vector is in the first quadrant (both x and y are positive), then there is a direct way to compute the solution. Using 1/sqrt(2) or 0.7071 as the default solution, compare both x and y with sin(3*pi/8), and only switch them to 1 if they are larger than sin(3*pi/8) or to 0 if the other coordinate is larger than sin(3*pi/8). The full code looks as follows:

double outx = 0.7071067811865475; // 1/sqrt(2) double outy = 0.7071067811865475;// 1/sqrt(2) if (x >= 0.923879532511286) { // sin(3*pi/8) outx = 1; } if (y >= 0.923879532511286) { // sin(3*pi/8) outy = 1; } if (y >= 0.923879532511286) { // sin(3*pi/8) outx = 0; } if (x >= 0.923879532511286) { // sin(3*pi/8) outy = 0; } if (xneg) { outx = -outx; }

I write tiny *if* clauses because I hope that the compile will avoid producing comparisons and jumps which may stress the branch predictor when the branches are hard to predict.

You can generalize the solution for the case where either x or y (or both) are negative by first taking the absolute value, and then restoring the sign at the end:

bool xneg = x < 0; bool yneg = y < 0; if (xneg) { x = -x; } if (yneg) { y = -y; } double outx = 0.7071067811865475; // 1/sqrt(2) double outy = 0.7071067811865475;// 1/sqrt(2) if (x >= 0.923879532511286) { // sin(3*pi/8) outx = 1; } if (y >= 0.923879532511286) { // sin(3*pi/8) outy = 1; } if (y >= 0.923879532511286) { // sin(3*pi/8) outx = 0; } if (x >= 0.923879532511286) { // sin(3*pi/8) outy = 0; } if (xneg) { outx = -outx; } if (yneg) { outy = -outy; }

You can rewrite everything with the ternary operator to entice the compiler to produce branchless code (i.e., code without jumps). The result is more compact.

bool xneg = x < 0; x = xneg ? -x : x; bool yneg = y < 0; y = yneg ? -y : y; double outx = (x >= 0.923879532511286) ? 1 : 0.7071067811865475; double outy = (y >= 0.923879532511286) ? 1 : 0.7071067811865475; outx = (y >= 0.923879532511286) ? 0 : outx; outy = (x >= 0.923879532511286) ? 0 : outy; outx = xneg ? -outx : outx; outy = yneg ? -outy : outy;

The clang compiler may produce an entirely branchless assembly given this code.

But as pointed out by Samuel Lee in the comments, you can do even better… Instead of capturing the sign with a separate variable, you can just copy the pre-existing sign using a function like copysign (available in C, C#, Java and so forth):

double outx = fabs(x); double outy = fabs(y); outx = (outx >= 0.923879532511286) ? 1 : 0.7071067811865475; outy = (outy >= 0.923879532511286) ? 1 : 0.7071067811865475; outx = (posy >= 0.923879532511286) ? 0 : outx; outy = (posx >= 0.923879532511286) ? 0 : outy; outx = copysign(outx, x); outy = copysign(outy, y);

I wrote a small benchmark that operates on random inputs. Your results will vary but on my mac laptop with LLVM 12, I get that the direct approach with copysign is 50 times faster than the approach with tan/sin/cos.

with tangent | 40 ns/vector |

fast approach | 1.2 ns/vector |

fast approach/copysign | 0.8 ns/vector |

- Compared to 1800, we eat less saturated fat and much more processed food and vegetable oils and it does not seem to be good for us:

Saturated fats from animal sources declined while polyunsaturated fats from vegetable oils rose. Non-communicable diseases (NCDs) rose over the twentieth century in parallel with increased consumption of processed foods, including sugar, refined flour and rice, and vegetable oils. Saturated fats from animal sources were inversely correlated with the prevalence of non-communicable diseases.

Kang et al. found that saturated fats reduce your risk of having a stroke:

a higher consumption of dietary saturated fat is associated with a lower risk of stroke, and every 10 g/day increase in saturated fat intake is associated with a 6% relative risk reduction in the rate of stroke.

Saturated fats come from meat and dairy products (e.g., butter). A low-fat diet can significantly increase the risk of coronary heart disease events.

Leroy and Cofnas argue against a reduction of red meat consumption:The IARC’s (2015) claim that red meat is “probably carcinogenic” has never been substantiated. In fact, a risk assessment by Kruger and Zhou (2018) concluded that this is not the case. (…) a meta-analysis of RCTs has shown that meat eating does not lead to deterioration of cardiovascular risk markers (O’Connor et al., 2017). The highest category of meat eating even paralleled a potentially beneficial increase in HDL-C level. Whereas plant-based diets indeed seem to lower total cholesterol and LDL-C in intervention studies, they also increase triglyceride levels and decrease HDL-C (Yokoyama et al., 2017), which are now often regarded as superior markers of cardiovascular risk (Jeppesen et al., 2001). (…) We believe that a large reduction in meat consumption, such as has been advocated by the EAT-Lancet Commission (Willett et al., 2019), could produce serious harm. Meat has long been, and continues to be, a primary source of high-quality nutrition. The theory that it can be replaced with legumes and supplements is mere speculation. While diets high in meat have proved successful over the long history of our species, the benefits of vegetarian diets are far from being established, and its dangers have been largely ignored by those who have endorsed it prematurely on the basis of questionable evidence.

- People dislike research that appears to favour males:

In both studies, both sexes reacted less positively to differences favouring males; in contrast to our earlier research, however, the effect was larger among female participants. Contrary to a widespread expectation, participants did not react less positively to research led by a female. Participants did react less positively, though, to research led by a male when the research reported a male-favouring difference in a highly valued trait. Participants judged male-favouring research to be lower in quality than female-favouring research, apparently in large part because they saw the former as more harmful.

- During the Jurassic era, atmospheric CO2 was very high, forests extended all the way to the North pole. Even so, there were freezing winters:

Forests were present all the way to the Pangean North Pole and into the southern latitudes as far as land extended. Although there may have been other contributing factors, the leading hypothesis is that Earth was in a “greenhouse” state because of very high atmospheric PCO2 (partial pressure of CO2), the highest of the past 420 million years. Despite modeling results indicating freezing winter temperatures at high latitudes, empirical evidence for freezing has been lacking. Here, we provide empirical evidence showing that, despite extraordinary high PCO2, freezing winter temperatures did characterize high Pangean latitudes based on stratigraphically widespread lake ice-rafted debris (L-IRD) in early Mesozoic strata of the Junggar Basin, northwest China. Traditionally, dinosaurs have been viewed as thriving in the warm and equable early Mesozoic climates, but our results indicate that they also endured freezing winters.

- In mice, researchers found that injecting stem-cell-derived “conditioned medium” protects against neurodegeneration:

Neuronal cell death is causal in many neurodegenerative diseases, including age-related loss of memory and dementias (such as Alzheimer’s disease), Parkinson’s disease, strokes, as well as diseases that afflict broad ages, i.e., traumatic brain injury, spinal cord injury, ALS, and spinal muscle atrophy. These diseases are characterized by neuroinflammation and oxidative cell damage, many involve perturbed proteostasis and all are devastating and without a cure. Our work describes a feasible meaningful disease-minimizing treatment for ALS and suggests a clinical capacity for treating a broad class of diseases of neurodegeneration, and excessive cell apoptosis.

- Greenland and the North of Europe were once warmer than they are today: the Medieval Warm Period (950 to 1250) overlaps with the Viking age (800–1300). Bajard et al. (2022) suggest that the Viking were quite adept at adapting their agricultural practices:

(…) the period from The Viking Age to the High Middle Ages was a period of expansion with the Viking diaspora, increasing trade, food and goods production and the establishment of Scandinavian towns. This period also sees a rapid increase in population and settlements, mainly due to a relatively stable warm climate (…) temperature was the main driver of agricultural practices in Southeastern Norway during the Late Antiquity. Direct comparison between the reconstructed temperature variability and palynological data from the same sediment sequence shows that small changes in temperature were synchronous with changes in agricultural practices (…) We conclude that the pre-Viking age society in Southwestern Scandinavia made substantial changes in their way of living to adapt to the climate variability of this period.

The Vikings grew barley in Greenland, a plant that grows normally in a temperate climate. In contrast, agriculture in Greenland today is nearly non-existent due to the harsh climate.

- Ashkenazi intelligence often score exceptionally well on intelligence tests, and they achieve extraordinary results in several intellectual pursuits.

Nevertheless, Wikipedia editors deleted the article on Ashkenazi intelligence. Tezuka argues that it is the result of an ideological bias that results in systematic censorship. - You can rejuvenate old human skins by grafting it on young mice.
- Tabarrok reminds us that research funding through competitions might result in total waste through rent dissipation…

A scientist who benefits from a 2-million-dollar NIH grant is willing to spend a million dollars of their time working on applications or incur the cost of restricting their research ideas in order to get it. Importantly, even though only one scientist will get the grant, hundreds of scientists are spending resources in competition to get it. So the gains we might be seeing from transferring resources to one researcher are dissipated multiplicatively across all the scientists who spent time and money competing for the grant but didn’t get it. The aggregate time costs to our brightest minds from this application contest system are quantifiably large, possibly entirely offsetting the total scientific value of the research that the funding supports.

- Corporate tax cuts lead to an increase in productivity and research over the long term. Conversely, increases in taxation reduce long-term productivity as well as research and development.

Why might that be? I believe that it has to do with important ‘negative incentives’ that we have introduced. In effect, we have made scientists less productive. We probably did so through several means, but two effects are probably important: the widespread introduction of research competitions and the addition of extrinsic motivations.

- Prior to 1960, there was hardly any formal research funding competitions. Today, by some estimates, it takes about 40 working days to prepare a single new grant application with about 30 working days for a resubmission, and the success rates is often low which means that for a single successful research grant, hundreds of days might have been spent, purely on the acquisition of funding. This effect is known in economics as rent dissipation. Suppose that I offer to give you $100 to support your research if you enter a competition. How much time are you willing to spend? Maybe you are willing to spend the equivalent of $50, if the success rate is 50%. The net result is that two researchers may each waste $50 in time so that one of them acquire $100 of support. There may be no net gain! Furthermore, if grant applications are valued enough (e.g., needed to get promotion), scientists may be willing to spend even more time than is rational to do so, and the introduction of a new grant competition may in fact reduce the overall research output. You should not underestimate the effect that constant administrative and grant writing might have on a researcher: many graduate students will tell you of their disappointment when encountering high status scientists who cannot seem to do actual research anymore. It can cause a vicious form of accelerated aging. If Albert Einstein had been stuck writing grant applications and reporting on the results from his team, history might have taken a different turn.
- We have massively increased the number and importance of ‘extrinsic motivations’ in science. Broadly speaking, we can distinguish between two types of motivations… intrinsic and extrinsic motivations. Winning prizes or securing prestigious positions are extrinsic motivations. Solving an annoying problem or pursuing a personal quest are intrinsic motivations. We repeatedly find that intrinsic motivations are positively correlated with long-term productivity whereas extrinsic motivations are negatively correlated with long-term productivity (e.g., Horodnic and Zaiţh 2015). In fact, extrinsic motivations even cancel out intrinsic motivations (Wrzesniewski et al., 2014). Extrinsically motivated individuals will focus on superficial gains, as opposed to genuine advances. Of course, the addition of extrinsic motivations may also create a selection effect: the field tends to recruit people who seek prestige for its own sake, as opposed to having a genuine interest in scientific pursuits. Thus creating prestigious prizes, prestigious positions, and prestigious conferences, may end up being detrimental to scientific productivity.

Many people object that the easiest explanation for our stagnation has to do with the fact that most of the easy findings have been covered. However, at all times in history, there were people making this point. In 1894, Michelson said:

While it is never safe to affirm that the future of Physical Science has no marvels in store even more astonishing than those of the past, it seems probable that most of the grand underlying principles have been firmly established and that further advances are to be sought chiefly in the rigorous application of these principles to all the phenomena which come under our notice.

Silverstein in the “The End is Near!”: The Phenomenon of the Declaration of Closure in a Discipline, documents carefully how, historically, many people predicted (wrongly) that their discipline was at an end.

]]>Often we need to convert between the two types. Both ARM and x64 processors can do in one inexpensive instructions. For example, ARM systems may use the `fcvt` instruction.

The details may differ, but most current processors can convert one number (from float to double, or from double to float) per CPU cycle. The latency is small (e.g., 3 or 4 cycles).

A typical processor might run at 3 GHz, thus we have 3 billion cycles per second. Thus we can convert 3 billion numbers per second. A 64-bit number uses 8 bytes, so it is a throughput of 24 gigabytes per second.

It is therefore unlikely that the type conversion can be a performance bottleneck, in general. If you would like to measure the speed on your own system: I have written a small C++ benchmark.

]]>Reportedly, the Apple CEO (Steve Jobs) went to see Intel back when Apple was designing the iPhone to ask for a processor deal. Intel turned Apple down. So Apple went with ARM.

Today, we use ARM processors for everything: game consoles (Nintendo Switch), powerful servers (Amazon and Google), mobile phones, embedded devices, and so forth.

Amazon makes available its new ARM-based processors (Graviton 3). These processors have sophisticated SIMD instructions (SIMD stands for Single Instruction Multiple Data) called SVE (Scalable Vector Extensions). With these instructions, we can greatly accelerate software. It is a form of single-core parallelism, as opposed to the parallelism that one gets by using multiple cores for one task. The SIMD parallelism, when it is applicable, is often far more efficient than multicore parallelism.

Amazon’s Graviton 3 appears to have 32-byte registers, since it is based on the ARM Neoverse V1 design. You can fit eight 32-bit integers in one register. Mainstream ARM processors (e.g., the ones that Intel uses) have SIMD instructions too (NEON), but with shorter registers (16 bytes). Having wider registers and instructions capable of operating over these wide registers allows you reduce the total number of instructions. Executing fewer instructions is a very good way to accelerate code.

To investigate SVE, I looked at a simple problem where you want to remove all negative integers from an array. That is, you read and array containing signed random integers and you want to write out to an output array only the positive integers. Normal C code might look as follows:

void remove_negatives_scalar(const int32_t *input, int64_t count, int32_t *output) { int64_t i = 0; int64_t j = 0; for(; i < count; i++) { if(input[i] >= 0) { output[j++] = input[i]; } } }

Replacing this code with new code that relies on special SVE functions made it go much faster (2.5 times faster). At the time, I suggested that my code was probably not nearly optimal. It processed 32 bytes per loop iteration, using 9 instructions. A sizeable fraction of these 9 instructions have to do with managing the loop, and few do the actual number crunching. A reader, Samuel Lee, proposed to effectively unroll my loop. He predicted much better performance (at least when the array is large enough) due to lower loop overhead. I include his proposed code below.

Using a graviton 3 processor and GCC 11 on my benchmark, I get the following results:

cycles/int | instr./int | instr./cycle | |
---|---|---|---|

scalar | 9.0 | 6.000 | 0.7 |

branchless scalar | 1.8 | 8.000 | 4.4 |

SVE | 0.7 | 1.125 | ~1.6 |

unrolled SVE | 0.4385 | 0.71962 | ~1.6 |

The new unrolled SVE code uses about 23 instructions to process 128 bytes (or 32 32-bit integers), hence about 0.71875 instructions per integer. That’s about 10 times fewer instructions than scalar code and roughly 4 times faster than scalar code in terms of CPU cycles.

The number of instructions retired per cycle is about the same for the two SVE functions, and it is relatively low, somewhat higher than 1.5 instructions retired per cycle.

Often the argument in favour of SVE is that it does not require special code to finish the tail of the processing. That is, you can process an entire array with SVE instructions, even if its length is not divisible by the register size (here 8 integers). I find Lee’s code interesting because it illustrates that you might actually need to handle the end of long array differently, for efficiency reasons.

Overall, I think that we can see that SVE works well for the problem at hand (filtering out 32-bit integers).

**Appendix**: Samuel Lee’s code.

void remove_negatives(const int32_t *input, int64_t count, int32_t *output) { int64_t j = 0; const int32_t* endPtr = input + count; const uint64_t vl_u32 = svcntw(); svbool_t all_mask = svptrue_b32(); while(input <= endPtr - (4*vl_u32)) { svint32_t in0 = svld1_s32(all_mask, input + 0*vl_u32); svint32_t in1 = svld1_s32(all_mask, input + 1*vl_u32); svint32_t in2 = svld1_s32(all_mask, input + 2*vl_u32); svint32_t in3 = svld1_s32(all_mask, input + 3*vl_u32); svbool_t pos0 = svcmpge_n_s32(all_mask, in0, 0); svbool_t pos1 = svcmpge_n_s32(all_mask, in1, 0); svbool_t pos2 = svcmpge_n_s32(all_mask, in2, 0); svbool_t pos3 = svcmpge_n_s32(all_mask, in3, 0); in0 = svcompact_s32(pos0, in0); in1 = svcompact_s32(pos1, in1); in2 = svcompact_s32(pos2, in2); in3 = svcompact_s32(pos3, in3); svst1_s32(all_mask, output + j, in0); j += svcntp_b32(all_mask, pos0); svst1_s32(all_mask, output + j, in1); j += svcntp_b32(all_mask, pos1); svst1_s32(all_mask, output + j, in2); j += svcntp_b32(all_mask, pos2); svst1_s32(all_mask, output + j, in3); j += svcntp_b32(all_mask, pos3); input += 4*vl_u32; } int64_t i = 0; count = endPtr - input; svbool_t while_mask = svwhilelt_b32(i, count); do { svint32_t in = svld1_s32(while_mask, input + i); svbool_t positive = svcmpge_n_s32(while_mask, in, 0); svint32_t in_positive = svcompact_s32(positive, in); svst1_s32(while_mask, output + j, in_positive); i += svcntw(); j += svcntp_b32(while_mask, positive); while_mask = svwhilelt_b32(i, count); } while (svptest_any(svptrue_b32(), while_mask)); }]]>

Go lacked this notion until recently, but it was recently added (as of version 1.18). So I took it out for a spin.

In Java, generics work well enough as long as you need “generic” containers (arrays, maps), and as long as stick with functional idioms. But Java will not let me code the way I would prefer. Here is how I would write a function that sums up numbers:

int sum(int[] v) { int summer = 0; for(int k = 0; k < v.length; k++) { summer += v[k]; } return summer; }

What if I need to support various number types? Then I would like to write the following generic function, but Java won’t let me.

// this Java code won't compile static <T extends Number> T sum(T[] v) { T summer = 0; for(int k = 0; k < v.length; k++) { summer += v[k]; } return summer; }

Go is not object oriented per se, so you do not have a ‘Number’ class. However, you can create your own generic ‘interfaces’ which serves the same function. So here is how you solve the same problem in Go:

type Number interface { uint | int | float32 | float64 } func sum[T Number](a []T) T{ var summer T for _, v := range(a) { summer += v } return summer }

So, at least in this one instance, Go generics are more expressive than Java generics. What about performance?

If I apply the above code to an array of integers, I get the following tight loop in assembly:

pc11: MOVQ (AX)(DX*8), SI INCQ DX ADDQ SI, CX CMPQ BX, DX JGT pc11

As far as Go is concerned, this is as efficient as it gets.

So far, I am giving an A to Go generics.

]]>At first, assembly code looks daunting, and I discourage you from writing sizeable programs in assembly. However, with little training, you can learn to count instructions and spot branches. It can help you gain a deeper insight into how your program works. Let me illustrate what you can learn by look at assembly. Let us consider the following C++ code:

long f(int x) { long array[] = {1,2,3,4,5,6,7,8,999,10}; return array[x]; } long f2(int x) { long array[] = {1,2,3,4,5,6,7,8,999,10}; return array[x+1]; }

This code contains two 80 bytes arrays, but they are identical. Is this a source of worry? If you look at the assembly code produced by most compilers, you will find that exactly identical constants are generally ‘compressed’ (just one version is stored). If I compile these two functions with gcc or clang compilers using the -S flag, I can plainly see the compression because the array occurs just once: (Do not look at all the instructions… just scan the code.)

.text .file "f.cpp" .globl _Z1fi // -- Begin function _Z1fi .p2align 2 .type _Z1fi,@function _Z1fi: // @_Z1fi .cfi_startproc // %bb.0: adrp x8, .L__const._Z2f2i.array add x8, x8, :lo12:.L__const._Z2f2i.array ldr x0, [x8, w0, sxtw #3] ret .Lfunc_end0: .size _Z1fi, .Lfunc_end0-_Z1fi .cfi_endproc // -- End function .globl _Z2f2i // -- Begin function _Z2f2i .p2align 2 .type _Z2f2i,@function _Z2f2i: // @_Z2f2i .cfi_startproc // %bb.0: adrp x8, .L__const._Z2f2i.array add x8, x8, :lo12:.L__const._Z2f2i.array add x8, x8, w0, sxtw #3 ldr x0, [x8, #8] ret .Lfunc_end1: .size _Z2f2i, .Lfunc_end1-_Z2f2i .cfi_endproc // -- End function .type .L__const._Z2f2i.array,@object // @__const._Z2f2i.array .section .rodata,"a",@progbits .p2align 3 .L__const._Z2f2i.array: .xword 1 // 0x1 .xword 2 // 0x2 .xword 3 // 0x3 .xword 4 // 0x4 .xword 5 // 0x5 .xword 6 // 0x6 .xword 7 // 0x7 .xword 8 // 0x8 .xword 999 // 0x3e7 .xword 10 // 0xa .size .L__const._Z2f2i.array, 80 .ident "Ubuntu clang version 14.0.0-1ubuntu1" .section ".note.GNU-stack","",@progbits .addrsig

However, if you modify even slightly the constants, then this compression typically does not happen (e.g., if you try to append one integer value to one of the arrays, the code will duplicate the arrays in full).

To assess the performance of a code routine, my first line of attack is always to count instructions. Keeping everything the same, if you can rewrite your code so that it generates fewer instructions, it should be faster. I also like to spot conditional jumps because that is often where your code can suffer, if the branch is hard to predict.

It is easy to convert a whole set of functions to assembly, but it becomes unpractical as your projects become larger. Under Linux, the standard ‘debugger’ (`gdb`) is a great tool to look selectively at the assembly code produced by the compile. Let us consider my previous blog post, Filtering numbers quickly with SVE on Amazon Graviton 3 processors. In that blog post, I present several functions which I have implemented in a short C++ file. To examine the result, I simply load the compiled binary into `gdb`:

$ gdb ./filter

Then I can examine functions… such as the `remove_negatives` function:

(gdb) set print asm-demangle (gdb) disas remove_negatives Dump of assembler code for function remove_negatives(int const*, long, int*): 0x00000000000022e4 <+0>: mov x4, #0x0 // #0 0x00000000000022e8 <+4>: mov x3, #0x0 // #0 0x00000000000022ec <+8>: cntw x6 0x00000000000022f0 <+12>: whilelt p0.s, xzr, x1 0x00000000000022f4 <+16>: nop 0x00000000000022f8 <+20>: ld1w {z0.s}, p0/z, [x0, x3, lsl #2] 0x00000000000022fc <+24>: cmpge p1.s, p0/z, z0.s, #0 0x0000000000002300 <+28>: compact z0.s, p1, z0.s 0x0000000000002304 <+32>: st1w {z0.s}, p0, [x2, x4, lsl #2] 0x0000000000002308 <+36>: cntp x5, p0, p1.s 0x000000000000230c <+40>: add x3, x3, x6 0x0000000000002310 <+44>: add x4, x4, x5 0x0000000000002314 <+48>: whilelt p0.s, x3, x1 0x0000000000002318 <+52>: b.ne 0x22f8 <remove_negatives(int const*, long, int*)+20> // b.any 0x000000000000231c <+56>: ret End of assembler dump.

At address 52, we conditionally go back to address 20. So we have a total of 9 instructions in our main loop. From my benchmarks (see previous blog post), I use 1.125 instructions per 32-bit word, which is consistent with each loop processing 8 32-bit words.

Another way to assess performance is to look at branches. Let us disassemble `remove_negatives_scalar`, a branchy function:

(gdb) disas remove_negatives_scalar Dump of assembler code for function remove_negatives_scalar(int const*, long, int*): 0x0000000000002320 <+0>: cmp x1, #0x0 0x0000000000002324 <+4>: b.le 0x234c <remove_negatives_scalar(int const*, long, int*)+44> 0x0000000000002328 <+8>: add x4, x0, x1, lsl #2 0x000000000000232c <+12>: mov x3, #0x0 // #0 0x0000000000002330 <+16>: ldr w1, [x0] 0x0000000000002334 <+20>: add x0, x0, #0x4 0x0000000000002338 <+24>: tbnz w1, #31, 0x2344 <remove_negatives_scalar(int const*, long, int*)+36> 0x000000000000233c <+28>: str w1, [x2, x3, lsl #2] 0x0000000000002340 <+32>: add x3, x3, #0x1 0x0000000000002344 <+36>: cmp x4, x0 0x0000000000002348 <+40>: b.ne 0x2330 <remove_negatives_scalar(int const*, long, int*)+16> // b.any 0x000000000000234c <+44>: ret End of assembler dump.

We see the branch at address 24 (instruction `tbnz`), it conditionally jumps over the next two instructions. We had an equivalent ‘branchless’ function called `remove_negatives_scalar_branchless`. Let us see if it is indeed branchless:

(gdb) disas remove_negatives_scalar_branchless Dump of assembler code for function remove_negatives_scalar_branchless(int const*, long, int*): 0x0000000000002350 <+0>: cmp x1, #0x0 0x0000000000002354 <+4>: b.le 0x237c <remove_negatives_scalar_branchless(int const*, long, int*)+44> 0x0000000000002358 <+8>: add x4, x0, x1, lsl #2 0x000000000000235c <+12>: mov x3, #0x0 // #0 0x0000000000002360 <+16>: ldr w1, [x0], #4 0x0000000000002364 <+20>: str w1, [x2, x3, lsl #2] 0x0000000000002368 <+24>: eor x1, x1, #0x80000000 0x000000000000236c <+28>: lsr w1, w1, #31 0x0000000000002370 <+32>: add x3, x3, x1 0x0000000000002374 <+36>: cmp x0, x4 0x0000000000002378 <+40>: b.ne 0x2360 <remove_negatives_scalar_branchless(int const*, long, int*)+16> // b.any 0x000000000000237c <+44>: ret End of assembler dump. (gdb)

Other than the conditional jump produced by the loop (address 40), the code is indeed branchless.

In this particular instance, with one small binary file, it is easy to find the functions I need. What if I load a large binary with many compiled functions?

Let me examine the benchmark binary from the simdutf library. It has many functions, but let us assume that I am looking for a function that might validate UTF-8 inputs. I can use `info functions` to find all functions matching a given pattern.

(gdb) info functions validate_utf8 All functions matching regular expression "validate_utf8": Non-debugging symbols: 0x0000000000012710 event_aggregate simdutf::benchmarks::BenchmarkBase::count_events<simdutf::benchmarks::Benchmark::run_validate_utf8(simdutf::implementation const&, unsigned long)::{lambda()#1}>(simdutf::benchmarks::Benchmark::run_validate_utf8(simdutf::implementation const&, unsigned long)::{lambda()#1}, unsigned long) [clone .constprop.0] 0x0000000000012b54 simdutf::benchmarks::Benchmark::run_validate_utf8(simdutf::implementation const&, unsigned long) 0x0000000000018c90 simdutf::fallback::implementation::validate_utf8(char const*, unsigned long) const 0x000000000001b540 simdutf::arm64::implementation::validate_utf8(char const*, unsigned long) const 0x000000000001cd84 simdutf::validate_utf8(char const*, unsigned long) 0x000000000001d7c0 simdutf::internal::unsupported_implementation::validate_utf8(char const*, unsigned long) const 0x000000000001e090 simdutf::internal::detect_best_supported_implementation_on_first_use::validate_utf8(char const*, unsigned long) const

You see that the `info functions` gives me both the function name as well as the function address. I am interested in `simdutf::arm64::implementation::validate_utf8`. At that point, it becomes easier to just refer to the function by its address:

(gdb) disas 0x000000000001b540 Dump of assembler code for function simdutf::arm64::implementation::validate_utf8(char const*, unsigned long) const: 0x000000000001b540 <+0>: stp x29, x30, [sp, #-144]! 0x000000000001b544 <+4>: adrp x0, 0xa0000 0x000000000001b548 <+8>: cmp x2, #0x40 0x000000000001b54c <+12>: mov x29, sp 0x000000000001b550 <+16>: ldr x0, [x0, #3880] 0x000000000001b554 <+20>: mov x5, #0x40 // #64 0x000000000001b558 <+24>: movi v22.4s, #0x0 0x000000000001b55c <+28>: csel x5, x2, x5, cs // cs = hs, nlast 0x000000000001b560 <+32>: ldr x3, [x0] 0x000000000001b564 <+36>: str x3, [sp, #136] 0x000000000001b568 <+40>: mov x3, #0x0 // #0 0x000000000001b56c <+44>: subs x5, x5, #0x40 0x000000000001b570 <+48>: b.eq 0x1b7b8 <simdutf::arm64::implementation::validate_utf8(char const*, unsigned long) const+632> // b.none 0x000000000001b574 <+52>: adrp x0, 0x86000 0x000000000001b578 <+56>: adrp x4, 0x86000 0x000000000001b57c <+60>: add x6, x0, #0x2f0 0x000000000001b580 <+64>: adrp x0, 0x86000 ...

I have cut short the output because it is too long. When single functions become large, I find it more convenient to redirect the output to a file which I can process elsewhere.

gdb -q ./benchmark -ex "set pagination off" -ex "set print asm-demangle" -ex "disas 0x000000000001b540" -ex quit > gdbasm.txt

Sometimes I am just interested in doing some basic statistics such as figuring out which instructions are used by the function:

$ gdb -q ./benchmark -ex "set pagination off" -ex "set print asm-demangle" -ex "disas 0x000000000001b540" -ex quit | awk '{print $3}' | sort |uniq -c | sort -r | head 32 and 24 tbl 24 ext 18 cmhi 17 orr 16 ushr 16 eor 14 ldr 13 mov 10 movi

And we see that the most common instruction in this code is `and`. It reassures me that the code was properly compiled. I can do some research on all the generated instructions and they all seem like adequate choices given the code that I produce.

The general lesson is that looking at the generated assembly is not so difficult and with little training, it can make you a better programmer.

**Tip**: It helps sometimes to disable pagination (`set pagination off`).

SVE is part of the Single Instruction/Multiple Data paradigm: a single instruction can operate on many values at once. Thus, for example, you may add N integers with N other integers using a single instruction.

What is unique about SVE is that you work with vectors of values, but without knowing specifically how long the vectors are. This is in contrast with conventional SIMD instructions (ARM NEON, x64 SSE, AVX) where the size of the vector is hardcoded. Not only do you write your code without knowing the size of the vector, but even the compiler may not know. This means that the same binary executable could work over different blocks (vectors) of data, depending on the processor. The benefit of this approach is that your code might get magically much more efficient on new processors.

It is a daring proposal. It is possible to write code that would work on one processor but fail on another processor, even though we have the same instruction set.

But is SVE on graviton 3 processors fast? To test it out, I wrote a small benchmark. Suppose you want to prune out all of the negative integers out of an array. A textbook implementation might look as follows:

void remove_negatives_scalar(const int32_t *input, int64_t count, int32_t *output) { int64_t i = 0; int64_t j = 0; for(; i < count; i++) { if(input[i] >= 0) { output[j++] = input[i]; } } }

However, the compiler will probably generate a branch and if your input has a random distribution, this could be inefficient code. To help matters, you may rewrite your code in a manner that is more likely to generate a branchless binary:

for(; i < count; i++) { output[j] = input[i]; j += (input[i] >= 0); }

Though it looks less efficient (because every input value in written out), such a branchless version is often practically faster.

I ported this last implementation to SVE using ARM intrinsic functions. At each step, we load a vector of integers (`svld1_s32`), we compare them with zero (`svcmpge_n_s32`), we remove the negative values (`svcompact_s32`) and we store the result (`svst1_s32`). During most iterations, we have a full vector of integers… Yet, during the last iteration, some values will be missing but we simply ignore them with the `while_mask` variable which indicates which integer values are ‘active’. The entire code sequence is done entirely using SVE instructions: there is no need to process separately the end of the sequence, as would be needed with conventional SIMD instruction sets.

#include <arm_sve.h> void remove_negatives(const int32_t *input, int64_t count, int32_t *output) { int64_t i = 0; int64_t j = 0; svbool_t while_mask = svwhilelt_b32(i, count); do { svint32_t in = svld1_s32(while_mask, input + i); svbool_t positive = svcmpge_n_s32(while_mask, in, 0); svint32_t in_positive = svcompact_s32(positive, in); svst1_s32(while_mask, output + j, in_positive); i += svcntw(); j += svcntp_b32(while_mask, positive); while_mask = svwhilelt_b32(i, count); } while (svptest_any(svptrue_b32(), while_mask)); }

Using a graviton 3 processor and GCC 11 on my benchmark, I get the following results:

cycles/integer | instructions/integer | instructions/cycle | |
---|---|---|---|

scalar | 9.0 | 6.000 | 0.7 |

branchless scalar | 1.8 | 8.000 | 4.4 |

SVE | 0.7 | 1.125 | 1.6 |

The SVE code uses far fewer instructions. In this particular test, SVE is 2.5 times faster than the best competitor (branchless scalar). Furthermore, it might use even fewer instructions on future processors, as the underlying registers get wider.

Of course, my code is surely suboptimal, but I am pleased that the first SVE benchmark I wrote turns out so well. It suggests that SVE might do well in practice.

**Credit**: Thanks to Robert Clausecker for the related discussion.

Today, the story is much nicer. The powerful processor cores can all sustain many memory requests. They support better *memory-level parallelism*.

To measure the performance of the processor, we use a pointer-chasing scheme where you ask a C program to load a memory address which contains the next memory address and so forth. If a processor could only sustain a single memory request, such a test would use all available ressources. We then modify this test so that we have have two interleaved pointer-chasing scheme, and then three and then four, and so forth. We call each new interleaved pointer-chasing component a ‘lane’.

As you add more lanes, you should see better performance, up to a maximum. The faster the performance goes up as you add lane, the more memory-level parallelism your processor core has. The best Amazon (AWS) servers come with either Intel Ice Lake or Amazon’s very own Graviton 3. I benchmark both of them, using a core of each type. The Intel processor has the upper hand in absolute terms. We achieve a 12 GB/s maximal bandwidth compared to 9 GB/s for the Graviton 3. The one-lane latency is 120 ns for the Graviton 3 server versus 90 ns for the Intel processor. The Graviton 3 appears to sustain about 19 simultaneous loads per core against about 25 for the Intel processor.

Thus Intel wins, but the Graviton 3 has nice memory-level parallelism… much better than the older Intel chips (e.g., Skylake) and much better than the early attempts at ARM-based servers.

The source code is available. I am using Ubuntu 22.04 and GCC 11. All machines have small page sizes (4kB). I chose not to tweak the page size for these experiments.

Prices for Graviton 3 are 2.32 $US/hour (64 vCPU) compared to 2.448 $US/hour for Ice Lake. So Graviton 3 appears to be marginally cheaper than the Intel chips.

When I write these posts, comparing one product to another, there is always hate mail afterward. So let me be blunt. I love all chips equally.

If you want to know which system is best for your application: run benchmarks. Comprehensive benchmarks found that Amazon’s ARM hardware could be advantageous for storage-intensive tasks.

**Further reading**: I enjoyed Graviton 3: First Impressions.

In turn, data in software is often organized in data structures having a fixed size (in bytes). We often organize these data structures in arrays. In general, a data structure may reside on more than one cache line. For example, if I put a 5-byte data structure at byte address 127, then it will occupy the last byte of one cache line, and four bytes in the next cache line.

When loading a data structure from memory, a naive model of the cost is the number of cache lines that are accessed. If your data structure spans 32 bytes or 64 bytes, and you have aligned the first element of an array, then you only ever need to access one cache line every time you load a data structure.

What if my data structures has 5 bytes? Suppose that I packed them in an array, using only 5 bytes per instance. What if I pick one at random… how many cache lines do I touch? Expectedly, the answer is barely more than 1 cache line on average.

Let us generalize.

Suppose that my data structure spans z bytes. Let g be the greatest common divisor between z and 64. Suppose that you load one instance of the data structure at random from a large array. In general, the expected number of additional cache lines accesses is (z – g)/64. The expected total number of cache line accesses is one more: 1 + (z – g)/64. You can check that it works for z = 32, since g is then 32 and you have (z – g)/64 is (32-32)/64 or zero.

I created the following table for all data structures no larger than a cache line. The worst-case scenario is a data structure spanning 63 bytes: you then almost always touch two cache lines.

I find it interesting that you have the same expected number of cache line accesses for data structures of size 17, 20, 24. It does not follow that computational cost a data structure spanning 24 bytes is the same as the cost of a data structure spanning 17 bytes. Everything else being identical, a smaller data structure should fare better, as it can fit more easily in CPU cache.

size of data structure (z) | expected cache line access |
---|---|

1 | 1.0 |

2 | 1.0 |

3 | 1.03125 |

4 | 1.0 |

5 | 1.0625 |

6 | 1.0625 |

7 | 1.09375 |

8 | 1.0 |

9 | 1.125 |

10 | 1.125 |

11 | 1.15625 |

12 | 1.125 |

13 | 1.1875 |

14 | 1.1875 |

15 | 1.21875 |

16 | 1.0 |

17 | 1.25 |

18 | 1.25 |

19 | 1.28125 |

20 | 1.25 |

21 | 1.3125 |

22 | 1.3125 |

23 | 1.34375 |

24 | 1.25 |

25 | 1.375 |

26 | 1.375 |

27 | 1.40625 |

28 | 1.375 |

29 | 1.4375 |

30 | 1.4375 |

31 | 1.46875 |

32 | 1.0 |

33 | 1.5 |

34 | 1.5 |

35 | 1.53125 |

36 | 1.5 |

37 | 1.5625 |

38 | 1.5625 |

39 | 1.59375 |

40 | 1.5 |

41 | 1.625 |

42 | 1.625 |

43 | 1.65625 |

44 | 1.625 |

45 | 1.6875 |

46 | 1.6875 |

47 | 1.71875 |

48 | 1.5 |

49 | 1.75 |

50 | 1.75 |

51 | 1.78125 |

52 | 1.75 |

53 | 1.8125 |

54 | 1.8125 |

55 | 1.84375 |

56 | 1.75 |

57 | 1.875 |

58 | 1.875 |

59 | 1.90625 |

60 | 1.875 |

61 | 1.9375 |

62 | 1.9375 |

63 | 1.96875 |

64 | 1.0 |

Thanks to Maximilian Böther for the motivation of this post.

]]>Most modern processors have SIMD instructions. The AVX-512 instructions are wider (more bits per register), but that is not necessarily their main appeal. If you merely take existing SIMD algorithms and apply them to AVX-512, you will probably not benefit as much as you would like. It is true that wider registers are beneficial, but in superscalar processors (processors that can issue several instructions per cycle), the number of instructions you can issue per cycle matters as much if not more. Typically, 512-bit AVX-512 instructions are more expensive and the processor can issue fewer of them per cycle. To fully benefit from AVX-512, you need to carefully design your code. It is made more challenging by the fact that Intel is releasing these instructions progressively: the recent processors have many new powerful AVX-512 instructions that were not initially available. Thus, AVX-512 is not “one thing” but rather a family of instruction sets.

Furthermore, early implementations of the AVX-512 instructions often lead to measurable downclocking: the processor would reduce its frequency for a time following the use of these instructions. Thankfully, the latest Intel processors to support AVX-512 (Rocket Lake and Ice Lake) have done away with this systematic frequency throttling. Thankfully, it is easy to detect these recent processors at runtime.

Amazon’s powerful Intel servers are based on Ice Lake. Thus if you are deploying your software applications to the cloud on powerful servers, you probably have pretty good support for AVX-512 already !

A few years ago, we released a really fast C++ JSON parser called simdjson. It is somewhat unique as a parser in the fact that it relies critically on SIMD instructions. On several metrics, it was and still is the fastest JSON parser though other interesting competitors have emerged.

Initially, I had written a quick and dirty AVX-512 kernel for simdjson. We never merged it and after a time, I just deleted it. I then forgot about it.

Thanks to contributions from talented Intel engineers (Fangzheng Zhang and Weiqiang Wan) as well as indirect contributions from readers of this blog (Kim Walisch and Jatin Bhateja), we produced a new and shiny AVX-512 kernel. As always, keep in mind that the simdjson is the work of many people, a whole community of dozens of contributors. I must express my gratitude to Fangzheng Zhang who first wrote to me about an AVX-512 port.

We just released in the latest version of simdjson. It breaks new speed records.

Let us consider an interesting test where you seek to scan a whole file (spanning kilobytes) to find a value corresponding to some identifier. In simdjson, the code is as follows:

auto doc = parser.iterate(json); for (auto tweet : doc.find_field("statuses")) { if (uint64_t(tweet.find_field("id")) == find_id) { result = tweet.find_field("text"); return true; } } return false;

On a Tiger Lake processor, with GCC 11, I get a 60% increase in the processing speed, expressed by the number of input bytes processed per second.

simdjson (512-bit SIMD): new | 7.4 GB/s |

simdjson (256-bit SIMD): old | 4.6 GB/s |

The speed gain is so important because in this task we mostly just read the data, and we do relatively little secondary processing. We do not create a tree out of the JSON data, we do not create a data structure.

The simdjson library has a minify function which just strips unnecessary spaces from the input. Maybe surprisingly, we are more than twice as fast as the previous baseline:

simdjson (512-bit SIMD): new | 12 GB/s |

simdjson (256-bit SIMD): old | 4.3 GB/s |

Another reasonable benchmark is to fully parse the input into a DOM tree with full validation. Parsing a standard JSON file (`twitter.json`), I get nearly a 30% gain:

simdjson (512-bit SIMD): new | 3.6 GB/s |

simdjson (256-bit SIMD): old | 2.8 GB/s |

While 30% may sound unexciting, we are starting from a fast baseline.

Could we do better? Assuredly. There are many AVX-512 instructions that we are not using yet. We do not use ternary Boolean operations (`vpternlog`). We are not using the new powerful shuffle functions (e.g., `vpermt2b`). We have an example of coevolution: better hardware requires new software which, in turn, makes the hardware shine.

Of course, to get these new benefits, you need recent Intel processors with adequate AVX-512 support and, evidently, you also need relatively recent C++ processors. Some of the recent laptop-class Intel processors do not support AVX-512 but you should be fine if you rely on AWS and have big Intel nodes.

You can grab our release directly or wait for it to reach one of the standard package managers (MSYS2, conan, vcpkg, brew, debian, FreeBSD, etc.).

]]>It is debatable whether handling exceptions is better than dealing with error codes. I will happily use one or the other.

What I will object to, however, is the use of exceptions for control flow. It is fine to throw an exception when a file cannot be opened, unexpectedly. But you should not use exceptions to branch on the type of a value.

Let me illustrate.

Suppose that my code expects integers to be always positive. I might then have a function that checks such a condition:

int get_positive_value(int x) { if(x < 0) { throw std::runtime_error("it is not positive!"); } return x; }

So far, so good. I am assuming that the exception is normally never thrown. It gets thrown if I have some kind of error.

If I want to sum the absolute values of the integers contained in an array, the following branching code is fine:

int sum = 0; for (int x : a) { if(x < 0) { sum += -x; } else { sum += x; } }

Unfortunately, I often see solutions abusing exceptions:

int sum = 0; for (int x : a) { try { sum += get_positive_value(x); } catch (...) { sum += -x; } }

The latter is obviously ugly and hard-to-maintain code. But what is more, it can be highly inefficient. To illustrate, I wrote a small benchmark over random arrays containing a few thousand elements. I use the LLVM clang 12 compiler on a skylake processor. The normal code is 10000 times faster in my tests!

normal code | 0.05 ns/value |

exception | 500 ns/value |

Your results will differ but it is generally the case that using exceptions for control flow leads to suboptimal performance. And it is ugly too!

]]>In my previous post, *Fast bitset decoding using Intel AVX-512*, I explained how you can use Intel’s new instructions, from the AVX-512 family, to decode bitsets faster. The AVX-512 instructions, as the name implies, often can process 512-bit (or 64-byte) registers.

At least two readers (Kim Walisch and Jatin Bhateja) pointed out that you could do better if you used the very latest AVX-512 instructions available on Intel processors with the Ice Lake or Tiger Lake microarchitectures. These processors support VBMI2 instructions including the `vpcompressb` instruction and its corresponding intrinsics (such as `_mm512_maskz_compress_epi8`). What this instruction does is take a 64-bit word and a 64-byte register, and it outputs (in packed manner) only the bytes corresponding to set bits in the 64-bit word. Thus if you use as the 64-bit word the value 0b11011 and you provide a 64-byte register with the values 0,1,2,3,4… you will get as a result 0,1,3,4. That is, the instruction effectively does the decoding already, with the caveat that it will only write bytes. In practice, you often want the indexes as 32-bit integers. Thankfully, you can go from packed bytes to packed 32-bit integers easily. One possibility is to extract successive 128-bit subwords (using the `vextracti32x4` instruction or its intrinsic `_mm512_extracti32x4_epi32`), and expand them (using the `vpmovzxbd` instruction or its intrinsic `_mm512_cvtepu8_epi32`). You get the following result:

void vbmi2_decoder_cvtepu8(uint32_t *base_ptr, uint32_t &base, uint32_t idx, uint64_t bits) { __m512i indexes = _mm512_maskz_compress_epi8(bits, _mm512_set_epi32( 0x3f3e3d3c, 0x3b3a3938, 0x37363534, 0x33323130, 0x2f2e2d2c, 0x2b2a2928, 0x27262524, 0x23222120, 0x1f1e1d1c, 0x1b1a1918, 0x17161514, 0x13121110, 0x0f0e0d0c, 0x0b0a0908, 0x07060504, 0x03020100 )); __m512i t0 = _mm512_cvtepu8_epi32(_mm512_castsi512_si128(indexes)); __m512i t1 = _mm512_cvtepu8_epi32(_mm512_extracti32x4_epi32(indexes, 1)); __m512i t2 = _mm512_cvtepu8_epi32(_mm512_extracti32x4_epi32(indexes, 2)); __m512i t3 = _mm512_cvtepu8_epi32(_mm512_extracti32x4_epi32(indexes, 3)); __m512i start_index = _mm512_set1_epi32(idx); _mm512_storeu_si512(base_ptr + base, _mm512_add_epi32(t0, start_index)); _mm512_storeu_si512(base_ptr + base + 16, _mm512_add_epi32(t1, start_index)); _mm512_storeu_si512(base_ptr + base + 32, _mm512_add_epi32(t2, start_index)); _mm512_storeu_si512(base_ptr + base + 48, _mm512_add_epi32(t3, start_index)); base += _popcnt64(bits); }

If you try to use this approach unconditionally, you will write 256 bytes of data for each 64-bit word you decode. In practice, if your word contains mostly just zeroes, you will be writing a lot of zeroes.

Branching is bad for performance, but only when it is hard to predict. However, it should be rather easy for the processor to predict whether we have fewer than 16 bits set in the provided word, of fewer than 32 bits, and so forth. So some level of branching is adequate. The following function should do:

void vbmi2_decoder_cvtepu8_branchy(uint32_t *base_ptr, uint32_t &base, uint32_t idx, uint64_t bits) { if(bits == 0) { return; } __m512i indexes = _mm512_maskz_compress_epi8(bits, _mm512_set_epi32( 0x3f3e3d3c, 0x3b3a3938, 0x37363534, 0x33323130, 0x2f2e2d2c, 0x2b2a2928, 0x27262524, 0x23222120, 0x1f1e1d1c, 0x1b1a1918, 0x17161514, 0x13121110, 0x0f0e0d0c, 0x0b0a0908, 0x07060504, 0x03020100 )); __m512i start_index = _mm512_set1_epi32(idx); int count = _popcnt64(bits); __m512i t0 = _mm512_cvtepu8_epi32(_mm512_castsi512_si128(indexes)); _mm512_storeu_si512(base_ptr + base, _mm512_add_epi32(t0, start_index)); if(count > 16) { __m512i t1 = _mm512_cvtepu8_epi32(_mm512_extracti32x4_epi32(indexes, 1)); _mm512_storeu_si512(base_ptr + base + 16, _mm512_add_epi32(t1, start_index)); if(count > 32) { __m512i t2 = _mm512_cvtepu8_epi32(_mm512_extracti32x4_epi32(indexes, 2)); _mm512_storeu_si512(base_ptr + base + 32, _mm512_add_epi32(t2, start_index)); if(count > 48) { __m512i t3 = _mm512_cvtepu8_epi32(_mm512_extracti32x4_epi32(indexes, 3)); _mm512_storeu_si512(base_ptr + base + 48, _mm512_add_epi32(t3, start_index)); } } } base += count; }

The results will vary depending on the input data, but I already have a realistic case with moderate density (about 10% of the bits are set) that I am reusing. Using a Tiger-Lake processor and GCC 9, I get the following timings per set value, when using a sizeable input:

nanoseconds/value | |
---|---|

basic | 0.95 |

unrolled (simdjson) | 0.74 |

AVX-512 (previous post) | 0.57 |

AVX-512 (new) | 0.29 |

That is a rather remarkable performance, especially considering how we do not need any large table or sophisticated algorithm. All we need are fancy AVX-512 instructions.

]]>You could check the value of each bit, but a better option is to use the fact that processors have fast instructions to compute the number of “trailing zeros”. Given 0b10001100100, this instruction would give you 2. This gives you the first index. Then you need to unset this least significant bit using code such as `word & (word - 1)`.

while (word != 0) { result[i] = trailingzeroes(word); word = word & (word - 1); i++; }

The problem with this code is that the number of iterations might be hard to predict, thus you might often cause your processor to mispredict the number of branches. A misprediction is expensive on modern processor. You can do better by further unrolling this loop. I describe how in an earlier blog post.

Intel latest processors have new instruction sets (AVX-512) that are quite powerful. In this instance, it allows to do the decoding without any branch and with few instructions. The key is the `vpcompressd` instruction and its corresponding C/C++ *Intel* function (`_mm512_mask_compressstoreu_epi32`). What it does is that given up to 16 integers, it only selects the ones corresponding to a bit set in a bitset. Thus given the array 0,1,2,3….16 and given the bitset 0b111010, you would generate the output 1,3,4,6. The function does not tell you how many relevant values are written out, but you can just count the number of ones, and conveniently, we have a fast instruction for that, available through the `_popcnt64` function. So the following code sequence would process 16-bit masks and write them out to a pointer (`base_ptr`).

__m512i base_index = _mm512_setr_epi32(0,1,2,3,4,5, 6,7,8,9,10,11,12,13,14,15); _mm512_mask_compressstoreu_epi32(base_ptr, mask, base_index); base_ptr += _popcnt64(mask);

The full function which processes 64-bit masks is somewhat longer, but it is essentially just 4 copies of the simple sequence.

void avx512_decoder(uint32_t *base_ptr, uint32_t &base, uint32_t idx, uint64_t bits) { __m512i start_index = _mm512_set1_epi32(idx); __m512i base_index = _mm512_setr_epi32(0,1,2,3,4,5, 6,7,8,9,10,11,12,13,14,15); base_index = _mm512_add_epi32(base_index, start_index); uint16_t mask; mask = bits & 0xFFFF; _mm512_mask_compressstoreu_epi32(base_ptr + base, mask, base_index); base += _popcnt64(mask); const __m512i constant16 = _mm512_set1_epi32(16); base_index = _mm512_add_epi32(base_index, constant16); mask = (bits>>16) & 0xFFFF; _mm512_mask_compressstoreu_epi32(base_ptr + base, mask, base_index); base += _popcnt64(mask); base_index = _mm512_add_epi32(base_index, constant16); mask = (bits>>32) & 0xFFFF; _mm512_mask_compressstoreu_epi32(base_ptr + base, mask, base_index); base += _popcnt64(mask); base_index = _mm512_add_epi32(base_index, constant16); mask = bits>>48; _mm512_mask_compressstoreu_epi32(base_ptr + base, mask, base_index); base += _popcnt64(mask); }

There is a downside to using AVX-512: for a short time, the processor might reduce its frequency when wide registers (512 bits) are used. You can still use the same instructions on shorter registers (e.g., use `_mm256_mask_compressstoreu_epi32` instead of `_mm512_mask_compressstoreu_epi32`) but in this instance, it doubles the number of instructions.

On a skylake-x processor with GCC, my benchmark reveals that the new AVX-512 is superior even with frequency throttling. Compared to the basic approach above, the AVX-512 approach use 45% times fewer cycles and 33% less time. We report the number of instructions, cycles and nanoseconds per value set in the bitset. The AVX-512 generates no branch misprediction.

instructions/value | cycles/value | nanoseconds/value | |
---|---|---|---|

basic | 9.3 | 4.4 | 1.5 |

unrolled (simdjson) | 9.9 | 3.6 | 1.2 |

AVX-512 | 6.2 | 2.4 | 1.0 |

The AVX-512 routine has record-breaking speed. It is also possible that the routine could be improved.

]]>I covered a related problem before, the removal of all spaces from strings. At the time, I concluded that the fastest approach might be to use SIMD instructions coupled with a large lookup table. A SIMD instruction is such that it can operate on many words at any given time: most commodity processors have instructions able to operate on 16 bytes at a time. Thus, using a single instruction, you can compare 16 consecutive bytes and identify the location of all spaces, for example. Once it is done, you must somehow move the unwanted characters. Most instruction sets do not have instructions for that purpose, however x64 processors have an instruction that can move bytes around as long as you have a precomputed shuffle mask (`pshufb`). ARM NEON has similar instructions as well. Thus you proceed in the following manner:

- Identify all unwanted characters in a block (e.g., 16 bytes).
- Lookup a shuffle mask in a large table.
- Move the unwanted bytes using the shuffle mask.

Such an approach is fast but it requires possibly large tables. Indeed, if you load 16 bytes, you need a table with 65536 shuffle masks. Storing such large tables is not very practical.

Recent Intel processors have handy new instructions that do exactly what we want: they prune out unwanted bytes (`vpcompressb`). It requires a recent processor with AVX-512 VBMI2 such as Ice Lake, Rocket Lake, Alder Lake, or Tiger Lake processors. Intel makes it difficult to figure out which features is available on which processor, so you need to do some research to find out if your favorite Intel processors supports the desired instructions. AMD processors do not support VBMI2.

On top of the new instructions, AVX-512 also allows you process the data in larger blocks (64 bytes). Using Intel instructions, the code is almost readable. I create a register containing only the space byte, and I then iterate over my data, each time loading 64 bytes of data. I compare it with the space: I only want to keep values that are large (in byte values) than the space. I then call the compress instruction which takes out the unwanted bytes. I read at regular intervals (every 64 bytes) but I write a variable number of bytes, so I advance the write pointer by the number of set bits in my mask: I count those using a fast instruction (`popcnt`).

__m512i spaces = _mm512_set1_epi8(' '); size_t i = 0; for (; i + 63 < howmany; i += 64) { __m512i x = _mm512_loadu_si512(bytes + i); __mmask64 notwhite = _mm512_cmpgt_epi8_mask (x, spaces); _mm512_mask_compressstoreu_epi8 (bytes + pos, notwhite, x); pos += _popcnt64(notwhite); }

I have updated the despacer library and its benchmark. With a Tiger Lake processor (3.3 GHz) and GCC 9 (Linux), I get the following results:

function | speed |
---|---|

conventional (despace32) | 0.4 GB/s |

SIMD with large lookup (sse42_despace_branchless) | 2.0 GB/s |

AVX-512 (vbmi2_despace) | 8.5 GB/s |

Your results will differ. Nevertheless, we find that AVX-512 is highly useful for this task and the related function surpasses all other such functions. It is not merely the raw speed, it is also the fact that we do not require a lookup table and that the code does not rely on branch prediction: there is no hard-to-predict branches that may harm your speed in practice.

The result should not surprise us since, for the first time, we almost have direct hardware support for the operation (“pruning unwanted bytes”). The downside is that few processors support the desired instruction set. And it is not clear whether AMD will ever support these fancy instructions.

I should conclude with Linus Torvalds take regarding AVX-512:

*I hope AVX-512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on*

I cannot predict what will happen to Intel or AVX-512, but if the past is any indication, specialized and powerful instructions have a bright future.

]]>Soon enough, programmers realized that they needed to not only to store files, but also to keep track of the different versions of a given file. It is no accident that we are all familiar with the fact that software is often associated with versions. It is necessary to distinguish the different versions of the computer code in order to keep track of updates.

We might think that after developing a new version of a software, the previous versions could be discarded. However, it is practical to keep a copy of each version of the computer code for several reasons:

- A change in the code that we thought was appropriate may cause problems: we need to be able to go back quickly.
- Sometimes different versions of the computer code are used at the same time and it is not possible for all users to switch to the latest version. If an error is found in a previous version of the computer code, it may be necessary for the programmer to go back and correct the error in an earlier version of the computer code without changing the current code. In this scenario, the evolution of the software is not strictly linear. It is therefore possible to release version 1.0, followed by version 2.0, and then release version 1.1.
- It is sometimes useful to be able to go back in time to study the evolution of the code in order to understand the motivation behind a section of code. For example, a section of code may have been added without much comment to quickly fix a new bug. The attentive programmer will be able to better understand the code by going back and reading the changes in context.
- Computer code is often modified by different programmers working at the same time. In such a social context, it is often useful to be able to quickly determine who made what change and when. For example, if a problem is caused by a segment of code, we may want to question the programmer who last worked on that segment.

Programmers quickly realized that they needed version control systems. The basic functions that a version control system provides are rollback and a history of changes made. Over time, the concept version control has spread. There are even several variants intended for the general public such as DropBox where various files, not only computer code, are stored.

The history of software version control tools dates back to the 1970s (Rochkind, 1975). In 1972, Rochkind developed the SCCS (Source Code Control System) at Bell Laboratories. This system made it possible to create, update and track changes in a software project. SCCS remained a reference from the end of the 1970s until the 1980s. One of the constraints of SCCS is that it does not allow collaborative work: only one person can modify a given file at a given time.

In the early 1980s, Tichy proposed the RCS (Revision Control System), which innovated with respect to SCCS by storing only the differences between the different versions of a file in backward order, starting from the latest file. In contrast, SCCS stored differences in forward order starting from the first version. For typical use where we access the latest version, RCS is faster.

In programming, we typically store computer code within text files. Text files most often use ASCII or Unicode (UTF-8 or UTF-16) encoding. Lines are separated by a sequence of special characters that identify the end of a line and the beginning of a new line. Two characters are often used for this purpose: “carriage return” (CR) and “line feed” (LF). In ASCII and UTF-8, these characters are represented with the byte having the value 13 and the byte having the value 10 respectively. In Windows, the sequence is composed of the CR character followed by the LF character, whereas in Linux and macOS, only the LF character is used. In most programming languages, we can represent these two characters with the escape sequences \r and \n respectively. So the string “a\nb\nc” has three lines in most programming languages under Linux or macOS: the lines “a”, “b” and “c”.

When a text file is edited by a programmer, usually only a small fraction of all lines are changed. Some lines may also be inserted or deleted. It is convenient to describe the differences as succinctly as possible by identifying the new lines, the deleted lines and the modified lines.

The calculation of differences between two text files is often done first by breaking the text files into lines. We then treat a text file as a list of lines. Given two versions of the same file, we want to associate as many lines in the first version as possible with an identical line in the second version. We also assume that the order of the lines is not reversed.

We can formalize this problem by looking for the longest common subsequence. Given a list, a subsequence simply takes a part of the list, excluding some elements. For example, (a,b,d) is a subsequence of the list (a,b,c,d,e). Given two lists, we can find a common subsequence, e.g. (a,b,d) is a subsequence of the list (a,b,c,d,e) and the list (z,a,b,d). The longest common subsequence between two lists of text lines represents the list of lines that have not been changed between the two versions of a text file. It might be difficult to solve this program using brute force. Fortunately, we can compute the longest common subsequence by dynamic programming. Indeed, we can make the following observations.

- If we have two strings with a longest subsequence of length k, and we add at the end of each of the two strings the same character, the new strings will have a longer subsequence of length k+1.
- If we have two strings of lengths m and n, ending in distinct characters (for example, “abc” and “abd”), then the longest subsequence of the two strings is the longest subsequence of the two strings after removing the last character from one of the two strings. In other words, to determine the length of the longest subsequence between two strings, we can take the maximum of the length of the subsequence after amputating one character from the first string while keeping the second unchanged, and the length of the subsequence after amputating one character from the second string while keeping the first unchanged.

These two observations are sufficient to allow an efficient calculation of the length of the longest common subsequence. It is sufficient to start with strings comprising only the first character and to add progressively the following characters. In this way, one can calculate all the longest common subsequences between the truncated strings. It is then possible to reverse this process to build the longest subsequence starting from the end. If two strings end with the same character, we know that the last character will be part of the longest subsequence. Otherwise, one of the two strings is cut off from its last character, making our choice in such a way as to maximize the length of the longest common subsequence.

The following function illustrates a possible solution to this problem. Given two arrays of strings, the function returns the longest common subsequence. If the first string has length `m`

and the second `n`

, then the algorithm runs in `O(m*n)`

time.

func longest_subsequence(file1, file2 []string) []string { m, n := len(file1), len(file2) P := make([]uint, (m+1)*(n+1)) for i := 1; i <= m; i++ { for j := 1; j <= n; j++ { if file1[i-1] == file2[j-1] { P[i*(n+1)+j] = 1 + P[(i-1)*(n+1)+(j-1)] } else { P[i*(n+1)+j] = max(P[i*(n+1)+(j-1)], P[(i-1)*(n+1)+j]) } } } longest := P[m*(n+1)+n] i, j := m, n subsequence := make([]string, longest) for k := longest; k > 0; { if P[i*(n+1)+j] == P[i*(n+1)+(j-1)] { j-- // the two strings end with the same char } else if P[i*(n+1)+j] == P[(i-1)*(n+1)+j] { i-- } else if P[i*(n+1)+j] == 1+P[(i-1)*(n+1)+(j-1)] { subsequence[k-1] = file1[i-1] k--; i--; j-- } } return subsequence }

Once the subsequence has been calculated, we can quickly calculate a description of the difference between the two text files. Simply move forward in each of the text files, line by line, stopping as soon as you reach a position corresponding to an element of the longest sub-sequence. The lines that do not correspond to the subsequence in the first file are considered as having been deleted, while the lines that do not correspond to the subsequence in the second file are considered as having been added. The following function illustrates a possible solution.

func difference(file1, file2 []string) []string { subsequence := longest_subsequence(file1, file2) i, j, k := 0, 0, 0 answer := make([]string, 0) for i < len(file1) && k < len(file2) { if file2[k] == subsequence[j] && file1[i] == subsequence[j] { answer = append(answer, "'"+file2[k]+"'\n") i++; j++; k++ } else { if file1[i] != subsequence[j] { answer = append(answer, "deleted: '"+file1[i]+"'\n") i++ } if file2[k] != subsequence[j] { answer = append(answer, "added: '"+file2[k]+"'\n") k++ } } } for ; i < len(file1); i++ { answer = append(answer, "deleted: '"+file1[i]+"'\n") } for ; k < len(file2); k++ { answer = append(answer, "added: '"+file2[k]+" \n") } return answer }

The function we propose as an illustration for computing the longest subsequence uses `O(m*n)`

memory elements. It is possible to reduce the memory usage of this function and simplify it (Hirschberg, 1975). Several other improvements are possible in practice (Miller and Myers, 1985). We can then represent the changes between the two files in a concise way.

Suggested reading: article Diff (wikipedia)

Like SCCS, RCS does not allow multiple programmers to work on the same file at the same time. The need to own a file to the exclusion of all other programmers while working on it may have seemed a reasonable constraint at the time, but it can make the work of a team of programmers much more cumbersome.

In 1986 Grune developed the Concurrent Versions System (CVS). Unlike previous systems, CVS allows multiple programmers to work on the same file simultaneously. It also adopts a client-server model that allows a single directory to be present on a network, accessible by multiple programmers simultaneously. The programmer can work on a file locally, but as long as he has not transmitted his version to the server, it remains invisible to the other developers.

The remote server also serves as a de facto backup for the programmers. Even if all the programmers’ computers are destroyed, it is possible to start over with the code on the remote server.

In a version control system, there is usually always a single latest version. All programmers make changes to this latest version. However, such a linear approach has its limits. An important innovation that CVS has updated is the concept of a branch. A branch allows to organize sets of versions that can evolve in parallel. In this model, the same file is virtually duplicated. There are then two versions of the file (or more than two) capable of evolving in parallel. By convention, there is usually one main branch that is used by default, accompanied by several secondary branches. Programmers can create new branches whenever they want. Branches can then be merged: if a branch A is divided into two branches (A and B) which are modified, it is then possible to bring all the modifications into a single branch (merging A and B). The branch concept is useful in several contexts:

- Some software development is speculative. For example, a programmer may explore a new approach without being sure that it is viable. In such a case, it may be better to work in a separate branch and merge with the main branch only if successful.
- The main branch may be restricted to certain programmers for security reasons. In such a case, programmers with reduced access may be restricted to separate branches. A programmer with privileged access may then merge the secondary branch after a code inspection.
- A branch can be used to explore a particular bug and its fix.
- A branch can be used to update a previous version of the code. Such a version may be kept up to date because some users depend on that earlier version and want to receive certain fixes. In such a case, the secondary branch may never be integrated with the main branch.

One of the drawbacks of CVS is poor performance when projects include multiple files and multiple versions. In 2000, Subversion (SVN) was proposed as an alternative to CVS that meets the same needs, but with better performance.

CVS and Subversion benefit from a client-server approach, which allows multiple programmers to work simultaneously with the same version directory. Yet programmers often want to be able to use several separate remote directories.

To meet these needs, various “distributed version control systems” (DVCS) have been developed. The most popular one is probably the Git system developed by Torvalds (2005). Torvalds was trying to solve a problem of managing Linux source code. Git became the dominant version management tool.

tool. It has been adopted by Google, Microsoft, etc. It is free software.

In a distributed model, a programmer who has a local copy of the code can synchronize it with either one directory or another. They can easily create a new copy of the remote directory on a new server. Such flexibility is considered essential in many complex projects such as the Linux operating system kernel.

Several companies offer Git-based services including GitHub. Founded in 2008, GitHub has tens of millions of users. In 2018, Microsoft acquired GitHub for $7.5 billion.

For CVS and Subversion, there is only one set of software versions. With a distributed approach, multiple sets can coexist on separate servers. The net result is that a software project can evolve differently, under the responsibility of different teams, with possible future reconciliation.

In this sense, Git is distributed. Although many users rely on GitHub (for example), your local copy can be attached to any remote directory, and it can even be attached to multiple remote directories. The verb “clone” is sometimes used to describe the recovery of a Git project locally, since it is a complete copy of all files, changes, and branches.

If a copy of the project is attached to another remote directory, it is called a fork. We often distinguish between branches and forks. A branch always belongs to the main project. A fork is originally a complete copy of the project, including all branches. It is possible for a fork to rejoin the main project, but it is not essential.

Given a publicly available Git directory, anyone can clone it and start working on it and contributing to it. We can create a new fork. From a fork, we can submit a pull request that invites people to integrate our changes. This allows a form of permissionless innovation. Indeed, it becomes possible to retrieve the code, modify it and propose a new version without ever having to interact directly with the authors.

Systems like CVS and subversion could become inefficient and take several minutes to perform certain operations. Git, in contrast, is generally efficient and fast, even for huge projects. Git is robust and does not get “corrupted” easily. However, it is not recommended to use Git for huge files such as multimedia content: Git’s strength lies in text files. It should be noted that the implementation of Git has improved over time and includes sophisticated indexing techniques.

Git is often used on the command line. It is possible to use graphical clients. Services like GitHub make Git a little easier.

The basic logical unit of Git is the `commit`

, which is a set of changes to multiple files. A `commit`

includes a reference to at least one parent, except for the first `commit`

which has no parent. A single `commit`

can be the parent of several children: several branches can be created from a `commit`

and each subsequent `commit`

becomes a `child`

of the initial `commit`

. Furthermore, when several branches are merged, the resulting `commit`

will have several parents. In this sense, the `commits`

form an “acyclic directed graph”.

With Git, we want to be able to refer to a `commit`

in an easy way, using a unique identifier. That is, we want to have a short numeric value that corresponds to one `commit`

and one `commit`

only. We could assign each `commit`

a version number (1.0, 2.0, etc.). Unfortunately, such an approach is difficult to reconcile with the fact that `commits`

do not form a linear chain where a `commit`

has one and only one parent. As an alternative, we use a hash function to compute the unique identifier. A hash function takes elements as parameters and calculates a numerical value (hash value). There are several simple hash functions. For example, we can iterate over the bytes contained in a message from a starting value `h`

, by computing `h = 31 * h + b`

where `b`

is the byte value. For example, a message containing bytes 3 and 4 might have a hash value of `31 * (31 * 3) + 4`

if we start `h = 0`

. Such a simple approach is effective in some cases, but it allows malicious users to create collisions: it would be possible to create a fake `commit`

that has the same hash value and thus create security holes. For this reason, Git uses more sophisticated hashing techniques (SHA-1, SHA-256) developed by cryptographic specialists. Commits are identified using a hash value (for example, the hexadecimal numeric value 921103db8259eb9de72f42db8b939895f5651489) which is calculated from the date and time, the comment left by the programmer, the user’s name, the parents and the nature of the change. In theory, two `commits`

could have the same hash value, but this is an unlikely event given the hash functions used by Git. It’s not always practical to reference a hexadecimal code. To make things easier, Git allows you to identify a commit with a label (e.g., v1.0.0). The following command will do: `git tag -a v1.0.0 -m "version 1.0.0"`

.

Though tags can be any string, tags often contain sequences of numbers indicating a version. There is no general agreement among programmers on how to attribute version numbers to a version. However, tags sometimes take the form of three numbers separated by dots: MAJOR.MINOR.PATCH (for example, 1.2.3). With each new version, 1 is added to at least one of the three numbers. The first number often starts at 1 while the next two start at 0.

- The first number (MAJOR) must be increased when you make major changes to the code. The other two numbers (MINOR and PATCH) are often reset to zero. For example, you can go from version 1.2.3 to version 2.0.0.
- The second number (MINOR) is increased for minor changes (for example, adding a function). When increasing the second number, the first number (MAJOR) is usually kept unchanged and the last number (PATCH) is reset to zero.
- The last number (PATCH) is increased when fixing bugs. The other two numbers are not increased.

There are finer versions of this convention like “semantic versioning“.

With Git, the programmer can have a local copy of the commit graph. They can add new `commits`

. In a subsequent step, the programmer must “push” his changes to a remote directory so that the changes become visible to other programmers. The other programmers can fetch the changes using a `pull’.

Git has advanced collaborative features. For example, the `git blame`

command lets you know who last modified a given piece of code.

Version control in computing is a sophisticated approach that has benefited from many years of work. It is possible to store multiple versions of the same file at low cost and navigate from one version to another quickly.

If you develop code without using a version control tool like Git or the equivalent, you are bypassing proven practices. It’s likely that if you want to work on complex projects with multiple programmers, your productivity will be much lower without version control.

]]>This 15-digit accuracy fails for numbers that outside the valid range. For example, the number 1e500 is too large and cannot be directly represented using standard 64-bit floating-point numbers. Similarly, 1e-500 is too small and it can only be represented as zero.

The range of 64-bit floating-point number might be defined as going from 4.94e-324 to 1.8e308 and -1.8e308 to -4.94e-324, together with exactly 0. However, this range includes subnormal numbers where the relative accuracy can be small. For example, the number 5.00000000000000e-324 is best represented as 4.94065645841247e-324, meaning that we have zero-digit accuracy.

For the 15-digit accuracy rule to work, you might remain in the normal range, e.g., from 2.225e−308 to 1.8e308 and -1.8e308 to -2.225e−308. There are other good reasons to remain in the normal range, such as poor performance and low accuracy in the subnormal range.

To summarize, standard floating-point numbers have excellent accuracy (at least 15 digits) as long you remain in their normal range which is between 2.225e−308 to 1.8e308 for positive numbers.

]]>Thus the character é is typically represented as the numerical value 233 (or 0xe9 in hexadecimal). Thus in Python, JavaScript and many other programming languages, you get the following:

>>> "\u00e9" 'é'

Unfortunately, unicode does not ensure that there is a unique way to achieve every visual character. For example, you can combine the letter ‘e’ (code point 0x65) with ‘acute accent’ (code point 0x0301):

>>> "\u0065\u0301" 'é'

Unfortunately, in most programming languages, these strings will not be considered to be the same even though they look the same to us:

>>> "\u0065\u0301"=="\u00e9" False

For obvious reason, it can be a problem within a computer system. What if you are doing some search in a database for name with the character ‘é’ in it?

The standard solution is to *normalize* your strings. In effect, you transform them so that strings that are semantically equal are written with the same code points. In Python, you may do it as follows:

>>> import unicodedata >>> unicodedata.normalize('NFC',"\u00e9") == unicodedata.normalize('NFC',"\u0065\u0301") True

There are multiple ways to normalize your strings, and there are nuances.

In JavaScript and other programming languages, there are equivalent functions:

> "\u0065\u0301".normalize() == "\u00e9".normalize() true

Though you should expect normalization to be efficient, it is unlikely to be computationally free. Thus you should not repeatedly normalize your strings, as I have done. Rather you should probably normalize the strings as they enter your system, so that each string is normalized only once.

Normalization alone does not solve all of your problems, evidently. There are multiple complicated issues with internalization, but if you are at least aware of the normalization problem, many perplexing issues are easily explained.

**Further reading**: Internationalization for Turkish:

Dotted and Dotless Letter “I”

In an earlier blog post, I presented the simpler problem of converting integers to fixed-digit strings, using exactly 16-characters with leading zeroes as needed. For example, the integer 12345 becomes the string ‘0000000000012345’.

For this problem, the most practical approach might be a tree-based version with a small table. The core idea is to start from the integer, compute an integer representing the 8 most significant decimal digits, and another integer representing the least significant 8 decimal digit. Then we repeat, dividing the two eight-digit integers into two four-digit integers, and so forth until we get to two-digit integers in which case we use a small table to convert them to a decimal representation. The code in C++ might look as follows:

void to_string_tree_table(uint64_t x, char *out) { static const char table[200] = { 0x30, 0x30, 0x30, 0x31, 0x30, 0x32, 0x30, 0x33, 0x30, 0x34, 0x30, 0x35, 0x30, 0x36, 0x30, 0x37, 0x30, 0x38, 0x30, 0x39, 0x31, 0x30, 0x31, 0x31, 0x31, 0x32, 0x31, 0x33, 0x31, 0x34, 0x31, 0x35, 0x31, 0x36, 0x31, 0x37, 0x31, 0x38, 0x31, 0x39, 0x32, 0x30, 0x32, 0x31, 0x32, 0x32, 0x32, 0x33, 0x32, 0x34, 0x32, 0x35, 0x32, 0x36, 0x32, 0x37, 0x32, 0x38, 0x32, 0x39, 0x33, 0x30, 0x33, 0x31, 0x33, 0x32, 0x33, 0x33, 0x33, 0x34, 0x33, 0x35, 0x33, 0x36, 0x33, 0x37, 0x33, 0x38, 0x33, 0x39, 0x34, 0x30, 0x34, 0x31, 0x34, 0x32, 0x34, 0x33, 0x34, 0x34, 0x34, 0x35, 0x34, 0x36, 0x34, 0x37, 0x34, 0x38, 0x34, 0x39, 0x35, 0x30, 0x35, 0x31, 0x35, 0x32, 0x35, 0x33, 0x35, 0x34, 0x35, 0x35, 0x35, 0x36, 0x35, 0x37, 0x35, 0x38, 0x35, 0x39, 0x36, 0x30, 0x36, 0x31, 0x36, 0x32, 0x36, 0x33, 0x36, 0x34, 0x36, 0x35, 0x36, 0x36, 0x36, 0x37, 0x36, 0x38, 0x36, 0x39, 0x37, 0x30, 0x37, 0x31, 0x37, 0x32, 0x37, 0x33, 0x37, 0x34, 0x37, 0x35, 0x37, 0x36, 0x37, 0x37, 0x37, 0x38, 0x37, 0x39, 0x38, 0x30, 0x38, 0x31, 0x38, 0x32, 0x38, 0x33, 0x38, 0x34, 0x38, 0x35, 0x38, 0x36, 0x38, 0x37, 0x38, 0x38, 0x38, 0x39, 0x39, 0x30, 0x39, 0x31, 0x39, 0x32, 0x39, 0x33, 0x39, 0x34, 0x39, 0x35, 0x39, 0x36, 0x39, 0x37, 0x39, 0x38, 0x39, 0x39, }; uint64_t top = x / 100000000; uint64_t bottom = x % 100000000; uint64_t toptop = top / 10000; uint64_t topbottom = top % 10000; uint64_t bottomtop = bottom / 10000; uint64_t bottombottom = bottom % 10000; uint64_t toptoptop = toptop / 100; uint64_t toptopbottom = toptop % 100; uint64_t topbottomtop = topbottom / 100; uint64_t topbottombottom = topbottom % 100; uint64_t bottomtoptop = bottomtop / 100; uint64_t bottomtopbottom = bottomtop % 100; uint64_t bottombottomtop = bottombottom / 100; uint64_t bottombottombottom = bottombottom % 100; // memcpy(out, &table[2 * toptoptop], 2); memcpy(out + 2, &table[2 * toptopbottom], 2); memcpy(out + 4, &table[2 * topbottomtop], 2); memcpy(out + 6, &table[2 * topbottombottom], 2); memcpy(out + 8, &table[2 * bottomtoptop], 2); memcpy(out + 10, &table[2 * bottomtopbottom], 2); memcpy(out + 12, &table[2 * bottombottomtop], 2); memcpy(out + 14, &table[2 * bottombottombottom], 2); }

It compiles down to dozens of instructions.

Could you do better without using a much larger table?

It turns out that you can do much better if you have a recent Intel processor with the appropriate AVX-512 instructions (IFMA, VBMI), as demonstrated by an Internet user called InstLatX64.

We rely on the observation that you can compute directly the quotient and the remainder of the division using a series of multiplications and shifts (Lemire et al. 2019).

The code is a bit technical, but remarkably, it does not require a table. And it generates several times fewer instructions. For the sake of simplicity, I merely provide an implementation using Intel intrinsics. Importantly, you are not expected to follow through with the code, but you should notice that it is rather short.

void to_string_avx512ifma(uint64_t n, char *out) { uint64_t n_15_08 = n / 100000000; uint64_t n_07_00 = n % 100000000; __m512i bcstq_h = _mm512_set1_epi64(n_15_08); __m512i bcstq_l = _mm512_set1_epi64(n_07_00); __m512i zmmzero = _mm512_castsi128_si512(_mm_cvtsi64_si128(0x1A1A400)); __m512i zmmTen = _mm512_set1_epi64(10); __m512i asciiZero = _mm512_set1_epi64('0'); __m512i ifma_const = _mm512_setr_epi64(0x00000000002af31dc, 0x0000000001ad7f29b, 0x0000000010c6f7a0c, 0x00000000a7c5ac472, 0x000000068db8bac72, 0x0000004189374bc6b, 0x0000028f5c28f5c29, 0x0000199999999999a); __m512i permb_const = _mm512_castsi128_si512(_mm_set_epi8(0x78, 0x70, 0x68, 0x60, 0x58, 0x50, 0x48, 0x40, 0x38, 0x30, 0x28, 0x20, 0x18, 0x10, 0x08, 0x00)); __m512i lowbits_h = _mm512_madd52lo_epu64(zmmzero, bcstq_h, ifma_const); __m512i lowbits_l = _mm512_madd52lo_epu64(zmmzero, bcstq_l, ifma_const); __m512i highbits_h = _mm512_madd52hi_epu64(asciiZero, zmmTen, lowbits_h); __m512i highbits_l = _mm512_madd52hi_epu64(asciiZero, zmmTen, lowbits_l); __m512i perm = _mm512_permutex2var_epi8(highbits_h, permb_const, highbits_l); __m128i digits_15_0 = _mm512_castsi512_si128(perm); _mm_storeu_si128((__m128i *)out, digits_15_0); }

Remarkably, the AVX-512 is 3.5 times faster than the table-based approach:

function | time per conversion |
---|---|

table | 8.8 ns |

AVX-512 | 2.5 ns |

I use GCC 9 and an Intel Tiger Lake processor (3.30GHz). My benchmarking code is available.

A downside of this nifty approach is that it is (obviously) non-portable. There are still few Intel processors supporting these nifty extensions, and it is currently limited to Intel: no AMD or ARM processor can do the same right now. However, the gain might be sufficient that it is worth the effort deploying it in some instances.

]]>

```
var data []uint64
var buf *bytes.Buffer = new(bytes.Buffer)
...
err := binary.Write(buf, binary.LittleEndian, data)
```

Until recently, I assumed that the `binary.Write` function did not allocate memory. Unfortunately, it does. The function converts the input array to a new, temporary byte arrays.

Instead, you can create a small buffer just big enough to hold you 8-byte integer and write that small buffer repeatedly:

```
var item = make([]byte, 8)
for _, x := range data {
binary.LittleEndian.PutUint64(item, x)
buf.Write(item)
}
```

Sadly, this might have poor performance on disks or networks where each write/read has a high overhead. To avoid this problem, you can use Go’s buffered writer and write the integers one by one. Internally, Go will allocate a small buffer.

```
writer := bufio.NewWriter(buf)
var item = make([]byte, 8)
for _, x := range data {
binary.LittleEndian.PutUint64(item, x)
writer.Write(item)
}
writer.Flush()
```

I wrote a small benchmark that writes an array of 100M integers to memory.

function | memory usage | time |
---|---|---|

binary.Write |
1.5 GB | 1.2 s |

one-by-one | 0 | 0.87 s |

buffered one-by-one | 4 kB | 1.2 s |

(Timings will vary depending on your hardware and testing procedure. I used Go 1.16.)

The buffered one-by-one approach is not beneficial with respect to speed in this instance, but it would be more helpful in other cases. In my benchmark, the simple one-by-one approach is fastest and uses least memory. For small inputs, `binary.Write` would be faster. The ideal function might have a fast path for small arrays, and a more careful handling of the larger inputs.

In a university, professors have extensive freedom regarding course content. As long as you reasonably meet the course objectives, you can do whatever you like. You can pick the textbook you prefer, write your own and so forth.

However, the staff that built our course revision system decided to make it so that every single change should go through all layers of approvals. So if I want to change the title of an assignment, according to this tool, I need the department to approve.

When I first encountered this new tool, I immediately started to work around it. And because I am department chair, I brought my colleagues along for the ride. So we ‘pretend’ to get approval, submitting fake documents. The good thing in a bureaucracy is that most people are too bored to check up on the fine prints.

Surprisingly, it appears that no other department has been routing around the damage that is this new tool. I should point out that I am not doing anything illegal or against the rules. I am a good soldier. I just route around the software system. But it makes people uneasy.

And there lies the scary point. People are easily manipulated by computing.

People seem to think that if the software requires some document, then surely the rules require the document in question. That is, human beings believe that the software must be an accurate embodiment of the law.

In some sense, software does the policing. It enforces the rules. But like the actual police, software can go far beyond the law… and most people won’t notice.

An actual policeman can be intimidating. However, it is a human being. If they ask something that does not make sense, you are likely to question them. You are also maybe more likely to think that a policeman could be mistaken. Software is like a deaf policeman. And people want software to be correct.

Suppose you ran a university and you wanted all professors to include a section on religion in all their courses. You could not easily achieve such a result by the means of law. Changing the university regulations to add such a requirement would be difficult at a secular institution. However, if you simply make it that all professors must fill out a section on religion when registering a course, then professors would probably do it without question.

Of course, you can achieve the same result with bureaucracy. You just change the forms and the rules. But it takes much effort. Changing software is comparatively easier. There is no need to document the change very much. There is no need to train the staff.

I think that there is great danger in some of the recent ‘digit ID’ initiatives that various governments are pushing. Suppose, for example, that your driver’s license is on your mobile phone. It seems reasonable, at first, for the government to be able to activate it and deactivate it remotely. You no longer need to go to a government office to get your new driver’s license. However, it now makes it possible for a civil servant to decide that you cannot drive your car on Tuesdays. They do not need a new law, they do not need your consent, they can just switch a flag inside a server.

You may assume then that people would complain loudly, and they may. However, they are much less likely to complain than if it is a policeman that comes to their door on Tuesdays to take away their driver’s license. We have a bias as human being to accept without question software enforcement.

It can be used for good. For example, the right software can probably help you lose weight. However, software can enable arbitrary enforcement. For crazy people like myself, it will fail. Sadly, not everyone is as crazy as I am.

]]>What the Common CV does do is provide much power to middle-managers who can insert various bureaucratic requirements. You have to use their tool, and they can tailor it administratively without your consent. It is part of an ongoing technocratic invasion.

How did Canadian academics react? Did they revolt? Not at all. In fact, they are embracing it. I recently had to formally submit my resume as part of a routine internal review process, they asked for my Common CV. That is, instead of fighting against the techno-bureaucratic process, they extend its application to every aspect of their lives including internal functions. And it is not that everyone enjoys it: in private, many people despise the Common CV.

So why won’t they dissent?

One reason might be that they are demoralized. Why resist the Common CV when every government agency providing funding to professors requires it?

If so, they are confused. We dissent as an appeal to the intelligence of a future day. A dissent today is a message to the future people who will have the power to correct our current mistakes. These messages from today are potential tools in the future. “The reasonable man adapts himself to the world: the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.” (Shaw)

The lack of dissent is hardly new of course. Only a minority of academics questioned the Vietnam war (Schreiber, 1973), and much of the resistance came when it became safe to speak out. The scientists described by Freeman Dyson in The Scientist as Rebel have always been a fringe.

Chomski lamented on this point:

]]>IT IS THE RESPONSIBILITY of intellectuals to speak the truth and to expose lies. This, at least, may seem enough of a truism to pass over without comment. Not so, however. For the modern intellectual, it is not at all obvious. Thus we have Martin Heidegger writing, in a pro-Hitler declaration of 1933, that “truth is the revelation of that which makes a people certain, clear, and strong in its action and knowledge”; it is only this kind of “truth” that one has a responsibility to speak. Americans tend to be more forthright. When Arthur Schlesinger was asked by

The New York Timesin November, 1965, to explain the contradiction between his published account of the Bay of Pigs incident and the story he had given the press at the time of the attack, he simply remarked that he had lied; and a few days later, he went on to compliment theTimesfor also having suppressed information on the planned invasion, in “the national interest,” as this term was defined by the group of arrogant and deluded men of whom Schlesinger gives such a flattering portrait in his recent account of the Kennedy Administration. It is of no particular interest that one man is quite happy to lie in behalf of a cause which he knows to be unjust; but it is significant that such events provoke so little response in the intellectual community—for example, no one has said that there is something strange in the offer of a major chair in the humanities to a historian who feels it to be his duty to persuade the world that an American-sponsored invasion of a nearby country is nothing of the sort. And what of the incredible sequence of lies on the part of our government and its spokesmen concerning such matters as negotiations in Vietnam? The facts are known to all who care to know. The press, foreign and domestic, has presented documentation to refute each falsehood as it appears. But the power of the government’s propaganda apparatus is such that the citizen who does not undertake a research project on the subject can hardly hope to confront government pronouncements with fact.

If I have two integers that use 3 digits, say, how many digits will their product have?

Mathematically, we might count the number of digits of an integer using the formula ceil(log(x+1)) where the log is the in the base you are interested in. In base 10, the integers with three digits go from 100 to 999, or from 10^{2} to 10^{3}-1, inclusively. For example, to compute the number of digits in base 10, you might use the following Python expression `ceil(log10(x+1))`. More generally, an integer has *d* digits in base b if it is between *b*^{d-1} and *b*^{d}-1, inclusively. By convention, the integer 0 has no digit in this model.

The product between an integer having d1 digits and integer having *d*_{2} digits is between *b*^{d1+d2-2} and *b*^{d1+d2}–*b*^{d1}–*b*^{d2}+1 (inclusively). Thus the product has either *d*_{1}+*d*_{2}-1 digits or *d*_{1}+*d*_{2} digits.

To illustrate, let us consider the product between two integers having three digits. In base 10, the smallest product is 100 times 100 or 10,000, so it requires 5 digits. The largest product is 999 times 999 or 998,001 so 6 digits.

Thus if you multiply a 32-bit number with another 32-bit number, you get a number than has at most 64 binary digits. The maximum value will be 2^{64} – 2^{33} + 1.

It seems slightly counter-intuitive that the product of two 32-bit numbers does not span the full range of 64-bit numbers because it cannot exceed 2^{64} – 2^{33} + 1. A related observation is that any given product may have several possible pairs of 32-bit numbers. For example, the product 4 can be achieved by the multiplication of 1 with 4 or the multiplication of 2 times 2. Furthermore, many other 64-bit values may not be produced from two 32-bit values: e.g., any prime number larger or equal than 2^{32} and smaller than 2^{64} .

**Further reading**: Computing the number of digits of an integer even faster

In recent years, we saw a surge of concentration in newspaper and television ownership. However, this was accompanied by a surge of online journalism. The total number of publishers increased, if nothing else.

You can more easily have a single carrier/distributor than a monopolistic publisher. For example, the same delivery service provides me my newspaper as well as a range of competing newspapers. The delivery man does not much care for the content of my newspaper. A few concentrated Internet providers support diverse competing services.

The current giants (Facebook, Twitter and Google) were built initially as neutral distributors. Google was meant to give you access to all of the web’s information. If the search engine is neutral, there is no reason to have more than one. If twitter welcomes everyone, then there is no reason to have competing services. Newspapers have fact-checking services, but newspaper delivery services do not.

Of course, countries like Russia and China often had competing services, but most of the rest of the world fell back on American-based large corporations for their web infrastructure. Even the Talibans use Twitter.

It has now become clear that Google search results are geared toward favouring some of their own services. Today, we find much demand for services like Facebook and Twitter to more closely vet their content. Effectively, they are becoming publishers. They are no longer neutral. It is undeniable that they now see their roles as arbitrer of content. They have fact-checking services and they censor individuals.

If my mental model is correct, then we will see the emergence of strong competitors. I do not predict the immediate downfall of Facebook and Twitter. However, much of their high valuation was due to them being considered neutral carriers. The difference in value between a monopoly and a normal player can be significant. People who know more about online marketing than I do also tell me that online advertisement might be overrated. And advertisement on a platform that is no longer universal is less valuable: the pie is shared. Furthermore, I would predict that startups that were dead on arrival ten years ago might be appealing businesses today. Thus, at the margin, it makes it more appealing for a young person to go work for a small web startup.

I should stress that this is merely a model. I do not claim to be right. I am also not providing investment or job advice.

**Further reading**: Stop spending so much time being trolled by billionaire corporations!

In the blog post, Quickly parsing eight digits, I presented a very quick way to parse eight ASCII characters representing an integers (e.g., 12345678) into the corresponding binary value. I want to come back to it and explain it a bit more, to show that it is not magic. This works in most programming languages, but I will stick with C for this blog post.

To recap, the long way is a simple loop:

uint32_t parse_eight_digits(const unsigned char *chars) { uint32_t x = chars[0] - '0'; for (size_t j = 1; j < 8; j++) x = x * 10 + (chars[j] - '0'); return x; }

We use the fact that in ASCII, the numbers 0, 1, … are in consecutive order in terms of byte values. The character ‘0’ is 0x30 (or 48 in decimal), the character ‘1’ is 0x31 (49 in decimal) and so forth. At each step in the loop, we multiple the running value by 10 and add the value of the next digit.

It assumes that all characters are in the valid range (from ‘0’ to ‘9’): other code should check that it is the case.

An optimizing compiler will probably unroll the loop and produce code that might look like this in assembly:

movzx eax, byte ptr [rdi] lea eax, [rax + 4*rax] movzx ecx, byte ptr [rdi + 1] lea eax, [rcx + 2*rax] lea eax, [rax + 4*rax] movzx ecx, byte ptr [rdi + 2] lea eax, [rcx + 2*rax] lea eax, [rax + 4*rax] movzx ecx, byte ptr [rdi + 3] lea eax, [rcx + 2*rax] lea eax, [rax + 4*rax] movzx ecx, byte ptr [rdi + 4] lea eax, [rcx + 2*rax] lea eax, [rax + 4*rax] movzx ecx, byte ptr [rdi + 5] lea eax, [rcx + 2*rax] lea eax, [rax + 4*rax] movzx ecx, byte ptr [rdi + 6] lea eax, [rcx + 2*rax] lea eax, [rax + 4*rax] movzx ecx, byte ptr [rdi + 7] lea eax, [rcx + 2*rax] add eax, -533333328

Notice how there are many loads, and a whole lot of operations.

We can substantially shorten the resulting code, down to something that looks like the following:

imul rax, qword ptr [rdi], 2561 movabs rcx, -1302123111085379632 add rcx, rax shr rcx, 8 movabs rax, 71777214294589695 and rax, rcx imul rax, rax, 6553601 shr rax, 16 movabs rcx, 281470681808895 and rcx, rax movabs rax, 42949672960001 imul rax, rcx shr rax, 32

How do we do it? We use a technique called SWAR which stands for SIMD within a register. The intuition behind is that modern computers have 64-bit registers. Processing eight consecutive bytes as eight distinct words, as in the native code above, is inefficient given how wide our registers are.

The first step is to load all eight characters into a 64-bit register. In C, you might do it in this manner:

int64_t val; memcpy(&val, chars, 8);

It looks maybe expensive, but most compilers will translate the memcpy instruction into a single load, when compiling with optimizations turned on.

Computers store values in little-endian order. This means that the first byte you encounter is going to be used as the least significant byte, and so forth.

Then we want to subtract the character ‘`0`‘ (or `0x30` in hexadecimal). We can do it with a single operation:

val = val - 0x3030303030303030;

So if you had the string ‘`12345678`‘, you will now have the value `0x0807060504030201`.

Next we are going to do a kind of pyramidal computation. We add pairs of successive bytes, then pairs of successive 16-bit values and then pairs of successive 32-bit bytes.

It goes something like this, suppose that you have the sequence of digit values `b1, b2, b3, b4, b5, b6, b7, b8`. You want to do…

- add pairs of bytes:
`10*b1+b2`,`10*b3+b4`,`10*b5+b6`,`10*b7+b8` - combine first and third sum:
`1000000*(10*b1+b2) + 100*(10*b5+b6)` - combine second and fourth sum:
`10*b7+b8 + 10000*(10*b3+b4)`

I will only explain the first step (pairs of bytes) as the other two steps are similar. Consider the least significant two bytes, which have value `256*b2 + b1`. We multiply it by 10, and we add the value shifted by 8 bits, and we get `b1+10*b2` in the least significant byte. We can compute 4 such operations in one operation…

val = (val * 10) + (val >> 8);

The next two steps are similar:

val1 = (((val & 0x000000FF000000FF) * (100 + (1000000ULL << 32)));

val2 = (((val >> 16) & 0x000000FF000000FF) * (1 + (10000ULL << 32))) >> 32;

And the overall code looks as follows…

uint32_t parse_eight_digits_unrolled(uint64_t val) { const uint64_t mask = 0x000000FF000000FF; const uint64_t mul1 = 0x000F424000000064; // 100 + (1000000ULL << 32) const uint64_t mul2 = 0x0000271000000001; // 1 + (10000ULL << 32) val -= 0x3030303030303030; val = (val * 10) + (val >> 8); // val = (val * 2561) >> 8; val = (((val & mask) * mul1) + (((val >> 16) & mask) * mul2)) >> 32; return val; }

**Appendix**: You can do much the same in C# starting with a `byte` pointer (`byte* chars`):

ulong val = Unsafe.ReadUnaligned<ulong>(chars); const ulong mask = 0x000000FF000000FF; const ulong mul1 = 0x000F424000000064; // 100 + (1000000ULL << 32) const ulong mul2 = 0x0000271000000001; // 1 + (10000ULL << 32) val -= 0x3030303030303030; val = (val * 10) + (val >> 8); // val = (val * 2561) >> 8; val = (((val & mask) * mul1) + (((val >> 16) & mask) * mul2)) >> 32;]]>

What about floating-point numbers? The nuance with floating-point numbers is that they cannot represent all numbers within a continuous range. For example, the real number 1/3 cannot be represented using binary floating-point numbers. So the convention is that given a textual representation, say “1.1e100”, we seek the closest approximation.

Still, are there ranges of numbers that you should not represent using floating-point numbers? That is, are there numbers that you should reject?

It seems that there are two different interpretation:

- My own interpretation is that floating-point types can represent all numbers from -infinity to infinity, inclusively. It means that ‘infinity’ or 1e9999 are indeed “in range”. For 64-bit IEEE floating-point numbers, this means that numbers smaller than 4.94e-324 but greater than 0 can be represented as 0, and that numbers greater than 1.8e308 should be infinity. To recap, all numbers are always in range.
- For 64-bit numbers, another interpretation is that only numbers in the ranges 4.94e-324 to 1.8e308 and -1.8e308 to -4.94e-324, together with exactly 0, are valid. Numbers that are too small (less than 4.94e-324 but greater than 0) or numbers that are larger than 1.8e308 are “out of range”. Common implementations of the strtod function or of the C++ equivalent follow this convention.

This matters because the C++ specification for the `from_chars` functions state that

If the parsed value is not in the range representable by the type of value, value is unmodified and the member ec of the return value is equal to errc::result_out_of_range.

I am not sure programmers have a common understanding of this specification.

]]>In the business world, double-entry bookkeeping is the idea that transactions are recorded in at least two accounts (debit and credit). One of the advantages of double-entry bookkeeping, compared to a more naive approach, is that it allows for some degree of auditing and error finding. If we compare accounting and software programming, we could say that double-entry accounting and its subsequent auditing is equivalent to software testing.

For an accountant, converting a naive accounting system into a double-entry system is a difficult task in general. In many cases, one would have to reconstruct it from scratch. In the same manner, it can be difficult to add tests to a large application that has been developed entirely without testing. And that is why testing should be first on your mind when building serious software.

A hurried or novice programmer can quickly write a routine, compile and run it and be satisfied with the result. A cautious or experienced programmer will know not to assume that the routine is correct.

Common software errors can cause problems ranging from a program that abruptly terminates to database corruption. The consequences can be costly: a software bug caused the explosion of an Ariane 5 rocket in 1996 (Dowson, 1997). The error was caused by the conversion of a floating point number to a signed integer represented with 16 bits. Only small integer values could be represented. Since the value could not be represented, an error was detected and the program stopped because such an error was unexpected. The irony is that the function that triggered the error was not required: it had simply been integrated as a subsystem from an earlier model of the Ariane rocket. In 1996 U.S. dollars, the estimated cost of this error is almost $400 million.

The importance of producing correct software has long been understood. The best scientists and engineers have been trying to do this for decades.

There are several common strategies. For example, if we need to do a complex scientific calculation, then we can ask several independent teams to produce an answer. If all the teams arrive at the same answer, we can then conclude that it is correct. Such redundancy is often used to prevent hardware-related faults (Yeh, 1996). Unfortunately, it is not practical to write multiple versions of your software in general.

Many of the early programmers had advanced mathematical training. They hoped that we could prove that a program is correct. By putting aside the hardware failures, we could then be certain that we would not encounter any errors. And indeed, today we have sophisticated software that allows us to sometimes prove that a program is correct.

Let us consider an example of formal verification to illustrate our point. We can use the z3 library from Python (De Moura and Bjørner, 2008). If you are not a Python user, don’t worry: you don’t have to be to follow the example. We can install the necessary library with the command `pip install z3-solver`

or the equivalent. Suppose we want to be sure that the inequality `( 1 + y ) / 2 < y`

holds for all 32-bit integers. We can use the following script:

```
import z3
y = z3.BitVec("y", 32)
s = z3.Solver()
s.add( ( 1 + y ) / 2 >= y )
if(s.check() == z3.sat):
model = s.model()
print(model)
```

In this example we construct a 32-bit word (*BitVec*) to represent our example. By default, the z3 library interprets the values that can be represented by such a variable as ranging from -2147483648 to 2147483647 (from \(-2^{31}\) to \(2^{31}-1\) inclusive). We enter the inequality opposite to the one we wish to show (`( 1 + y ) / 2 >= y`

). If z3 does not find a counterexample, then we will know that the inequality `( 1 + y ) / 2 < y`

holds.

When running the script, Python displays the integer value 2863038463 which indicates that z3 has found a counterexample. The z3 library always gives a positive integer and it is up to us to interpret it correctly. The number 2147483648 becomes -2147483648, the number 2147483649 becomes -2147483647 and so on. This representation is often called the two’s complement. Thus, the number 2863038463 is in fact interpreted as a negative number. Its exact value is not important: what matters is that our inequality (`( 1 + y ) / 2 < y`

) is incorrect when the variable is negative. We can check this by giving the variable the value -1, we then get `0 < -1`

. When the variable takes the value 0, the inequality is also false (`0<0`

). We can also check that the inequality is false when the variable takes the value 1. So let us add as a condition that the variable is greater than 1 (`s.add( y > 1 )`

):

```
import z3
y = z3.BitVec("y", 32)
s = z3.Solver()
s.add( ( 1 + y ) / 2 >= y )
s.add( y > 1 )
if(s.check() == z3.sat):
model = s.model()
print(model)
```

Since the latter script does not display anything on the screen when it is executed, we can conclude that the inequality is satisfied as long as the variable of variable is greater than 1.

Since we have shown that the inequality `( 1 + y ) / 2 < y`

is true, perhaps the inequality `( 1 + y ) < 2 * y`

is true too? Let’s try it:

```
import z3
y = z3.BitVec("y", 32)
s = z3.Solver()
s.add( ( 1 + y ) >= 2 * y )
s.add( y > 1 )
if(s.check() == z3.sat):
model = s.model()
print(model)
```

This script will display 1412098654, half of 2824197308 which is interpreted by z3 as a negative value. To avoid this problem, let’s add a new condition so that the double of the variable can still be interpreted as a positive value:

```
import z3
y = z3.BitVec("y", 32)
s = z3.Solver()
s.add( ( 1 + y ) / 2 >= y )
s.add( y > 0 )
s.add( y < 2147483647/2)
if(s.check() == z3.sat):
model = s.model()
print(model)
```

This time the result is verified. As you can see, such a formal approach requires a lot of work, even in relatively simple cases. It may have been possible to be more optimistic in the early days of computer science, but by the 1970s, computer scientists like Dijkstra were expressing doubts:

we see automatic program verifiers verifying toy programs and one observes the honest expectation that with faster machines with lots of concurrent processing, the life-size problems wiIl come within reach as well. But, honest as these expectations may be, are they justified? I sometimes wonder… (Dijkstra, 1975)

It is impractical to apply such a mathematical method on a large scale. Errors can take many forms, and not all of these errors can be concisely presented in mathematical form. Even when it is possible, even when we can accurately represent the problem in a mathematical form, there is no reason to believe that a tool like z3 will always be able to find a solution: when problems become difficult, computational times can become very long. An empirical approach is more appropriate in general.

Over time, programmers have come to understand the need to test their software. It is not always necessary to test everything: a prototype or an example can often be provided without further validation. However, any software designed in a professional context and having to fulfill an important function should be at least partially tested. Testing allows us to reduce the probability that we will have to face a disastrous situation.

There are generally two main categories of tests.

- There are unit tests. These are designed to test a particular component of a software program. For example, a unit test can be performed on a single function. Most often, unit tests are automated: the programmer can execute them by pressing a button or typing a command. Unit tests often avoid the acquisition of valuable resources, such as creating large files on a disk or making network connections. Unit testing does not usually involve reconfiguring the operating system.
- Integration tests aim to validate a complete application. They often require access to networks and access to sometimes large amounts of data. Integration tests sometimes require manual intervention and specific knowledge of the application. Integration testing may involve reconfiguring the operating system and installing software. They can also be automated, at least in part. Most often, integration tests are based on unit tests that serve as a foundation.

Unit tests are often part of a continuous integration process (Kaiser et al., 1989). Continuous integration often automatically performs specific tasks including unit testing, backups, applying cryptographic signatures, and so on. Continuous integration can be done at regular intervals, or whenever a change is made to the code.

Unit tests are used to structure and guide software development. Tests can be written before the code itself, in which case we speak of *test-driven development*. Often, tests are written after developing the functions. Tests can be written by programmers other than those who developed the functions. It is sometimes easier for independent developers to provide tests that are capable of uncovering errors because they do not share the same assumptions.

It is possible to integrate tests into functions or an application. For example, an application may run a few tests when it starts. In such a case, the tests will be part of the distributed code. However, it is more common not to publish unit tests. They are a component reserved for programmers and they do not affect the functioning of the application. In particular, they do not pose a security risk and they do not affect the performance of the application.

Experienced programmers often consider tests to be as important as the original code. It is therefore not uncommon to spend half of one’s time on writing tests. The net effect is to substantially reduce the initial speed of writing computer code. Yet this apparent loss of time often saves time in the long run: setting up tests is an investment. Software that is not well tested is often more difficult to update. The presence of tests allows us to make changes or extensions with less uncertainty.

Tests should be readable, simple and they should run quickly. They often use little memory.

Unfortunately, it is difficult to define exactly how good tests are. There are several statistical measures. For example, we can count the lines of code that execute during tests. We then talk about test coverage. A coverage of 100% implies that all lines of code are concerned by the tests. In practice, this coverage measure can be a poor indication of test quality.

Consider this example:

```
package main
import (
"testing"
)
func Average(x, y uint16) uint16 {
return (x + y)/2
}
func TestAverage(t *testing.T) {
if Average(2,4) != 3 {
t.Error(Average(2,4))
}
}
```

In the Go language, we can run tests with the command `go test`

. We have an `Average`

function with a corresponding test. In our example, the test will run successfully. The coverage is 100%.

Unfortunately, the `Average`

function may not be as correct as we would expect. If we pass the integers 40000 and 40000 as parameters, we would expect the average value of 40000 to be returned. But the integer 40000 added to the integer 40000 cannot be represented with a 16-bit integer (`uint16`

): the result will be instead `(40000+4000)%65536=14464`

. So the function will return 7232 which may be surprising. The following test will fail:

```
func TestAverage(t *testing.T) {
if Average(40000,40000) != 40000 {
t.Error(Average(40000,40000))
}
}
```

When possible and fast, we can try to test the code more exhaustively, like in this example where we include several values:

```
package main
import (
"testing"
)
func Average(x, y uint16) uint16 {
if y > x {
return (y - x)/2 + x
} else {
return (x - y)/2 + y
}
}
func TestAverage(t *testing.T) {
for x := 0; x < 65536; x++ {
for y := 0; y < 65536; y++ {
m := int(Average(uint16(x),uint16(y)))
if x < y {
if m < x || m > y {
t.Error("error ", x, " ", y)
}
} else {
if m < y || m > x {
t.Error("error ", x, " ", y)
}
}
}
}
}
```

In practice, it is rare that we can do exhaustive tests. We can instead use pseudo-random tests. For example, we can generate pseudo-random numbers and use them as parameters. In the case of random tests, it is important to keep them deterministic: each time the test runs, the same values are tested. This can be achieved by providing a fixed *seed* to the random number generator as in this example:

```
package main
import (
"testing"
"math/rand"
)
func Average(x, y uint16) uint16 {
if y > x {
return (y - x)/2 + x
} else {
return (x - y)/2 + y
}
}
func TestAverage(t *testing.T) {
rand.Seed(1234)
for test := 0; test < 1000; test++ {
x := rand.Intn(65536)
y := rand.Intn(65536)
m := int(Average(uint16(x),uint16(y)))
if x < y {
if m < x || m > y {
t.Error("error ", x, " ", y)
}
} else {
if m < y || m > x {
t.Error("error ", x, " ", y)
}
}
}
}
```

Tests based on random exploration are part of a strategy often called *fuzzing* (Miller at al., 1990).

We generally distinguish two types of tests. Positive tests aim at verifying that a function or component behaves in an agreed way. Thus, the first test of our `Average`

function was a positive test. Negative tests verify that the software behaves correctly even in unexpected situations. We can produce negative tests by providing our functions with random data (*fuzzing*). Our second example can be considered a negative test if the programmer expected small integer values.

The tests should fail when the code is modified (Budd et al., 1978). On this basis, we can also develop more sophisticated measures by testing for random changes in the code and ensuring that such changes often cause tests to fail.

Some programmers choose to generate tests automatically from the code. In such a case, a component is tested and the result is captured. For example, in our example of calculating the average, we could have captured the fact that `Average(40000,40000)`

has the value 7232. If a subsequent change occurs that changes the result of the operation, the test will fail. Such an approach saves time since the tests are generated automatically. We can quickly and effortlessly achieve 100% code coverage. On the other hand, such tests can be misleading. In particular, it is possible to capture incorrect behaviour. Furthermore, the objective when writing tests is not so much their number as their quality. The presence of several tests that do not contribute to validate the essential functions of our software can even become harmful. Irrelevant tests can waste programmers’ time in subsequent revisions.

Finally, we review the benefits of testing: tests help us organize our work, they are a measure of quality, they help us document the code, they avoid regression, they help debugging and they can produce more efficient code.

Designing sophisticated software can take weeks or months of work. Most often, the work will be broken down into separate units. It can be difficult, until you have the final product, to judge the outcome. Writing tests as we develop the software helps to organize the work. For example, a given component can be considered complete when it is written and tested. Without the test writing process, it is more difficult to estimate the progress of a project since an untested component may still be far from being completed.

Tests are also used to show the care that the programmer has put into his work. They also make it possible to quickly evaluate the care taken with the various functions and components of a software program: the presence of carefully composed tests can be an indication that the corresponding code is reliable. The absence of tests for certain functions can serve as a warning.

Some programming languages are quite strict and have a compilation phase that validates the code. Other programming languages (Python, JavaScript) leave more freedom to the programmer. Some programmers consider that tests can help to overcome the limitations of less strict programming languages by imposing on the programmer a rigour that the language does not require.

Software programming should generally be accompanied by clear and complete documentation. In practice, the documentation is often partial, imprecise, erroneous or even non-existent. Tests are therefore often the only technical specification available. Reading tests allows programmers to adjust their expectations of software components and functions. Unlike documentation, tests are usually up-to-date, if they are run regularly, and they are accurate to the extent that they are written in a programming language. Tests can therefore provide good examples of how the code is used.

Even if we want to write high-quality documentation, tests can also play an important role. To illustrate computer code, examples are often used. Each example can be turned into a test. So we can make sure that the examples included in the documentation are reliable. When the code changes, and the examples need to be modified, a procedure to test our examples will remind us to update our documentation. In this way, we avoid the frustrating experience of readers of our documentation finding examples that are no longer functional.

Programmers regularly fix flaws in their software. It often happens that the same problem occurs again. The same problem may come back for various reasons: sometimes the original problem has not been completely fixed. Sometimes another change elsewhere in the code causes the error to return. Sometimes the addition of a new feature or software optimization causes a bug to return, or a new bug to be added. When software acquires a new flaw, it is called a regression. To prevent such regressions, it is important to accompany every bug fix or new feature with a corresponding test. In this way, we can quickly become aware of regressions by running the tests. Ideally, the regression can be identified while the code is being modified, so we avoid regression. In order to convert a bug into a simple and effective test, it is useful to reduce it to its simplest form. For example, in our previous example with `Average(40000,40000)`

, we can add the detected error in additional test:

```
package main
import (
"testing
)
func Average(x, y uint16) uint16 {
if y > x {
return (y - x)/2 + x
} else {
return (x - y)/2 + y
}
}
func TestAverage(t *testing.T) {
if Average(2,4) != 3 {
t.Error("error 1")
}
if Average(40000,40000) != 40000 {
t.Error("error 2")
}
}
```

In practice, the presence of an extensive test suite makes it possible to identify and correct bugs more quickly. This is because testing reduces the extent of errors and provides the programmer with several guarantees. To some extent, the time spent writing tests saves time when errors are found while reducing the number of errors.

Furthermore, an effective strategy to identify and correct a bug involves writing new tests. It can be more efficient on the long run than other debugging strategies such as stepping through the code. Indeed, after your debugging session is completed, you are left with new unit tests in addition to a corrected bug.

The primary function of tests is to verify that functions and components produce the expected results. However, programmers are increasingly using tests to measure the performance of components. For example, the execution speed of a function, the size of the executable or the memory usage can be measured. It is then possible to detect a loss of performance following a modification of the code. You can compare the performance of your code against a reference code and check for differences using statistical tests.

All computer systems have flaws. Hardware can fail at any time. And even when the hardware is reliable, it is almost impossible for a programmer to predict all the conditions under which the software will be used. No matter who you are, and no matter how hard you work, your software will not be perfect. Nevertheless, you should at least try to write code that is generally correct: it most often meets the expectations of users.

It is possible to write correct code without writing tests. Nevertheless, the benefits of a test suite are tangible in difficult or large-scale projects. Many experienced programmers will refuse to use a software component that has been built without tests.

The habit of writing tests probably makes you a better programmer. Psychologically, you are more aware of your human limitations if you write tests. When you interact with other programmers and with users, you may be better able to take their feedback into account if you have a test suite.

- James Whittaker, Jason Arbon, Jeff Carollo, How Google Tests Software, Addison-Wesley Professional; 1st edition (March 23 2012)
- Lisa Crispin, Janet Gregory, Agile Testing: A Practical Guide for Testers and Agile Teams, Addison-Wesley Professional; 1st edition (Dec 30 2008)

The following Twitter users contributed ideas: @AntoineGrodin, @dfaranha, @Chuckula1, @EddyEkofo, @interstar, @Danlark1, @blattnerma, @ThuggyPinch, @ecopatz, @rsms, @pdimov2, @edefazio, @punkeel, @metheoryt, @LoCtrl, @richardstartin, @metala, @franck_guillaud, @__Achille__, @a_n__o_n, @atorstling, @tapoueh, @JFSmigielski, @DinisCruz, @jsonvmiller, @nickblack, @ChrisNahr, @ennveearr1, @_vkaku, @kasparthommen, @mathjock, @feO2x, @pshufb, @KishoreBytes, @kspinka, @klinovp, @jukujala, @JaumeTeixi

]]>The N1 core used in the Graviton2 chip had an instruction fetch unit that was 4 to 8 instructions wide and 4 wide instruction decoder that fed into an 8 wide issue unit that included two SIMD units, two load/store units, three arithmetic units, and a branch unit. With the Perseus N2 core used in the Graviton3, there is an 8 wide fetch unit that feeds into a 5 wide to 8 wide decode unit, which in turn feeds into a 15 wide issue unit, which is basically twice as wide as that on the N1 core used in the Graviton2. The vector engines have twice as much width (and support for BFloat16 mixed precision operations) and the load/store. arithmetic, and branch units are all doubled up, too.

It feels like we are witnessing a revolution in processors. Not only are corporations like Apple and Amazon making their own chips… they appear to match Intel in raw power.

When you are reading these numbers from a string, there are distinct functions. In C, you have `strtof` and `strtod`. One parses a string to a `float` and the other function parses it to a `double`.

At a glance, it seems redundant. Why not just parse your string to a `double` value and cast it back to a `float`, if needed?

Of course, that would be slightly more expensive. But, importantly, it is also gives incorrect results in the sense that it is not equivalent to parsing directly to a float. In other words, these functions are not equivalent:

float parse1(const char * c) { char * end; return strtod(c, &end); } float parse2(const char * c) { char * end; return strtof(c, &end); }

It is intuitive that if I first parse the number as a `float` and then cast it back to a `double`, I will have lost information in the process. Indeed, if I start with the string “3.14159265358979323846264338327950”, parsed as a `float` (32-bit), I get 3.1415927410125732421875. If I parse it as a double (32-bit), I get the more accurate result 3.141592653589793115997963468544185161590576171875. The difference is not so small, about 9e-08.

In the other direction, first parsing to a `double` and then casting back to a `float`, I can also lose information, although only a little bit due to the double rounding effect. To illustrate, suppose that I have the number 1.48 and that I round it in one go to the nearest integer: I get 1. If I round it first to a single decimal (1.5) and then to the nearest integer, I might get 2 using the usually rounding conventions (either round up, or round to even). Rounding twice is lossy and not equivalent to a single rounding operation. Importantly, you lose a bit of precision in the sense that you may not get back the closest value.

With floating-point numbers, I get this effect with the string “0.004221370676532388” (for example). You probably cannot tell unless you are a machine, but parsing directly to a `float` is 2e-7 % more accurate.

In most applications, such a small loss of accuracy is not relevant. However, if you ever find yourself having to compare results with another program, you may get inconsistent results. It can make debugging more difficult.

Further reading: Floating-Point Determinism (part of a series)

]]>The frequency of documents containing highly politicized terms has been increasing consistently over the last three decades. The most politicized field is Education & Human Resources. The least are Mathematical & Physical Sciences and Computer & Information Science & Engineering, although even they are significantly more politicized than any field was in 1990. At the same time, abstracts have been becoming more similar to each other over time. Taken together, the results imply that there has been a politicization of scientific funding in the US in recent years and a decrease in the diversity of ideas supported.