Science and Technology links (November 26 2022)

  1. In Molière’s famous play, Tartuffe, the main characters is outwardly pious but fundamentally deceitful. Are people who insist on broadcasting their high virtue better people, or are they more like Tartuffe? Dong et al. (2022) conclude that people who say that they have good values are not necessarily better people in practice, but they are more likely to be hypocrites. That is, Tartuffe is a realistic character. It is worth pointing out that Molière’s play was censored by the king.
  2. The thymus is a small organ which plays a critical role in your immune system by producing T cells. As you get older, your thymus nearly disappears. The net result is that by the time you are 60 years old, you have few available T cells left and your immune system cannot adapt to new diseases as well. Calum Chace reports on a small clinical trial that showed that we can rejuvenate the thymus inexpensively. Unfortunately, though the clinical trial has been completed years ago, nobody seems to care about what is possibly a medical breakthrough in the fight against aging.
  3. When doing exercises to get stronger, it is the extension of the muscles that matters, not the contraction.
  4. Climate change might results in us having more rainbows.
  5. Social scientists are no better at predicting societal change than the rest of us.
  6. In female rats, scientists find that a gene therapy may protect against reproductive aging.
  7. High-speed wireless internet significantly increased teen girls’ severe mental health diagnoses – by 90%.
  8. We can reset the age of a single cell.
  9. Content warnings reduce aesthetic appreciation of visual art.
  10. Highly intelligent people are not more likely to have mental illness.
  11. Psychopaths only do better if others cannot retaliate. They also tend to cooperate less.
  12. Hundreds of thousands of bats are killed per year in countries with many wind turbines.

Making all your integers positive with zigzag encoding

You sometimes feel the need to make all of your integers positive, without losing any information. That is, you want to map all of your integers from ‘signed’ integers (e.g., -1, 1, 3, -3) to ‘unsigned integers’ (e.g., 3,2,6,7). This could be useful if you have a fast function to compress integers that fails to work well for negative integers.

Many programming languages (Go, C, etc.) allow you just ‘cast’ the integer. For example, the following Go code will print out -1, 18446744073709551615, -1 under most systems:

	var x = -1
	var y = uint(x)
	var z = int(y)
	fmt.Println(x, y, z)

That is, you can take a small negative value, interpret it as a large integer, and then ‘recover’ back your small value.

What if you want to have that small values remain small ? Then  a standard approach is to use zigzag encoding. The recipe is as follows:

  • Compute twice the absolute value of your integer.
  • Subtract 1 to the result when the original integer was negative.

Effectively, what you are doing is that all positive integers become even integers (exactly twice as big), and all negative integers become odd integers. We interleave negative and positive integers (odd and even).

original value zigzag value
-20 39
-19 37
-18 35
-17 33
-16 31
-15 29
-14 27
-13 25
-12 23
-11 21
-10 19
-9 17
-8 15
-7 13
-6 11
-5 9
-4 7
-3 5
-2 3
-1 1
0 0
1 2
2 4
3 6
4 8
5 10
6 12
7 14
8 16
9 18
10 20
11 22
12 24
13 26
14 28
15 30
16 32
17 34
18 36
19 38
20 40

In Python, you might implement the encoding and the decoding as follows:

def zigzag_encode(val) :
    if val < 0:
        return - 2 * val  - 1
    return 2 * val

def zigzag_decode(val) :
    if val & 1 == 1 :
        return - val // 2
    return val // 2

The same code in C/C++ might work, but it could be more efficient to use optimized code which assumes that the underlying hardware represents signed integers with two’s complement encoding (which is a safe assumption in 2022 and a requirement in C++20) and that bytes span 8 bits (another safe assumption)…

int fast_decode(unsigned int x) {
    return (x >> 1) ^ (-(x&1));

unsigned int fast_encode(int x) {
    return (2*x) ^ (x >>(sizeof(int) * 8 - 1));

Much the same code will work in Go, Rust, Swift, etc.

What is the size of a byte[] array in Java?

Java allows you to create an array just big enough to contain 4 bytes, like so:

byte[] array = new byte[4];

How much memory does this array take? If you have answered “4 bytes”, you are wrong. A more likely answer is 24 bytes.

I wrote a little Java program that relies on the jamm library to print out some answers, for various array sizes:

size of the array estimated memory usage
0 16 bytes
1 24 bytes
2 24 bytes
3 24 bytes
4 24 bytes
5 24 bytes
6 24 bytes
7 24 bytes
8 24 bytes
9 32 bytes

This is not necessarily the exact memory usage on your system, but it is a reasonable guess.

Further work: A library such as JOL might provide a more accurate measure, according to a reader (Bempel).


Rounding modes: std::from_chars versus strtod/strtof

A recent C++ standard (C++17) introduced new functions to parse floating-point numbers std::from_chars, from strings (e.g., ASCII text) to binary numbers. How should such a function parse values that cannot be represented exactly? The specification states that the resulting value rounded to nearest. This means that 1.0000000000000000001 and 0.999999999999999999 become exactly 1.0.

The C language has its own functions (strtod/strtof). I could not find a reference in the standard as to how it should round, but the source code suggests that the functions round according to the current floating-point rounding mode, as determined by the fegetround() function. One can round toward zero, up, down or to nearest. I wrote a small command-line utility to test it out. And indeed, I get the following results under LLVM and GCC:

1.0000000000000000001 1.00001 1.0 1.0
0.999999999999999999 1.0 0.999999 0.999999

Thus you cannot assume that, in general, std::from_chars will agree with strtod/strtof even for just boring strings such as 0.999999999999999999.

Thankfully, you can check the rounding mode efficiently.

A fast function to check your floating-point rounding mode

For speed, we use finite-precision number types in software. When doing floating-point computations in software, the results are usually not exact. For example, you can have the number 1.0 and the number 2-1000. They can both be represented exactly using standard double-precision IEEE floating-point numbers. However, you cannot represent their sum or difference exactly. So what does a computer do when you ask it to compute (1.0 + 2-1000) or (1.0 – 2-1000) ? It usually outputs the nearest approximation, that 1.0. In other words, it looks at all the numbers that it can represent and picks one that is nearest to the exact value. This works for most basic operations like addition, multiplication, division, subtraction.

However, it is possible to ask the processor to change the rounding mode. In some cases, you can request a specific rounding mode per instruction (e.g., with some AVX-512 instructions), but most times there is one setting valid for all instructions within the current thread. In C/C++, you can change the rounding mode to round upward (fesetround(FE_UPWARD)), downward (fesetround(FE_DOWNWARD)), toward zero (fesetround(FE_TOWARDZERO)). The default is to round to the nearest value (fesetround(FE_TONEAREST)).

Some of your code might be assuming that the rounding is to the nearest value. Maybe you want to check that it is so if you are providing a function for others to use. All you need is the following…

#include <fenv.h>

bool are_we_ok() {
  return fegetround() == FE_TONEAREST;

Unfortunately, this function is not free. It might be a function call to a non-trivial instruction such as stmxcsr on x64 processors. It is fine to call it from time to time, but if you have tiny functions that only cost a few hundred instructions, it may be too expensive.

The C functions strtod()/strtof()  available on Linux contain the following switch/case code:

	bc.rounding = 1;
	switch(fegetround()) {
	  case FE_TOWARDZERO:	bc.rounding = 0; break;
	  case FE_UPWARD:	bc.rounding = 2; break;
	  case FE_DOWNWARD:	bc.rounding = 3;

It means that your software might routinely call the fegetround() function. Thankfully, there is a fast approach requiring just an addition, a subtraction and a comparison. Let x be a small value, much smaller than 1, so that 1+x is very close to 1 and should be rounded to 1. Then it suffices to check that (1+x == 1-x). Indeed, when rounding up, 1+x will differ from 1, it will round to a small number just slightly larger than 1. Similarly, when rounding toward zero or rounding down, 1-x will get rounded to a value just under 1. Only when you round to the nearest value do you get that 1+x == 1-x.

You might think that we simply can do 1.0 + 1e-100 == 1.0 - 1e-100, using a function such as this one…

bool rounds_to_nearest() {
  return (1.0f +  1e-38f == 1.0f - 1e-38f);

Sadly, your optimizing compiler might turn this function into a function which returns true unconditionally, irrespective of the processor’s rounding mode.

As a workaround, the following function should do:

bool rounds_to_nearest() {
  static volatile float fmin = 1e-38;
  return (fmin + 1.0f == 1.0f - fmin);

We declare the small value to be volatile: this asks the C/C++ compiler to load the variable each time it needs to access it. It prevents the compiler from optimizing away the function at compile time. The compiler is forced to recompute the function each time.

You can do slightly better performance-wise because this function loads the value 1e-38 twice. To load it just once, do…

bool rounds_to_nearest() {
  static volatile float fmin = 1e-38;
  float fmini = fmin;
  return (fmini + 1.0f == 1.0f - fmini);

In my tests, on both an Apple ARM platform and on a recent Intel processor with GCC, this fast function is several times faster than a call to fegetround(). It is cheap enough that you can probably call it repeatedly without much care.

Credit: I took the idea from GitHub user @mwalcott3.

Note: Some processors have modes where all operations are done using a higher precision (e.g., 80 bits) so a float may use more than 32 bits internally. This means that you should choose you small factor (e.g. 1e-38) to be really small.

Measuring the memory usage of your C++ program

In C++, we might implement dynamic lists using the vector template. The int-valued constructor of the vector template allocates at least enough memory to store the provided number of elements in a contiguous manner. How much memory does the following code use?

  std::vector<uint8_t> v1(10);
  std::vector<uint8_t> v2(1000000);

The naive answer is 1000010 bytes or slightly less than 1 MB, but if you think a bit about it, you quickly realize that 1000010 bytes might be a lower bound. Indeed, the vector might allocate more memory and there is unavoidably some overhead for the vector instance.

Thankfully, it is easy to measure it. I wrote a little C++ program to measure actual memory usage in terms of allocated pages attributed to the program. We find that we use far more memory (2x or 4x more) than a naive analysis might suggest.

start of the program after the first vector at the end
ARM-based macOS 1.25 MB 1.25 MB 2.25 MB
Intel-based Linux 1.94 MB 1.94 MB 4.35 MB

Interestingly, reserving memory may not use any new memory as pointed by a reader (Martin Leitner-Ankerl). In my tests, adding the following two lines did not change memory usage:

std::vector<uint8_t> v3;

Further reading: Measuring memory usage: virtual versus real memory

Tooling: Ivica Bogosavljevic recommends that Linux users try heaptrack to better understand memory usage. Aleksey Kladov prefers Bytehound.

Modern vector programming with masked loads and stores

When you start a program, it creates a ‘process’ which own its memory. Memory is allocated to a software process in blocks called ‘pages’. These pages might span 4kB, 16kB or more. For a given process, it is safe to read and write within these pages.

In your code, you might allocate a 32-byte array. How much memory does the array require? The answer is that the allocation of the array might require no extra memory because the process had already the room needed in its pages, or else, the array might entice the operating system to grant the process many more pages. Similarly, ‘freeing’ the array does not (generally) reclaim the memory.

In general, the operating system and the processor do not care when your program reads and writes anywhere within the pages allocated to it. These pages are the ‘segment’ that the process owns. When you do access a forbidden page, one that was not allocated to your process, then you normally get a segmentation fault. Most of the time, it means that your program crashes.

Interestingly,  if I allocate a small array and then I read beyond its bounds, the program will often not immediately crash and may even remain correct. Thus you can allocate an array with room for two integers, read three integers from it, and your program might ‘work’. The following function is wrong but it might very well work fine for years… (and then it will mysterious crash)

int f() {
    int* data = new int[2];
    data[0] = 1;
    data[1] = 2;
    int x = data[0];
    int y = data[1];
    int z = data[2];
    delete[] data;
    return x + y;

But why would you ever want to read beyond the bounds of the allocated memory? For performance and/or convenience.

Modern processors have vector instructions designed to work on ‘large registers’. For example, recent Intel processors have 512-bit registers. Such a register can store 16 standard floating-point values. The following code will compute the dot product between two vectors very quickly… (read the comments in the code to follow through)

float dot512fma(float *x1, float *x2, size_t length) {
  // create a vector of 16 32-bit floats (zeroed)
  __m512 sum = _mm512_setzero_ps();
  for (size_t i = 0; i < length; i += 16) {
    // load 16 32-bit floats
    __m512 v1 = _mm512_loadu_ps(x1 + i);
    // load 16 32-bit floats

    __m512 v2 = _mm512_loadu_ps(x2 + i);
    // do sum[0] += v1[i]*v2[i] (fused multiply-add)
    sum = _mm512_fmadd_ps(v1, v2, sum);
  // reduce: sums all elements
  return _mm512_reduce_add_ps(sum);

There is a problem with this code, however. If length is not a multiple of sixteen, then we might read too much data. This might or might not crash the program, and it may give wrong results.

What might you do? With earlier processors, you had to handle it in an ad hoc manner.

The solution today is to use masked loads, to load just the data that is safe to read in wide register. For AVX-512, it has hardly any overhead, except for the computation of the mask itself. You might code it as follows:

float dot512fma(float *x1, float *x2, size_t length) {
  // create a vector of 16 32-bit floats (zeroed)
  __m512 sum = _mm512_setzero_ps();
  size_t i = 0;
  for (; i + 16 <= length; i+=16) {
    // load 16 32-bit floats
    __m512 v1 = _mm512_loadu_ps(x1 + i);
    // load 16 32-bit floats
    __m512 v2 = _mm512_loadu_ps(x2 + i);
    // do sum[0] += v1[i]*v2[i] (fused multiply-add)
    sum = _mm512_fmadd_ps(v1, v2, sum);
  if  (i  < length) {
    // load 16 32-bit floats, load only the first length-i floats
    // other floats are automatically set to zero
    __m512 v1 = _mm512_maskz_loadu_ps((1<<(length-i))-1, x1 + i);
    // load 16 32-bit floats, load only the first length-i floats
    __m512 v2 = _mm512_maskz_loadu_ps((1<<(length-i))-1, x2 + i);
    // do sum[0] += v1[i]*v2[i] (fused multiply-add)
    sum = _mm512_fmadd_ps(v1, v2, sum);
  // reduce: sums all elements
  return _mm512_reduce_add_ps(sum);

As you can see, I have added a final branch which works with a computed mask. The mask is computed using arithmetic. In this case, I use an ugly formula: ((1<<(length-i))-1).

It is also possible to use just one loop, and to update the mask for the final iteration, but it makes the code slightly harder to understand in my view.

In any case, using masks, I achieve high performance: I use fast AVX-512 instructions throughout. It is also convenient: I can write the entire function with the same coding style.

A sensible question is whether these masked loads and stores are really safe with respect to segmentation faults. You can check it by repeatedly writing and loading just beyond the allocated memory, eventually you will cause a segmentation fault. I have written a small C++ test. The following code always crashes at the last line of the loop, where status has value ‘RIGHT_AFTER’.

  uint8_t *data = new uint8_t[1024];
  size_t ps = page_size();
  // round up to the end of the page:
  uintptr_t page_limit = ps - (reinterpret_cast<uintptr_t>(data) % ps) - 1;
  __m128i ones = _mm_set1_epi8(1); // register filled with ones
  for (int z = 0;; z++) {
    status = RIGHT_BEFORE;
    data[z * ps + page_limit] = 1;
    status = AVX512_STORE;
    _mm_mask_storeu_epi8(data + z * ps + page_limit, 1, ones);
    status = AVX512_LOAD;
    __m128i oneandzeroes = _mm_maskz_loadu_epi8(1, data + z * ps + page_limit);
    status = RIGHT_AFTER;
    data[z * ps + page_limit + 1] = 1;

What about ARM processors ? Thankfully, you can do much of the same thing using Amazon’s graviton processors. A dot product might look as follows:

float dotsve(float *x1, float *x2, int64_t length) {
  int64_t i = 0;
  svfloat32_t sum = svdup_n_f32(0);
  while(i + svcntw() <= length) {
    svfloat32_t in1 = svld1_f32(svptrue_b32(), x1 + i);
    svfloat32_t in2 = svld1_f32(svptrue_b32(), x2 + i);
    sum = svmad_f32_m(svptrue_b32(), in1, in2, sum);
    i += svcntw();
  svbool_t while_mask = svwhilelt_b32(i, length);
  do {
    svfloat32_t in1 = svld1_f32(while_mask, x1 + i);
    svfloat32_t in2 = svld1_f32(while_mask, x2 + i);
    sum = svmad_f32_m(svptrue_b32(), in1, in2, sum);
    i += svcntw();
    while_mask = svwhilelt_b32(i, length);
  } while (svptest_any(svptrue_b32(), while_mask));

  return svaddv_f32(svptrue_b32(),sum);

It is the same algorithm. One difference is that SVE has its own intrinsic functions to general masks. Furthermore, while AVX-512 allow you to pick different register sizes, SVE hides away the register sizes, so that your binary code should run irrespective of the register size. SVE has also ‘non-faulting’ loads and other options.

My code is available: both the AVX-512 code and the ARM/SVE2 code (in a separate directory). You may need access to AWS (Amazon) to run the ARM/SVE2 code.


Book Review : Template Metaprogramming with C++

I have spent the last few years programming often in C++. The C++ langage is probably one of the hardest to master. I still learn something new every week. Furthermore, C++ is getting upgrades all the time: C++17 was a great step forward and C++20 brings even more exiting improvments.

In C++, we often use ‘templates’. As the name suggests, they allow us to create C++ functions and classes from a generic recipe. For example, the following template allows us to create functions that sum up two values:

template <class T>
T f(T x, T y) {
    return x + y;

It gets automatically instantiated when you need it. The following function will return the sum of two integers.

int g(int x, int y) {
    return f(x,y);

Templates are very powerful. They allow us to create highly efficient code, because everything happens at compile time: the optimizer can do it is work. With great power, comes great responsibility: templates can be infuriating since they may lead to unreadable error messages, and they can greatly increase the compile time. I use templates sparingly when programming in C++.

I spent the last few weeks reading
Template Metaprogramming with C++ by Marius Băncilă. It is currently 50$ on Amazon.

Though, technically, the book is about advanced ‘template’ techniques, it is much more broad and practical. It is one of the ‘good programming books’. If you are an experienced C++ programmer, you should give it a peek. It is full of practical code.

The book reminded me of the following features.

Explicit instantiation declaration: In my example above, the template f with type int gets compiled as soon as I invoke it. In that particular case, it is not worrisome. However, it can be helpful to pre-compile the functions to reduce compile time in large projects, involving large pieces of code. We typically achieve the desire results in C and C++ by separating function declaration (put in a header file) and function definition (put in a source file). It is somewhat trickier to do with templates. Yet we can do so by putting the following line in the header:

extern template int f(int,int);

Followed by the following in a source file:

template int f(int,int);

And voilà!

Lambda templates. It is cumbersome in C and C++ to define a function. In C++, you can create ‘lambdas’ which are effectively locally defined functions. For example, the following code first defines an ‘addition’ function (called lambda) and then it applies it to add two numbers:

int g(int x, int y) {
    auto lambda = [](int x, int y) -> int { return x + y; };
    return lambda(x, y);

Up until recently, you could not mix templates and lambdas, but now you can with C++20:

int g(int x, int y) {
    auto lambda = []<class T>(T x, T y) -> T { return x + y; };
    return lambda(x, y);

Variable templates. Sometimes you need constants that depend on a type. For example, the number pi might different depending on the precision of the type you are using. That is where variable templates shine:

template <typename T>
constexpr T PI = T(3.1415926535897932384626433832795028841971693993751);

std::cout << PI<double> << std::endl;
std::cout << PI<float> << std::endl;

There is plenty more to explore in Băncilă’s book.

I posted some code examples on GitHub.

Disclosure: I got a free copy of the book from the editor in exchange for a promise that I would do a review if I liked the book. I did not agree beforehand to produce a positive review. I do not get paid if you buy the book.

Science and Technology links (October 16 2022)

  1. Doctors in Israel are toying with polygenic screening: it is a way to make it more likely that your baby will grow up to be healthy.
  2. In 2021, 337 million prescriptions were written for antidepressants in US, according to the New York Times.
  3. Students in private schools do better in India than those attending government-run schools, even in districts where 70% of all students attend private schools.
  4. People who participate in free markets are more moral toward strangers.
  5. In Japan 4000 years ago, the sea level was 2 meters higher than the current level.

The number of comparisons needed to sort a shuffled array: qsort versus std::sort

Given an array of N numbers of type double, the standard way to sort it in C is to invoke the qsort function

qsort(array, N, sizeof(double), compare);

where compare is a function which returns an integer less than, equal to, or greater than zero if the first argument is less than, equal to, or greater than the second. Because it is a C function, it takes in void pointers which we must convert back to actual values. The following is a reasonable implementation of a comparison function:

int compare(const void *a, const void *b) {
  double x = *(double *)a;
  double y = *(double *)b;
  if (x < y) {
    return -1;
  if (x == y) {
    return 0;
  return 1;

Though the function appears to have branches, optimizing compilers can generate binary code without any jumps in this case.

Though the name suggests that qsort might be implemented using the textbook algorithm Quicksort, the actual implementation depends on the standard library.

The standard approach in C++ is similar. The code might look as follows:

    std::sort(array, array + N, compare);

Again, we have a compare function. In C++, the compare function can refer directly to the type, since the std::sort is actually a template function, and not a mere function. It makes that the C++ compiler effectively generates a function for each comparison function you provide. It trades an increase in binary size for a potential increase in performance. A reasonable implementation of the comparison function is as follows:

bool compare(const double x, const double y) {
  return x < y;

The signature of the C++ comparison function is different: we return a Boolean value as opposed to a three-class integer value.

An interesting question is how many comparisons each function makes. Typically, the comparisons are inexpensive and a bad predictor of the performance, but you can imagine cases where comparing your values could be expensive.

The exact number of comparisons depends on the underlying implementation provided by your system. As inputs, I use random arrays.

I choose to count the average number of times that the comparison function is called. Experimentally, I find that the C++ function makes many more calls to the comparison function than the C function (qsort). The C library that comes with GCC (glibc) uses k – 1.2645 comparisons per element on average to sort arrays of size 2k, matching the theoretical average case performance of merge sort.

LLVM 13 (Apple):

N calls to the comparison function (qsort) per input value calls to the comparison function (std::sort) per input value
210 10.04 12.26
211 11.13 13.41
212 12.21 14.54

GCC 9:

N calls to the comparison function (qsort) per input value calls to the comparison function (std::sort) per input value
210 8.74 11.98
211 9.74 13.22
212 10.74 14.40

My code is available.

Further reading.

A review of elementary data types : numbers and strings

Computer programming starts with the organization of the data into data structures. In almost all cases, we work with strings or numbers. It is critical to understand these building blocks to become an expert programmer.


We often organize data using fixed blocks of memory. When these blocks are relatively small (e.g., 8 bits, 16 bits, 32 bits, 64 bits), we commonly call them ‘words’.

The notion of ‘word’ is important because processors do not operate over arbitrary data types. For practical reasons, processors expect data to fit in hardware registers having some fixed size (usually 64-bit registers). Most modern processors accommodate 8-bit, 16-bit, 32-bit and 64-bit words with fast instructions. It is typical to have the granularity of the memory accesses to be no smaller than the ‘byte’ (8 bits) so bytes are, in a sense, the smallest practical words.

Variable-length data structures like strings might be made of a variable number of words. Historically, strings have been made of lists of bytes, but other alternatives are common (e.g., 16-bit or 32-bit words).

Boolean values

The simplest type is probably the Boolean type. A Boolean value can take either the false or the true value. Though a single bit suffices to represent a Boolean value, it is common to use a whole byte (or more).
We can negate a Boolean value: the true value becomes the false value, and conversely. There are also binary operations:

  • The result of the OR operation between two Boolean values is false if and only if both inputs are false. The OR operation is often noted |. E.g., 1 | 0 == 1 where we use the convention that the symbol == states the equality between two values.
  • The result of the AND operation between two Boolean values is true if and only if both inputs are true. The AND operation is often noted &. E.g., 1 & 1 == 1.
  • The result of the XOR operation is true if and only the two inputs differ in value. The XOR operation is often noted ^. E.g., 1 ^ 1 == 0.
  • The result of the AND NOT operation between two Boolean values is true if and only if the first Boolean value is true and the second one is false.


Integer data types are probably the most widely supported in software and hardware, after the Boolean types. We often represent integers using digits. E.g., the integer 1234 has 4 decimal digits. By extension, we use ‘binary’ digits, called bits, within computers. We often write an integer using the binary notation using the 0b prefix. E.g., the integer 0b10 is two, the integer 0b10110 is equal to 2^1+2^2+2^4 or 22. After the prefix 0b, we enumerate the bit values, starting with the most significant bit. We may also use the hexadecimal (base 16) notation with the 0x prefix: in that case, we use 16 different digits in the list 0, 1, 2, 3,..., 9, A, B, C, D, E, F. These digits have values 0, 1, 2, 3,..., 9, 10, 11, 12, 13, 14, 15. For digits represented as letters, we may use either the lower or upper cases. Thus the number 0x5A is equal to 5 * 16 + 10 or 90 in decimal. The hexadecimal notation is convenient when working with binary values: a single digit represents a 4-bit value, two digits represent an 8-bit value, and so forth.

We might count the number of digits of an integer using the formula ceil(log(x+1)) where the logarithm is the in the base you are interested in (e.g., base 2) and where ceil is the ceiling function: ceil(x) returns the smallest integer no smaller than x. The product between an integer having d1 digits and an integer having d2 digits has either d1+d2-1 digits or d1+d2 digits. To illustrate, let us consider the product between two integers having three digits. In base 10, the smallest product is 100 times 100 is 10,000, so it requires 5 digits. The largest product is 999 times 999 or 998,001 so 6 digits.

For speed or convenience, we might use a fixed number of digits. Given that we work with binary computers, we are likely to use binary digits. We also need a way to represent the sign of a number (negative and positive).

Unsigned integers

Possibly the simplest number type is the unsigned integer where we use a fixed number of bits used to represent non-negative integers. Most processors support arithmetic operations over unsigned integers. The term ‘unsigned’ in this instance is equivalent to ‘non-negative’: the integers can be zero or positive.

We can operate on binary integers using bitwise logical operations. For example, the bitwise AND between 0b101 and 0b1100 is 0b100. The bitwise OR is 0b1111. The bitwise XOR (exclusive OR) is 0b1001.

The powers of two (1, 2, 4, 8,…) are the only numbers having a single 1-bit in their binary representation (0b1, 0b10, 0b100, 0b1000, etc.). The numbers preceding powers of two (1,3,7,…) are the numbers made of consecutive 1-bit in the least significant positions (0b1, 0b11, 0b111, 0b1111, etc.). A unique characteristic of powers of two is that their bitwise AND with the preceding integer is zero: e.g., 4 AND 3 is zero, 8 AND 7 is zero, and so forth.

In the Go programming language, for example, we have 8-bit, 16-bit, 32-bit and 64-bit unsigned integer types: uint8, uint16, uint32, uint64. They can represent all numbers from 0 up to (but not including) 2 to the power of 8, 16, 32 and 64. For example, an 8-bit unsigned integer can represent all integers from 0 up to 255 inclusively.

Because we choose to use a fixed number of bits, we therefore can only represent a range of integers. The result of an arithmetic operation may exceed the range (an overflow). For example, 255 plus 2 is 257: though both inputs (255 and 2) can be represented using 8-bit unsigned integers, the result exceeds the range.

Regarding multiplications, the product of two 8-bit unsigned integers is at most 65025 which can be represented by a 16-bit unsigned integer. It is always the case that the product of two n-bit integers can be represented using 2n bits. The converse is untrue: a given 2n-bit integer is not the product of two n-bit integers. As n becomes large, only a small fraction of all 2n-bit integers can be written as the product of two n-bit integers, a result first proved by Erdős.

Typically, arithmetic operations are “modulo” the power of two. That is, everything is as if we did the computation using infinite-precision integers and then we only kept the (positive) remainder of the division by the power of two.

Let us elaborate. Given two integers a and b (b being non-zero), there are unique integers d and r where r is in [0,b) such that a = d * b + r. The integer r is the remainder and the integer d is the quotient.

Euclid’s division lemma tells us that the quotient and the remainder exist and are unique. We can check uniqueness. Suppose that there is another such pair of integers (d' and r'), a = d' * b + r'. We can check that if d' is equal to d, then we must have that r' is equal to r, and conversely, if r' is equal to r, then d' is equal to d. Suppose that r' is greater than r (if not, just reverse the argument). Then, by subtraction, we have that 0 = (d'-d)*b + (r'-r). We must have that r'-r is in [0,b). If d'-d is negative, then we have that (d-d')*b = (r'-r), but that is impossible because r'-r is in [0,b) whereas (d-d')*b is greater or equal than b. A similar argument works when d'-d is positive.

In our case, the divisor (b) is a power of two. When the numerator (a) is positive, then the remainder amounts to a selection of the least significant bits. For example, the remainder of the division of 65 (or 0b1000001) with 64 is 1.

When considering unsigned arithmetic, it often helps to think that we keep only the least significant bits (8, 16, 32 or 64) of the final result. Thus if we take 255 and we add 2, we get 257, but as an 8-bit unsigned integer, we get the number 1. Thus, using integers of the type uint8, we have that 255 + 2 is 1 (255 + 2 == 1). The power of two itself is zero: 256 is equal to zero as an uint8 integer. If we subtract two numbers and the value would be negative, we effectively ‘wrap’ around: 10 – 20 in uint8 arithmetic is the positive remainder of (-10) divided by 256 which is 246. Another way to think of negative numbers is that we can add the power of two (say 256) as many times as needed (size its value is effectively zero) until we get a value that is between 0 and the power of two. Thus if we must evaluate 1-5*250 as an 8-bit integer, we take the result (-1249) and we add 256 as many times as needed: we have that -1249+5*256 is 31, a number between 0 and 256. Thus 1-5*250 is 31 as an unsigned 8-bit number.

We have that 0-1, as an 8-bit number, is 255 or 0b11111111. 0-2 is 254, 0-3 is 253 and so forth. Consider the set of integers…

-1024, -1023,..., -513, -512, -511, ..., -1, 0, 1, ..., 255, 256, 257,... 

As 256-bit integers, they are mapped to

0, 255, ..., 255, 0, 1, ..., 255, 0, 1, ..., 255, 0, 1, ...

Multiplication by a power of two is equivalent to shifting the bits left, possibly losing the leftmost bits. For example, 17 is 0b10001. Multiplying it by 4, we get 0b1000100 or 68. If we were to multiply 17 by 4, we would get 0b100010000 or, as an 8-bit integer, 0b10000. That is, as 8-bit unsigned integers, we have that 17 * 16 is 16. Thus we have that 17 * 16 == 1 * 16.

The product of two non-zero integers may be zero. For example, 16*16 is zero as an 8-bit integer. It happens only when both integers are divisible by two. The product of two odd integers must always be odd.

We say that two numbers are ‘coprime’ if their largest common divisor is 1. Odd integers are coprime with powers of two. Even integers are never coprime with a power of two.

When multiplying a non-zero integer by an odd integer using finite-bit arithmetic, we never get zero. Thus, for example, 3 * x as an 8-bit integer is zero if and only if x is zero when using fixed-bit unsigned integers. It means that 3 * x is equal to 3 * y if and only if x and y are equal. Thus we have that the following Go code will print out all values from 0 to 255, without repetition:

    for i:=uint8(1); i != 0; i++ {

Multiplying integers by an odd integer permutes them.

If you consider powers of an odd integer, you similarly never get a zero result. However, you may eventually get the power to be one. For example, as an 8-bit unsigned integer, 3 to the power of 64 is 1. This number (64) is sometimes called the ‘order’ of 3. Since this is the smallest exponent so that the result is one, we have that all 63 preceding powers give distinct results. We can show this result as follows. Suppose that 3 raised to the p is equal to 3 raised to the power q, and assume without loss of generality that p>q, then we have that 3 to the power of p-q must be 1, by inspection. And if both p and q are smaller than 64, then so must b p-q, a contradiction. Further, we can check that the powers of an odd integer repeat after the order is reached: we have that 3 to the power 64 is 1, 3 to the power of 65 is 3, 3 to the power of 66 is 9, and so forth. It follows that the order of any odd integer must divide the power of two (e.g., 256).

How large can the order of an odd integer be? We can check that all powers of an odd integer must be odd integers and there are only 128 distinct 8-bit integers. Thus the order of an 8-bit odd integer can be at most 128. Conversely, Euler’s theorem tells us that any odd integer to the power of the number of odd integers (e.g., 3 to the power 128) must be one. Because the values of the power of an odd integer repeat cyclicly after the order is reached, we have that the order of any odd integer must divide 128 for 8-bit unsigned integers. Generally, irrespective of the width in bits of the words, the order of an odd integer must be a power of two.

Given two non-zero unsigned integers, a and b, we would expect that a+b>max(a+b) but it is only true if there is no overflow. When and only when there is an overflow, we have that a+b<min(a+b) using finite-bit unsigned arithmetic. We can check for an overflow with either conditions: a+b<a and a+b<b.

Typically, one of the most expensive operations a computer can do with two integers is to divide them. A division can require several times more cycles than a multiplication, and a multiplication is in turn often many times more expensive than a simple addition or subtraction. However, the division by a power of two and the multiplication by a power of two are inexpensive: we can compute the integer quotient of the division of an unsigned integer by shifting the bits right. For example, the integer 7 (0b111) divided by 2 is 0b011 or 3. We can further divide 7 (0b111) by 4 to get 0b001 or 1. The integer remainder is given by selecting the bits that would be shifted out: the remainder of 7 divided by 4 is 7 AND 0b11 or 0b11. The remainder of the division by two is just the least significant bit. Even integers are characterized by having zero as the least significant bit. Similarly, the multiplication by a power of two is just a left shift: the integer 7 (0b111) multiplied by two is 14 (0b1110). More generally, an optimizing compiler may produce efficient code for the computation of the remainder and quotient when the divisor is fixed. Typically, it involves at least a multiplication and a shift.

Given an integer x, we say that y is its multiplicative inverse if x * y == 1. We have that every odd integer has a multiplicative inverse because multiplication by an integer creates a permutation of all integers. We can compute this multiplicative inverse using Newton’s method. That is, we start with a guess and from the guess, we get a better one, and so forth, until we naturally converge to the right value. So we need some formula f(y), so that we can repeatedly call y = f(y) until y converges. A useful recurrence formula is f(y) = y * (2 - y * x). You can verify that if y is the multiplicative inverse of x, then f(y) = y. Suppose that y is not quite the inverse, suppose that x * y = 1 + z * p for some odd integer z and some power of two p. If the power of two is (say) 8, then it tells you that y is the multiplicative inverse over the first three bits. We get x * f(y) = x * y * (2 - y * x) = 2 + 2 * z * p - (1 - 2 * z * p + z * z * p * p) = 1 - z * z * p * p. We can see from this result that if y is the multiplicative inverse over the first n bits, then f(y) is the multiplicative inverse over 2n bits. That is, if y is the inverse “for the first n bits”, then f(y) is the inverse “for the first 2n bits”. We double the precision each time we call the recurrence formula. It means that we can quickly converge on the inverse.

What should our initial guess for y be? If we use 3-bit words, then every number is its inverse. So starting with y = x would give us three bits of accuracy, but we can do better: ( 3 * x ) ^ 2 provides 5 bits of accuracy. The following Go program verifies the claim:

package main

import "fmt"

func main() {
    for x := 1; x < 32; x += 2 {
        y := (3 * x) ^ 2
        if (x*y)&0b11111 != 1 {

Observe how we capture the 5 least significant bits using the expression &0b11111: it is a bitwise logical AND operation.

Starting from 5 bits, the first call to the recurrence formula gives 10 bits, then 20 bits for the second call, then 40 bits, then 80 bits. So, we need to call our recurrence formula 2 times for 16-bit values, 3 times for 32-bit values and 4 times for 64-bit values. The function FindInverse64 computes the 64-bit multiplicative inverse of an odd integer:

func f64(x, y uint64) uint64 {
    return y * (2 - y*x)

func FindInverse64(x uint64) uint64 {
    y := (3 * x) ^ 2 // 5 bits
    y = f64(x, y)    // 10 bits
    y = f64(x, y)    // 20 bits
    y = f64(x, y)    // 40 bits
    y = f64(x, y)    // 80 bits
    return y

We have that FindInverse64(271) * 271 == 1. Importantly, it fails if the provided integer is even.

We can use multiplicative inverses to replace the division by an odd integer with a multiplication. That is, if you precompute FindInverse64(3), then you can compute the division by three for any multiple of three by computing the product: e.g., FindInverse64(3) * 15 == 5.

When we store multi-byte values such as unsigned integers in arrays of bytes, we may use one of two conventions: little- and big-endian. The little- and big-endian variants only differ by the byte order: we either start with the least significant bytes (little endian) or by the most significant bytes (big endian). Let us consider the integer 12345. An an hexadecimal value, it is 0x3039. If we store it as two bytes, we may either store it as the byte value 0x30 followed by the byte value 0x39 (big endian), or by the reverse (0x39 followed by 0x30). Most modern systems default on the little-endian convention, and there are relatively few big-endian systems. In practice, we rarely have to be concerned with the endianness of our system.

Signed integers and two’s complement

Given unsigned integers, how do we add support for signed integers? At first glance, it is tempting to reserve a bit for the sign. Thus if we have 32 bits, we might use one bit to indicate whether the value is positive or negative, and then we can use 31 bits to store the absolute value of the integer.

Though this sign-bit approach is workable, it has downsides. The first obvious downside is that there are two possible zero values: +0 and -0. The other downside is that it makes signed integers wholly distinct values as compared to unsigned integers: ideally, we would like hardware instructions that operate on unsigned integers to ‘just work’ on signed integers.

Thus modern computers use two’s complement notation to represent signed integers. To simplify the exposition, we consider 8-bit integers. We represent all positive integers up to half the range (127 for 8-bit words) in the same manner, whether using signed or unsigned integers. Only when the most significant bit is set, do we differ: for the signed integers, it is as if the unsigned value derived from all but the most significant bit is subtracted by half the range (128). For example, as an 8-bit signed value, 0b11111111 is -1. Indeed, ignoring the most significant bit, we have 0b1111111 or 127, and subtracting 128, we get -1.

Binary unsigned signed
0b00000000 0 0
0b00000001 1 1
0b00000010 2 2
0b01111111 127 127
0b10000000 128 -128
0b10000001 129 -127
0b11111110 254 -2
0b11111111 255 -1

In Go, you can ‘cast’ unsigned integers to signed integers, and vice versa: Go leaves the binary values unchanged, but it simply reinterprets the value as unsigned and signed integers. If we execute the following code, we have that x==z:

    x := uint16(52429)
    y := int16(x)
    z := uint16(y)

Conveniently, whether we compute the multiplication, the addition or the subtraction between two values, the result is the same (in binary) whether we interpret the bits as a signed or unsigned value. Thus we can use the same hardware circuits.

A downside of the two’s complement notation is that the smallest negative value (-128 in the 8-bit case) cannot be safely negated. Indeed, the number 128 cannot be represented using 8-bit signed integers. This asymmetry is unavoidable because we have three types of numbers: zero, negative values and positive values. Yet we have an even number of binary values.

Like with unsigned integers, we can shift (right and left) signed integers. The left shift works like for unsigned integers at the bit level. We have that

x := int8(1)
(x << 1) == 2
(x << 7) == -128

However, right shift works differently for signed and unsigned integers. For unsigned integers, we shift in zeroes from the left; for signed integers, we either shift in zeroes (if the integer is positive or zero) or ones (if the integer and negatives). We illustrate this behaviour with an example:

    x := int8(-1)
    (x >> 1) == -1
    y := uint8(x)
    y == 255
    (y >> 1) == 127

When a signed integer is positive, then dividing by a power of two or shifting right has the same result (10/4 == (10>>2)). However, when the integer is negative, it is only true when the negative integer is divisible by the power of two. When the negative integer is not divisible by the power of two, then the shift is smaller by one than the division, as illustrated by the following code:

    x := int8(-10)
    (x / 4) == -2
    (x >> 2) == -3

Floating-point numbers

On computers, real numbers are typically approximated by binary floating-point numbers: a fixed-width integer m (the significand) multiplied by 2 raised to an integer exponent p: m * 2**p where 2**p represents the number two raised to the power p. A signed bit is added so that both a positive and negative zero are available. Most systems today follow the IEEE 754 standard which means that you can get consistent results across programming languages and operating systems. Hence, it does not matter very much if you implement your software in C++ under Linux whereas someone else implements it in C# under Windows: if you both have recent systems, you can expect identical numerical outcomes when doing basic arithmetic and square-root operations.

A positive normal double-precision floating-point number is a binary floating-point number where the 53-bit integer m is in the interval [2**52,2**53) while being interpreted as a number in [1,2) by virtually dividing it by 2**52, and where the 11-bit exponent p ranges from -1022 to +1023. Thus we can represent all values between 2**-1022 and up to but not including 2**1024. Some values smaller than 2**-1022 can be represented as subnormal values: they use a special exponent code which has the value 2**-1022 and the significand is then interpreted as a value in the interval [0,1).

In Go, a float64 number can represent all decimal numbers made of a 15-digit significand from approximately -1.8 * 10**308 to 1.8 *10**308. The reverse is not true: it is not sufficient to have 15 digits of precision to distinguish any two floating-point numbers: we may need up to 17 digits.

The float32 type is similar. It can represent all numbers between 2**-126 up to, but not including, 2**128; with special handling for some numbers smaller than 2**-126 (subnormals). The float32 type can represent exactly all decimal numbers made of a 6-digit decimal significand but 9 digits are needed in general to identify uniquely a number.

Floating-point numbers also include the positive and negative infinity, as well as a special not-a-number value. They are identified by a reserved exponent value.

Numbers are typically serialized as decimal numbers in strings and then parsed back by the receiver. However, it is generally impossible to convert decimal numbers into binary floating-point numbers: the number 0.2 has no exact representation as a binary floating-point number. However, you should expect the system to choose the best possible approximation: 7205759403792794 * 2**-55 as a float64 number (or about 0.20000000000000001110). If the initial number was a float64 (for example), you should expect the exact value to be preserved: it will work as expected in Go.


One of the earliest string standards is ASCII: it was first specified in the early 1960s. The ASCII standard is still popular. Each character is a byte, with the most significant bit set to zero. There are therefore only 128 distinct ASCII characters. It is often sufficient for simple tasks like programming. Unfortunately, the ASCII standard could only ever represent up to 128 characters: far less than needed.

Many diverging standards emerged for representing characters in software. The existence of multiple incompatible formats made the production of interoperable localized software challenging.

Engineers developed Unicode in the late 1980s as an attempt to provide a universal standard. Initially, it was believed that using 16 bits per character would be sufficient, but this belief was wrong. The Unicode standard was extended to include up to 1,114,112 characters. Only a small fraction of all possible characters have been assigned, but more are assigned over time with each Unicode revision. The Unicode standard is an extension of the ASCII standard: the first 128 Unicode characters match the ASCII characters.

Due to the original expectation that Unicode would fit in 16-bit space, a format based on 16-bit words (UTF-16) format was published in 1996. It may use either 16-bit or 32-bit per character. The UTF-16 format was adopted by programming languages such as Java, and became a default under Windows. Unfortunately, UTF-16 is not backward compatible with ASCII at a byte level. An ASCII-compatible format was proposed and formalized in 2003: UTF-8. Over time, UTF-8 became widely used for text interchange formats such as JSON, HTML or XML. Programming languages such as Go, Rust and Swift use UTF-8 by default. Both formats (UTF-8 and UTF-16) require validation: not all arrays of bytes are valid. The UTF-8 format is more expensive to validate.

ASCII characters require one byte with UTF-8 and two bytes with UTF-16. The UTF-16 format can represent all characters, except for the supplemental characters such as emojis, using two bytes. The UTF-8 format uses two bytes for Latin, Hebrew and Arabic alphabets, three bytes for Asiatic characters and 4 bytes for the supplemental characters.

UTF-8 encodes values in sequences of one to four bytes. We refer to the first byte of a sequence as a leading byte; the most significant bits of the leading byte indicates the length of the sequence:

  • If the most significant bit is zero, we have a sequence of one byte (ASCII).
  • If the three most significant bits are 0b110, we have a two-byte sequence.
  • If the four most significant bits are 0b1110, we have a three-byte sequence.
  • Finally, if the five most significant bits are 0b11110, we have a four-byte sequence.
    All bytes following the leading byte in a sequence are continuation bytes, and they must have 0b10 as their most significant bits. Except for the required most significant bits, the numerical value of the character (between 0 to 1,114,112) is stored by starting with the most significant bits (in the leading byte) followed by the less significant bits in the other continuation bytes.

In the UTF-16 format, characters in 0x0000-0xD7FF and 0xE000-0xFFFF are stored as single 16-bit words. Characters in the range 0x010000 to 0x10FFFF require two 16-bit words called a surrogate pair. The first word in the pair is in the range 0xd800 to 0xdbff whereas the second word is in the range from 0xdc00 to 0xdfff. The character value is made of the 10 least significant bits of the two words, using the second word as least significant, and adding 0x10000 to the result. There are two types of UTF-16 format. In the little-endian variant, each 16-bit word is stored using the least significant bits in the first byte. The reverse is true in the big-endian variant.

When using ASCII, it is relatively easy to access the characters in random order. For UTF-16, it is possible if we assume that there are no supplemental characters, but since some characters might require 4 bytes while other 2 bytes, it is not possible to go directly to a character by its index without accessing the previous content. The UTF-8 is similarly not randomly accessible in general.

Software often depends on the chosen locale: e.g., US English, French Canadian, and so forth. Sorting strings is locale-dependent. It is not generally possible to sort strings without knowing the locale. However, it is possible to sort strings lexicographically as byte sequences (UTF-8) or as 16-bit word sequences (UTF-16). When using UTF-8, the result is then a string sort based on the characters’ numerical value.

Optimizing compilers deduplicate strings and arrays

When programming, it can be wasteful to store the same constant data again and again. You use more memory, you access more data. Thankfully, your optimizing compiler may help.

Consider the following two lines of C code:

    printf("Good day professor Jones");
    printf("Good day professor Jane");

There is redundancy since the prefix “Good day professor” is the same in both cases. To my knowledge, no compiler is likely to trim this redundancy. However, you can get the desired trimming by breaking the strings:

    printf("Good day professor ");

    printf("Good day professor ");

Most compilers will recognize the constant string and store it once in the program. It works even if the constant string “Good day professor” appears in different functions.

Thus the following function may return true:

    const char * str1 = "dear friend";
    const char * str2 = "dear friend";
    return str1 == str2;

That is, you do not need to manually create constant strings: the compiler recognizes the redundancy (typically).

The same trick fails with extended strings:

    const char * str1 = "dear friend";
    const char * str2 = "dear friend\0f";
    return str1 == str2;

All compilers I tried return false. They create two C strings even if one is a prefix of the other in the following example…

char get1(int k) {
    const char * str = "dear friend";
    return str[k];

char get2(int k) {
    const char * str = "dear friend\0f";
    return str[k];

Unsurprisingly, the “data compression” trick works with arrays. For example, the arrays in these two functions are likely to be compiled to just one array because the compiler recognizes that they are identical:

int f(int k) {
    int array[] = {1,2,3,4,5,34432,321323,321321,1,
    return array[k];

int g(int k) {
    int array[] = {1,2,3,4,5,34432,321323,321321,1,
    return array[k+1];

It may still work if one array is an exact subarray of the other ones with GCC, as in this example:

int f(int k) {
    int array[] = {1,2,3,4,5,34432,321323,321321,1,
    return array[k];

int g(int k) {
    int array[] = {1,2,3,4,5,34432,321323,321321,1,
    return array[k+1];

You can also pile up several arrays as in the following case where GCC creates just one array:

long long get1(int k) {
    long long str[] = {1,2,3};
    return str[k];

long long get2(int k) {
    long long str[] = {1,2,3,4};
    return str[k+1];

long long get3(int k) {
    long long str[] = {1,2,3,4,5,6};
    return str[k+1];

long long get4(int k) {
    long long str[] = {1,2,3,4,5,6,7,8};
    return str[k+1];

It also works with arrays of pointers, as in the following case:

const char * get1(int k) {
    const char * str[] = {"dear friend", "dear sister", "dear brother"};
    return str[k];

const char * get2(int k) {
    const char * str[] = {"dear friend", "dear sister", "dear brother"};
    return str[k+1];

Of course, if you want to make sure to keep your code thin and efficient, you should not blindly rely on the compiler. Nevertheless, it is warranted to be slightly optimistic.

Science and Technology links (September 16 2022)

Escaping strings faster with AVX-512

When programming, we often have to ‘escape’ strings. A standard way to do it is to insert the backslash character (\) before some characters such as the double quote. For example, the string

my title is "La vie"


my title is \"La vie\"

A simple routine in C++ to escape a string might look as follows:

  for (...) {
    if ((*in == '\\') || (*in == '"')) {
      *out++ = '\\';
    *out++ = *in;

Such a character-by-character approach is unlikely to provide the best possible performance on modern hardware.

Recent Intel processors have fast instructions (AVX-512) that are well suited for such problems. I decided to sketch a solution using Intel intrinsic functions. The routine goes as follows:

  1. I use two constant registers containing 64 copies of the backslash character and 64 copies of the quote characters.
  2. I start a loop by loading 32 bytes from the input.
  3. I expands these 32 bytes into a 64 byte register, interleaving zero bytes.
  4. I compare these bytes with the quotes and backslash characters.
  5. From the resulting mask, I then construct (by shifting and blending) escaped characters.
  6. I ‘compress’ the result, removing the zero bytes that appear before the unescaped characters.
  7. I advance the output pointer by the number of written bytes and I continue the loop.

The C++ code roughly looks like this…

  __m512i solidus = _mm512_set1_epi8('\\');
  __m512i quote = _mm512_set1_epi8('"');
  for (; in + 32 <= finalin; in += 32) {
    __m256i input = _mm256_loadu_si256(in);
    __m512i input1 = _mm512_cvtepu8_epi16(input);
    __mmask64 is_solidus = _mm512_cmpeq_epi8_mask(input1, solidus);
    __mmask64 is_quote = _mm512_cmpeq_epi8_mask(input1, quote);
    __mmask64 is_quote_or_solidus = _kor_mask64(is_solidus, is_quote);
    __mmask64 to_keep = _kor_mask64(is_quote_or_solidus, 0xaaaaaaaaaaaaaaaa);
    __m512i shifted_input1 = _mm512_bslli_epi128(input1, 1);
    __m512i escaped =
        _mm512_mask_blend_epi8(is_quote_or_solidus, shifted_input1, solidus);
    _mm512_mask_compressstoreu_epi8(out, to_keep, escaped);
    out += _mm_popcnt_u64(_cvtmask64_u64(to_keep));

This code can be greatly improved. Nevertheless, it is a good first step. What are the results an Intel icelake processor using GCC 11 (Linux) ? A simple benchmark indicates a 5x performance boost compared to a naive implementation:

regular code 0.6 ns/character
AVX-512 code 0.1 ns/character

It looks quite encouraging ! My source code is available. I require a recent x64 processor with AVX-512 VBMI2 support.

Science and Technology links (September 12 2022)

  1. A standard dataset in artificial-intelligence research has ten percent of its images mislabeled. Yet state-of-the-art algorithms achieve better-than-90% classification on the same dataset. (Credit: Leo Boytsov)
  2. Despite the reeducation camps and the massive trauma, families that did well before the Chinese socialist revolution are still doing well. In other words, it is difficult and maybe impossible to do away with the competitiveness of some families.
  3. Exercise appears to enhance the hippocampus (in mice) so that exercise might enhance spatial learning.
  4. Lead is a toxic substance that might have accounted for a significant fraction of violent crimes. The more careful use of lead explains in part the reduction in violent crimes over time.
  5. Watching a video at 2x the speed can double your learning speed.
  6. A man who received a heart transplant from a pig died two months later.
  7. Mental speed does not decrease with age as we expected: no mental speed decline is observed before the age of 60.
  8. Marmots appear not to age while they are hibernating.
  9. It is often stated that twins raised apart have a striking similarity in intelligence. And twins raised apart are similar in many ways (e.g., personality), but their intelligence might be less closely related than we thought.
  10. The widespread availability of guns makes lynching less likely.
  11. Vitamin D supplements do not appear to protect the elderly against fractures.
  12. The largest flood in the history of California happened in 1862. The flood followed a 20-year-long drought and the valleys became seas for a short time. Such a flood appears to be a relatively common occurrence though there has been no comparable flood since then. Native Americans apparently knew of these floods:

    Native Americans knew that the Sacramento Valley could become an inland sea when the rains came. Their storytellers described water filling the valley from the Coast Range to the Sierra.

Catching sanitizer errors programmatically

The C and C++ languages offer little protection against programmer errors. Errors do not always show up where you expect. You can silently corrupt the content of your memory. It can make bugs difficult to track. To solve this problem, I am a big fan of programming in C and C++ using sanitizers. They slow your program, but the check that memory accesses are safe, for example.

Thus if you iterate over an array and access elements that are out of bounds, a memory sanitizer will immediately catch the error:

  int array[8];
  for(int k = 0;; k++) {
    array[k] = 0;

The sanitizer reports the error, but what if you would like to catch the error and store it in some log? Thankfully, GCC and LLVM sanitizers call a function (__asan_on_error()) when when an error is encounter, allowing us to log it. Of course, you need to record the state of your program. The following is an example where the state is recorded in a string.

#include <iostream>
#include <string>
#include <stdlib.h>

std::string message;

extern "C" {
void __asan_on_error() {
  std::cout << "You caused an error: " << message << std::endl;

int main() {
  int array[8];
  for(int k = 0;; k++) {
    message = std::string("access at ") + std::to_string(k);
    array[k] = 0;
  return EXIT_SUCCESS;

The extern expression makes sure that C++ does not mangle the function name.

Running this program with memory sanitizers will print the following:

You caused an error: access at 8

You could also write to a file, if you would like.

“Hello world” is slower in C++ than in C (Linux)

A simple C program might print ‘hello world’ on screen:

#include <stdio.h>
#include <stdlib.h>

int main() {
    printf("hello world\n");
    return EXIT_SUCCESS;

You can write the equivalent in C++:

#include <iostream>
#include <stdlib.h>

int main() {
    std::cout << "hello world" << std::endl;
    return EXIT_SUCCESS;

In the recently released C++20 standard, we could use std::format instead or wrap the stream in a basic_osyncstream for thread safety, but the above code is what you’d find in most textbooks today.

How fast do these programs run? You may not care about the performance of these ‘hello world’ programs per se, but many systems rely on small C/C++ programs running specific and small tasks. Sometimes you just want to run a small program to execute a computation, process a small file and so forth.

We can check the running time using a benchmarking tool such as hyperfine. Such tools handle various factors such as shell starting time and so forth.

I do not believe that printing ‘hello world’ itself should be slower or faster in C++ compared to C, at least not significantly. What we are testing by running these programs is the overhead due to the choice of programming language when launching the program. One might argue that in C++, you can use printf (the C function), and that’s correct. You can code in C++ as if you were in C all of the time. It is not unreasonable, but we are interested in the performance when relying on conventional/textbook C++ using the standard C++ library.

Under Linux when using the standard C++ library (libstdc++), we can ask that the standard C++ be linked with the executable. The result is a much larger binary executable, but it may provide faster starting time.

Hyperfine tells me that the C++ program relying on the dynamically loaded C++ library takes almost 1 ms more time than the C program.

C 0.5 ms
C++ (dynamic) 1.4 ms
C++ (static) 0.8 ms

My source code and Makefile are available. I get these numbers on Ubuntu 22.04 LTS using an AWS node (Graviton 3).

If these numbers are to be believed, there may a significant penalty due to textbook C++ code for tiny program executions, under Linux.

Half a millisecond or more of overhead, if it is indeed correct, is a huge penalty for a tiny program like ‘hello workd’. And it only happens when I use dynamic loading of the C++ library: the cost is much less when using a statically linked C++ library.

It seems that loading the C++ library dynamically is adding a significant cost of up to 1 ms. We might check for a few additional confounding factors proposed by my readers.

  1. The C compiler might not call the printf function, and might call the simpler puts function instead: we can fool the compiler into calling printf with the syntax printf("hello %s\n", "world"): it makes no measurable difference in our tests.
  2. If we compile the C function using a C++ compiler, the problem disappears, as you would hope, and we match the speed of the C program.
  3. Replacing  "hello world" << std::endl; with "hello world\n"; does not seem to affect the performance in these experiments. The C++ program remains much slower.
  4. Adding std::ios_base::sync_with_stdio(false); before using std::cout also appears to make no difference. The C++ program remains much slower.
C (non trivial printf) 0.5 ms
C++ (using printf) 0.5 ms
C++ (std::cout replaced by \n) 1.4 ms
C++ (sync_with_stdio set to false) 1.4 ms

Thus we have every indication that dynamically loading the C++ standard library takes a lot time, certainly hundreds of extra microseconds. It may be a one-time cost but if your programs are small, it can dominate the running time. Statically linking the C++ library helps, but it also creates larger binaries. You may reduce somewhat the size overhead with appropriate link-time flags such as --gc-sections, but a significant overhead remains in my tests.

Note: This blog post has been edited to answer the multiple comments suggesting confounding factors, other than standard library loading, that the original blog post did not consider. I thank my readers for their proposals.

Appendix 1 We can measure precisely the loading time by preceding the execution of the function by LD_DEBUG=statistics (thanks to Grégory Pakosz for the hint). The C++ code requires more cycles. If we use LD_DEBUG=all (e.g., LD_DEBUG=all ./hellocpp), then we observe that the C++ version does much more work (more versions checks, more relocations, many more initializations and finalizers). In the comments, Sam Mason blames dynamic linking: on his machine he gets the following result…

C code that dynamically links to libc takes ~240µs, which goes down to ~150µs when statically linked. A fully dynamic C++ build takes ~800µs, while a fully static C++ build is only ~190µs.

Appendix 2 We can try to use sampling-based profiling to find out where the programs speeds its time. Calling perf record/perf report is not terribly useful on my system. Some readers report that their profiling points the finger at locale initialization in this manner. I get a much more useful profile with valgrind --tool=callgrind command && callgrind_annotate. The results are consistent with the theory that loading the C++ library dynamically is relatively expensive.

Science and Technology links (August 7 2022)

  1. Increase in computing performance explain up to 94% of the performance improvements in field such as weather prediction, protein folding, and oil exploration: information technology is a a driver of long-term performance improvement across society. If we stop improving our computing, the consequences could be dire.
  2. The coral cover of the Great Barrier Reef has reached its highest level since the Australian Institute of Marine Science (AIMS) began monitoring 36 years ago.
  3. The leading Alzheimer’s theory, the amyloid hypothesis, which motivated years of research, was built in part on fraud. The author of the fraud is professor Sylvain Lesné. His team appeared to have composed figures by piecing together parts of photos from different experiments. Effectively, they “photoshopped” their scientific papers. Not just one or two images, but at least 70. Note that all clinical trials of drugs developed (at high cost) on the amyloid hypothesis have failed. The total cost is in the billions of dollars. Meanwhile, competing theories and therapies have been sidelined by the compelling amyloid hypothesis. It will be interesting to watch what kind of penalty, if any, Lesné receives for his actions. You may read the testimonies of researchers whose work was made difficult because they did not believe in the amyloid hypothesis. E.g., The maddening saga of how an Alzheimer’s ‘cabal’ thwarted progress toward a cure for decades (from 2019) or the story of Rachael Neve. The net result? No progress despite massive funding:

    In the 30 (now 35) years that biomedical researchers have worked determinedly to find a cure for Alzheimer’s disease, their counterparts have developed drugs that helped cut deaths from cardiovascular disease by more than half, and cancer drugs able to eliminate tumors that had been incurable. But for Alzheimer’s, not only is there no cure, there is not even a disease-slowing treatment.

    We have two decades of Alzheimer’s research based partly on deliberate fraud that may have cost millions of lives.

  4. For years, we have been told that depression was caused by a chemical imbalance in the brain, specifically low serotonin levels. Antidepressants were initially proposed to work by rectifying the serotonin abnormality. A new paper published by Nature shows not only that depressed people do not have lower serotonin levels but, moreover, long-term antidepressant use may lead to lower serotonin levels.
  5. Science has been ‘covided’:

    Across science, 98 of the 100 most-cited papers published in 2020 to 2021 were related to COVID-19. A large number of scientists received large numbers of citations to their COVID-19 work, often exceeding the citations they had received to all their work during their entire career.

  6. The Earth is getting greener due to increase CO2 presence in the atmosphere and this is predicted to lead to substantial cooling according to an article in Nature.
  7. The Sahara was green from 14,000 to 5,000 years ago. Some researchers believe that it became a desert due to the sudden cooling of the Atlantic Ocean.

Comparing strtod with from_chars (GCC 12)

A reader (Richard Ebeling) invited me to revisit an older blog post: Parsing floats in C++: benchmarking strtod vs. from_chars. Back then I reported that switching from strtod to from_chars in C++ to parse numbers could lead to a speed increase (by 20%). The code is much the same, we go from…

char * string = "3.1416";
char * string_end = string;
double x = strtod(string, &string_end);
if(string_end == string) { 
  //you have an error!

… to something more modern in C++17…

std::string st = "3.1416";
double x; 
auto [p, ec] = std::from_chars(, + st.size(), x);
if (p == {
      //you have errors!

Back when I first reported on this result, only Visual Studio had support for from_chars. The C++ library in GCC 12 now has full support for from_chars. Let us run the benchmark again:

strtod 270 MB/s
from_chars 1 GB/s

So it is almost four times faster! The benchmark reads random values in the [0,1] interval.

Internally, GCC 12 adopted the fast_float library.

Further reading: Number Parsing at a Gigabyte per Second, Software: Pratice and Experience 51 (8), 2021.

Round a direction vector to an 8-way compass

Modern game controllers can point in a wide range of directions. Game designers sometimes want to convert the joystick direction to get 8-directional movement. A typical solution offered is to compute the angle, round it up and then compute back the direction vector.

  double angle = atan2(y, x);
  angle = (int(round(4 * angle / PI + 8)) % 8) * PI / 4;
  xout = cos(angle);
  yout = sin(angle);

If you assume that the unit direction vector is in the first quadrant (both x and y are positive), then there is a direct way to compute the solution. Using 1/sqrt(2) or 0.7071 as the default solution, compare both x and y with sin(3*pi/8), and only switch them to 1 if they are larger than sin(3*pi/8) or to 0 if the other coordinate is larger than sin(3*pi/8). The full code looks as follows:

  double outx = 0.7071067811865475; // 1/sqrt(2)
  double outy = 0.7071067811865475;// 1/sqrt(2)
  if (x >= 0.923879532511286) { // sin(3*pi/8)
    outx = 1;
  if (y >= 0.923879532511286) { // sin(3*pi/8)
    outy = 1;
  if (y >= 0.923879532511286) { // sin(3*pi/8)
    outx = 0;
  if (x >= 0.923879532511286) { // sin(3*pi/8)
    outy = 0;
  if (xneg) {
    outx = -outx;

I write tiny if clauses because I hope that the compile will avoid producing comparisons and jumps which may stress the branch predictor when the branches are hard to predict.
You can generalize the solution for the case where either x or y (or both) are negative by first taking the absolute value, and then restoring the sign at the end:

  bool xneg = x < 0;
  bool yneg = y < 0;
  if (xneg) {
    x = -x;
  if (yneg) {
    y = -y;
  double outx = 0.7071067811865475; // 1/sqrt(2)
  double outy = 0.7071067811865475;// 1/sqrt(2)
  if (x >= 0.923879532511286) { // sin(3*pi/8)
    outx = 1;
  if (y >= 0.923879532511286) { // sin(3*pi/8)
    outy = 1;
  if (y >= 0.923879532511286) { // sin(3*pi/8)
    outx = 0;
  if (x >= 0.923879532511286) { // sin(3*pi/8)
    outy = 0;
  if (xneg) {
    outx = -outx;
  if (yneg) {
    outy = -outy;

You can rewrite everything with the ternary operator to entice the compiler to produce branchless code (i.e., code without jumps). The result is more compact.

  bool xneg = x < 0;
  x = xneg ? -x : x;
  bool yneg = y < 0;
  y = yneg ? -y : y;
  double outx = (x >= 0.923879532511286) ? 1 
    : 0.7071067811865475;
  double outy = (y >= 0.923879532511286) ? 1 
    : 0.7071067811865475;
  outx = (y >= 0.923879532511286) ? 0 : outx;
  outy = (x >= 0.923879532511286) ? 0 : outy;
  outx = xneg ? -outx : outx;
  outy = yneg ? -outy : outy;

The clang compiler may produce an entirely branchless assembly given this code.

But as pointed out by Samuel Lee in the comments, you can do even better… Instead of capturing the sign with a separate variable, you can just copy the pre-existing sign using a function like copysign (available in C, C#, Java and so forth):

 double outx = fabs(x);
 double outy = fabs(y);
 outx = (outx >= 0.923879532511286) ? 1 
   : 0.7071067811865475;
 outy = (outy >= 0.923879532511286) ? 1 
   : 0.7071067811865475;
 outx = (posy >= 0.923879532511286) ? 0 : outx;
 outy = (posx >= 0.923879532511286) ? 0 : outy;
 outx = copysign(outx, x);
 outy = copysign(outy, y);

I wrote a small benchmark that operates on random inputs. Your results will vary but on my mac laptop with LLVM 12, I get that the direct approach with copysign is 50 times faster than the approach with tan/sin/cos.

with tangent 40 ns/vector
fast approach 1.2 ns/vector
fast approach/copysign 0.8 ns/vector