Are your memory-bound benchmarking timings normally distributed?

When optimizing software, we routinely measure the time that takes a given function or task. Our goal is to decide whether the new code is more or less efficient. The typical assumption is that we get a normal distribution of the timings, and so we should therefore report the average time. If the average time goes up, our new code is less efficient.

I believe that the normality assumption is frequently violated in software benchmarks. I recently gave a talk on precise and accurate benchmarking (video, slides) at a “Benchmarking in the Data Center” workshop where I made this point in detail.

Why should it matter? If you have normal distributions, you can mathematically increase the accuracy of the measure by taking more samples. Your error should go down as the square root of the number of measures.

If you ever tried to increase the number of samples in the hope of getting a more accurate result, you may have been severely disappointed. And that is because the timings often more closely ressemble a log normal distribution:

A log normal distribution is asymmetrical, you have mean that is relatively close the minimum, and a long tail… as you take more and more samples, you may find more and more large values.

You can often show that you do not have a normal distribution because you find 4-sigma, 5-sigma or even 13-sigma events: you measure values that are far above the average compared to your estimated standard deviation. It is not possible, in a normal distribution, to be multiple times the standard deviation away from the mean. However, it happens much more easily with a log normal.

Of course, real data is messy. I am not claiming that your timings precisely follow a log normal distribution. It is merely a model. Nevertheless, it suggests that reporting the average and the standard error is inadequate. I like to measure the minimum and the average. You can then use the distance between the average and the minimum as an easy-to-measure error metric: if the average is far from the minimum, then your real-world performance could differ drastically from the minimum. The minimum value is akin to the friction-less model in Physics: it is the performance you get after taking out all the hard-to-model features of the problem.

People object that they want the real performance but that’s often an ill-posed problem like asking about how fast a real ball rolls down a real slope, with the friction and air resistance: you can measure it, but a ball rolling down a slope does not have inherent, natural friction and air resistance… that’s something that comes out in your specific use case. In other words, a given piece of software code does not have a natural performance distribution… independent from your specific use case… however, it is possible to measure the best performance that a function might have on a given CPU. The function does not have an intrinsic performance distribution but it has a well defined minimum time of execution.

The minimum is easier to measure accurately than the average. I routinely achieve a precision of 1% or better in practice. That is, if I rerun the same benchmark another day on the same machine, I can be almost certain that the result won’t vary by much more than 1%. The average is slightly more difficult to nail down. You can verify that it is so with a model: if you generate log-normally distributed samples, you will find it easier to determine the minimum than the average with high accuracy.

What about the median? Let us imagine that I take 30 samples from lognormal distribution. I repeat many times, each time measuring the average, the median and the minimum. Both the median and average will have greater relative standard error. In practice, I also find that the median is both harder to compute (more overhead) and not as precise as the minimum.

I find that under some conditions (fixed data structures/data layout, few system calls, single-core processing, no tiny function, and no JIT compiler) the minimum time elapsed is a good measure.

For my talk, I used a compute-bound routine: the performance of my function was not bound by the speed of the RAM, by the network or by the disk. I took data that was in CPU cache and I generated more data in CPU cache. For these types of problems, I find that timings often ressemble a log normal.

What about other types of tasks? What about memory-bound tasks?

I took a memory benchmark that I like to use. It consists of a large array spanning hundreds of megabytes, and the software must jump from location to location in it, being incapable of predicting the next jump before completing the read. It is often called a “pointer chasing” routine. I can interleave the pointer-chasing routines so that I have several loads in flight at each time: I call this the number of lanes. The more lanes I have, the more “bandwidth limited” I become, the fewer the lanes, the more “memory latency” I become. I am never compute bound in these tests, meaning that the number of instructions retired per cycle is always low.

For each fixed number of lanes, I run the benchmark 30 times on an Amazon Intel c6i node running Ubuntu 22. The time elapsed vary between over 3 seconds per run (for one lane) to about 1/6 s (for 30 lanes). I then estimate the standard deviation, I compute the mean, the maximum and the minimum. I then compute the bottom sigma as the gap (in number of standard deviations) between the minimum and the average, and then the gap between the average and the maximum. If the distribution is normal, I should have roughly the same number of sigmas on either side (min and max) and I should not exceed 3 sigmas. I find that one side (the maximum),  easily exceed 3 sigmas, so it is not a normal distribution. It is also clearly not symmetrical.

Yet my measures are relatively precise. The relative distance between the minimum and the mean, I get a tiny margin, often under 1%.

It is interesting that the more lanes I have, the more accurate the results: intuitively, this is not entirely surprising as it breaks down the data dependency and one bad step has less impact on the whole processing.

Thus, for a wide range of performance-related timings, you should not assume that you have a normal distribution without checking first! Computing the distance between the maximum and the mean divided by the standard deviation is a useful indicator. I personally find that a log normal distribution is a better model for my timings, at a high level.

Further reading: Log-normal Distributions across the Sciences: Keys and Clues. David Gilbertson has a related post: The mean misleads: why the minimum is the true measure of a function’s run time.

What are we going to do about ChatGPT?

Generative artificial intelligence, and in particular ChatGPT, has taken the world by storm. Some intellectuals are proposing we create a worldwide ban on advanced AI research, using military operations if needed. Many leading researchers and pundits are proposing a six-month ‘pause’, during which time we could not train more advanced intelligences.

The six-month pause is effectively unenforceable. We did lock down nearly the entire world to suppress covid. Except that as the partygate scandal showed, high ranking politicians were happy to sidestep the rules. Do we seriously think that the NSA and other major governments are going to obey we pause?

Covid also showed that what starts as a short pause can be renewed again and again, until it becomes unsustainable. Philosophers and computer scientists have spent the last fifty years doing research on artificial-intelligence safety, do you think that we are due for a conceptual breakthrough in the next six months? We had rogue but helpful artificial intelligence in mainstream culture for a long, long time: HAL 9000 in 2001: A Space Odyssey (1968), Mike in The Moon is a Harsh Mistress (1966). I grew up watching Lost in Space (1965) where an intelligent robot commonly causes harm. Six months won’t cut it for a new philosophical breakthrough. Shelley’s Frankenstein was published in 1818. A case can be made that the monster is not the danger itself, but rather how human beings do it. In some sense, it is Frankenstein himself that is the cause of the tragedy, not his monster.

Going nuclear and banning advanced AI research definitively is flat out unfeasible: it will almost certainly require going to war. You might object that we successfully stopped the proliferation of nuclear weapons… Except that it is a bad analogy: what is suggested is an entire worldwide ban. We never tried to ban nuclear weapons, we only limited them to powerful countries. If all you want to do is keep advanced AI out of the hands of some countries like North Korea, that can be arranged by cutting them off from the Internet and preventing them from importing computer equipment. But to stop Americans from doing artificial-intelligence research? When we tried to ban gain-of-function research on viruses, it moved to China.

Currently, only large corporations and governments can produce something akin to ChatGPT. However, it seems possible that the cost of training and running generative artificial-intelligence models might fall in the coming years. If it does, we might end up facing a choice between banning computing entirely or accepting powerful new artificial-intelligence software.

This being said, you can stall technological progress. It is even fairly easy. We basically stopped improving nuclear reactors decades ago by adding layers of regulations.

So while a worldwide ban or a pause are unlikely, it is possible to significantly harm artificial-intelligence research in parts of the world. We could require people who operate powerful computers to seek licenses. We could stop funding artificial-intelligence research. Would it work in China? Doubtful, but if we were to stop artificial intelligence just in North America and Europe, it would harm progress.

Let us be generous: maybe we could stop artificial intelligence. We might freeze it in its tracks. Should we?

We have a case of the one-sided bet fallacy. The one-sided bet fallacy goes as follow: a lottery ticket costs effectively nothing, maybe a dollar. However, you can win big (millions of dollars). Thus you should buy a lottery ticket. It is a fallacy because the cost of the ticket is not negligible, you only made it zero in your head.

How is it an instance of one-sided bet fallacy? We assume that a pause is free, but that the risk is immense. Thus a pause seems like a good deal.

Human civilization is not stationary, it never was. It was the mistake Malthus made when he predicted that human beings were bound to constantly starve. His logic was mathematical: given more food, people reproduce exponentially, until they exhaust the food supply.

For all we know, our near or medium term salvation could require more advanced artificial intelligence. Until we know more, we cannot tell whether the pause is more dangerous than continued progress.

In most of the world, we use less land for farming, and yet we produce more than enough food for a growing population. Nearly everywhere but in Africa, South America and South Asia, we have more and more land occupied by forest. In the recent past, worldwide poverty has been going down, except maybe for the covid era.

But shouldn’t we listen to the experts? Firstly, let us remember that an expert is someone who has experience. Nobody has experience with the breakthrough that is current generative artificial intelligence. Nobody knows how it will impact our society. Before we gain some experts, we need experience. Secondly, we should listen to all the knowledgeable people, not just those that are telling scary stories. Yann LeCun is just as much an expert as anyone in artificial intelligence, and he is entirely against the pause.

Am I saying that we should do nothing? Do not fall for the Politician’s syllogism:

  1. We must do something.
  2. This is something.
  3. Therefore, we must do this.

It is another fallacy that you must do something, anything. What you must do when facing the unknown is to follow the OODA loop: observe–orient–decide–act. Notice how action comes last? Notice how the decision comes after you have observed?

In my estimation, we are dealing with a disruptive technology. Historically, disruptive technologies have nearly wiped out entire industries, or entire occupations. Generative artificial intelligence will disrupt education: it is already doing so. But new technologies also bring tremendous benefits. McKinsey tells us that artificial intelligence might  amount to 1.2 percent additional economic growth per year. If so, that is an enormous contribution to society: turning this down is not an option if we want to get people out of poverty. I expect that we will see massive productivity gains in some industries where a small team is able to do what required a large team before. The transformation is never immediate. It often takes longer than you would expect. But it will be there, soon.

C++20: consteval and constexpr functions

Optimizing compilers seek try to push as much of the computation as possible at compile time. In modern C++, you can declare a function as ‘constexpr’, meaning that you state explicitly that the function may be executed at compile time.

The constexpr qualifier is not magical. There may not be any practical difference in practice between an inline function and a constexpr function, as in this example:

inline int f(int x) {
  return x+1;
}

constexpr int fc(int x) {
  return x+1;
}

In both cases, the compiler may compute the result at compile time.

The fast_float C++ library allows you to parsing strings to floating-point numbers at compile time if your compiler supports C++20. Thus you can write the following function which returns either the parsed floating-point value or the number -1.0 in case of error:

constexpr double parse(std::string_view input) {
  double result;
  auto answer = fast_float::from_chars(input.data(),
    input.data()+input.size(), result);
  if(answer.ec != std::errc()) { return -1.0; }
  return result;
}

So what happens if you put parse("3.1416") in your code? Unfortunately, even at the highest optimization levels, you cannot be certain that the function will be computed at compile time.

Thankfully, C++20 has an another attribute ‘consteval’ which ensures that the function is evaluated at compile time:

consteval double parse(std::string_view input) {
  double result;
  auto answer = fast_float::from_chars(input.data(),
    input.data()+input.size(), result);
  if(answer.ec != std::errc()) { return -1.0; }
  return result;
}

What happens if the parameter of the function cannot be determined at compile time? There should be a compilation error.

With the consteval qualifier, the following two functions should be practically equivalent:

double f1() {
  return parse("3.1416");
}

double f2() {
  return 3.1416;
}

You can try it out with fast_float if you’d like.

Can GPT pass my programming courses?

Professeur Caplan reports that ChatGPT, the new artificial intelligence that is all the rage, can pass his economics midterm and get an A.

What about computer science ? I submitted to GPT one problem I offer my students…

Write a Java program that calculates the sum of numbers from 1 to 10,000 (including 1 and 10,000) but omitting numbers that are divisible by three and also omitting numbers whose hundredth digit is 2 or 3 (e.g. 1201 or 3313).

Bing/GPT solves the problem with ease:

public class SumOfNumbers {
  public static void main(String[] args) {
    int sum = 0;
    for (int i = 1; i <= 10000; i++) {
      if (i % 3 != 0 && (i / 100) % 10 != 2 && (i / 100) % 10 != 3) {
        sum += i;
      }
    }
    System.out.println("The sum of numbers from 1 to 10,000 (including 1 and 10,000) but omitting numbers that are divisible by three and also omitting numbers whose hundredth digit is 2 or 3 is: " + sum);
  }
}

Maybe it somehow knew just this particular answer. What if I tweak the question a bit?

Write a Java program that calculates the sum of numbers from 1 to 10,000 (including 1 and 10,000) but omitting numbers where the hundredth digit is divisible by 3.

Bing/GPT gets it almost correct:

public class SumOfNumbers {
  public static void main(String[] args) {
    int sum = 0;
    for (int i = 1; i <= 10000; i++) {
      if ((i / 100) % 3 != 0) {
        sum += i;
      }
    }
    System.out.println("The sum of numbers from 1 to 10,000 (including 1 and 10,000) but omitting numbers where the hundredth digit is divisible by 3 is: " + sum);
  }
}

What about other questions I often ask… such as…

Write a Java function which can invert the elements in an array of integers.

Bing/GPT gets it right, complete with a detailed explanation…

public static void invertArray(int[] arr) {
  int n = arr.length;
  for (int i = 0; i < n / 2; i++) {
    int temp = arr[i];
    arr[i] = arr[n - i - 1];
    arr[n - i - 1] = temp;
  }
}

What about database work? Here is a basic question I ask…

What can you say about the attributes A and B if A→B and B→A?

Bing/GPT tells me that…

If A→B and B→A, then A and B are said to be functionally dependent on each other. This means that the value of A uniquely determines the value of B, and vice versa. In other words, if you know the value of A, you can determine the value of B, and if you know the value of B, you can determine the value of A

This is correct.

What about a more technical database question ?

Bing/GPT answers…

SELECT COUNT(*) FROM table_name WHERE A = B

Unfortunately, it extends to the exams themselves. Bing/GPT gets much of it right. In the current “introduction to programming” exam, there was only one question the GPT got obviously wrong, but I would have attributed some points ‘for effort’.

Bing/GPT can pass my introductory computer science courses.

I further asked it to solve a problem using the C++ simdjson library. We are definitively well beyond introductory courses. GPT did well enough as the answer is almost correct:

Then I asked that it produces its own SIMD routine in C++ to count the number of characters equal to ‘.’ and it did so… It is almost correct…

Bing GPT can solve mathematical problems, in French.

Counting cycles and instructions on ARM-based Apple systems

In my blog post Counting cycles and instructions on the Apple M1 processor, I showed how we could have access to “performance counters” to count how many cycles and instructions a given piece of code took on ARM-based mac systems. At the time, we only had access to one Apple processor, the remarkable M1. Shortly after, Apple came out with other ARM-based processors and my current laptop runs on the M2 processor. Sadly, my original code only works for the M1 processor.

Thanks to the reverse engineering work of ibireme, a software engineer, we can generalize the approach. We have further extended my original code so that it works under both Linux and on ARM-based macs. The code has benefited from contributions from Wojciech Muła and John Keiser.

For the most part, you setup a global event_collector instance, and then you surround the code you want to benchmark by collector.start() and collector.end(), pushing the results into an event_aggregate:

#include "performancecounters/event_counter.h"

event_collector collector;

void f() {
  event_aggregate aggregate{};
  for (size_t i = 0; i < repeat; i++) {
   collector.start();
   function(); // benchmark this function
   event_count allocate_count = collector.end();
   aggregate << allocate_count;
  }
}

And then you can query the aggregate to get the average or best performance counters:

aggregate.elapsed_ns() // average time in ns
aggregate.instructions() // average # of instructions
aggregate.cycles() // average # of cycles
aggregate.best.elapsed_ns() // time in ns (best round)
aggregate.best.instructions() // # of instructions (best round)
aggregate.best.cycles() // # of cycles (best round)

I updated my original benchmark which records the cost of parsing floating-point numbers, comparing the fast_float library against the C function strtod:

# parsing random numbers
model: generate random numbers uniformly in the interval [0.000000,1.000000]
volume: 10000 floats
volume = 0.0762939 MB
                           strtod     33.10 ns/float    428.06 instructions/float
                                      75.32 cycles/float
                                       5.68 instructions/cycle
                        fastfloat      9.53 ns/float    193.78 instructions/float
                                      27.24 cycles/float
                                       7.11 instructions/cycle

The code is freely available for research purposes.

Runtime asserts are not free

When writing software in C and C++, it is common to add C asserts to check that some conditions are satisfied at runtime. Often, it is a simple comparison between two values.

In many instances, these asserts are effectively free as far as performance goes, but not always. If they appear within a tight loop, they may impact the performance.

One might object that you can choose to only enable assertions for the debug version of your binary… but this choice is subject to debate. Compilers like GCC or clang (LLVM) do not deactivate asserts when compiling with optimizations. Some package maintainers require all asserts to remain in the release binary.

What do I mean by expensive? Let us consider an example. Suppose I want to copy an array. I might use the following code:

for (size_t i = 0; i < N; i++) {
  x1[i] = x2[i];
}

Suppose that I know that all my values are smaller than some threshold, but I want to be double sure. So during the copy, I might add a check:

for (size_t i = 0; i < N; i++) {
  assert(x2[i] < RAND_MAX);
  x1[i] = x2[i];
}

It is an inexpensive check. But how much does it cost me? I wrote a small benchmark which I run on an M2 processor after compiling the code with clang (LLVM 14). I get the following result:

simple loop 0.3 ns/word
loop with assert 0.9 ns/word

So adding the assert multiply the running time by a factor of three in this instance. Your results will vary but you should not be surprised if the version with asserts is significantly slower.

So asserts are not free. To make matters more complicated, some projects refuse to rely on library that terminates on errors. So having asserts in your library may disqualify it for some uses.

You will find calls to asserts in my code. I do not argue that you should never use them. But spreading asserts in performance critical code might be unwise.

Precision, recall and why you shouldn’t crank up the warnings to 11

Recently, the code hosting site GitHub deployed widely a tool called CodeQL with rather agressive settings. It does static analysis on the code and it attempts to flag problems. I use the phrase “static analysis” to refer to an analysis that does not run the code. Static analysis is limited: it can identify a range of actual bugs, but it tends also to catch false positives: code patterns that it thinks are bugs but aren’t.

Several Intel engineers proposed code to add AVX-512 support to a library I help support. We got the following scary warnings:

CodeQL is complaining that we are taking as an input a pointer to 8-byte words, and treating it if it were a pointer to 64-byte words. If you work with AVX-512, and are providing optimized replacements for existing function, such code is standard. And no compiler that I know of, even at the most extreme settings, will ever issue a warning, let alone a scary “High severity Check Failure”.

On its own, this is merely a small annoyance that I can ignore. However, I fear that it is part of a larger trend where people come to rely more or more on overbearing static analysis to judge code quality. The more warnings, the better, they think.

And indeed, surely, the more warnings that a linter/checker can generate, the better it is ?

No. It is incorrect for several reasons:

  1. Warnings and false errors may consume considerable programmer time. You may say ‘ignore them’ but even if you do, others will be exposed to the same warnings and they will be tempted to either try to fix your code, or to report the issue to you. Thus, unless you work strictly alone or in a closed group, it is difficult to escape the time sink.
  2. Training young programmers to avoid non-issues may make them less productive. The two most important features of software are (in order): correctness (whether it does what you say it does) and performance (whether it is efficient). Fixing shallow warnings is easy work, but it often does not contribute to either correctness (i.e., it does not fix bugs) nor does it make the code any faster. You may feel productive, and it may look like you are changing much code, but what are you gaining?
  3. Modifying code to fix a non-issue has a non-zero chance of introducing a real bug. If you have code that has been running in production for a long time, without error… trying to “fix it” (when it is not broken) may actually break it. You should be conservative about fixing code without strong evidence that there is a real issue. Your default behaviour should be to refuse to change the code unless you can see the benefits. There are exceptions but almost all code changes should either fix a real bug, introduce a new feature or improve the performance.
  4. When programming, you need to clear your mental space. Distractions are bad. They make you dumber. So your programming environment should not have unnecessary features.

Let us use some mathematics. Suppose that my code has bugs, and that a static checker has some probability of catching a bug each time it issues a warning. In my experience, this probability can be low… but the exact percentage is not important to the big picture. Let me use a reasonable model. Given B bugs per 1000 lines the probability that my warning has caught a bug follows a logistic functions, say 1/(1+exp(10 – B)). So if I have 10 bugs per 1000 lines of code, then each warning has a 50% probability of being useful. It is quite optimistic.

The recall is how many of the bugs I have caught. If I have 20 bugs in my code per 1000 lines, then having a million warnings will almost ensure that all bugs are caught. But the human beings would need to do a lot of work.

So given B, how many warnings should I issue? Of course, in the real world I do not know B, and I do not know that the usefulness of the warnings follows a logistic function, but humour me.

A reasonable answer is that we want to maximize the F-score: the harmonic mean between the precision and the recall.

I hastily coded a model in Python, where I vary the number of warnings. The recall always increases while the precision always fall. The F-score follows a model distribution: having no warnings in terrible, but having too many is just as bad. With a small number of warnings, you can maximize the F-score.

A more intuitive description of the issue is that the more warnings you produce, the more likely you are to waste programmer time. You are also more likely to catch bugs. One is negative, one is positive. There is a trade-off. When there is a trade-off, you need to seek the sweet middle point.

The trend toward an ever increasing number of warnings does not improve productivity. In fact, at the margin, disabling the warnings entirely might be just as productive as having the warnings: the analysis has zero value.

I hope that it is not a symptom of a larger trend where programming becomes bureaucratic. Software programming is one of the key industry where productivity has been fantastic and where we have been able to innovate at great speed.

Science and Technology links (March 11 2023)

  1. In 1500, China was the largest economy in the world, followed by India and France. The USA did not exist yet. In 1700, 4% of human beings lived in France. In the mid 18tg century, there are 25 million inhabitants in France and 5.5 million in England. Yet France had slipped to a sixth place as an economic power worldwide, while the United Kingdom moved up. It was beginning a sustained decline. So what happened to France? It appears that decades before the French revolution, France became less religious and that was the driver for a large decrease in fertility.
  2. Men who do physical labor have higher sperm counts and testosterone levels.
  3. You can take human brain cells, put them in a rat, and these cells will integrate into the rat’s brain.
  4. The top 1 per cent in income score slightly worse on cognitive ability than those in the income strata right below them according to Keuschnigg et al. (2023).
  5. Both hemispheres (south and north) absorb the same amount of energy from the Sun. According to Hadas et al., the presence of more ocean causes the southern-hemisphere to absorb more sunlight; however, it also makes it stormier, increasing cloud-albedo and balancing the difference in reflected sunlight. In effect, the atmosphere is balancing out greating forcing.
  6. Half of elderly American men take statins. In patients suffering from Alzheimer’s, statins worsen cognition: after stopping statins, the patients improved.
  7. Though earlier research concluded that happiness increases with income only up to a point, Killingsworth et al. (2023) find there is a robust linear-log relationship between income and happiness, meaning that each time you multiply your income, your happiness goes up a level. Note that this is a relationship, not necessarily a causal effect.
  8. Sea levels were about 1.5 meters higher 6000 years ago.
  9. High-intensity acute exercise enhances memory.
  10. According to the CDC, 13% of all Americans take antidepressants at any given time. About 15% of patients taking antidepressants have an effect that is above the placebo effect. In other words, most people take antidepressants for no benefit. But antidepressants may not be without risks. Antidepressants are associated with dementia: they may put you more at risk for Alzheimer’s.
  11. Despite substantial reductions in anthropogenic CO2 emissions in early 2020, the annual atmospheric CO2 growth rate did not decline, according to Laughner et al. (2021). They write:

    The lack of clear declines in the atmospheric growth rates of CO2 and CH4, despite large reductions in human activity, reflect carbon-cycle feedbacks in air–sea carbon exchange, large interannual variability in the land carbon sink, and the chemical lifetime of CH4. These feedbacks foreshadow similar challenges to intentional mitigation.

  12. By using the blood of a young mouse in a old mouse, we appear to be able to at least partially rejuvenate the brain of the old mice.
  13. The gas H2S increased longevity in worms.
  14. The energy density of Li-ion batteries has increased by less than 3% in the last 25 years according to Li (2019).
  15. There were giant penguins during the Paleocene. They might have weighted 150 kg, for more than most human beings. The climate during the Paleocene was generally tropical and the poles were ice-free.
  16. Young men sometimes outperform young women in standardized mathematical tests, while the opposite is true in some other domains. The gap is larger in gender-equal countries.
  17. There are still compulsory sterilization laws in the USA: the State may legally require your sterilization.
  18. We used to think that the earliest human beings (homo sapiens) migrated out of Africa 120,000 years ago. However, we found fossils as old as 177,000 years. The oldest fossils were found in Morocco (315,000 years old), though there are some doubts whether they were from homo sapies.
  19. It appears that ChatGPT is biased against conservative politicians. One might expect the system to express the biases of its creators.

Trimming spaces from strings faster with SVE on an Amazon Graviton 3 processor

Programmers sometimes need to trim, or remove, characters, such as spaces from strings. It might be a surprising expensive task. In C/C++, the following function is efficient:

size_t trimspaces(const char *s, size_t len, char *out) {
  char * init_out{out};
  for(size_t i = 0; i < len; i++) {
    *out = s[i];
    out += (s[i] != ' ');
  }
  return out - init_out;
}

Basically, we write all characters from the input, but we only increment the pointer if the input is not a space.

Amazon makes available new ARM-based systems relying on their Graviton 3 processors. These processors support advanced instructions called “SVE”. One very nice family of instructions to ‘compact’ values (effectively, remove unwanted values). I put it to good use when filtering out integer values from arrays.

Unfortunately, the SVE compact instructions cannot be directly applied to the problem of pruning spaces in strings, because they only operate on larger words (e.g., 32-bit words). But, fortunately, it is possible to load bytes directly into 32-bit values so that each byte value occupies 32-bit in memory using intrinsics functions such as svld1sb_u32. As you would expect, you can also do the reverse, and take an array of 32-bit values, and automatically convert it to a byte array (e.g., using svst1b_u32).

Thus I can take my byte array (a string), load it into temporary 32-bit vectors, prune these vectors, and then store the result as a byte array, back to a string. The following C code is a reasonable implementation of this idea:

size_t sve_trimspaces(const char *s, size_t len, char *out) {
  uint8_t *out8 = reinterpret_cast<uint8_t *>(out);
  size_t i = 0;
  for (; i + svcntw() <= len; i += svcntw()) {
   svuint32_t input = svld1sb_u32(svptrue_b32(), (const int8_t *)s + i);
   svbool_t matches = svcmpne_n_u32(svptrue_b32(), input, 32);
   svuint32_t compressed = svcompact_u32(matches, input);
   svst1b_u32(svptrue_b32(), out8, compressed);
   out8 += svcntp_b32(svptrue_b32(), matches);
  }
  if (i < len) {
   svbool_t read_mask = svwhilelt_b32(i, len);
   svuint32_t input = svld1sb_u32(read_mask, (const int8_t *)s + i);
   svbool_t matches = svcmpne_n_u32(read_mask, input, 32);
   svuint32_t compressed = svcompact_u32(matches, input);
   svst1b_u32(read_mask, out8, compressed);
   out8 += svcntp_b32(read_mask, matches);
  }
  return out8 - reinterpret_cast<uint8_t *>(out);
}

Is it faster? Using GCC 12 on a Graviton 3, I get that the SVE approach is 3.6 times faster and it uses 6 times fewer instructions. The SVE code is not six times faster, because it is retiring fewer instructions per cycle. My code is available.

conventional code 1.8 cycles/bytes 7 instructions/byte
SVE code 0.5 cycles/bytes 1.1 instructions/byte

You can achieve a similar result with ARM NEON, but you require large tables to emulate the compression instructions. Large tables lead to code bloat which, at the margin, is not desirable.

Float-parsing benchmark: Regular Visual Studio, ClangCL and Linux GCC

Windows users have choices when it comes to C++ programming. You may choose to stick with the regular Visual Studio. If you prefer, Microsoft makes available ClangCL which couples the LLVM compiler (commonly used by Apple) with the Visual Studio backend. Further, under Windows, you may easily build software for Linux using the Windows Subsystem for Linux.

Programmers often need to convert a string (e.g., 312.11) into a binary float-point number. It is a common task when doing data science.

I wrote a small set of benchmark programs to measure the speed. To run, I use an up-to-date Visual Studio (2022) with the latest ClangCL component, and the latest version of CMake.

After downloading it on your machine, you may build it with CMake. Under Linux, you may do:

cmake -B build && cmake --build build

And then you can execute like so:

./build/benchmarks/benchmark -f data/canada.txt

The Microsoft Visual Studio usage is similar except that you must specify the build type (e.g., Release):

cmake -B build && cmake --build build --config Release

For ClangCL, it is almost identical, except that you need to add -T ClangCL:

cmake -B build -T ClangCL && cmake --build build --config Release

Under Windows, the binary is not produced at build/benchmarks/benchmark but rather at build/benchmarks/Release/benchmark.exe, but the commands are otherwise the same. I use the default CMake flags corresponding to the Release build.

I run the benchmark on a laptop with Tiger Lake Intel processor (i7-11370 @ 3.3 GHz). It is not an ideal machine for benchmarking so I indicate the error margin on the speed measurements.

Among other libraries, my benchmark puts the fast_float library to the test: it is an implementation of the C++17 from_chars function for floats and doubles. Microsoft has its own implementation of this function which I include in the mix.

My benchmarking code remains identical (in C++) when I switch system or compiler: I simply recompile it.

Visual Studio std::from_chars 87 MB/s (+/- 20 %)
Visual Studio fast_float 285 MB/s (+/- 24 %)
ClangCL fast_float 460 MB/s (+/- 36 %)
Linux GCC11 fast_float 1060 MB/s (+/- 28 %)

We can observe that there are large performance differences. All the tests run on the same machine. The Linux build runs under the Windows Subsystem for Linux, and you do not expect the subsystem to run computations faster than Windows itself. The benchmark does not involve disk access. The benchmark is not allocating new memory.

It is not the first time that I notice that the Visual Studio compiler provides disappointing performance. I have yet to read a good explanation for why that is. Some people blame inlining, but I have yet to find a scenario where a release built from Visual Studio had poor inlining.

I am not ruling out methodological problems but I almost systematically find lower performance under Visual Studio when doing C++ microbenchmarks, despite using different benchmarking methodologies and libraries (e.g., Google Benchmark). It means that if there is a methodological issue, it is deeper than a mere programming typo.

ARM vs Intel on Amazon’s cloud: A URL Parsing Benchmark

Twitter user opdroid1234 remarked that they are getting more performance out of the ARM nodes than out of the Intel nodes on Amazon’s cloud (AWS).

I found previously that the Graviton 3 processors had less bandwidth than comparable Intel systems. However, I have not done much in terms of raw compute power.

The Intel processors have the crazily good AVX-512 instructions: ARM processors have nothing close except for dedicated accelerators. But what about more boring computing?

We wrote a fast URL parser in C++. It does not do anything beyond portable C++: no assembly language, no explicit SIMD instructions, etc.

Can the ARM processors parse URLs faster?

I am going to compare the following node types:

  • c6i.large: Intel Ice Lake (0.085 US$/hour)
  • c7g.large: Amazon Graviton 3 (0.0725 US$/hour)

I am using Ubuntu 22.04 on both nodes. I make sure that cmake, ICU and GNU G++ are installed.

I run the following routine:

  • git clone https://github.com/ada-url/ada
  • cd ada
  • cmake -B build -D ADA_BENCHMARKS=ON
  • cmake --build build
  • ./build/benchmarks/bench --benchmark_filter=Ada

The results are that the ARM processor is indeed slightly faster:

Intel Ice Lake 364 ns/url
Graviton 3 320 ns/url

The Graviton 3 processor is about 15% faster. It is not the 20% to 30% that opdroid1234 reports, but the Graviton 3 nodes are also slightly cheaper.

Please note that (1) I am presenting just one data point, I encourage you to run your own benchmarks (2) I am sure that opdroid1234 is being entirely truthful (3) I love all processors (Intel, ARM) equally (4) I am not claiming that ARM is better than Intel or AMD.

Note: I do not own stock in ARM, Intel or Amazon. I do not work for any of these companies.

Further reading: Optimized PyTorch 2.0 inference with AWS Graviton processors

Regular Visual Studio versus ClangCL

If you are programming in C++ using Microsoft tools, you can use the traditional Visual Studio compiler. Or you can use LLVM as a front-end (ClangCL).

Let us compare their performance characteristics with a fast string transcoding library (simdutf). I use an up-to-date Visual Studio (2022) with the latest ClangCL component (based on LLVM 15). For building the library, we use the latest version of CMake. I will abstain from parallelizing the build: I use default settings. Hardware-wise, I use a Microsoft Surface Laptop Studio: it has a Tiger Lake Intel processor (i7-11370 @ 3.3 GHz).

After grabbing the  simdutf library from GitHub, I prepare the build directory for standard Visual Studio:

> cmake -B buildvc

I do the same for ClangCL:

> cmake -B buildclangcl -T ClangCL

You may also build directly with LLVM:

> cmake -B buildfastclang  -D CMAKE_LINKER="lld"   -D CMAKE_CXX_COMPILER=clang++  -D CMAKE_C_COMPILER=clang -D CMAKE_RC_COMPILER=llvm-rc

For each build directory, I can build in Debug mode (--config Debug) or in Release mode (--config Release) with commands such as

> cmake --build buildvc --config Debug

The project builds an extensive test suite by default. I often rely on my Apple macbook, and I build a lot of software using Amazon (AWS) nodes. I use an AWS c6i.large node (Intel Icelake running at 3,5 GHz, 2 vCPU).

The simdutf library and its testing suite build in a reasonable time as illustrated by the following table (Release builds). For comparison purposes, I also build the library using ‘WSL’ on the Microsoft laptop (Windows Subsystem for Linux).

macbook air ARM M2 processor LLVM 14 25 s
AWS/Linux Intel Ice Lake processor GCC 11 54 s
AWS/Linux Intel Ice Lake processor LLVM 14 54 s
WSL (Microsoft Laptop) Intel Rocket Lake processor GCC 11 1 min*
WSL (Microsoft Laptop) Intel Rocket Lake processor LLVM 14 1 min*

On Intel processors, we build multiple kernels to support the various families of processors. On a 64-bit ARM processor, we only build one kernel. Thus the performance of the AWS/Linux system and the macbook is somewhat comparable.

Let us switch back to Windows and build the library.

Debug Release
Visual Studio (default) 2 min 2 min 15 s
ClangCL 2 min 51 s 3 min 12 s
Windows LLVM (direct with ldd) 2 min 2 min 4 s

Let us run an execution benchmark. We pick an UTF-8 Arabic file that we load in memory and that we transcode to UTF-16 using a fast AVX-512 algorithm. (The exact command is benchmark -P convert_utf8_to_utf16+icelake -F Arabic-Lipsum.utf8.txt).

Debug Release
Visual Studio (default) 0.789 GB/s 4.2 GB/s
ClangCL 0.360 GB/s 5.9 GB/s
WSL GCC 11 (omitted) 6.3 GB/s
WSL LLVM 14 (omitted) 5.9 GB/s
AWS Server (GCC) (omitted) 8.2 GB/s
AWS Server (clang) (omitted) 7.7 GB/s

I draw the following tentative conclusions:

  1. There may be a significant performance difference between Debug and Release code (e.g., between 5x to 15x difference).
  2. Compiling your Windows software with ClangCL may lead to better performance (in Release mode). In my test, I get a 40% speedup with ClangCL. However, compiling with ClangCL takes much longer. I have recommended Windows users build their librairies with ClangCL and I maintain this recommendation.
  3. In Debug mode, the regular Visual Studio produces more performant code and it compiles faster than ClangCL.

Thus it might make sense to use the regular Visual Studio compiler in Debug mode as it builds fast and offers other benefits while testing the code, and then ClangCL in Release mode for the performance.

You may bypass ClangCL under Windows and build directly with clang with the LLVM linker for faster builds. However, I have not verified that you get the same speed.

No matter what I do, however, I seem to be getting slower builds under Windows that I expect. I am not exactly sure why build the code takes so much longer under Windows. It is not my hardware since building with Linux under Windows, on the same laptop, is fast.

Of course, parallelizing the build is the most obvious way to speed it up. Appending -- -m to my CMake command could help my performance. I deliberately avoided parallel builds in this experiment.

WSL Update. Reader Quan Anh Mai asked me whether I had made a copy of the source files in the Windows Subsystem for Linux drive, and I had not. Doing so multiplied the speed by a factor of three. I included the better timings.

Computing the UTF-8 size of a Latin 1 string quickly (AVX edition)

Computers represent strings using bytes. Most often, we use the Unicode standard to represent characters in bytes. The universal format to exchange strings online is called UTF-8. It can represent over a million characters while retaining compatibility with the ancient ASCII format.

Though most of our software stack has moved to Unicode strings, there are still older standards like Latin 1 used for European strings. Thus we often need to convert Latin 1 strings to UTF-8. It is useful to first compute the size (in bytes) of the eventual UTF-8 strings.

Thanfully, it is an easy problem: it suffices to count two output bytes for each input byte having the most significant bit set, and one output byte for other bytes. A relatively simple C++ function suffices:

size_t scalar_utf8_length(const char * c, size_t len) {
  size_t answer = 0;
  for(size_t i = 0; i<len; i++) {
    if((c[i]>>7)) { answer++;}
  }
  return answer + len;
}

To go faster, we may want to use fancy “SIMD” instructions: instructions that process several bytes at once. Your compiler is likely to already do some autovectorization with the simple C function. At compile-time, it will use some SIMD instructions. However, you can try to do hand-code your own version.

We have several instruction sets to choose from, but let us pick the AVX2 instruction sets, available on most x64 processors today. AVX2 has a fast mask function that can extract all the most significant bits, and then another fast function that can count them (popcount). Thus the following routine should do well (credit to Anna Henningsen):

size_t avx2_utf8_length_basic(const uint8_t *str, size_t len) {
  size_t answer = len / sizeof(__m256i) * sizeof(__m256i);
  size_t i;
  for (i = 0; i + sizeof(__m256i) <= len; i += 32) {
    __m256i input = _mm256_loadu_si256((const __m256i *)(str + i));
   answer += __builtin_popcount(_mm256_movemask_epi8(input));
  }
  return answer + scalar_utf8_length(str + i, len - i);
}

Can you do better? On Intel processors, both the ‘movemask’ and population count instructions are fast, but they have some latency: it takes several cycles for them to execute. They may also have additional execution constraints. Part of the latency and constraints is not going to get better with instruction sets like AVX-512 because it requires moving from SIMD registers to general registers at every iterations. It will similarly be a bit of challenge to port this routine to ARM processors.

Thus we would like to rely on cheaper instructions, and stay in SIMD registers until the end. Even if it does not improve the speed of the AVX code, it may work better algorithmically with other instruction sets.

For this purpose, we can borrow a recipe from the paper Faster Population Counts Using AVX2 Instructions (Computer Journal 61 (1), 2018). The idea is to quickly extract the bits, and add them to a counter that is within a SIMD register, and only exact the values at the end.

The code is slightly more complicated because we have an inner loop. Within the inner loop, we use 8-bit counters, only moving to 64-bit counters at the end of the inner loop. To ensure that there is no overflow, the inner loop can only run for 255 iterations. The code looks as follows…

size_t avx2_utf8_length_mkl(const uint8_t *str, size_t len) {
  size_t answer = len / sizeof(__m256i) * sizeof(__m256i);
  size_t i = 0;
  __m256i four_64bits = _mm256_setzero_si256();
  while (i + sizeof(__m256i) <= len) {
    __m256i runner = _mm256_setzero_si256();
    size_t iterations = (len - i) / sizeof(__m256i);
    if (iterations > 255) { iterations = 255; }
    size_t max_i = i + iterations * sizeof(__m256i) - sizeof(__m256i);
    for (; i <= max_i; i += sizeof(__m256i)) {
      __m256i input = _mm256_loadu_si256((const __m256i *)(str + i));
      runner = _mm256_sub_epi8(
        runner, _mm256_cmpgt_epi8(_mm256_setzero_si256(), input));
    }
    four_64bits = _mm256_add_epi64(four_64bits, 
      _mm256_sad_epu8(runner, _mm256_setzero_si256()));
  }
  answer += _mm256_extract_epi64(four_64bits, 0) +
    _mm256_extract_epi64(four_64bits, 1) +
    _mm256_extract_epi64(four_64bits, 2) +
    _mm256_extract_epi64(four_64bits, 3);
    return answer + scalar_utf8_length(str + i, len - i);
}

We can also further unroll the inner loop to save a bit on the number of instructions.

I wrote a small benchmark with 8kB random inputs, with an AMD Rome (Zen2) server and GCC 11 (-O3 -march=native). The results will vary depending on your input, your compiler and your processor.

function cycles/byte instructions/byte instructions/cycle
scalar (no autovec) 0.89 3.3 3.8
scalar (w. autovec) 0.56 0.71 1.27
AVX2 (movemask) 0.055 0.15 2.60
AVX2 (in-SIMD) 0.039 0.15 3.90
AVX2 (in-SIMD/unrolled) 0.028 0.07 2.40

So the fastest method in my test is over 30 times faster than the purely scalar version. If I allow the scalar vector to be ‘autovectorized’ by the compiler, it gets about 50% faster.

It is an interesting scenario because the number of instructions retired by cycle varies so much. The slightly more complex “in-SIMD” function does better than the ‘movemask’ function because it manages to retire more instructions per cycle, in these tests. The unrolled version is fast and requires few instructions per cycle, but it has a ‘relatively’ low number of instructions retired per cycle.

My code is available. It should be possible to further tune this code. You need privileged access to run the benchmark because I rely on performance counters.

Further work: It is not necessary to use SIMD instructions explicitly, you can use SWAR instead for a compromise between portability and performance.

Follow-up: I have a NEON version of this post.

Science and Technology links (February 12 2023)

  1. Kenny finds that the returns due to education are declining. Rich countries are spending more on education, with comparatively weaker test results. It costs more than ever to train a PhD student, but it takes ever longer for them to complete their studies. There 25 times more crop researchers today than in the 1970s, but rate of progress remains constant. Researchers are better at producing patents, but the quality of these patents may be decreasing.Furthermore, there aren’t more people to educate: 2012 was “peak child”. It is the year in history when we had the most births. We are now in decline. Fertility levels are below replacement levels in much of the high-income countries.
  2. Exercise prevents muscle aging at the gene-expression level.
  3. 28% of people in Japan are 65 years old or older.
  4. People who undergo colonoscopies reduce their risk of death by 0.01% in absolute terms.
  5. Jupiter has 92 Moons while Saturn has 83 Moons, however the number increases over time. Most Moons are tiny objects.
  6. Homo erectus beings could still be found 100,000 years ago in Indonesia. They may have mated with Denisovans who themselves cohabited with modern human beings.
  7. Living in the country is not necessarily cheaper than living in a city.
  8. Though it is often believed that the second world war lead to increased industrial output in the USA, the opposite is true.
  9. The city of Venice has been ‘sinking’ underwater for centuries. However, in the last few years, the trend has been reversed: Venice is gaining land.
  10. Both left-wing and right-wing individual can hold authoritarian preferences. Left-wing authoritarianism is associated with the belief in a COVID disease threat, in an ecological threat or in a Trump threat.

Bit Hacking (with Go code)

At a fundamental level, a programmer needs to manipulate bits. Modern processors operate over data by loading in ‘registers’ and not individual bits. Thus a programmer must know how to manipulate the bits within a register. Generally, we can do so while programming with 8-bit, 16-bit, 32-bit and 64-bit integers. For example, suppose that I want to set an individual bit to value 1. Let us pick the bit an index 12 in a 64-bit words. The word with just the bit at index 12 set is 1<<12 : the number 1 shifted to the left 12 times, or 4096. In Go, we format numbers using the fmt.Printf function: we use a string with formatting instructions followed by the values we want to print. We begin a formatting sequence with the letter % which has a special meaning (if one wants to print %, one most use the string %%). It can be followed by the letter b which stands for binary, the letter d (for decimal) or x (for hexadecimal). Sometimes we want to specify the minimal length (in characters) of the output, and we do so by a leading number: e.g, fmt.Printf("%100d", 4096) prints a 100-character string that ends with 4096 and begins with spaces. We can specify zero as a padding character rather than the space by adding it as a prefix (e.g., "%0100d"). In Go, we may print thus the individual bits in a word as in the following example:

package main

import "fmt"

func main() {
    var x uint64 = 1 << 12
    fmt.Printf("%064b", x)
}

Running this program we get a binary string representing 1<<12:

0000000000000000000000000000000000000000000000000001000000000000

The general convention when printing numbers is that the most significant digits are printed first followed by the least significant digits: e.g., we write 1234 when we mean 1000 + 200 + 30 + 4. Similarly, Go prints the most significant bits first, and so the number 1<<12 has 64-13=51 leading zeros followed by a 1 with 12 trailing zeros.

We might find it interesting to revisit how Go represents negative integers. Let us take the 64-bit integer -2. Using two’s complement notation, the number should be represented as the unsigned number (1<<64)-2 which should be a word made entirely one ones, except for the second last bit. We can use the fact that a cast operation in Go (e.g., uint64(x)) preserves the binary representation:

package main

import "fmt"

func main() {
    var x int64 = -2
    fmt.Printf("%064b", uint64(x))
}

This program will print 1111111111111111111111111111111111111111111111111111111111111110 as expected.

Go has some relevant binary operators that we often use to manipulate bits:

&    bitwise AND
|    bitwise OR
^    bitwise XOR
&^   bitwise AND NOT

Furthermore, the symbol ^ is also used to flip all bits a word when used as an unary operation: a ^ b computes the bitwise XOR of a and b whereas ^a flips all bits of a. We can verify that we have a|b == (a^b) | (a&b) == (a^b) + (a&b).

We have other useful identities. For example, given two integers a and b, we have that a+b = (a^b) + 2*(a&b). In the identity 2*(a&b) represents the carries whereas a^b represents the addition without the carries. Consider for example 0b1001 + 0b01001. We have that 0b1 + 0b1 == 0b10 and this is the 2*(a&b) component, whereas 0b1000 + 0b01000 == 0b11000 is captured by a^b. We have that 2*(a|b) = 2*(a&b) + 2*(a^b), thus a+b = (a^b) + 2*(a&b) becomes a+b = 2*(a|b) - (a^b). These relationships are valid whether we consider unsigned or signed integers, since the operations (bitwise logical, addition and subtraction) are identical at the bits level.

Setting, clearing and flipping bits

We know how to create a 64-bit word with just one bit set to 1 (e.g., 1<<12). Conversely, we can also create a word that is made of 1s except for a 0 at bit index 12 by flipping all bits: ^uint64(1<<12). Before flipping all bits of an expression, it is sometimes useful to specify its type (taking uint64 or uint32) so that the result is unambiguous.

We can then use these words to affect an existing word:

  1. If we want to set the 12th bit of word w to one: w |= 1<<12.
  2. If we want to clear (set to zero) the 12th bit of word w: w &^= 1<<12 (which is equivalent to w = w & ^uint64(1<<12)).
  3. If we just want to flip (send zeros to ones and ones to zeros) the 12th bit: w ^= 1<<12.

We may also affect a range of bits. For example, we know that the word (1<<12)-1 has all but the 11 least significant bits set to zeros, and the 11 least significant bits set to ones.

  1. If we want to set the 11 least significant bits of the word w to ones: w |= (1<<12)-1.
  2. If we want to clear (set to zero) the 11 least signficant bits of word w: w &^= (1<<12)-1.
  3. If we want to flip the 11 least signficant bits: w ^= (1<<12)-1.
    The expression (1<<12)-1 is general in the sense that if we want to select the 60 least significant bits, we might do (1<<60)-1. It even works with 64 bits: (1<<64)-1 has all bits set to 1.

We can also generate a word that has an arbitrary range of bits set: the word ((1<<13)-1) ^ ((1<<2)-1) has the bits from index 2 to index 12 (inclusively) set to 1, other bits are set to 0. With such a construction, we can set, clear or flip an arbitrary range of bits within a word, efficiently.

We can set any bit we like in a word. But what about querying the bit sets ? We can check the 12th bit is set in the word u by checking whether w & (1<<12) is non-zero. Indeed, the expression w & (1<<12) has value 1<<12 if the 12th bit is set in w and, otherwise, it has value zero. We can extend such a check: we can verify whether any of the bits from index 2 to index 12 (inclusively) set to 1 by computing w & ((1<<13)-1) ^ ((1<<2)-1). The result is zero if and only if no bit in the specified range is set to one.

Efficient and safe operations over integers

By thinking about values in terms of their bit representation, we can write more efficient code or, equivalent, have a better appreciation for what optimized binary code might look like. Consider the problem of checking if two numbers have the same sign: we want to know whether they are both smaller than zero, or both greater than or equal to zero. A naive implementation might look as follows:

func SlowSameSign(x, y int64) bool {
return ((x < 0) && (y < 0)) || ((x >= 0) && (y >= 0))
}

However, let us think about what distinguishes negative integers from other integers: they have their last bit set. That is, their most significant bit as an unsigned value is one. If we take the exclusive or (xor) of two integers, then the result will have its last bit set to zero if their sign is the same. That is, the result is positive (or zero) if and only if the signs agree. We may therefore prefer the following function to determine if two integers have the same sign:

func SameSign(x, y int64) bool {
    return (x ^ y) >= 0
}

Suppose that we want to check whether x and y differ by at most 1. Maybe x is smaller than y, but it could be larger.

Let us consider the problem of computing the average of two integers. We have the following correct function:

func Average(x, y uint16) uint16 {
    if y > x {
        return (y-x)/2 + x
    } else {
        return (x-y)/2 + y
    }
}

With a better knowledge of the integer representation, we can do better.

We have another relevant identity x == 2*(x>>1) + (x&1). It means that x/2 is within [(x>>1), (x>>1)+1). That is, x>>1 is the greatest integer no larger than x/2. Conversely, we have that (x+(x&1))>>1 is the smallest integer no smaller than x/2.

We have that x+y = (x^y) + 2*(x&y). Hence we have that (x+y)>>1 == ((x^y)>>1) + (x&y) (ignoring overflows in x+y). Hence, ((x^y)>>1) + (x&y) is the greatest integer no larger than (x+y)/2. We also have that x+y = 2*(x|y) - (x^y) or x+y + (x^y)&1= 2*(x|y) - (x^y) + (x^y)&1 and so (x+y+(x^y)&1)>>1 == (x|y) - ((x^y)>>1) (ignoring overflows in x+y+(x^y)&1). It follows that (x|y) - ((x^y)>>1) is the smallest integer no smaller than (x+y)/2. The difference between (x|y) - ((x^y)>>1) and ((x^y)>>1) + (x&y) is (x^y)&1. Hence, we have the following two fast functions:

func FastAverage1(x, y uint16) uint16 {
    return (x|y) - ((x^y)>>1)
}
func FastAverage2(x, y uint16) uint16 {
    return ((x^y)>>1) + (x&y)
}

Though we use the type uint16, it works irrespective of the integer size (uint8, uint16, uint32, uint64) and it also applies to signed integers (int8, int16, int32, int64).

Efficient Unicode processing

In UTF-16, we may have surrogate pairs: the first word in the pair is in the range 0xd800 to 0xdbff whereas the second word is in the range from 0xdc00 to 0xdfff. How may we detect efficiency surrogate pairs? If the values are stored using an uint16 type, then it would seem that we could detect a value part of a surrogate pair with two comparisons: (x>=0xd800) && (x<=0xdfff). However, it may prove more efficient to use the fact that subtractions naturally wrap-around: 0-0xd800==0x2800. Thus x-0xd800 will range between 0 and 0xdfff-0xd800 inclusively whenever we have a value that is part of a surrogate pair. However, any other value will be larger than 0xdfff-0xd800=0x7fff. Thus, a single comparison is needed : (x-0xd800)<=0x7ff.
Once we have determined that we have a value that might correspond to a surrogate pair, we may check that the first value x1 is valid (in the range 0xd800 to 0xdbff) with the condition (x-0xd800)<=0x3ff, and similarly for the second value x2: (x-0xdc00)<=0x3ff. We may then reconstruct the code point as (1<<20) + ((x-0xd800)<<10) + x-0xdc00. In practice, you may not need to concern yourself with such an optimization since your compiler might do it for you. Nevertheless, it is important to keep in mind that what might seem like multiple comparisons could actually be implemented as a single one.

Basic SWAR

Modern processors have specialized instructions capable of operating over multiple units of data with a single instruction (called SIMD for Single Instruction Multiple Data). We can do several operations using a single instruction (or few) instructions with a technique called SWAR (SIMD within a register). Typically, we are given a 64-bit word w (uint64) and we want to treat it as a vector of eight 8-bit words (uint8).

Given a byte value (uint8) I can have it replicated over all bytes of a word with a single multiplication: x * uint64(0x0101010101010101). For example, we have 0x12 * uint64(0x0101010101010101) == 0x1212121212121212. This approach can be generalized in various ways. For example, we have that 0x7 * uint64(0x1101011101110101) == 0x7707077707770707.

For convenience, let us define b80 to be the uint64 equal to 0x8080808080808080 and b01 be the uint64 equal to 0x0101010101010101. We can check whether all bytes are smaller than 128. We first replicate the byte value with all but the most significant bit set to zero (0x80 * b01) or b80) and then we compute the bitwise AND with our 64-bit word and check whether the result is zero: (w & b80)) == 0. It might compile to a two or three instructions on a processor.

We can check whether any byte is zero, assuming that we have checked that they are smaller than 128, with an expression such as ((w - b01) & b80) == 0. If we are not sure that they are smaller than zero, we can simply add an operation: (((w - b01)|w) & b80) == 0. Checking that a byte is zero allows us to check whether two words, w1 and w2, have a matching byte value since, when this happens, w1^w2 has a zero byte value.

We can also design more complicated operations if we assume that all byte values are no larger than 128. For example, we may check that all byte values are no larger than a 7-bit value (t) by the following routine: ((w + (0x80 - t) * b01) & b80) == 0. If the value t is a constant, then the multiplication would be evaluated at compile time and it should be barely more expensive than checking whether all bytes are smaller than 128. In Go, we check that no byte value is larger than 77, assuming that all byte values are smaller than 128 by verifying thaat b80 & (w+(128-77) * b01) is zero. Similarly, we can check that all byte values are at least as large a 7-bit t, assuming that they are also all smaller than 128: ((b80 - w) + t * b01) & b80) == 0. We can generalize further. Suppose we want to check that all bytes are at least as large at the 7-bit value a and no larger than the 7-bit value b. It suffices to check that ((w + b80 - a * b01) ^ (w + b80 - b * b01)) & b80 == 0.

Rotating and Reversing Bits

Given a word, we say that we rotate the bits if we shift left or right the bits, while moving back the leftover bits at the beginning. To illustrate the concept, suppose that we are given the 8-bit integer 0b1111000 and we want to rotate it left by 3 bits. The Go language provides a function for this purpose (bits.RotateLeft8 from the math/bits package): we get 0b10000111. In Go, there is no rotate right operation. However, rotating left by 3 bits is the same as rotating right by 5 bits when processing 8-bit integers. Go provide rotation functions for 8-bit, 16-bit, 32-bit and 64-bit integers.

Suppose that you would like to know if two 64-bit words (w1 and w2) have matching byte values, irrespective of the ordering. We know how to check that they have matching ordered byte values efficiently (e.g., (((w1^w2 - b01)|(w1^w2)) & b80) == 0. To compare all bytes with all other bytes, we can repeat the same operation as many times as they are bytes in a word (eight times for 64-bit integers): each time, we rotate one of the words by 8 bits:

(((w1^w2 - b01)|(w1^w2)) & b80) == 0
w1 = bits.RotateLeft64(w1,8)
(((w1^w2 - b01)|(w1^w2)) & b80) == 0
w1 = bits.RotateLeft64(w1,8)
...

We recall that words can be interpreted as little-endian or big-endian depending on whether the first bytes are the least significant or the most significant. Go allows you to reverse the order of the bytes in a 64-bit word with the function bits.ReverseBytes64 from the math/bits package. There are similar functions for 16-bit and 32-bit words. We have that bits.ReverseBytes16(0xcc00) == 0x00cc. Reversing the bytes in a 16-bit word, and rotating by 8 bits, are equivalent operations.

We can also reverse bits. We have that bits.Reverse16(0b1111001101010101) == 0b1010101011001111. Go has functions to reverse bits for 8-bit, 16-bit, 32-bit and 64-bit words. Many processors have fast instructions to reverse the bit orders, and it can be a fast operation.

Fast Bit Counting

It can be useful to count the number of bits set to 1 in a word. Go has fast functions for this purpose in the math/bits package for words having 8 bits, 16 bits, 32 bits and 64 bits. Thus we have that bits.OnesCount16(0b1111001101010101) == 10.

Similarly, we sometimes want to count the number of trailing or leading zeros. The number of trailing zeros is the number of consecutive zero bits appearing in the least significant positions.For example, the word 0b1 has no trailing zero, whereas the word 0b100 has two trailing zeros. When the input is a power of two, the number of trailing zeros is the logarithm in base two. We can use the Go functions bits.TrailingZeros8, bits.TrailingZeros16
and so forth to compute the number of trailing zeros. The number of leading zeros is similar, but we start from the most significant positions. Thus the 8-bit integer 0b10000000 has zero leading zeros,
while the integer 0b00100000 has two leading zeros. We can use the functions bits.LeadingZeros8, bits.LeadingZeros16
and so forth.

While the number of trailing zeros gives directly the logarithm of powers of two, we can use the number of leading zeros to compute the logarithm of any integer, rounded up to the nearest integer. For 32-bit integers, the following function provides the correct result:

func Log2Up(x uint32) int {
    return 31 - bits.LeadingZeros32(x|1)
}

We can also compute other logarithms. Intuitively, this ought to be possible because if log_b is the logarithm in base b, then log_b (x) = \log_2(x)/\log_2(b). In other words, all logarithms are within a constant factor (e.g., 1/log_2(b)).

For example, we might be interested in the number of decimal digits necessary to represent an integer (e.g., the integer 100 requires three digits). The general formula is ceil(log(x+1)) where the logarithm should be taken in base 10. We can show that the following function (designed by an engineer called Kendall Willets) computes the desired number of digits for 32-bit integers:

func DigitCount(x uint32) uint32 {
    var table = []uint64{
        4294967296, 8589934582, 8589934582,
        8589934582, 12884901788, 12884901788,
        12884901788, 17179868184, 17179868184,
        17179868184, 21474826480, 21474826480,
        21474826480, 21474826480, 25769703776,
        25769703776, 25769703776, 30063771072,
        30063771072, 30063771072, 34349738368,
        34349738368, 34349738368, 34349738368,
        38554705664, 38554705664, 38554705664,
        41949672960, 41949672960, 41949672960,
        42949672960, 42949672960}
    return uint32((uint64(x) + table[Log2Up(x)]) >> 32)
}

Though the function is a bit mysterious, its computation mostly involves computing the number of trailing zeros, and using the result to lookup a value in a table. It translates in only a few CPU instructions and is efficient.

Indexing Bits

Given a word, it is sometimes useful to compute the position of the set bits (bits set to 1). For example, given the word 0b11000111, we would like to have the indexes 0, 1, 2, 6, 7 corresponding to the 5 bits with value 1. We can determine efficiently how many indexes we need to produce thanks to the bits.OnesCount functions. The bits.TrailingZeros functions can serve to identify the position of a bit. We may also use the fact that x & (x-1) set to zero the least significant 1-bit of x. The following Go function generates an array of indexes:

func Indexes(x uint64) []int {
    var ind = make([]int, bits.OnesCount64(x))
    pos := 0
    for x != 0 {
        ind[pos] = bits.TrailingZeros64(x)
        x &= x - 1
        pos += 1
    }
    return ind
}

Given 0b11000111, it produces the array 0, 1, 2, 6, 7:

var x = uint64(0b11000111)
for _, v := range Indexes(x) {
    fmt.Println(v)
}

If we want to compute the bits in reverse order (7, 6, 2, 1, 0), we can do so with a bit-reversal function, like so:

for _, v := range Indexes(bits.Reverse64(x)) {
    fmt.Println(63 - v)
}

Conclusion

As a programmer, you may access, set, copy, or move individual bit values efficiently. With some care, you can avoid arithmetic overflows without much of a performance penalty. With SWAR, you can use a single word as if it was made of several subwords. Though most of these operations are only rarely needed, it is important to know that they are available.

Serializing IPs quickly in C++

On the Internet, we often use 32-bit addresses which we serialize as strings such as 192.128.0.1. The string corresponds to the Integer address 0xc0800001 (3229614081 in decimal).

How might you serialize, go from the integer to the string, efficiently in C++?

The simplest code in modern C++ might look as follows:

std::string output = std::to_string(address >> 24);
for (int i = 2; i >= 0; i--) {
  output.append(std::to_string((address >> (i * 8)) % 256) + ".");
}

At least symbolically, it will repeatedly create small strings that are appended to an initial string.

Can we do better? We have new functions in C++ (std::to_chars) which are dedicated to writing quickly to a string buffer. So we might try to allocate a single buffer and write to it using buffers. The result is not pretty:

std::string output(4 * 3 + 3, '\0'); // allocate just one big string
char *point = output.data();
char *point_end = output.data() + output.size();
point = std::to_chars(point, point_end, uint8_t(address >> 24)).ptr;
for (int i = 2; i >= 0; i--) {
 *point++ = '.';
 point = std::to_chars(point, point_end, uint8_t(address >> (i * 8))).ptr;
}
output.resize(point - output.data());

The uglier code might be faster.

I wrote a benchmark where I repeatedly create strings and add them to containers.

On an AMD Rome (Zen2) processor using GCC 11, I get the following numbers.

pretty code 80 ns/string
to_chars 24 ns/string

So about three times faster?

We can go at least twice as fast with a bulkier function but it uses significantly more storage and memory:

  // uses 1025 bytes
  constexpr static const char *lookup =
".0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .10 .11 .12 .13 .14 .15 .16 .17 "
".18 .19 .20 .21 .22 .23 .24 .25 .26 .27 .28 .29 .30 .31 .32 .33 .34 .35 "
".36 .37 .38 .39 .40 .41 .42 .43 .44 .45 .46 .47 .48 .49 .50 .51 .52 .53 "
".54 .55 .56 .57 .58 .59 .60 .61 .62 .63 .64 .65 .66 .67 .68 .69 .70 .71 "
".72 .73 .74 .75 .76 .77 .78 .79 .80 .81 .82 .83 .84 .85 .86 .87 .88 .89 "
".90 .91 .92 .93 .94 .95 .96 .97 .98 .99 "
".100.101.102.103.104.105.106.107.108.109.110.111.112.113.114.115.116."
"117.118.119.120.121.122.123.124.125.126.127.128.129.130.131.132.133.134."
"135.136.137.138.139.140.141.142.143.144.145.146.147.148.149.150.151.152."
"153.154.155.156.157.158.159.160.161.162.163.164.165.166.167.168.169.170."
"171.172.173.174.175.176.177.178.179.180.181.182.183.184.185.186.187.188."
"189.190.191.192.193.194.195.196.197.198.199.200.201.202.203.204.205.206."
"207.208.209.210.211.212.213.214.215.216.217.218.219.220.221.222.223.224."
"225.226.227.228.229.230.231.232.233.234.235.236.237.238.239.240.241.242."
"243.244.245.246.247.248.249.250.251.252.253.254.255";
  std::string output(4 * 3 + 3, '\0');
  char *point = output.data();
  uint8_t by;
  by = address >> 24;
  std::memcpy(point, lookup + by * 4 + 1, 4);
  point += (by < 10) ? 1 : (by < 100 ? 2 : 3);
  by = address >> 16;
  std::memcpy(point, lookup + by * 4, 4);
  point += (by < 10) ? 2 : (by < 100 ? 3 : 4);
  by = address >> 8;
  std::memcpy(point, lookup + by * 4, 4);
  point += (by < 10) ? 2 : (by < 100 ? 3 : 4);
  by = address >> 0;
  std::memcpy(point, lookup + by * 4, 4);
  point += (by < 10) ? 2 : (by < 100 ? 3 : 4);
  output.resize(point - output.data());


In my benchmark, I include a version by Marcin Zukowski that uses even more memory (about 4 times) and is slightly faster.

As always, your results will vary depending on your system and your compiler. However, I recommend against creating small strings and aggregating them together in performance critical code.

Credit: Peter Dimov and Marcin Zukowski have contributed fast versions (see benchmarks). The version with a large table is derived from their work. Ivan-Assen Ivanov contributed ideas on Twitter.

Move or copy your strings? Possible performance impacts

You sometimes want to add a string to an existing data structure. For example, the C++17 template ‘std::optional’ may be used to represent a possible string value. You may copy it there, as this code would often do…

std::string mystring;
std::optional<std::string> myoption;
myoption = mystring;

Or you can move it:

std::string mystring;
std::optional<std::string> myoption;
myoption = std::move(mystring);

In C++, when ‘moving’ a value, the compiler does not need to create a whole new copy of the string. So it is often cheaper.

I wrote a little benchmark to assess the performance difference. It is a single test, but it should illustrate.

Firstly, for relatively long strings (a phrase or a sentence), the move is 5 times to 20 times faster.

copy move
Apple LLVM 14, M2 processor 24 ns/string 1.2 ns/string
GCC 11, Intel Ice Lake 19 ns/string 4 ns/string

Secondly, for short strings (a single word), the move is 1.5 times to 3 times faster but the absolute difference is small (as small as a fraction of a nanosecond). Your main concern should be with long strings.

copy move
Apple LLVM 14, M2 processor 2.0 ns/string 1.2 ns/string
GCC 11, Intel Ice Lake 7 ns/string 2.6 ns/string

My results illustrate that moving your sizeable data structure instead of copying them is beneficial.

But that’s not the fastest approach: the fastest approach is to just hold a pointer. Copying an address is unbeatably fast. A slightly less optimal approach is to use a lightweight object like an std::string_view: copying or creating an std::string_view is cheaper than doing the same with a C++ string.

International domain names: where does https://meßagefactory.ca lead you?

Originally, the domain part of a web address was all ASCII (so no accents, no emojis, no Chinese characters). This was extended a long time ago thanks to something called internationalized domain name (IDN).

Today, in theory, you can use any Unicode character you like as part of a domain name, including emojis. Whether that is wise is something else.

What does the standard says? Given a domain name, we should identify its labels. They are normally separated by dots (.) into labels: www.microsoft.com has three labels. But you may also use other Unicode characters as separators ( ., ., 。, 。). Each label is further processed. If it is all ASCII, then it is left as is. Otherwise, we must convert it to an ASCII code called “punycode” after doing the following according to RFC 3454:

  • Map characters (section 3 of RFC 3454),
  • Normalize (section 4 of RFC 3454),
  • Reject forbidden characters,
  • Optionally reject based on unassigned code points (section 7).

And then you get to the punycode algorithm. There are further conditions to be satisfied, such as the domain name in ASCII cannot exceed 255 bytes.

That’s quite a lot of work. The goal is to transcribe each Unicode domain name into an ASCII domain name. You would hope that it would be a well-defined algorithm: given a Unicode domain name, there should be a unique output.

Let us choose a common non-ASCII character, the letter ß, called Eszett. Let me create a link with this character:

What happens if you click on this link? The result depends on your browser. If you are using Microsoft Edge, Google Chrome or the Brave browser, you may end up at https://messagefactory.ca/. If you are using Safari or Firefox  you may end up at https://xn--meagefactory-m9a.ca. Of course, your results may vary depending on your exact system. Under ios (iPhone), I expect that the Safari behaviour will prevail irrespective of your browser.

Not what I expected.

Update: We wrote our own library to process international domain names according to the standard: it is called idna and part of the ada-url project. Our library produces https://xn--meagefactory-m9a.ca which is the non-transitional (and now correct) answer.

Year 2022: Scientific progress

The year 2022 is over. As with every year that passes, we have made some scientific progress. I found the following achievements interesting:

Science and technology links (January 15 2023)

    1. For under $600, one can buy a 20-terabyte disk on Amazon. Unless you work professionally in multimedia, it is more storage than you need. However, having much storage it, by itself, of little use if you cannot access it. Thankfully, you can buy a 1-terabyte “disk” for $200 that provides over 6 GB/s of bandwidth. I have a similar disk in my game console. Is this as good as it gets? Researchers show that we can transmit data over a distance at more than a petabit per second. According to some estimates, that is more than the total data size of the books in the library of congress, per second.
    2. Transplanting rejuvenated blood stem cells extends lifespan of aged immunocompromised mice.
    3. Amazon is using drones for deliveries in California and Texas.
    4. People who think themselves as less attractive are more likely willing to wear surgical masks.
    5. Conversations rarely end when they should.
    6. Using legal restrictions, manufacturers are able to prevent their customers from repairing their own products. There may be hope. Farmers in the US will be allowed to repair John Deere tractors.
    7. For most of my life, nuclear power has been viewed with suspicion, and little effort has been done to exploit it further. The tide might be finally turning. The UK government plans to authorize small modular nuclear reactors, and other nuclear-power innovations.
    8. Cancer rates are falling rapidly in both men and women.
    9. There is evidence that vitamin D supplements might help your immune system if your levels are too low. However, you may also require magnesium supplements to benefit from the vitamin D.
    10. In the post-pandemic era, people work fewer hours.
    11. There is evidence that viruses dormant in our genomes can wake up when we are older and harm us.
    12. Research papers and patents are becoming less disruptive over time. However, we are producing many more research papers and patents.
    13. Genetically modified immune cells are used to fight cancer in a patient.
    14. Increasing house insulation may not lead to lower energy usage on the long run. It is not, per se, an argument against better insulation. Yet it suggests that we should plan for increase power production.
    15. In a paper published by Nature, Kleinherenbrink et al. find that global mean sea levels are likely rising according to a linear curve, as opposed to an accelerating curve. The current rate is estimated to be between 3 and 4 millimeters per year. Meanwhile, the most low-lying island nations on the planet are growing.
    16. Antartica might be getting cooler.
    17. People prefer to stay away from promiscuous men.
    18. Cold temperatures harm the antiviral immunity of your nose. Thus cold weather may make you more prone to catching a virus.
    19. Replacing grades with pass/fail scores in courses lead students to make less effort. In my view, it does not imply that we should not adopt pass/fail scores because there are diminish returns to more efforts. E.g., if the kids in a country spend much more time perfecting their knowledge of trigonometry, you may not end up with a more prosperous country. In some sense, intense competition may be a net loss.
    20. Using solar power generation in countries such as Switzerland results in a net energy loss: though the solar panels produce energy, they never recoup the energy investment needed to make them and deploy them.