Can GPT pass my programming courses?

Professeur Caplan reports that ChatGPT, the new artificial intelligence that is all the rage, can pass his economics midterm and get an A.

What about computer science ? I submitted to GPT one problem I offer my students…

Write a Java program that calculates the sum of numbers from 1 to 10,000 (including 1 and 10,000) but omitting numbers that are divisible by three and also omitting numbers whose hundredth digit is 2 or 3 (e.g. 1201 or 3313).

Bing/GPT solves the problem with ease:

public class SumOfNumbers {
  public static void main(String[] args) {
    int sum = 0;
    for (int i = 1; i <= 10000; i++) {
      if (i % 3 != 0 && (i / 100) % 10 != 2 && (i / 100) % 10 != 3) {
        sum += i;
      }
    }
    System.out.println("The sum of numbers from 1 to 10,000 (including 1 and 10,000) but omitting numbers that are divisible by three and also omitting numbers whose hundredth digit is 2 or 3 is: " + sum);
  }
}

Maybe it somehow knew just this particular answer. What if I tweak the question a bit?

Write a Java program that calculates the sum of numbers from 1 to 10,000 (including 1 and 10,000) but omitting numbers where the hundredth digit is divisible by 3.

Bing/GPT gets it almost correct:

public class SumOfNumbers {
  public static void main(String[] args) {
    int sum = 0;
    for (int i = 1; i <= 10000; i++) {
      if ((i / 100) % 3 != 0) {
        sum += i;
      }
    }
    System.out.println("The sum of numbers from 1 to 10,000 (including 1 and 10,000) but omitting numbers where the hundredth digit is divisible by 3 is: " + sum);
  }
}

What about other questions I often ask… such as…

Write a Java function which can invert the elements in an array of integers.

Bing/GPT gets it right, complete with a detailed explanation…

public static void invertArray(int[] arr) {
  int n = arr.length;
  for (int i = 0; i < n / 2; i++) {
    int temp = arr[i];
    arr[i] = arr[n - i - 1];
    arr[n - i - 1] = temp;
  }
}

What about database work? Here is a basic question I ask…

What can you say about the attributes A and B if A→B and B→A?

Bing/GPT tells me that…

If A→B and B→A, then A and B are said to be functionally dependent on each other. This means that the value of A uniquely determines the value of B, and vice versa. In other words, if you know the value of A, you can determine the value of B, and if you know the value of B, you can determine the value of A

This is correct.

What about a more technical database question ?

Bing/GPT answers…

SELECT COUNT(*) FROM table_name WHERE A = B

Unfortunately, it extends to the exams themselves. Bing/GPT gets much of it right. In the current “introduction to programming” exam, there was only one question the GPT got obviously wrong, but I would have attributed some points ‘for effort’.

Bing/GPT can pass my introductory computer science courses.

I further asked it to solve a problem using the C++ simdjson library. We are definitively well beyond introductory courses. GPT did well enough as the answer is almost correct:

Then I asked that it produces its own SIMD routine in C++ to count the number of characters equal to ‘.’ and it did so… It is almost correct…

Bing GPT can solve mathematical problems, in French.

Counting cycles and instructions on ARM-based Apple systems

In my blog post Counting cycles and instructions on the Apple M1 processor, I showed how we could have access to “performance counters” to count how many cycles and instructions a given piece of code took on ARM-based mac systems. At the time, we only had access to one Apple processor, the remarkable M1. Shortly after, Apple came out with other ARM-based processors and my current laptop runs on the M2 processor. Sadly, my original code only works for the M1 processor.

Thanks to the reverse engineering work of ibireme, a software engineer, we can generalize the approach. We have further extended my original code so that it works under both Linux and on ARM-based macs. The code has benefited from contributions from Wojciech Muła and John Keiser.

For the most part, you setup a global event_collector instance, and then you surround the code you want to benchmark by collector.start() and collector.end(), pushing the results into an event_aggregate:

#include "performancecounters/event_counter.h"

event_collector collector;

void f() {
  event_aggregate aggregate{};
  for (size_t i = 0; i < repeat; i++) {
   collector.start();
   function(); // benchmark this function
   event_count allocate_count = collector.end();
   aggregate << allocate_count;
  }
}

And then you can query the aggregate to get the average or best performance counters:

aggregate.elapsed_ns() // average time in ns
aggregate.instructions() // average # of instructions
aggregate.cycles() // average # of cycles
aggregate.best.elapsed_ns() // time in ns (best round)
aggregate.best.instructions() // # of instructions (best round)
aggregate.best.cycles() // # of cycles (best round)

I updated my original benchmark which records the cost of parsing floating-point numbers, comparing the fast_float library against the C function strtod:

# parsing random numbers
model: generate random numbers uniformly in the interval [0.000000,1.000000]
volume: 10000 floats
volume = 0.0762939 MB
                           strtod     33.10 ns/float    428.06 instructions/float
                                      75.32 cycles/float
                                       5.68 instructions/cycle
                        fastfloat      9.53 ns/float    193.78 instructions/float
                                      27.24 cycles/float
                                       7.11 instructions/cycle

The code is freely available for research purposes.

Runtime asserts are not free

When writing software in C and C++, it is common to add C asserts to check that some conditions are satisfied at runtime. Often, it is a simple comparison between two values.

In many instances, these asserts are effectively free as far as performance goes, but not always. If they appear within a tight loop, they may impact the performance.

One might object that you can choose to only enable assertions for the debug version of your binary… but this choice is subject to debate. Compilers like GCC or clang (LLVM) do not deactivate asserts when compiling with optimizations. Some package maintainers require all asserts to remain in the release binary.

What do I mean by expensive? Let us consider an example. Suppose I want to copy an array. I might use the following code:

for (size_t i = 0; i < N; i++) {
  x1[i] = x2[i];
}

Suppose that I know that all my values are smaller than some threshold, but I want to be double sure. So during the copy, I might add a check:

for (size_t i = 0; i < N; i++) {
  assert(x2[i] < RAND_MAX);
  x1[i] = x2[i];
}

It is an inexpensive check. But how much does it cost me? I wrote a small benchmark which I run on an M2 processor after compiling the code with clang (LLVM 14). I get the following result:

simple loop 0.3 ns/word
loop with assert 0.9 ns/word

So adding the assert multiply the running time by a factor of three in this instance. Your results will vary but you should not be surprised if the version with asserts is significantly slower.

So asserts are not free. To make matters more complicated, some projects refuse to rely on library that terminates on errors. So having asserts in your library may disqualify it for some uses.

You will find calls to asserts in my code. I do not argue that you should never use them. But spreading asserts in performance critical code might be unwise.

Precision, recall and why you shouldn’t crank up the warnings to 11

Recently, the code hosting site GitHub deployed widely a tool called CodeQL with rather agressive settings. It does static analysis on the code and it attempts to flag problems. I use the phrase “static analysis” to refer to an analysis that does not run the code. Static analysis is limited: it can identify a range of actual bugs, but it tends also to catch false positives: code patterns that it thinks are bugs but aren’t.

Several Intel engineers proposed code to add AVX-512 support to a library I help support. We got the following scary warnings:

CodeQL is complaining that we are taking as an input a pointer to 8-byte words, and treating it if it were a pointer to 64-byte words. If you work with AVX-512, and are providing optimized replacements for existing function, such code is standard. And no compiler that I know of, even at the most extreme settings, will ever issue a warning, let alone a scary “High severity Check Failure”.

On its own, this is merely a small annoyance that I can ignore. However, I fear that it is part of a larger trend where people come to rely more or more on overbearing static analysis to judge code quality. The more warnings, the better, they think.

And indeed, surely, the more warnings that a linter/checker can generate, the better it is ?

No. It is incorrect for several reasons:

  1. Warnings and false errors may consume considerable programmer time. You may say ‘ignore them’ but even if you do, others will be exposed to the same warnings and they will be tempted to either try to fix your code, or to report the issue to you. Thus, unless you work strictly alone or in a closed group, it is difficult to escape the time sink.
  2. Training young programmers to avoid non-issues may make them less productive. The two most important features of software are (in order): correctness (whether it does what you say it does) and performance (whether it is efficient). Fixing shallow warnings is easy work, but it often does not contribute to either correctness (i.e., it does not fix bugs) nor does it make the code any faster. You may feel productive, and it may look like you are changing much code, but what are you gaining?
  3. Modifying code to fix a non-issue has a non-zero chance of introducing a real bug. If you have code that has been running in production for a long time, without error… trying to “fix it” (when it is not broken) may actually break it. You should be conservative about fixing code without strong evidence that there is a real issue. Your default behaviour should be to refuse to change the code unless you can see the benefits. There are exceptions but almost all code changes should either fix a real bug, introduce a new feature or improve the performance.
  4. When programming, you need to clear your mental space. Distractions are bad. They make you dumber. So your programming environment should not have unnecessary features.

Let us use some mathematics. Suppose that my code has bugs, and that a static checker has some probability of catching a bug each time it issues a warning. In my experience, this probability can be low… but the exact percentage is not important to the big picture. Let me use a reasonable model. Given B bugs per 1000 lines the probability that my warning has caught a bug follows a logistic functions, say 1/(1+exp(10 – B)). So if I have 10 bugs per 1000 lines of code, then each warning has a 50% probability of being useful. It is quite optimistic.

The recall is how many of the bugs I have caught. If I have 20 bugs in my code per 1000 lines, then having a million warnings will almost ensure that all bugs are caught. But the human beings would need to do a lot of work.

So given B, how many warnings should I issue? Of course, in the real world I do not know B, and I do not know that the usefulness of the warnings follows a logistic function, but humour me.

A reasonable answer is that we want to maximize the F-score: the harmonic mean between the precision and the recall.

I hastily coded a model in Python, where I vary the number of warnings. The recall always increases while the precision always fall. The F-score follows a model distribution: having no warnings in terrible, but having too many is just as bad. With a small number of warnings, you can maximize the F-score.

A more intuitive description of the issue is that the more warnings you produce, the more likely you are to waste programmer time. You are also more likely to catch bugs. One is negative, one is positive. There is a trade-off. When there is a trade-off, you need to seek the sweet middle point.

The trend toward an ever increasing number of warnings does not improve productivity. In fact, at the margin, disabling the warnings entirely might be just as productive as having the warnings: the analysis has zero value.

I hope that it is not a symptom of a larger trend where programming becomes bureaucratic. Software programming is one of the key industry where productivity has been fantastic and where we have been able to innovate at great speed.

Science and Technology links (March 11 2023)

  1. In 1500, China was the largest economy in the world, followed by India and France. The USA did not exist yet. In 1700, 4% of human beings lived in France. In the mid 18tg century, there are 25 million inhabitants in France and 5.5 million in England. Yet France had slipped to a sixth place as an economic power worldwide, while the United Kingdom moved up. It was beginning a sustained decline. So what happened to France? It appears that decades before the French revolution, France became less religious and that was the driver for a large decrease in fertility.
  2. Men who do physical labor have higher sperm counts and testosterone levels.
  3. You can take human brain cells, put them in a rat, and these cells will integrate into the rat’s brain.
  4. The top 1 per cent in income score slightly worse on cognitive ability than those in the income strata right below them according to Keuschnigg et al. (2023).
  5. Both hemispheres (south and north) absorb the same amount of energy from the Sun. According to Hadas et al., the presence of more ocean causes the southern-hemisphere to absorb more sunlight; however, it also makes it stormier, increasing cloud-albedo and balancing the difference in reflected sunlight. In effect, the atmosphere is balancing out greating forcing.
  6. Half of elderly American men take statins. In patients suffering from Alzheimer’s, statins worsen cognition: after stopping statins, the patients improved.
  7. Though earlier research concluded that happiness increases with income only up to a point, Killingsworth et al. (2023) find there is a robust linear-log relationship between income and happiness, meaning that each time you multiply your income, your happiness goes up a level. Note that this is a relationship, not necessarily a causal effect.
  8. Sea levels were about 1.5 meters higher 6000 years ago.
  9. High-intensity acute exercise enhances memory.
  10. According to the CDC, 13% of all Americans take antidepressants at any given time. About 15% of patients taking antidepressants have an effect that is above the placebo effect. In other words, most people take antidepressants for no benefit. But antidepressants may not be without risks. Antidepressants are associated with dementia: they may put you more at risk for Alzheimer’s.
  11. Despite substantial reductions in anthropogenic CO2 emissions in early 2020, the annual atmospheric CO2 growth rate did not decline, according to Laughner et al. (2021). They write:

    The lack of clear declines in the atmospheric growth rates of CO2 and CH4, despite large reductions in human activity, reflect carbon-cycle feedbacks in air–sea carbon exchange, large interannual variability in the land carbon sink, and the chemical lifetime of CH4. These feedbacks foreshadow similar challenges to intentional mitigation.

  12. By using the blood of a young mouse in a old mouse, we appear to be able to at least partially rejuvenate the brain of the old mice.
  13. The gas H2S increased longevity in worms.
  14. The energy density of Li-ion batteries has increased by less than 3% in the last 25 years according to Li (2019).
  15. There were giant penguins during the Paleocene. They might have weighted 150 kg, for more than most human beings. The climate during the Paleocene was generally tropical and the poles were ice-free.
  16. Young men sometimes outperform young women in standardized mathematical tests, while the opposite is true in some other domains. The gap is larger in gender-equal countries.
  17. There are still compulsory sterilization laws in the USA: the State may legally require your sterilization.
  18. We used to think that the earliest human beings (homo sapiens) migrated out of Africa 120,000 years ago. However, we found fossils as old as 177,000 years. The oldest fossils were found in Morocco (315,000 years old), though there are some doubts whether they were from homo sapies.
  19. It appears that ChatGPT is biased against conservative politicians. One might expect the system to express the biases of its creators.

Trimming spaces from strings faster with SVE on an Amazon Graviton 3 processor

Programmers sometimes need to trim, or remove, characters, such as spaces from strings. It might be a surprising expensive task. In C/C++, the following function is efficient:

size_t trimspaces(const char *s, size_t len, char *out) {
  char * init_out{out};
  for(size_t i = 0; i < len; i++) {
    *out = s[i];
    out += (s[i] != ' ');
  }
  return out - init_out;
}

Basically, we write all characters from the input, but we only increment the pointer if the input is not a space.

Amazon makes available new ARM-based systems relying on their Graviton 3 processors. These processors support advanced instructions called “SVE”. One very nice family of instructions to ‘compact’ values (effectively, remove unwanted values). I put it to good use when filtering out integer values from arrays.

Unfortunately, the SVE compact instructions cannot be directly applied to the problem of pruning spaces in strings, because they only operate on larger words (e.g., 32-bit words). But, fortunately, it is possible to load bytes directly into 32-bit values so that each byte value occupies 32-bit in memory using intrinsics functions such as svld1sb_u32. As you would expect, you can also do the reverse, and take an array of 32-bit values, and automatically convert it to a byte array (e.g., using svst1b_u32).

Thus I can take my byte array (a string), load it into temporary 32-bit vectors, prune these vectors, and then store the result as a byte array, back to a string. The following C code is a reasonable implementation of this idea:

size_t sve_trimspaces(const char *s, size_t len, char *out) {
  uint8_t *out8 = reinterpret_cast<uint8_t *>(out);
  size_t i = 0;
  for (; i + svcntw() <= len; i += svcntw()) {
   svuint32_t input = svld1sb_u32(svptrue_b32(), (const int8_t *)s + i);
   svbool_t matches = svcmpne_n_u32(svptrue_b32(), input, 32);
   svuint32_t compressed = svcompact_u32(matches, input);
   svst1b_u32(svptrue_b32(), out8, compressed);
   out8 += svcntp_b32(svptrue_b32(), matches);
  }
  if (i < len) {
   svbool_t read_mask = svwhilelt_b32(i, len);
   svuint32_t input = svld1sb_u32(read_mask, (const int8_t *)s + i);
   svbool_t matches = svcmpne_n_u32(read_mask, input, 32);
   svuint32_t compressed = svcompact_u32(matches, input);
   svst1b_u32(read_mask, out8, compressed);
   out8 += svcntp_b32(read_mask, matches);
  }
  return out8 - reinterpret_cast<uint8_t *>(out);
}

Is it faster? Using GCC 12 on a Graviton 3, I get that the SVE approach is 3.6 times faster and it uses 6 times fewer instructions. The SVE code is not six times faster, because it is retiring fewer instructions per cycle. My code is available.

conventional code 1.8 cycles/bytes 7 instructions/byte
SVE code 0.5 cycles/bytes 1.1 instructions/byte

You can achieve a similar result with ARM NEON, but you require large tables to emulate the compression instructions. Large tables lead to code bloat which, at the margin, is not desirable.

Float-parsing benchmark: Regular Visual Studio, ClangCL and Linux GCC

Windows users have choices when it comes to C++ programming. You may choose to stick with the regular Visual Studio. If you prefer, Microsoft makes available ClangCL which couples the LLVM compiler (commonly used by Apple) with the Visual Studio backend. Further, under Windows, you may easily build software for Linux using the Windows Subsystem for Linux.

Programmers often need to convert a string (e.g., 312.11) into a binary float-point number. It is a common task when doing data science.

I wrote a small set of benchmark programs to measure the speed. To run, I use an up-to-date Visual Studio (2022) with the latest ClangCL component, and the latest version of CMake.

After downloading it on your machine, you may build it with CMake. Under Linux, you may do:

cmake -B build && cmake --build build

And then you can execute like so:

./build/benchmarks/benchmark -f data/canada.txt

The Microsoft Visual Studio usage is similar except that you must specify the build type (e.g., Release):

cmake -B build && cmake --build build --config Release

For ClangCL, it is almost identical, except that you need to add -T ClangCL:

cmake -B build -T ClangCL && cmake --build build --config Release

Under Windows, the binary is not produced at build/benchmarks/benchmark but rather at build/benchmarks/Release/benchmark.exe, but the commands are otherwise the same. I use the default CMake flags corresponding to the Release build.

I run the benchmark on a laptop with Tiger Lake Intel processor (i7-11370 @ 3.3 GHz). It is not an ideal machine for benchmarking so I indicate the error margin on the speed measurements.

Among other libraries, my benchmark puts the fast_float library to the test: it is an implementation of the C++17 from_chars function for floats and doubles. Microsoft has its own implementation of this function which I include in the mix.

My benchmarking code remains identical (in C++) when I switch system or compiler: I simply recompile it.

Visual Studio std::from_chars 87 MB/s (+/- 20 %)
Visual Studio fast_float 285 MB/s (+/- 24 %)
ClangCL fast_float 460 MB/s (+/- 36 %)
Linux GCC11 fast_float 1060 MB/s (+/- 28 %)

We can observe that there are large performance differences. All the tests run on the same machine. The Linux build runs under the Windows Subsystem for Linux, and you do not expect the subsystem to run computations faster than Windows itself. The benchmark does not involve disk access. The benchmark is not allocating new memory.

It is not the first time that I notice that the Visual Studio compiler provides disappointing performance. I have yet to read a good explanation for why that is. Some people blame inlining, but I have yet to find a scenario where a release built from Visual Studio had poor inlining.

I am not ruling out methodological problems but I almost systematically find lower performance under Visual Studio when doing C++ microbenchmarks, despite using different benchmarking methodologies and libraries (e.g., Google Benchmark). It means that if there is a methodological issue, it is deeper than a mere programming typo.

ARM vs Intel on Amazon’s cloud: A URL Parsing Benchmark

Twitter user opdroid1234 remarked that they are getting more performance out of the ARM nodes than out of the Intel nodes on Amazon’s cloud (AWS).

I found previously that the Graviton 3 processors had less bandwidth than comparable Intel systems. However, I have not done much in terms of raw compute power.

The Intel processors have the crazily good AVX-512 instructions: ARM processors have nothing close except for dedicated accelerators. But what about more boring computing?

We wrote a fast URL parser in C++. It does not do anything beyond portable C++: no assembly language, no explicit SIMD instructions, etc.

Can the ARM processors parse URLs faster?

I am going to compare the following node types:

  • c6i.large: Intel Ice Lake (0.085 US$/hour)
  • c7g.large: Amazon Graviton 3 (0.0725 US$/hour)

I am using Ubuntu 22.04 on both nodes. I make sure that cmake, ICU and GNU G++ are installed.

I run the following routine:

  • git clone https://github.com/ada-url/ada
  • cd ada
  • cmake -B build -D ADA_BENCHMARKS=ON
  • cmake --build build
  • ./build/benchmarks/bench --benchmark_filter=Ada

The results are that the ARM processor is indeed slightly faster:

Intel Ice Lake 364 ns/url
Graviton 3 320 ns/url

The Graviton 3 processor is about 15% faster. It is not the 20% to 30% that opdroid1234 reports, but the Graviton 3 nodes are also slightly cheaper.

Please note that (1) I am presenting just one data point, I encourage you to run your own benchmarks (2) I am sure that opdroid1234 is being entirely truthful (3) I love all processors (Intel, ARM) equally (4) I am not claiming that ARM is better than Intel or AMD.

Note: I do not own stock in ARM, Intel or Amazon. I do not work for any of these companies.

Further reading: Optimized PyTorch 2.0 inference with AWS Graviton processors

Regular Visual Studio versus ClangCL

If you are programming in C++ using Microsoft tools, you can use the traditional Visual Studio compiler. Or you can use LLVM as a front-end (ClangCL).

Let us compare their performance characteristics with a fast string transcoding library (simdutf). I use an up-to-date Visual Studio (2022) with the latest ClangCL component (based on LLVM 15). For building the library, we use the latest version of CMake. I will abstain from parallelizing the build: I use default settings. Hardware-wise, I use a Microsoft Surface Laptop Studio: it has a Tiger Lake Intel processor (i7-11370 @ 3.3 GHz).

After grabbing the  simdutf library from GitHub, I prepare the build directory for standard Visual Studio:

> cmake -B buildvc

I do the same for ClangCL:

> cmake -B buildclangcl -T ClangCL

You may also build directly with LLVM:

> cmake -B buildfastclang  -D CMAKE_LINKER="lld"   -D CMAKE_CXX_COMPILER=clang++  -D CMAKE_C_COMPILER=clang -D CMAKE_RC_COMPILER=llvm-rc

For each build directory, I can build in Debug mode (--config Debug) or in Release mode (--config Release) with commands such as

> cmake --build buildvc --config Debug

The project builds an extensive test suite by default. I often rely on my Apple macbook, and I build a lot of software using Amazon (AWS) nodes. I use an AWS c6i.large node (Intel Icelake running at 3,5 GHz, 2 vCPU).

The simdutf library and its testing suite build in a reasonable time as illustrated by the following table (Release builds). For comparison purposes, I also build the library using ‘WSL’ on the Microsoft laptop (Windows Subsystem for Linux).

macbook air ARM M2 processor LLVM 14 25 s
AWS/Linux Intel Ice Lake processor GCC 11 54 s
AWS/Linux Intel Ice Lake processor LLVM 14 54 s
WSL (Microsoft Laptop) Intel Rocket Lake processor GCC 11 1 min*
WSL (Microsoft Laptop) Intel Rocket Lake processor LLVM 14 1 min*

On Intel processors, we build multiple kernels to support the various families of processors. On a 64-bit ARM processor, we only build one kernel. Thus the performance of the AWS/Linux system and the macbook is somewhat comparable.

Let us switch back to Windows and build the library.

Debug Release
Visual Studio (default) 2 min 2 min 15 s
ClangCL 2 min 51 s 3 min 12 s
Windows LLVM (direct with ldd) 2 min 2 min 4 s

Let us run an execution benchmark. We pick an UTF-8 Arabic file that we load in memory and that we transcode to UTF-16 using a fast AVX-512 algorithm. (The exact command is benchmark -P convert_utf8_to_utf16+icelake -F Arabic-Lipsum.utf8.txt).

Debug Release
Visual Studio (default) 0.789 GB/s 4.2 GB/s
ClangCL 0.360 GB/s 5.9 GB/s
WSL GCC 11 (omitted) 6.3 GB/s
WSL LLVM 14 (omitted) 5.9 GB/s
AWS Server (GCC) (omitted) 8.2 GB/s
AWS Server (clang) (omitted) 7.7 GB/s

I draw the following tentative conclusions:

  1. There may be a significant performance difference between Debug and Release code (e.g., between 5x to 15x difference).
  2. Compiling your Windows software with ClangCL may lead to better performance (in Release mode). In my test, I get a 40% speedup with ClangCL. However, compiling with ClangCL takes much longer. I have recommended Windows users build their librairies with ClangCL and I maintain this recommendation.
  3. In Debug mode, the regular Visual Studio produces more performant code and it compiles faster than ClangCL.

Thus it might make sense to use the regular Visual Studio compiler in Debug mode as it builds fast and offers other benefits while testing the code, and then ClangCL in Release mode for the performance.

You may bypass ClangCL under Windows and build directly with clang with the LLVM linker for faster builds. However, I have not verified that you get the same speed.

No matter what I do, however, I seem to be getting slower builds under Windows that I expect. I am not exactly sure why build the code takes so much longer under Windows. It is not my hardware since building with Linux under Windows, on the same laptop, is fast.

Of course, parallelizing the build is the most obvious way to speed it up. Appending -- -m to my CMake command could help my performance. I deliberately avoided parallel builds in this experiment.

WSL Update. Reader Quan Anh Mai asked me whether I had made a copy of the source files in the Windows Subsystem for Linux drive, and I had not. Doing so multiplied the speed by a factor of three. I included the better timings.

Computing the UTF-8 size of a Latin 1 string quickly (AVX edition)

Computers represent strings using bytes. Most often, we use the Unicode standard to represent characters in bytes. The universal format to exchange strings online is called UTF-8. It can represent over a million characters while retaining compatibility with the ancient ASCII format.

Though most of our software stack has moved to Unicode strings, there are still older standards like Latin 1 used for European strings. Thus we often need to convert Latin 1 strings to UTF-8. JavaScript strings are often stored as either Latin 1, UTF-8 or UTF-16 internally. It is useful to first compute the size (in bytes) of the eventual UTF-8 strings.

Thanfully, it is an easy problem: it suffices to count two output bytes for each input byte having the most significant bit set, and one output byte for other bytes. A relatively simple C++ function suffices:

size_t scalar_utf8_length(const char * c, size_t len) {
  size_t answer = 0;
  for(size_t i = 0; i<len; i++) {
    if((c[i]>>7)) { answer++;}
  }
  return answer + len;
}

To go faster, we may want to use fancy “SIMD” instructions: instructions that process several bytes at once. Your compiler is likely to already do some autovectorization with the simple C function. At compile-time, it will use some SIMD instructions. However, you can try to do hand-code your own version.

We have several instruction sets to choose from, but let us pick the AVX2 instruction sets, available on most x64 processors today. AVX2 has a fast mask function that can extract all the most significant bits, and then another fast function that can count them (popcount). Thus the following routine should do well (credit to Anna Henningsen):

size_t avx2_utf8_length_basic(const uint8_t *str, size_t len) {
  size_t answer = len / sizeof(__m256i) * sizeof(__m256i);
  size_t i;
  for (i = 0; i + sizeof(__m256i) <= len; i += 32) {
    __m256i input = _mm256_loadu_si256((const __m256i *)(str + i));
   answer += __builtin_popcount(_mm256_movemask_epi8(input));
  }
  return answer + scalar_utf8_length(str + i, len - i);
}

Can you do better? On Intel processors, both the ‘movemask’ and population count instructions are fast, but they have some latency: it takes several cycles for them to execute. They may also have additional execution constraints. Part of the latency and constraints is not going to get better with instruction sets like AVX-512 because it requires moving from SIMD registers to general registers at every iterations. It will similarly be a bit of challenge to port this routine to ARM processors.

Thus we would like to rely on cheaper instructions, and stay in SIMD registers until the end. Even if it does not improve the speed of the AVX code, it may work better algorithmically with other instruction sets.

For this purpose, we can borrow a recipe from the paper Faster Population Counts Using AVX2 Instructions (Computer Journal 61 (1), 2018). The idea is to quickly extract the bits, and add them to a counter that is within a SIMD register, and only exact the values at the end.

The code is slightly more complicated because we have an inner loop. Within the inner loop, we use 8-bit counters, only moving to 64-bit counters at the end of the inner loop. To ensure that there is no overflow, the inner loop can only run for 255 iterations. The code looks as follows…

size_t avx2_utf8_length_mkl(const uint8_t *str, size_t len) {
  size_t answer = len / sizeof(__m256i) * sizeof(__m256i);
  size_t i = 0;
  __m256i four_64bits = _mm256_setzero_si256();
  while (i + sizeof(__m256i) <= len) {
    __m256i runner = _mm256_setzero_si256();
    size_t iterations = (len - i) / sizeof(__m256i);
    if (iterations > 255) { iterations = 255; }
    size_t max_i = i + iterations * sizeof(__m256i) - sizeof(__m256i);
    for (; i <= max_i; i += sizeof(__m256i)) {
      __m256i input = _mm256_loadu_si256((const __m256i *)(str + i));
      runner = _mm256_sub_epi8(
        runner, _mm256_cmpgt_epi8(_mm256_setzero_si256(), input));
    }
    four_64bits = _mm256_add_epi64(four_64bits, 
      _mm256_sad_epu8(runner, _mm256_setzero_si256()));
  }
  answer += _mm256_extract_epi64(four_64bits, 0) +
    _mm256_extract_epi64(four_64bits, 1) +
    _mm256_extract_epi64(four_64bits, 2) +
    _mm256_extract_epi64(four_64bits, 3);
    return answer + scalar_utf8_length(str + i, len - i);
}

We can also further unroll the inner loop to save a bit on the number of instructions.

I wrote a small benchmark with 8kB random inputs, with an AMD Rome (Zen2) server and GCC 11 (-O3 -march=native). The results will vary depending on your input, your compiler and your processor.

function cycles/byte instructions/byte instructions/cycle
scalar (no autovec) 0.89 3.3 3.8
scalar (w. autovec) 0.56 0.71 1.27
AVX2 (movemask) 0.055 0.15 2.60
AVX2 (in-SIMD) 0.039 0.15 3.90
AVX2 (in-SIMD/unrolled) 0.028 0.07 2.40

So the fastest method in my test is over 30 times faster than the purely scalar version. If I allow the scalar vector to be ‘autovectorized’ by the compiler, it gets about 50% faster.

It is an interesting scenario because the number of instructions retired by cycle varies so much. The slightly more complex “in-SIMD” function does better than the ‘movemask’ function because it manages to retire more instructions per cycle, in these tests. The unrolled version is fast and requires few instructions per cycle, but it has a ‘relatively’ low number of instructions retired per cycle.

My code is available. It should be possible to further tune this code. You need privileged access to run the benchmark because I rely on performance counters.

Further work: It is not necessary to use SIMD instructions explicitly, you can use SWAR instead for a compromise between portability and performance.

Follow-up: I have a NEON version of this post.

Remark. Whenever I mention Latin 1, some people are prone to remark that browsers treat HTML pages declared as Latin 1 and ASCII as windows-1252. That is because modern web browsers do not support Latin 1 and ASCII in HTML. Even so, you should not use Latin 1, ASCII or even windows-1252 for your web pages. I recommend using Unicode (UTF-8). However, if you code in Python, Go or Node.js, and you declare a string as Latin 1, it should be Latin 1, not windows-1252. It is bug to confuse Latin 1, ASCII and windows-1252. They are different formats.

Remark. Whenever I mention Latin 1, some people are prone to remark that browsers treat HTML pages declared as Latin 1 and ASCII as windows-1252. That is because modern web browsers do not support Latin 1 and ASCII in HTML. Even so, you should not use Latin 1, ASCII or even windows-1252 for your web pages. I recommend using Unicode (UTF-8). However, if you code in Python, Go or Node.js, and you declare a string as Latin 1, it should be Latin 1, not windows-1252. It is bug to confuse Latin 1, ASCII and windows-1252. They are different formats.

Science and Technology links (February 12 2023)

  1. Kenny finds that the returns due to education are declining. Rich countries are spending more on education, with comparatively weaker test results. It costs more than ever to train a PhD student, but it takes ever longer for them to complete their studies. There 25 times more crop researchers today than in the 1970s, but rate of progress remains constant. Researchers are better at producing patents, but the quality of these patents may be decreasing.Furthermore, there aren’t more people to educate: 2012 was “peak child”. It is the year in history when we had the most births. We are now in decline. Fertility levels are below replacement levels in much of the high-income countries.
  2. Exercise prevents muscle aging at the gene-expression level.
  3. 28% of people in Japan are 65 years old or older.
  4. People who undergo colonoscopies reduce their risk of death by 0.01% in absolute terms.
  5. Jupiter has 92 Moons while Saturn has 83 Moons, however the number increases over time. Most Moons are tiny objects.
  6. Homo erectus beings could still be found 100,000 years ago in Indonesia. They may have mated with Denisovans who themselves cohabited with modern human beings.
  7. Living in the country is not necessarily cheaper than living in a city.
  8. Though it is often believed that the second world war lead to increased industrial output in the USA, the opposite is true.
  9. The city of Venice has been ‘sinking’ underwater for centuries. However, in the last few years, the trend has been reversed: Venice is gaining land.
  10. Both left-wing and right-wing individual can hold authoritarian preferences. Left-wing authoritarianism is associated with the belief in a COVID disease threat, in an ecological threat or in a Trump threat.

Bit Hacking (with Go code)

At a fundamental level, a programmer needs to manipulate bits. Modern processors operate over data by loading in ‘registers’ and not individual bits. Thus a programmer must know how to manipulate the bits within a register. Generally, we can do so while programming with 8-bit, 16-bit, 32-bit and 64-bit integers. For example, suppose that I want to set an individual bit to value 1. Let us pick the bit an index 12 in a 64-bit words. The word with just the bit at index 12 set is 1<<12 : the number 1 shifted to the left 12 times, or 4096. In Go, we format numbers using the fmt.Printf function: we use a string with formatting instructions followed by the values we want to print. We begin a formatting sequence with the letter % which has a special meaning (if one wants to print %, one most use the string %%). It can be followed by the letter b which stands for binary, the letter d (for decimal) or x (for hexadecimal). Sometimes we want to specify the minimal length (in characters) of the output, and we do so by a leading number: e.g, fmt.Printf("%100d", 4096) prints a 100-character string that ends with 4096 and begins with spaces. We can specify zero as a padding character rather than the space by adding it as a prefix (e.g., "%0100d"). In Go, we may print thus the individual bits in a word as in the following example:

package main

import "fmt"

func main() {
    var x uint64 = 1 << 12
    fmt.Printf("%064b", x)
}

Running this program we get a binary string representing 1<<12:

0000000000000000000000000000000000000000000000000001000000000000

The general convention when printing numbers is that the most significant digits are printed first followed by the least significant digits: e.g., we write 1234 when we mean 1000 + 200 + 30 + 4. Similarly, Go prints the most significant bits first, and so the number 1<<12 has 64-13=51 leading zeros followed by a 1 with 12 trailing zeros.

We might find it interesting to revisit how Go represents negative integers. Let us take the 64-bit integer -2. Using two’s complement notation, the number should be represented as the unsigned number (1<<64)-2 which should be a word made entirely one ones, except for the second last bit. We can use the fact that a cast operation in Go (e.g., uint64(x)) preserves the binary representation:

package main

import "fmt"

func main() {
    var x int64 = -2
    fmt.Printf("%064b", uint64(x))
}

This program will print 1111111111111111111111111111111111111111111111111111111111111110 as expected.

Go has some relevant binary operators that we often use to manipulate bits:

&    bitwise AND
|    bitwise OR
^    bitwise XOR
&^   bitwise AND NOT

Furthermore, the symbol ^ is also used to flip all bits a word when used as an unary operation: a ^ b computes the bitwise XOR of a and b whereas ^a flips all bits of a. We can verify that we have a|b == (a^b) | (a&b) == (a^b) + (a&b).

We have other useful identities. For example, given two integers a and b, we have that a+b = (a^b) + 2*(a&b). In the identity 2*(a&b) represents the carries whereas a^b represents the addition without the carries. Consider for example 0b1001 + 0b01001. We have that 0b1 + 0b1 == 0b10 and this is the 2*(a&b) component, whereas 0b1000 + 0b01000 == 0b11000 is captured by a^b. We have that 2*(a|b) = 2*(a&b) + 2*(a^b), thus a+b = (a^b) + 2*(a&b) becomes a+b = 2*(a|b) - (a^b). These relationships are valid whether we consider unsigned or signed integers, since the operations (bitwise logical, addition and subtraction) are identical at the bits level.

Setting, clearing and flipping bits

We know how to create a 64-bit word with just one bit set to 1 (e.g., 1<<12). Conversely, we can also create a word that is made of 1s except for a 0 at bit index 12 by flipping all bits: ^uint64(1<<12). Before flipping all bits of an expression, it is sometimes useful to specify its type (taking uint64 or uint32) so that the result is unambiguous.

We can then use these words to affect an existing word:

  1. If we want to set the 12th bit of word w to one: w |= 1<<12.
  2. If we want to clear (set to zero) the 12th bit of word w: w &^= 1<<12 (which is equivalent to w = w & ^uint64(1<<12)).
  3. If we just want to flip (send zeros to ones and ones to zeros) the 12th bit: w ^= 1<<12.

We may also affect a range of bits. For example, we know that the word (1<<12)-1 has all but the 11 least significant bits set to zeros, and the 11 least significant bits set to ones.

  1. If we want to set the 11 least significant bits of the word w to ones: w |= (1<<12)-1.
  2. If we want to clear (set to zero) the 11 least signficant bits of word w: w &^= (1<<12)-1.
  3. If we want to flip the 11 least signficant bits: w ^= (1<<12)-1.
    The expression (1<<12)-1 is general in the sense that if we want to select the 60 least significant bits, we might do (1<<60)-1. It even works with 64 bits: (1<<64)-1 has all bits set to 1.

We can also generate a word that has an arbitrary range of bits set: the word ((1<<13)-1) ^ ((1<<2)-1) has the bits from index 2 to index 12 (inclusively) set to 1, other bits are set to 0. With such a construction, we can set, clear or flip an arbitrary range of bits within a word, efficiently.

We can set any bit we like in a word. But what about querying the bit sets ? We can check the 12th bit is set in the word u by checking whether w & (1<<12) is non-zero. Indeed, the expression w & (1<<12) has value 1<<12 if the 12th bit is set in w and, otherwise, it has value zero. We can extend such a check: we can verify whether any of the bits from index 2 to index 12 (inclusively) set to 1 by computing w & ((1<<13)-1) ^ ((1<<2)-1). The result is zero if and only if no bit in the specified range is set to one.

Efficient and safe operations over integers

By thinking about values in terms of their bit representation, we can write more efficient code or, equivalent, have a better appreciation for what optimized binary code might look like. Consider the problem of checking if two numbers have the same sign: we want to know whether they are both smaller than zero, or both greater than or equal to zero. A naive implementation might look as follows:

func SlowSameSign(x, y int64) bool {
return ((x < 0) && (y < 0)) || ((x >= 0) && (y >= 0))
}

However, let us think about what distinguishes negative integers from other integers: they have their last bit set. That is, their most significant bit as an unsigned value is one. If we take the exclusive or (xor) of two integers, then the result will have its last bit set to zero if their sign is the same. That is, the result is positive (or zero) if and only if the signs agree. We may therefore prefer the following function to determine if two integers have the same sign:

func SameSign(x, y int64) bool {
    return (x ^ y) >= 0
}

Suppose that we want to check whether x and y differ by at most 1. Maybe x is smaller than y, but it could be larger.

Let us consider the problem of computing the average of two integers. We have the following correct function:

func Average(x, y uint16) uint16 {
    if y > x {
        return (y-x)/2 + x
    } else {
        return (x-y)/2 + y
    }
}

With a better knowledge of the integer representation, we can do better.

We have another relevant identity x == 2*(x>>1) + (x&1). It means that x/2 is within [(x>>1), (x>>1)+1). That is, x>>1 is the greatest integer no larger than x/2. Conversely, we have that (x+(x&1))>>1 is the smallest integer no smaller than x/2.

We have that x+y = (x^y) + 2*(x&y). Hence we have that (x+y)>>1 == ((x^y)>>1) + (x&y) (ignoring overflows in x+y). Hence, ((x^y)>>1) + (x&y) is the greatest integer no larger than (x+y)/2. We also have that x+y = 2*(x|y) - (x^y) or x+y + (x^y)&1= 2*(x|y) - (x^y) + (x^y)&1 and so (x+y+(x^y)&1)>>1 == (x|y) - ((x^y)>>1) (ignoring overflows in x+y+(x^y)&1). It follows that (x|y) - ((x^y)>>1) is the smallest integer no smaller than (x+y)/2. The difference between (x|y) - ((x^y)>>1) and ((x^y)>>1) + (x&y) is (x^y)&1. Hence, we have the following two fast functions:

func FastAverage1(x, y uint16) uint16 {
    return (x|y) - ((x^y)>>1)
}
func FastAverage2(x, y uint16) uint16 {
    return ((x^y)>>1) + (x&y)
}

Though we use the type uint16, it works irrespective of the integer size (uint8, uint16, uint32, uint64) and it also applies to signed integers (int8, int16, int32, int64).

Efficient Unicode processing

In UTF-16, we may have surrogate pairs: the first word in the pair is in the range 0xd800 to 0xdbff whereas the second word is in the range from 0xdc00 to 0xdfff. How may we detect efficiency surrogate pairs? If the values are stored using an uint16 type, then it would seem that we could detect a value part of a surrogate pair with two comparisons: (x>=0xd800) && (x<=0xdfff). However, it may prove more efficient to use the fact that subtractions naturally wrap-around: 0-0xd800==0x2800. Thus x-0xd800 will range between 0 and 0xdfff-0xd800 inclusively whenever we have a value that is part of a surrogate pair. However, any other value will be larger than 0xdfff-0xd800=0x7fff. Thus, a single comparison is needed : (x-0xd800)<=0x7ff.
Once we have determined that we have a value that might correspond to a surrogate pair, we may check that the first value x1 is valid (in the range 0xd800 to 0xdbff) with the condition (x-0xd800)<=0x3ff, and similarly for the second value x2: (x-0xdc00)<=0x3ff. We may then reconstruct the code point as (1<<20) + ((x-0xd800)<<10) + x-0xdc00. In practice, you may not need to concern yourself with such an optimization since your compiler might do it for you. Nevertheless, it is important to keep in mind that what might seem like multiple comparisons could actually be implemented as a single one.

Basic SWAR

Modern processors have specialized instructions capable of operating over multiple units of data with a single instruction (called SIMD for Single Instruction Multiple Data). We can do several operations using a single instruction (or few) instructions with a technique called SWAR (SIMD within a register). Typically, we are given a 64-bit word w (uint64) and we want to treat it as a vector of eight 8-bit words (uint8).

Given a byte value (uint8) I can have it replicated over all bytes of a word with a single multiplication: x * uint64(0x0101010101010101). For example, we have 0x12 * uint64(0x0101010101010101) == 0x1212121212121212. This approach can be generalized in various ways. For example, we have that 0x7 * uint64(0x1101011101110101) == 0x7707077707770707.

For convenience, let us define b80 to be the uint64 equal to 0x8080808080808080 and b01 be the uint64 equal to 0x0101010101010101. We can check whether all bytes are smaller than 128. We first replicate the byte value with all but the most significant bit set to zero (0x80 * b01) or b80) and then we compute the bitwise AND with our 64-bit word and check whether the result is zero: (w & b80)) == 0. It might compile to a two or three instructions on a processor.

We can check whether any byte is zero, assuming that we have checked that they are smaller than 128, with an expression such as ((w - b01) & b80) == 0. If we are not sure that they are smaller than zero, we can simply add an operation: (((w - b01)|w) & b80) == 0. Checking that a byte is zero allows us to check whether two words, w1 and w2, have a matching byte value since, when this happens, w1^w2 has a zero byte value.

We can also design more complicated operations if we assume that all byte values are no larger than 128. For example, we may check that all byte values are no larger than a 7-bit value (t) by the following routine: ((w + (0x80 - t) * b01) & b80) == 0. If the value t is a constant, then the multiplication would be evaluated at compile time and it should be barely more expensive than checking whether all bytes are smaller than 128. In Go, we check that no byte value is larger than 77, assuming that all byte values are smaller than 128 by verifying thaat b80 & (w+(128-77) * b01) is zero. Similarly, we can check that all byte values are at least as large a 7-bit t, assuming that they are also all smaller than 128: ((b80 - w) + t * b01) & b80) == 0. We can generalize further. Suppose we want to check that all bytes are at least as large at the 7-bit value a and no larger than the 7-bit value b. It suffices to check that ((w + b80 - a * b01) ^ (w + b80 - b * b01)) & b80 == 0.

Rotating and Reversing Bits

Given a word, we say that we rotate the bits if we shift left or right the bits, while moving back the leftover bits at the beginning. To illustrate the concept, suppose that we are given the 8-bit integer 0b1111000 and we want to rotate it left by 3 bits. The Go language provides a function for this purpose (bits.RotateLeft8 from the math/bits package): we get 0b10000111. In Go, there is no rotate right operation. However, rotating left by 3 bits is the same as rotating right by 5 bits when processing 8-bit integers. Go provide rotation functions for 8-bit, 16-bit, 32-bit and 64-bit integers.

Suppose that you would like to know if two 64-bit words (w1 and w2) have matching byte values, irrespective of the ordering. We know how to check that they have matching ordered byte values efficiently (e.g., (((w1^w2 - b01)|(w1^w2)) & b80) == 0. To compare all bytes with all other bytes, we can repeat the same operation as many times as they are bytes in a word (eight times for 64-bit integers): each time, we rotate one of the words by 8 bits:

(((w1^w2 - b01)|(w1^w2)) & b80) == 0
w1 = bits.RotateLeft64(w1,8)
(((w1^w2 - b01)|(w1^w2)) & b80) == 0
w1 = bits.RotateLeft64(w1,8)
...

We recall that words can be interpreted as little-endian or big-endian depending on whether the first bytes are the least significant or the most significant. Go allows you to reverse the order of the bytes in a 64-bit word with the function bits.ReverseBytes64 from the math/bits package. There are similar functions for 16-bit and 32-bit words. We have that bits.ReverseBytes16(0xcc00) == 0x00cc. Reversing the bytes in a 16-bit word, and rotating by 8 bits, are equivalent operations.

We can also reverse bits. We have that bits.Reverse16(0b1111001101010101) == 0b1010101011001111. Go has functions to reverse bits for 8-bit, 16-bit, 32-bit and 64-bit words. Many processors have fast instructions to reverse the bit orders, and it can be a fast operation.

Fast Bit Counting

It can be useful to count the number of bits set to 1 in a word. Go has fast functions for this purpose in the math/bits package for words having 8 bits, 16 bits, 32 bits and 64 bits. Thus we have that bits.OnesCount16(0b1111001101010101) == 10.

Similarly, we sometimes want to count the number of trailing or leading zeros. The number of trailing zeros is the number of consecutive zero bits appearing in the least significant positions.For example, the word 0b1 has no trailing zero, whereas the word 0b100 has two trailing zeros. When the input is a power of two, the number of trailing zeros is the logarithm in base two. We can use the Go functions bits.TrailingZeros8, bits.TrailingZeros16
and so forth to compute the number of trailing zeros. The number of leading zeros is similar, but we start from the most significant positions. Thus the 8-bit integer 0b10000000 has zero leading zeros,
while the integer 0b00100000 has two leading zeros. We can use the functions bits.LeadingZeros8, bits.LeadingZeros16
and so forth.

While the number of trailing zeros gives directly the logarithm of powers of two, we can use the number of leading zeros to compute the logarithm of any integer, rounded up to the nearest integer. For 32-bit integers, the following function provides the correct result:

func Log2Up(x uint32) int {
    return 31 - bits.LeadingZeros32(x|1)
}

We can also compute other logarithms. Intuitively, this ought to be possible because if log_b is the logarithm in base b, then log_b (x) = \log_2(x)/\log_2(b). In other words, all logarithms are within a constant factor (e.g., 1/log_2(b)).

For example, we might be interested in the number of decimal digits necessary to represent an integer (e.g., the integer 100 requires three digits). The general formula is ceil(log(x+1)) where the logarithm should be taken in base 10. We can show that the following function (designed by an engineer called Kendall Willets) computes the desired number of digits for 32-bit integers:

func DigitCount(x uint32) uint32 {
    var table = []uint64{
        4294967296, 8589934582, 8589934582,
        8589934582, 12884901788, 12884901788,
        12884901788, 17179868184, 17179868184,
        17179868184, 21474826480, 21474826480,
        21474826480, 21474826480, 25769703776,
        25769703776, 25769703776, 30063771072,
        30063771072, 30063771072, 34349738368,
        34349738368, 34349738368, 34349738368,
        38554705664, 38554705664, 38554705664,
        41949672960, 41949672960, 41949672960,
        42949672960, 42949672960}
    return uint32((uint64(x) + table[Log2Up(x)]) >> 32)
}

Though the function is a bit mysterious, its computation mostly involves computing the number of trailing zeros, and using the result to lookup a value in a table. It translates in only a few CPU instructions and is efficient.

Indexing Bits

Given a word, it is sometimes useful to compute the position of the set bits (bits set to 1). For example, given the word 0b11000111, we would like to have the indexes 0, 1, 2, 6, 7 corresponding to the 5 bits with value 1. We can determine efficiently how many indexes we need to produce thanks to the bits.OnesCount functions. The bits.TrailingZeros functions can serve to identify the position of a bit. We may also use the fact that x & (x-1) set to zero the least significant 1-bit of x. The following Go function generates an array of indexes:

func Indexes(x uint64) []int {
    var ind = make([]int, bits.OnesCount64(x))
    pos := 0
    for x != 0 {
        ind[pos] = bits.TrailingZeros64(x)
        x &= x - 1
        pos += 1
    }
    return ind
}

Given 0b11000111, it produces the array 0, 1, 2, 6, 7:

var x = uint64(0b11000111)
for _, v := range Indexes(x) {
    fmt.Println(v)
}

If we want to compute the bits in reverse order (7, 6, 2, 1, 0), we can do so with a bit-reversal function, like so:

for _, v := range Indexes(bits.Reverse64(x)) {
    fmt.Println(63 - v)
}

Conclusion

As a programmer, you may access, set, copy, or move individual bit values efficiently. With some care, you can avoid arithmetic overflows without much of a performance penalty. With SWAR, you can use a single word as if it was made of several subwords. Though most of these operations are only rarely needed, it is important to know that they are available.

Serializing IPs quickly in C++

On the Internet, we often use 32-bit addresses which we serialize as strings such as 192.128.0.1. The string corresponds to the Integer address 0xc0800001 (3229614081 in decimal).

How might you serialize, go from the integer to the string, efficiently in C++?

The simplest code in modern C++ might look as follows:

std::string output = std::to_string(address >> 24);
for (int i = 2; i >= 0; i--) {
  output.append(std::to_string((address >> (i * 8)) % 256) + ".");
}

At least symbolically, it will repeatedly create small strings that are appended to an initial string.

Can we do better? We have new functions in C++ (std::to_chars) which are dedicated to writing quickly to a string buffer. So we might try to allocate a single buffer and write to it using buffers. The result is not pretty:

std::string output(4 * 3 + 3, '\0'); // allocate just one big string
char *point = output.data();
char *point_end = output.data() + output.size();
point = std::to_chars(point, point_end, uint8_t(address >> 24)).ptr;
for (int i = 2; i >= 0; i--) {
 *point++ = '.';
 point = std::to_chars(point, point_end, uint8_t(address >> (i * 8))).ptr;
}
output.resize(point - output.data());

The uglier code might be faster.

I wrote a benchmark where I repeatedly create strings and add them to containers.

On an AMD Rome (Zen2) processor using GCC 11, I get the following numbers.

pretty code 80 ns/string
to_chars 24 ns/string

So about three times faster?

We can go at least twice as fast with a bulkier function but it uses significantly more storage and memory:

  // uses 1025 bytes
  constexpr static const char *lookup =
".0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .10 .11 .12 .13 .14 .15 .16 .17 "
".18 .19 .20 .21 .22 .23 .24 .25 .26 .27 .28 .29 .30 .31 .32 .33 .34 .35 "
".36 .37 .38 .39 .40 .41 .42 .43 .44 .45 .46 .47 .48 .49 .50 .51 .52 .53 "
".54 .55 .56 .57 .58 .59 .60 .61 .62 .63 .64 .65 .66 .67 .68 .69 .70 .71 "
".72 .73 .74 .75 .76 .77 .78 .79 .80 .81 .82 .83 .84 .85 .86 .87 .88 .89 "
".90 .91 .92 .93 .94 .95 .96 .97 .98 .99 "
".100.101.102.103.104.105.106.107.108.109.110.111.112.113.114.115.116."
"117.118.119.120.121.122.123.124.125.126.127.128.129.130.131.132.133.134."
"135.136.137.138.139.140.141.142.143.144.145.146.147.148.149.150.151.152."
"153.154.155.156.157.158.159.160.161.162.163.164.165.166.167.168.169.170."
"171.172.173.174.175.176.177.178.179.180.181.182.183.184.185.186.187.188."
"189.190.191.192.193.194.195.196.197.198.199.200.201.202.203.204.205.206."
"207.208.209.210.211.212.213.214.215.216.217.218.219.220.221.222.223.224."
"225.226.227.228.229.230.231.232.233.234.235.236.237.238.239.240.241.242."
"243.244.245.246.247.248.249.250.251.252.253.254.255";
  std::string output(4 * 3 + 3, '\0');
  char *point = output.data();
  uint8_t by;
  by = address >> 24;
  std::memcpy(point, lookup + by * 4 + 1, 4);
  point += (by < 10) ? 1 : (by < 100 ? 2 : 3);
  by = address >> 16;
  std::memcpy(point, lookup + by * 4, 4);
  point += (by < 10) ? 2 : (by < 100 ? 3 : 4);
  by = address >> 8;
  std::memcpy(point, lookup + by * 4, 4);
  point += (by < 10) ? 2 : (by < 100 ? 3 : 4);
  by = address >> 0;
  std::memcpy(point, lookup + by * 4, 4);
  point += (by < 10) ? 2 : (by < 100 ? 3 : 4);
  output.resize(point - output.data());


In my benchmark, I include a version by Marcin Zukowski that uses even more memory (about 4 times) and is slightly faster.

As always, your results will vary depending on your system and your compiler. However, I recommend against creating small strings and aggregating them together in performance critical code.

Credit: Peter Dimov and Marcin Zukowski have contributed fast versions (see benchmarks). The version with a large table is derived from their work. Ivan-Assen Ivanov contributed ideas on Twitter.

Move or copy your strings? Possible performance impacts

You sometimes want to add a string to an existing data structure. For example, the C++17 template ‘std::optional’ may be used to represent a possible string value. You may copy it there, as this code would often do…

std::string mystring;
std::optional<std::string> myoption;
myoption = mystring;

Or you can move it:

std::string mystring;
std::optional<std::string> myoption;
myoption = std::move(mystring);

In C++, when ‘moving’ a value, the compiler does not need to create a whole new copy of the string. So it is often cheaper.

I wrote a little benchmark to assess the performance difference. It is a single test, but it should illustrate.

Firstly, for relatively long strings (a phrase or a sentence), the move is 5 times to 20 times faster.

copy move
Apple LLVM 14, M2 processor 24 ns/string 1.2 ns/string
GCC 11, Intel Ice Lake 19 ns/string 4 ns/string

Secondly, for short strings (a single word), the move is 1.5 times to 3 times faster but the absolute difference is small (as small as a fraction of a nanosecond). Your main concern should be with long strings.

copy move
Apple LLVM 14, M2 processor 2.0 ns/string 1.2 ns/string
GCC 11, Intel Ice Lake 7 ns/string 2.6 ns/string

My results illustrate that moving your sizeable data structure instead of copying them is beneficial.

But that’s not the fastest approach: the fastest approach is to just hold a pointer. Copying an address is unbeatably fast. A slightly less optimal approach is to use a lightweight object like an std::string_view: copying or creating an std::string_view is cheaper than doing the same with a C++ string.

International domain names: where does https://meßagefactory.ca lead you?

Originally, the domain part of a web address was all ASCII (so no accents, no emojis, no Chinese characters). This was extended a long time ago thanks to something called internationalized domain name (IDN).

Today, in theory, you can use any Unicode character you like as part of a domain name, including emojis. Whether that is wise is something else.

What does the standard says? Given a domain name, we should identify its labels. They are normally separated by dots (.) into labels: www.microsoft.com has three labels. But you may also use other Unicode characters as separators ( ., ., 。, 。). Each label is further processed. If it is all ASCII, then it is left as is. Otherwise, we must convert it to an ASCII code called “punycode” after doing the following according to RFC 3454:

  • Map characters (section 3 of RFC 3454),
  • Normalize (section 4 of RFC 3454),
  • Reject forbidden characters,
  • Optionally reject based on unassigned code points (section 7).

And then you get to the punycode algorithm. There are further conditions to be satisfied, such as the domain name in ASCII cannot exceed 255 bytes.

That’s quite a lot of work. The goal is to transcribe each Unicode domain name into an ASCII domain name. You would hope that it would be a well-defined algorithm: given a Unicode domain name, there should be a unique output.

Let us choose a common non-ASCII character, the letter ß, called Eszett. Let me create a link with this character:

What happens if you click on this link? The result depends on your browser. If you are using Microsoft Edge, Google Chrome or the Brave browser, you may end up at https://messagefactory.ca/. If you are using Safari or Firefox  you may end up at https://xn--meagefactory-m9a.ca. Of course, your results may vary depending on your exact system. Under ios (iPhone), I expect that the Safari behaviour will prevail irrespective of your browser.

Not what I expected.

Update: We wrote our own library to process international domain names according to the standard: it is called idna and part of the ada-url project. Our library produces https://xn--meagefactory-m9a.ca which is the non-transitional (and now correct) answer.

Year 2022: Scientific progress

The year 2022 is over. As with every year that passes, we have made some scientific progress. I found the following achievements interesting:

Science and technology links (January 15 2023)

    1. For under $600, one can buy a 20-terabyte disk on Amazon. Unless you work professionally in multimedia, it is more storage than you need. However, having much storage it, by itself, of little use if you cannot access it. Thankfully, you can buy a 1-terabyte “disk” for $200 that provides over 6 GB/s of bandwidth. I have a similar disk in my game console. Is this as good as it gets? Researchers show that we can transmit data over a distance at more than a petabit per second. According to some estimates, that is more than the total data size of the books in the library of congress, per second.
    2. Transplanting rejuvenated blood stem cells extends lifespan of aged immunocompromised mice.
    3. Amazon is using drones for deliveries in California and Texas.
    4. People who think themselves as less attractive are more likely willing to wear surgical masks.
    5. Conversations rarely end when they should.
    6. Using legal restrictions, manufacturers are able to prevent their customers from repairing their own products. There may be hope. Farmers in the US will be allowed to repair John Deere tractors.
    7. For most of my life, nuclear power has been viewed with suspicion, and little effort has been done to exploit it further. The tide might be finally turning. The UK government plans to authorize small modular nuclear reactors, and other nuclear-power innovations.
    8. Cancer rates are falling rapidly in both men and women.
    9. There is evidence that vitamin D supplements might help your immune system if your levels are too low. However, you may also require magnesium supplements to benefit from the vitamin D.
    10. In the post-pandemic era, people work fewer hours.
    11. There is evidence that viruses dormant in our genomes can wake up when we are older and harm us.
    12. Research papers and patents are becoming less disruptive over time. However, we are producing many more research papers and patents.
    13. Genetically modified immune cells are used to fight cancer in a patient.
    14. Increasing house insulation may not lead to lower energy usage on the long run. It is not, per se, an argument against better insulation. Yet it suggests that we should plan for increase power production.
    15. In a paper published by Nature, Kleinherenbrink et al. find that global mean sea levels are likely rising according to a linear curve, as opposed to an accelerating curve. The current rate is estimated to be between 3 and 4 millimeters per year. Meanwhile, the most low-lying island nations on the planet are growing.
    16. Antartica might be getting cooler.
    17. People prefer to stay away from promiscuous men.
    18. Cold temperatures harm the antiviral immunity of your nose. Thus cold weather may make you more prone to catching a virus.
    19. Replacing grades with pass/fail scores in courses lead students to make less effort. In my view, it does not imply that we should not adopt pass/fail scores because there are diminish returns to more efforts. E.g., if the kids in a country spend much more time perfecting their knowledge of trigonometry, you may not end up with a more prosperous country. In some sense, intense competition may be a net loss.
    20. Using solar power generation in countries such as Switzerland results in a net energy loss: though the solar panels produce energy, they never recoup the energy investment needed to make them and deploy them.

Care is needed to use C++ std::optional with non-trivial objects

We often have to represent in software a value that might be missing. Different programming languages have abstraction for this purpose.

A recent version of C++ (C++17) introduces std::optional templates. It is kind of neat. You can write code that prints a string, or a warning if no string is available as follows:

void f(std::optional<std::string> s) {
  std::cout << s.value_or("no string") << std::endl;
}

I expect std::optional to be relatively efficient when working with trivial objects (an integer, a std::string_view, etc.). However if you want to avoid copying your objects, you should use std::optional with care.

Let us consider this example:

A f(std::optional<A> z) {
    return z.value_or(A());
}

A g() { 
    A a("message"); 
    auto z = std::optional<A>(a); 
    return f(z); 
}

How many instances of the string class are constructed when call the function g()? It is up to five instances:

  1. At the start of the function we construct one instance of A.
  2. We then copy-construct this instance when passing it to std::optional<A>.
  3. Passing std::optional<A> to the function f involves another copy.
  4. In the value_or construction, we have a default construction of A (which is wasteful work in this instance).
  5. Finally, we copy construct an instance of A when the call to value_or terminates.

It is possible for an optimizing compiler to elide some or all of the superfluous constructions, especially if you can inline the functions, so that the compiler can see the useless work and prune it. However, in general, you may encounter inefficient code.

I do not recommend against using std::optional. There are effective strategies to avoid the copies. Just apply some care if performance is a concern.

Transcoding Unicode with AVX-512: AMD Zen 4 vs. Intel Ice Lake

Most systems today rely on Unicode strings. However, we have two popular Unicode formats: UTF-8 and UTF-16. We often need to convert from one format to the other. For example, you might have a database formatted with UTF-16, but you need to produce JSON documents using UTF-8. This conversion is often called ‘transcoding’.

In the last few years, we wrote a specialized library that process Unicode strings, with a focus on performance: the simdutf library. The library is used by JavaScript runtimes (Node JS and bun).

The simdutf library is able to benefit from the latest and most powerful instructions on your processors. In particular, it does well with recent processors with AVX-512 instructions (Intel Ice Lake, Rocket Lake, as well as AMD Zen 4).

I do not yet have a Zen 4 processor, but Velu Erwan was kind of enough to benchmark it for me. A reasonable task is to transcode an Arabic file from UTF-8 to UTF-16: it is typically a non-trivial task because Arabic UTF-8 is a mix of one-byte and two-byte characters that we must convert to two-byte UTF-16 characters (with validation). The steps required (under Linux) are as follows:

git clone https://github.com/simdutf/simdutf && 
cd simdutf && 
cmake -B build && cmake --build build && 
wget –content-disposition https://cutt.ly/d2cIxRx && 
./build/benchmarks/benchmark -F Arabic-Lipsum.utf8.txt -P convert_utf8 

(Ideally, run the last command with privileged access to the performance counters.)

Like Intel, AMD has its own compiler. I did not have access to the Intel compiler for my tests, but Velu has the AMD compiler.

A sensible reference point is the iconv function, as it is provided by the runtime library. The AMD processor is running much faster than the Intel core (5.4 GHz vs. 3.4 GHz). We use GCC 12.

transcoder Intel Ice Lake (GCC) AMD Zen 4 (GCC) AMD Zen 4 (AMD compiler)
iconv 0.70 GB/s 0.97 GB/s 0.98 GB/s
simdutf 7.8 GB/s 11 GB/s 12 GB/s

At a glance, the Zen 4 processor is slightly less efficient on a per-cycle basis when running the simdutf AVX-512 code (2.8 instructions/cycle for AMD versus 3.1 instructions/cycle for Intel) but keep in mind that we did not have access to a Zen 4 processor when tuning our code. The efficiency difference is small enough that we can consider the processors roughly on par pending further investigations.

The big difference that the AMD Zen 4 runs at a much higher frequency. If I rely on wikipedia, I do not think that there is an Ice Lake processor that can match the 5.4 GHz. However, some Rocket Lake processors come close.

In our benchmarks, we track the CPU frequency and we get the same measured frequency when running an AVX-512 as when running conventional code (iconv).  Thus AVX-512 can be really advantageous.

These results suggest that AMD Zen 4 is matching Intel Ice Lake in AVX-512 performance. Given that the Zen 4 microarchitecture is the first AMD attempt at supporting AVX-512 commercially, it is a remarkable feat.

Further reading: AMD Zen 4 performance while parsing JSON (phoronix.com).

Note: Raw AMD results are available: GCC 12 and AOCC.

Credit: Velu Erwan got the processor from AMD France. The exact specification is AMD 7950X, 2x16GB DDR5 4800MT reconfig as 5600MT. The UTF-8 to UTF-16 transcoder is largely based on the work of Robert Clausecker.

Emojis in domain names, punycode and performance

Most domain names are encoded using ASCII (e.g., yahoo.com). However, you can register domain names with almost any character in them. For example, there is a web site at 💩.la called poopla. Yet the underlying infrastructure is basically pure ASCII. To make it work, the text of your domain is first translated into ASCII using a special encoding called ‘punycode‘. The poopla web site is actually at https://xn--ls8h.la/.

Punycode is a tricky format. Thankfully, domain names are made of labels (e.g., in microsoft.com, microsoft is a label) and each label can use at most 63 bytes. In total, a domain name cannot exceed 255 bytes, and that is after encoding it to punycode if necessary.

Some time ago, Colm MacCárthaigh suggested that I look at the performance impact of punycode. To answer the question, I quickly implemented a function that does the job. It is a single function without much fanfare. It is roughly derived from the code in the standard, but it looks simpler to me. Importantly, I do not claim that my implementation is particularly fast.

As a dataset, I use all of the examples from the wikipedia entry of punycode.

GCC 11, Intel Ice Lake 75 ns/string
Apple M2, LLVM 14 27 ns/string

The punycode format is relatively complicated, it was not designed for high speed, so there are limits to how fast one can go. Nevertheless, it should be possible to go faster. Yet it is probably ‘fast enough’ and it is unlikely that punycode encoding would ever be a performance bottleneck if only because domain names are almost always pure ASCII and punycode is unneeded.

Update: We wrote our own library to process international domain names according to the standard: it is called idna and part of the ada-url project. It turns out that punycode encoding and decoding is a small task as part of a much larger and more complex set of problems. It is indeed probably not a performance bottleneck by itself.