Optimizing compilers deduplicate strings and arrays

When programming, it can be wasteful to store the same constant data again and again. You use more memory, you access more data. Thankfully, your optimizing compiler may help.

Consider the following two lines of C code:

    printf("Good day professor Jones");
    printf("Good day professor Jane");

There is redundancy since the prefix “Good day professor” is the same in both cases. To my knowledge, no compiler is likely to trim this redundancy. However, you can get the desired trimming by breaking the strings:

    printf("Good day professor ");
    printf("Jones");

    printf("Good day professor ");
    printf("Jane");

Most compilers will recognize the constant string and store it once in the program. It works even if the constant string “Good day professor” appears in different functions.

Thus the following function may return true:

    const char * str1 = "dear friend";
    const char * str2 = "dear friend";
    return str1 == str2;

That is, you do not need to manually create constant strings: the compiler recognizes the redundancy (typically).

The same trick fails with extended strings:

    const char * str1 = "dear friend";
    const char * str2 = "dear friend\0f";
    return str1 == str2;

All compilers I tried return false. They create two C strings even if one is a prefix of the other in the following example…

char get1(int k) {
    const char * str = "dear friend";
    return str[k];
}

char get2(int k) {
    const char * str = "dear friend\0f";
    return str[k];
}

Unsurprisingly, the “data compression” trick works with arrays. For example, the arrays in these two functions are likely to be compiled to just one array because the compiler recognizes that they are identical:

int f(int k) {
    int array[] = {1,2,3,4,5,34432,321323,321321,1,
    2,3,4,5,34432,321323,321321};
    return array[k];
}


int g(int k) {
    int array[] = {1,2,3,4,5,34432,321323,321321,1,
    2,3,4,5,34432,321323,321321};
    return array[k+1];
}

It may still work if one array is an exact subarray of the other ones with GCC, as in this example:

int f(int k) {
    int array[] = {1,2,3,4,5,34432,321323,321321,1,
    2,3,4,5,34432,321323,321321};
    return array[k];
}


int g(int k) {
    int array[] = {1,2,3,4,5,34432,321323,321321,1,
    2,3,4,5,34432,321323,321321,1,4};
    return array[k+1];
}

You can also pile up several arrays as in the following case where GCC creates just one array:

long long get1(int k) {
    long long str[] = {1,2,3};
    return str[k];
}

long long get2(int k) {
    long long str[] = {1,2,3,4};
    return str[k+1];
}

long long get3(int k) {
    long long str[] = {1,2,3,4,5,6};
    return str[k+1];
}

long long get4(int k) {
    long long str[] = {1,2,3,4,5,6,7,8};
    return str[k+1];
}

It also works with arrays of pointers, as in the following case:

const char * get1(int k) {
    const char * str[] = {"dear friend", "dear sister", "dear brother"};
    return str[k];
}

const char * get2(int k) {
    const char * str[] = {"dear friend", "dear sister", "dear brother"};
    return str[k+1];
}

Of course, if you want to make sure to keep your code thin and efficient, you should not blindly rely on the compiler. Nevertheless, it is warranted to be slightly optimistic.

Science and Technology links (September 16 2022)

Escaping strings faster with AVX-512

When programming, we often have to ‘escape’ strings. A standard way to do it is to insert the backslash character (\) before some characters such as the double quote. For example, the string

my title is "La vie"

becomes

my title is \"La vie\"

A simple routine in C++ to escape a string might look as follows:

  for (...) {
    if ((*in == '\\') || (*in == '"')) {
      *out++ = '\\';
    }
    *out++ = *in;
  }

Such a character-by-character approach is unlikely to provide the best possible performance on modern hardware.

Recent Intel processors have fast instructions (AVX-512) that are well suited for such problems. I decided to sketch a solution using Intel intrinsic functions. The routine goes as follows:

  1. I use two constant registers containing 64 copies of the backslash character and 64 copies of the quote characters.
  2. I start a loop by loading 32 bytes from the input.
  3. I expands these 32 bytes into a 64 byte register, interleaving zero bytes.
  4. I compare these bytes with the quotes and backslash characters.
  5. From the resulting mask, I then construct (by shifting and blending) escaped characters.
  6. I ‘compress’ the result, removing the zero bytes that appear before the unescaped characters.
  7. I advance the output pointer by the number of written bytes and I continue the loop.

The C++ code roughly looks like this…

  __m512i solidus = _mm512_set1_epi8('\\');
  __m512i quote = _mm512_set1_epi8('"');
  for (; in + 32 <= finalin; in += 32) {
    __m256i input = _mm256_loadu_si256(in);
    __m512i input1 = _mm512_cvtepu8_epi16(input);
    __mmask64 is_solidus = _mm512_cmpeq_epi8_mask(input1, solidus);
    __mmask64 is_quote = _mm512_cmpeq_epi8_mask(input1, quote);
    __mmask64 is_quote_or_solidus = _kor_mask64(is_solidus, is_quote);
    __mmask64 to_keep = _kor_mask64(is_quote_or_solidus, 0xaaaaaaaaaaaaaaaa);
    __m512i shifted_input1 = _mm512_bslli_epi128(input1, 1);
    __m512i escaped =
        _mm512_mask_blend_epi8(is_quote_or_solidus, shifted_input1, solidus);
    _mm512_mask_compressstoreu_epi8(out, to_keep, escaped);
    out += _mm_popcnt_u64(_cvtmask64_u64(to_keep));
  }

This code can be greatly improved. Nevertheless, it is a good first step. What are the results an Intel icelake processor using GCC 11 (Linux) ? A simple benchmark indicates a 5x performance boost compared to a naive implementation:

regular code 0.6 ns/character
AVX-512 code 0.1 ns/character

It looks quite encouraging ! My source code is available. I require a recent x64 processor with AVX-512 VBMI2 support.

Science and Technology links (September 12 2022)

  1. A standard dataset in artificial-intelligence research has ten percent of its images mislabeled. Yet state-of-the-art algorithms achieve better-than-90% classification on the same dataset. (Credit: Leo Boytsov)
  2. Despite the reeducation camps and the massive trauma, families that did well before the Chinese socialist revolution are still doing well. In other words, it is difficult and maybe impossible to do away with the competitiveness of some families.
  3. Exercise appears to enhance the hippocampus (in mice) so that exercise might enhance spatial learning.
  4. Lead is a toxic substance that might have accounted for a significant fraction of violent crimes. The more careful use of lead explains in part the reduction in violent crimes over time.
  5. Watching a video at 2x the speed can double your learning speed.
  6. A man who received a heart transplant from a pig died two months later.
  7. Mental speed does not decrease with age as we expected: no mental speed decline is observed before the age of 60.
  8. Marmots appear not to age while they are hibernating.
  9. It is often stated that twins raised apart have a striking similarity in intelligence. And twins raised apart are similar in many ways (e.g., personality), but their intelligence might be less closely related than we thought.
  10. The widespread availability of guns makes lynching less likely.
  11. Vitamin D supplements do not appear to protect the elderly against fractures.
  12. The largest flood in the history of California happened in 1862. The flood followed a 20-year-long drought and the valleys became seas for a short time. Such a flood appears to be a relatively common occurrence though there has been no comparable flood since then. Native Americans apparently knew of these floods:

    Native Americans knew that the Sacramento Valley could become an inland sea when the rains came. Their storytellers described water filling the valley from the Coast Range to the Sierra.

Catching sanitizer errors programmatically

The C and C++ languages offer little protection against programmer errors. Errors do not always show up where you expect. You can silently corrupt the content of your memory. It can make bugs difficult to track. To solve this problem, I am a big fan of programming in C and C++ using sanitizers. They slow your program, but the check that memory accesses are safe, for example.

Thus if you iterate over an array and access elements that are out of bounds, a memory sanitizer will immediately catch the error:

  int array[8];
  for(int k = 0;; k++) {
    array[k] = 0;
  }

The sanitizer reports the error, but what if you would like to catch the error and store it in some log? Thankfully, GCC and LLVM sanitizers call a function (__asan_on_error()) when when an error is encounter, allowing us to log it. Of course, you need to record the state of your program. The following is an example where the state is recorded in a string.

#include <iostream>
#include <string>
#include <stdlib.h>

std::string message;

extern "C" {
void __asan_on_error() {
  std::cout << "You caused an error: " << message << std::endl;
}
}


int main() {
  int array[8];
  for(int k = 0;; k++) {
    message = std::string("access at ") + std::to_string(k);
    array[k] = 0;
  }
  return EXIT_SUCCESS;
}

The extern expression makes sure that C++ does not mangle the function name.

Running this program with memory sanitizers will print the following:

You caused an error: access at 8

You could also write to a file, if you would like.

“Hello world” is slower in C++ than in C (Linux)

A simple C program might print ‘hello world’ on screen:

#include <stdio.h>
#include <stdlib.h>

int main() {
    printf("hello world\n");
    return EXIT_SUCCESS;
}

You can write the equivalent in C++:

#include <iostream>
#include <stdlib.h>

int main() {
    std::cout << "hello world" << std::endl;
    return EXIT_SUCCESS;
}

In the recently released C++20 standard, we could use std::format instead or wrap the stream in a basic_osyncstream for thread safety, but the above code is what you’d find in most textbooks today.

How fast do these programs run? You may not care about the performance of these ‘hello world’ programs per se, but many systems rely on small C/C++ programs running specific and small tasks. Sometimes you just want to run a small program to execute a computation, process a small file and so forth.

We can check the running time using a benchmarking tool such as hyperfine. Such tools handle various factors such as shell starting time and so forth.

I do not believe that printing ‘hello world’ itself should be slower or faster in C++ compared to C, at least not significantly. What we are testing by running these programs is the overhead due to the choice of programming language when launching the program. One might argue that in C++, you can use printf (the C function), and that’s correct. You can code in C++ as if you were in C all of the time. It is not unreasonable, but we are interested in the performance when relying on conventional/textbook C++ using the standard C++ library.

Under Linux when using the standard C++ library (libstdc++), we can ask that the standard C++ be linked with the executable. The result is a much larger binary executable, but it may provide faster starting time.

Hyperfine tells me that the C++ program relying on the dynamically loaded C++ library takes almost 1 ms more time than the C program.

C 0.5 ms
C++ (dynamic) 1.4 ms
C++ (static) 0.8 ms

My source code and Makefile are available. I get these numbers on Ubuntu 22.04 LTS using an AWS node (Graviton 3).

If these numbers are to be believed, there may a significant penalty due to textbook C++ code for tiny program executions, under Linux.

Half a millisecond or more of overhead, if it is indeed correct, is a huge penalty for a tiny program like ‘hello workd’. And it only happens when I use dynamic loading of the C++ library: the cost is much less when using a statically linked C++ library.

It seems that loading the C++ library dynamically is adding a significant cost of up to 1 ms. We might check for a few additional confounding factors proposed by my readers.

  1. The C compiler might not call the printf function, and might call the simpler puts function instead: we can fool the compiler into calling printf with the syntax printf("hello %s\n", "world"): it makes no measurable difference in our tests.
  2. If we compile the C function using a C++ compiler, the problem disappears, as you would hope, and we match the speed of the C program.
  3. Replacing  "hello world" << std::endl; with "hello world\n"; does not seem to affect the performance in these experiments. The C++ program remains much slower.
  4. Adding std::ios_base::sync_with_stdio(false); before using std::cout also appears to make no difference. The C++ program remains much slower.
C (non trivial printf) 0.5 ms
C++ (using printf) 0.5 ms
C++ (std::cout replaced by \n) 1.4 ms
C++ (sync_with_stdio set to false) 1.4 ms

Thus we have every indication that dynamically loading the C++ standard library takes a lot time, certainly hundreds of extra microseconds. It may be a one-time cost but if your programs are small, it can dominate the running time. Statically linking the C++ library helps, but it also creates larger binaries. You may reduce somewhat the size overhead with appropriate link-time flags such as --gc-sections, but a significant overhead remains in my tests.

Note: This blog post has been edited to answer the multiple comments suggesting confounding factors, other than standard library loading, that the original blog post did not consider. I thank my readers for their proposals.

Appendix 1 We can measure precisely the loading time by preceding the execution of the function by LD_DEBUG=statistics (thanks to Grégory Pakosz for the hint). The C++ code requires more cycles. If we use LD_DEBUG=all (e.g., LD_DEBUG=all ./hellocpp), then we observe that the C++ version does much more work (more versions checks, more relocations, many more initializations and finalizers). In the comments, Sam Mason blames dynamic linking: on his machine he gets the following result…

C code that dynamically links to libc takes ~240µs, which goes down to ~150µs when statically linked. A fully dynamic C++ build takes ~800µs, while a fully static C++ build is only ~190µs.

Appendix 2 We can try to use sampling-based profiling to find out where the programs speeds its time. Calling perf record/perf report is not terribly useful on my system. Some readers report that their profiling points the finger at locale initialization in this manner. I get a much more useful profile with valgrind --tool=callgrind command && callgrind_annotate. The results are consistent with the theory that loading the C++ library dynamically is relatively expensive.

Science and Technology links (August 7 2022)

  1. Increase in computing performance explain up to 94% of the performance improvements in field such as weather prediction, protein folding, and oil exploration: information technology is a a driver of long-term performance improvement across society. If we stop improving our computing, the consequences could be dire.
  2. The coral cover of the Great Barrier Reef has reached its highest level since the Australian Institute of Marine Science (AIMS) began monitoring 36 years ago.
  3. The leading Alzheimer’s theory, the amyloid hypothesis, which motivated years of research, was built in part on fraud. The author of the fraud is professor Sylvain Lesné. His team appeared to have composed figures by piecing together parts of photos from different experiments. Effectively, they “photoshopped” their scientific papers. Not just one or two images, but at least 70. Note that all clinical trials of drugs developed (at high cost) on the amyloid hypothesis have failed. The total cost is in the billions of dollars. Meanwhile, competing theories and therapies have been sidelined by the compelling amyloid hypothesis. It will be interesting to watch what kind of penalty, if any, Lesné receives for his actions. You may read the testimonies of researchers whose work was made difficult because they did not believe in the amyloid hypothesis. E.g., The maddening saga of how an Alzheimer’s ‘cabal’ thwarted progress toward a cure for decades (from 2019) or the story of Rachael Neve. The net result? No progress despite massive funding:

    In the 30 (now 35) years that biomedical researchers have worked determinedly to find a cure for Alzheimer’s disease, their counterparts have developed drugs that helped cut deaths from cardiovascular disease by more than half, and cancer drugs able to eliminate tumors that had been incurable. But for Alzheimer’s, not only is there no cure, there is not even a disease-slowing treatment.

    We have two decades of Alzheimer’s research based partly on deliberate fraud that may have cost millions of lives.

  4. For years, we have been told that depression was caused by a chemical imbalance in the brain, specifically low serotonin levels. Antidepressants were initially proposed to work by rectifying the serotonin abnormality. A new paper published by Nature shows not only that depressed people do not have lower serotonin levels but, moreover, long-term antidepressant use may lead to lower serotonin levels.
  5. Science has been ‘covided’:

    Across science, 98 of the 100 most-cited papers published in 2020 to 2021 were related to COVID-19. A large number of scientists received large numbers of citations to their COVID-19 work, often exceeding the citations they had received to all their work during their entire career.

  6. The Earth is getting greener due to increase CO2 presence in the atmosphere and this is predicted to lead to substantial cooling according to an article in Nature.
  7. The Sahara was green from 14,000 to 5,000 years ago. Some researchers believe that it became a desert due to the sudden cooling of the Atlantic Ocean.

Comparing strtod with from_chars (GCC 12)

A reader (Richard Ebeling) invited me to revisit an older blog post: Parsing floats in C++: benchmarking strtod vs. from_chars. Back then I reported that switching from strtod to from_chars in C++ to parse numbers could lead to a speed increase (by 20%). The code is much the same, we go from…

char * string = "3.1416";
char * string_end = string;
double x = strtod(string, &string_end);
if(string_end == string) { 
  //you have an error!
}

… to something more modern in C++17…

std::string st = "3.1416";
double x; 
auto [p, ec] = std::from_chars(st.data(), st.data() + st.size(), x);
if (p == st.data()) {
      //you have errors!
}

Back when I first reported on this result, only Visual Studio had support for from_chars. The C++ library in GCC 12 now has full support for from_chars. Let us run the benchmark again:

strtod 270 MB/s
from_chars 1 GB/s

So it is almost four times faster! The benchmark reads random values in the [0,1] interval.

Internally, GCC 12 adopted the fast_float library.

Further reading: Number Parsing at a Gigabyte per Second, Software: Pratice and Experience 51 (8), 2021.

Round a direction vector to an 8-way compass

Modern game controllers can point in a wide range of directions. Game designers sometimes want to convert the joystick direction to get 8-directional movement. A typical solution offered is to compute the angle, round it up and then compute back the direction vector.

  double angle = atan2(y, x);
  angle = (int(round(4 * angle / PI + 8)) % 8) * PI / 4;
  xout = cos(angle);
  yout = sin(angle);

If you assume that the unit direction vector is in the first quadrant (both x and y are positive), then there is a direct way to compute the solution. Using 1/sqrt(2) or 0.7071 as the default solution, compare both x and y with sin(3*pi/8), and only switch them to 1 if they are larger than sin(3*pi/8) or to 0 if the other coordinate is larger than sin(3*pi/8). The full code looks as follows:

  double outx = 0.7071067811865475; // 1/sqrt(2)
  double outy = 0.7071067811865475;// 1/sqrt(2)
  if (x >= 0.923879532511286) { // sin(3*pi/8)
    outx = 1;
  }
  if (y >= 0.923879532511286) { // sin(3*pi/8)
    outy = 1;
  }
  if (y >= 0.923879532511286) { // sin(3*pi/8)
    outx = 0;
  }
  if (x >= 0.923879532511286) { // sin(3*pi/8)
    outy = 0;
  }
  if (xneg) {
    outx = -outx;
  }

I write tiny if clauses because I hope that the compile will avoid producing comparisons and jumps which may stress the branch predictor when the branches are hard to predict.
You can generalize the solution for the case where either x or y (or both) are negative by first taking the absolute value, and then restoring the sign at the end:

  bool xneg = x < 0;
  bool yneg = y < 0;
  if (xneg) {
    x = -x;
  }
  if (yneg) {
    y = -y;
  }
  double outx = 0.7071067811865475; // 1/sqrt(2)
  double outy = 0.7071067811865475;// 1/sqrt(2)
  if (x >= 0.923879532511286) { // sin(3*pi/8)
    outx = 1;
  }
  if (y >= 0.923879532511286) { // sin(3*pi/8)
    outy = 1;
  }
  if (y >= 0.923879532511286) { // sin(3*pi/8)
    outx = 0;
  }
  if (x >= 0.923879532511286) { // sin(3*pi/8)
    outy = 0;
  }
  if (xneg) {
    outx = -outx;
  }
  if (yneg) {
    outy = -outy;
  }

You can rewrite everything with the ternary operator to entice the compiler to produce branchless code (i.e., code without jumps). The result is more compact.

  bool xneg = x < 0;
  x = xneg ? -x : x;
  bool yneg = y < 0;
  y = yneg ? -y : y;
  double outx = (x >= 0.923879532511286) ? 1 
    : 0.7071067811865475;
  double outy = (y >= 0.923879532511286) ? 1 
    : 0.7071067811865475;
  outx = (y >= 0.923879532511286) ? 0 : outx;
  outy = (x >= 0.923879532511286) ? 0 : outy;
  outx = xneg ? -outx : outx;
  outy = yneg ? -outy : outy;

The clang compiler may produce an entirely branchless assembly given this code.

But as pointed out by Samuel Lee in the comments, you can do even better… Instead of capturing the sign with a separate variable, you can just copy the pre-existing sign using a function like copysign (available in C, C#, Java and so forth):

 double outx = fabs(x);
 double outy = fabs(y);
 outx = (outx >= 0.923879532511286) ? 1 
   : 0.7071067811865475;
 outy = (outy >= 0.923879532511286) ? 1 
   : 0.7071067811865475;
 outx = (posy >= 0.923879532511286) ? 0 : outx;
 outy = (posx >= 0.923879532511286) ? 0 : outy;
 outx = copysign(outx, x);
 outy = copysign(outy, y);

I wrote a small benchmark that operates on random inputs. Your results will vary but on my mac laptop with LLVM 12, I get that the direct approach with copysign is 50 times faster than the approach with tan/sin/cos.

with tangent 40 ns/vector
fast approach 1.2 ns/vector
fast approach/copysign 0.8 ns/vector

Science and Technology links (July 23 2022)

    1. Compared to 1800, we eat less saturated fat and much more processed food and vegetable oils and it does not seem to be good for us:

      Saturated fats from animal sources declined while polyunsaturated fats from vegetable oils rose. Non-communicable diseases (NCDs) rose over the twentieth century in parallel with increased consumption of processed foods, including sugar, refined flour and rice, and vegetable oils. Saturated fats from animal sources were inversely correlated with the prevalence of non-communicable diseases.

      Kang et al. found that saturated fats reduce your risk of having a stroke:

      a higher consumption of dietary saturated fat is associated with a lower risk of stroke, and every 10 g/day increase in saturated fat intake is associated with a 6% relative risk reduction in the rate of stroke.

      Saturated fats come from meat and dairy products (e.g., butter). A low-fat diet can significantly increase the risk of coronary heart disease events.
      Leroy and Cofnas argue against a reduction of red meat consumption:

      The IARC’s (2015) claim that red meat is “probably carcinogenic” has never been substantiated. In fact, a risk assessment by Kruger and Zhou (2018) concluded that this is not the case. (…) a meta-analysis of RCTs has shown that meat eating does not lead to deterioration of cardiovascular risk markers (O’Connor et al., 2017). The highest category of meat eating even paralleled a potentially beneficial increase in HDL-C level. Whereas plant-based diets indeed seem to lower total cholesterol and LDL-C in intervention studies, they also increase triglyceride levels and decrease HDL-C (Yokoyama et al., 2017), which are now often regarded as superior markers of cardiovascular risk (Jeppesen et al., 2001). (…) We believe that a large reduction in meat consumption, such as has been advocated by the EAT-Lancet Commission (Willett et al., 2019), could produce serious harm. Meat has long been, and continues to be, a primary source of high-quality nutrition. The theory that it can be replaced with legumes and supplements is mere speculation. While diets high in meat have proved successful over the long history of our species, the benefits of vegetarian diets are far from being established, and its dangers have been largely ignored by those who have endorsed it prematurely on the basis of questionable evidence.

    2. People dislike research that appears to favour males:

      In both studies, both sexes reacted less positively to differences favouring males; in contrast to our earlier research, however, the effect was larger among female participants. Contrary to a widespread expectation, participants did not react less positively to research led by a female. Participants did react less positively, though, to research led by a male when the research reported a male-favouring difference in a highly valued trait. Participants judged male-favouring research to be lower in quality than female-favouring research, apparently in large part because they saw the former as more harmful.

    3. During the Jurassic era, atmospheric CO2 was very high, forests extended all the way to the North pole. Even so, there were freezing winters:

      Forests were present all the way to the Pangean North Pole and into the southern latitudes as far as land extended. Although there may have been other contributing factors, the leading hypothesis is that Earth was in a “greenhouse” state because of very high atmospheric PCO2 (partial pressure of CO2), the highest of the past 420 million years. Despite modeling results indicating freezing winter temperatures at high latitudes, empirical evidence for freezing has been lacking. Here, we provide empirical evidence showing that, despite extraordinary high PCO2, freezing winter temperatures did characterize high Pangean latitudes based on stratigraphically widespread lake ice-rafted debris (L-IRD) in early Mesozoic strata of the Junggar Basin, northwest China. Traditionally, dinosaurs have been viewed as thriving in the warm and equable early Mesozoic climates, but our results indicate that they also endured freezing winters.

    4. In mice, researchers found that injecting stem-cell-derived “conditioned medium” protects against neurodegeneration:

      Neuronal cell death is causal in many neurodegenerative diseases, including age-related loss of memory and dementias (such as Alzheimer’s disease), Parkinson’s disease, strokes, as well as diseases that afflict broad ages, i.e., traumatic brain injury, spinal cord injury, ALS, and spinal muscle atrophy. These diseases are characterized by neuroinflammation and oxidative cell damage, many involve perturbed proteostasis and all are devastating and without a cure. Our work describes a feasible meaningful disease-minimizing treatment for ALS and suggests a clinical capacity for treating a broad class of diseases of neurodegeneration, and excessive cell apoptosis.

    5. Greenland and the North of Europe were once warmer than they are today: the Medieval Warm Period (950 to 1250) overlaps with the Viking age (800–1300). Bajard et al. (2022) suggest that the Viking were quite adept at adapting their agricultural practices:

      (…) the period from The Viking Age to the High Middle Ages was a period of expansion with the Viking diaspora, increasing trade, food and goods production and the establishment of Scandinavian towns. This period also sees a rapid increase in population and settlements, mainly due to a relatively stable warm climate (…) temperature was the main driver of agricultural practices in Southeastern Norway during the Late Antiquity. Direct comparison between the reconstructed temperature variability and palynological data from the same sediment sequence shows that small changes in temperature were synchronous with changes in agricultural practices (…) We conclude that the pre-Viking age society in Southwestern Scandinavia made substantial changes in their way of living to adapt to the climate variability of this period.

      The Vikings grew barley in Greenland, a plant that grows normally in a temperate climate. In contrast, agriculture in Greenland today is nearly non-existent due to the harsh climate.

    6. Ashkenazi intelligence often score exceptionally well on intelligence tests, and they achieve extraordinary results in several intellectual pursuits.
      Nevertheless, Wikipedia editors deleted the article on Ashkenazi intelligence. Tezuka argues that it is the result of an ideological bias that results in systematic censorship.
    7. You can rejuvenate old human skins by grafting it on young mice.
    8. Tabarrok reminds us that research funding through competitions might result in total waste through rent dissipation…

      A scientist who benefits from a 2-million-dollar NIH grant is willing to spend a million dollars of their time working on applications or incur the cost of restricting their research ideas in order to get it. Importantly, even though only one scientist will get the grant, hundreds of scientists are spending resources in competition to get it. So the gains we might be seeing from transferring resources to one researcher are dissipated multiplicatively across all the scientists who spent time and money competing for the grant but didn’t get it. The aggregate time costs to our brightest minds from this application contest system are quantifiably large, possibly entirely offsetting the total scientific value of the research that the funding supports.

    9. Corporate tax cuts lead to an increase in productivity and research over the long term. Conversely, increases in taxation reduce long-term productivity as well as research and development.

Negative incentives in academic research

In the first half of the XXth century, there were relatively few scientists, and these scientists were generally not lavishly funded. Yet it has been convincingly argued that these scientists were massively more productive. We face a major replication crisis where important results in fields such as psychology and medicine cannot be reproduced independently. Academic researchers routinely fail to even fill out the results from clinical trials. We have entered a ‘dark age’ where we are mostly stagnant. It is not that there is no progress per se, but progress is slow, uncommon and expensive.

Why might that be? I believe that it has to do with important ‘negative incentives’ that we have introduced. In effect, we have made scientists less productive. We probably did so through several means, but two effects are probably important: the widespread introduction of research competitions and the addition of extrinsic motivations.

  1.  Prior to 1960, there was hardly any formal research funding competitions. Today, by some estimates, it takes about 40 working days to prepare a single new grant application with about 30 working days for a resubmission, and the success rates is often low which means that for a single successful research grant, hundreds of days might have been spent, purely on the acquisition of funding. This effect is known in economics as rent dissipation. Suppose that I offer to give you $100 to support your research if you enter a competition. How much time are you willing to spend? Maybe you are willing to spend the equivalent of $50, if the success rate is 50%. The net result is that two researchers may each waste $50 in time so that one of them acquire $100 of support. There may be no net gain! Furthermore, if grant applications are valued enough (e.g., needed to get promotion), scientists may be willing to spend even more time than is rational to do so, and the introduction of a new grant competition may in fact reduce the overall research output. You should not underestimate the effect that constant administrative and grant writing might have on a researcher: many graduate students will tell you of their disappointment when encountering high status scientists who cannot seem to do actual research anymore. It can cause a vicious form of accelerated aging. If Albert Einstein had been stuck writing grant applications and reporting on the results from his team, history might have taken a different turn.
  2. We have massively increased the number and importance of ‘extrinsic motivations’ in science. Broadly speaking, we can distinguish between two types of motivations… intrinsic and extrinsic motivations. Winning prizes or securing prestigious positions are extrinsic motivations. Solving an annoying problem or pursuing a personal quest are intrinsic motivations. We repeatedly find that intrinsic motivations are positively correlated with long-term productivity whereas extrinsic motivations are negatively correlated with long-term productivity (e.g., Horodnic and Zaiţh 2015). In fact, extrinsic motivations even cancel out intrinsic motivations (Wrzesniewski et al., 2014). Extrinsically motivated individuals will focus on superficial gains, as opposed to genuine advances. Of course, the addition of extrinsic motivations may also create a selection effect: the field tends to recruit people who seek prestige for its own sake, as opposed to having a genuine interest in scientific pursuits. Thus creating prestigious prizes, prestigious positions, and prestigious conferences, may end up being detrimental to scientific productivity.

Many people object that the easiest explanation for our stagnation has to do with the fact that most of the easy findings have been covered. However, at all times in history, there were people making this point. In 1894, Michelson said:

While it is never safe to affirm that the future of Physical Science has no marvels in store even more astonishing than those of the past, it seems probable that most of the grand underlying principles have been firmly established and that further advances are to be sought chiefly in the rigorous application of these principles to all the phenomena which come under our notice.

Silverstein in the “The End is Near!”: The Phenomenon of the Declaration of Closure in a Discipline, documents carefully how, historically, many people predicted (wrongly) that their discipline was at an end.

How quickly can you convert floats to doubles (and back)?

Many programming languages have two binary floating-point types: float (32-bit) and double (64-bit). It reflects the fact that most general-purpose processors supports both data types natively.

Often we need to convert between the two types. Both ARM and x64 processors can do in one inexpensive instructions. For example, ARM systems may use the fcvt instruction.

The details may differ, but most current processors can convert one number (from float to double, or from double to float) per CPU cycle. The latency is small (e.g., 3 or 4 cycles).

A typical processor might run at 3 GHz, thus we have 3 billion cycles per second. Thus we can convert 3 billion numbers per second. A 64-bit number uses 8 bytes, so it is a throughput of 24 gigabytes per second.

It is therefore unlikely that the type conversion can be a performance bottleneck, in general. If you would like to measure the speed on your own system: I have written a small C++ benchmark.

Filtering numbers faster with SVE on Graviton 3 processors

Processors come, roughly, in two large families x64 processors from Intel and AMD, and ARM processors from Apple, Samsung, and many other vendors. For a long time, ARM processors occupied mostly the market of embedded processors (the computer running your fridge at home) with the ‘big processors’ being almost exclusively the domain of x64 processors.

Reportedly, the Apple CEO (Steve Jobs) went to see Intel back when Apple was designing the iPhone to ask for a processor deal. Intel turned Apple down. So Apple went with ARM.

Today, we use ARM processors for everything: game consoles (Nintendo Switch), powerful servers (Amazon and Google), mobile phones, embedded devices, and so forth.

Amazon makes available its new ARM-based processors (Graviton 3). These processors have sophisticated SIMD instructions (SIMD stands for Single Instruction Multiple Data) called SVE (Scalable Vector Extensions). With these instructions, we can greatly accelerate software. It is a form of single-core parallelism, as opposed to the parallelism that one gets by using multiple cores for one task. The SIMD parallelism, when it is applicable, is often far more efficient than multicore parallelism.

Amazon’s Graviton 3 appears to have 32-byte registers, since it is based on the ARM Neoverse V1 design. You can fit eight 32-bit integers in one register. Mainstream ARM processors (e.g., the ones that Intel uses) have SIMD instructions too (NEON), but with shorter registers (16 bytes). Having wider registers and instructions capable of operating over these wide registers allows you reduce the total number of instructions. Executing fewer instructions is a very good way to accelerate code.

To investigate SVE, I looked at a simple problem where you want to remove all negative integers from an array. That is, you read and array containing signed random integers and you want to write out to an output array only the positive integers. Normal C code might look as follows:

void remove_negatives_scalar(const int32_t *input, 
        int64_t count, int32_t *output) {
  int64_t i = 0;
  int64_t j = 0;
  for(; i < count; i++) {
    if(input[i] >= 0) {
      output[j++] = input[i];
    }
  }
}

 

Replacing this code with new code that relies on special SVE functions made it go much faster (2.5 times faster). At the time, I suggested that my code was probably not nearly optimal. It processed 32 bytes per loop iteration, using 9 instructions. A sizeable fraction of these 9 instructions have to do with managing the loop, and few do the actual number crunching.  A reader, Samuel Lee, proposed to effectively unroll my loop. He predicted much better performance (at least when the array is large enough) due to lower loop overhead. I include his proposed code below.

Using a graviton 3 processor and GCC 11 on my benchmark, I get the following results:

cycles/int instr./int instr./cycle
scalar 9.0 6.000 0.7
branchless scalar 1.8 8.000 4.4
SVE 0.7 1.125 ~1.6
unrolled SVE 0.4385 0.71962 ~1.6

The new unrolled SVE code uses about 23 instructions to process 128 bytes (or 32 32-bit integers), hence about 0.71875 instructions per integer. That’s about 10 times fewer instructions than scalar code and roughly 4 times faster than scalar code in terms of CPU cycles.

The number of instructions retired per cycle is about the same for the two SVE functions, and it is relatively low, somewhat higher than 1.5 instructions retired per cycle.

Often the argument in favour of SVE is that it does not require special code to finish the tail of the processing. That is, you can process an entire array with SVE instructions, even if its length is not divisible by the register size (here 8 integers). I find Lee’s code interesting because it illustrates that you might actually need to handle the end of long array differently, for efficiency reasons.

Overall, I think that we can see that SVE works well for the problem at hand (filtering out 32-bit integers).

Appendix: Samuel Lee’s code.

void remove_negatives(const int32_t *input, int64_t count, int32_t *output) {
  int64_t j = 0;
  const int32_t* endPtr = input + count;
  const uint64_t vl_u32 = svcntw();

  svbool_t all_mask = svptrue_b32();
  while(input <= endPtr - (4*vl_u32))
  {
      svint32_t in0 = svld1_s32(all_mask, input + 0*vl_u32);
      svint32_t in1 = svld1_s32(all_mask, input + 1*vl_u32);
      svint32_t in2 = svld1_s32(all_mask, input + 2*vl_u32);
      svint32_t in3 = svld1_s32(all_mask, input + 3*vl_u32);

      svbool_t pos0 = svcmpge_n_s32(all_mask, in0, 0);
      svbool_t pos1 = svcmpge_n_s32(all_mask, in1, 0);
      svbool_t pos2 = svcmpge_n_s32(all_mask, in2, 0);
      svbool_t pos3 = svcmpge_n_s32(all_mask, in3, 0);

      in0 = svcompact_s32(pos0, in0);
      in1 = svcompact_s32(pos1, in1);
      in2 = svcompact_s32(pos2, in2);
      in3 = svcompact_s32(pos3, in3);

      svst1_s32(all_mask, output + j, in0);
      j += svcntp_b32(all_mask, pos0);
      svst1_s32(all_mask, output + j, in1);
      j += svcntp_b32(all_mask, pos1);
      svst1_s32(all_mask, output + j, in2);
      j += svcntp_b32(all_mask, pos2);
      svst1_s32(all_mask, output + j, in3);
      j += svcntp_b32(all_mask, pos3);

      input += 4*vl_u32;
  }

  int64_t i = 0;
  count = endPtr - input;

  svbool_t while_mask = svwhilelt_b32(i, count);
  do {
    svint32_t in = svld1_s32(while_mask, input + i);
    svbool_t positive = svcmpge_n_s32(while_mask, in, 0);
    svint32_t in_positive = svcompact_s32(positive, in);
    svst1_s32(while_mask, output + j, in_positive);
    i += svcntw();
    j += svcntp_b32(while_mask, positive);
    while_mask = svwhilelt_b32(i, count);
  } while (svptest_any(svptrue_b32(), while_mask));
}

Go generics are not bad

When programming, we often need to write ‘generic’ functions where the exact data type is not important. For example, you might want to write a simple function that sums up numbers.

Go lacked this notion until recently, but it was recently added (as of version 1.18). So I took it out for a spin.

In Java, generics work well enough as long as you need “generic” containers (arrays, maps), and as long as stick with functional idioms. But Java will not let me code the way I would prefer. Here is how I would write a function that sums up numbers:

    int sum(int[] v) {
        int summer = 0;
        for(int k = 0; k < v.length; k++) {
            summer += v[k];
        }
        return summer;
    }

What if I need to support various number types? Then I would like to write the following generic function, but Java won’t let me.

    // this Java code won't compile
    static <T extends Number>  T sum(T[] v) {
        T summer = 0;
        for(int k = 0; k < v.length; k++) {
            summer += v[k];
        }
        return summer;
    }

Go is not object oriented per se, so you do not have a ‘Number’ class. However, you can create your own generic ‘interfaces’ which serves the same function. So here is how you solve the same problem in Go:

type Number interface {
  uint | int | float32 | float64
}


func sum[T Number](a []T) T{
    var summer T
    for _, v := range(a) {
        summer += v
    }
   return summer
}

So, at least in this one instance, Go generics are more expressive than Java generics. What about performance?

If I apply the above code to an array of integers, I get the following tight loop in assembly:

pc11:
        MOVQ    (AX)(DX*8), SI
        INCQ    DX
        ADDQ    SI, CX
        CMPQ    BX, DX
        JGT     pc11

As far as Go is concerned, this is as efficient as it gets.

So far, I am giving an A to Go generics.

Looking at assembly code with gdb

Most of us write code using higher level languages (Go, C++), but if you want to understand the code that matters to your processor, you need to look at the ‘assembly’ version of your code. Assembly is a just a series of instructions.

At first, assembly code looks daunting, and I discourage you from writing sizeable programs in assembly. However, with little training, you can learn to count instructions and spot branches. It can help you gain a deeper insight into how your program works. Let me illustrate what you can learn by look at assembly. Let us consider the following C++ code:

long f(int x) {
    long array[] = {1,2,3,4,5,6,7,8,999,10};
    return array[x];
}

long f2(int x) {
    long array[] = {1,2,3,4,5,6,7,8,999,10};
    return array[x+1];
}

This code contains two 80 bytes arrays, but they are identical. Is this a source of worry? If you look at the assembly code produced by most compilers, you will find that exactly identical constants are generally ‘compressed’ (just one version is stored). If I compile these two functions with gcc or clang compilers using the -S flag, I can plainly see the compression because the array occurs just once: (Do not look at all the instructions… just scan the code.)

	.text
	.file	"f.cpp"
	.globl	_Z1fi                           // -- Begin function _Z1fi
	.p2align	2
	.type	_Z1fi,@function
_Z1fi:                                  // @_Z1fi
	.cfi_startproc
// %bb.0:
	adrp	x8, .L__const._Z2f2i.array
	add	x8, x8, :lo12:.L__const._Z2f2i.array
	ldr	x0, [x8, w0, sxtw #3]
	ret
.Lfunc_end0:
	.size	_Z1fi, .Lfunc_end0-_Z1fi
	.cfi_endproc
                                        // -- End function
	.globl	_Z2f2i                          // -- Begin function _Z2f2i
	.p2align	2
	.type	_Z2f2i,@function
_Z2f2i:                                 // @_Z2f2i
	.cfi_startproc
// %bb.0:
	adrp	x8, .L__const._Z2f2i.array
	add	x8, x8, :lo12:.L__const._Z2f2i.array
	add	x8, x8, w0, sxtw #3
	ldr	x0, [x8, #8]
	ret
.Lfunc_end1:
	.size	_Z2f2i, .Lfunc_end1-_Z2f2i
	.cfi_endproc
                                        // -- End function
	.type	.L__const._Z2f2i.array,@object  // @__const._Z2f2i.array
	.section	.rodata,"a",@progbits
	.p2align	3
.L__const._Z2f2i.array:
	.xword	1                               // 0x1
	.xword	2                               // 0x2
	.xword	3                               // 0x3
	.xword	4                               // 0x4
	.xword	5                               // 0x5
	.xword	6                               // 0x6
	.xword	7                               // 0x7
	.xword	8                               // 0x8
	.xword	999                             // 0x3e7
	.xword	10                              // 0xa
	.size	.L__const._Z2f2i.array, 80

	.ident	"Ubuntu clang version 14.0.0-1ubuntu1"
	.section	".note.GNU-stack","",@progbits
	.addrsig

However, if you modify even slightly the constants, then this compression typically does not happen (e.g., if you try to append one integer value to one of the arrays, the code will duplicate the arrays in full).

To assess the performance of a code routine, my first line of attack is always to count instructions. Keeping everything the same, if you can rewrite your code so that it generates fewer instructions, it should be faster. I also like to spot conditional jumps because that is often where your code can suffer, if the branch is hard to predict.

It is easy to convert a whole set of functions to assembly, but it becomes unpractical as your projects become larger. Under Linux, the standard ‘debugger’ (gdb) is a great tool to look selectively at the assembly code produced by the compile. Let us consider my previous blog post, Filtering numbers quickly with SVE on Amazon Graviton 3 processors. In that blog post, I present several functions which I have implemented in a short C++ file. To examine the result, I simply load the compiled binary into gdb:

$ gdb ./filter

Then I can examine functions… such as the remove_negatives function:

(gdb) set print asm-demangle
(gdb) disas remove_negatives
Dump of assembler code for function remove_negatives(int const*, long, int*):
   0x00000000000022e4 <+0>:	mov	x4, #0x0                   	// #0
   0x00000000000022e8 <+4>:	mov	x3, #0x0                   	// #0
   0x00000000000022ec <+8>:	cntw	x6
   0x00000000000022f0 <+12>:	whilelt	p0.s, xzr, x1
   0x00000000000022f4 <+16>:	nop
   0x00000000000022f8 <+20>:	ld1w	{z0.s}, p0/z, [x0, x3, lsl #2]
   0x00000000000022fc <+24>:	cmpge	p1.s, p0/z, z0.s, #0
   0x0000000000002300 <+28>:	compact	z0.s, p1, z0.s
   0x0000000000002304 <+32>:	st1w	{z0.s}, p0, [x2, x4, lsl #2]
   0x0000000000002308 <+36>:	cntp	x5, p0, p1.s
   0x000000000000230c <+40>:	add	x3, x3, x6
   0x0000000000002310 <+44>:	add	x4, x4, x5
   0x0000000000002314 <+48>:	whilelt	p0.s, x3, x1
   0x0000000000002318 <+52>:	b.ne	0x22f8 <remove_negatives(int const*, long, int*)+20>  // b.any
   0x000000000000231c <+56>:	ret
End of assembler dump.

At address 52, we conditionally go back to address 20. So we have a total of 9 instructions in our main loop. From my benchmarks (see previous blog post), I use 1.125 instructions per 32-bit word, which is consistent with each loop processing 8 32-bit words.

Another way to assess performance is to look at branches. Let us disassemble remove_negatives_scalar, a branchy function:

(gdb) disas remove_negatives_scalar
Dump of assembler code for function remove_negatives_scalar(int const*, long, int*):
   0x0000000000002320 <+0>:	cmp	x1, #0x0
   0x0000000000002324 <+4>:	b.le	0x234c <remove_negatives_scalar(int const*, long, int*)+44>
   0x0000000000002328 <+8>:	add	x4, x0, x1, lsl #2
   0x000000000000232c <+12>:	mov	x3, #0x0                   	// #0
   0x0000000000002330 <+16>:	ldr	w1, [x0]
   0x0000000000002334 <+20>:	add	x0, x0, #0x4
   0x0000000000002338 <+24>:	tbnz	w1, #31, 0x2344 <remove_negatives_scalar(int const*, long, int*)+36>
   0x000000000000233c <+28>:	str	w1, [x2, x3, lsl #2]
   0x0000000000002340 <+32>:	add	x3, x3, #0x1
   0x0000000000002344 <+36>:	cmp	x4, x0
   0x0000000000002348 <+40>:	b.ne	0x2330 <remove_negatives_scalar(int const*, long, int*)+16>  // b.any
   0x000000000000234c <+44>:	ret
End of assembler dump.

We see the branch at address 24 (instruction tbnz), it conditionally jumps over the next two instructions. We had an equivalent ‘branchless’ function called remove_negatives_scalar_branchless. Let us see if it is indeed branchless:

(gdb) disas remove_negatives_scalar_branchless
Dump of assembler code for function remove_negatives_scalar_branchless(int const*, long, int*):
   0x0000000000002350 <+0>:	cmp	x1, #0x0
   0x0000000000002354 <+4>:	b.le	0x237c <remove_negatives_scalar_branchless(int const*, long, int*)+44>
   0x0000000000002358 <+8>:	add	x4, x0, x1, lsl #2
   0x000000000000235c <+12>:	mov	x3, #0x0                   	// #0
   0x0000000000002360 <+16>:	ldr	w1, [x0], #4
   0x0000000000002364 <+20>:	str	w1, [x2, x3, lsl #2]
   0x0000000000002368 <+24>:	eor	x1, x1, #0x80000000
   0x000000000000236c <+28>:	lsr	w1, w1, #31
   0x0000000000002370 <+32>:	add	x3, x3, x1
   0x0000000000002374 <+36>:	cmp	x0, x4
   0x0000000000002378 <+40>:	b.ne	0x2360 <remove_negatives_scalar_branchless(int const*, long, int*)+16>  // b.any
   0x000000000000237c <+44>:	ret
End of assembler dump.
(gdb)

Other than the conditional jump produced by the loop (address 40), the code is indeed branchless.

In this particular instance, with one small binary file, it is easy to find the functions I need. What if I load a large binary with many compiled functions?

Let me examine the benchmark binary from the simdutf library. It has many functions, but let us assume that I am looking for a function that might validate UTF-8 inputs. I can use info functions to find all functions matching a given pattern.

(gdb) info functions validate_utf8
All functions matching regular expression "validate_utf8":

Non-debugging symbols:
0x0000000000012710  event_aggregate simdutf::benchmarks::BenchmarkBase::count_events<simdutf::benchmarks::Benchmark::run_validate_utf8(simdutf::implementation const&, unsigned long)::{lambda()#1}>(simdutf::benchmarks::Benchmark::run_validate_utf8(simdutf::implementation const&, unsigned long)::{lambda()#1}, unsigned long) [clone .constprop.0]
0x0000000000012b54  simdutf::benchmarks::Benchmark::run_validate_utf8(simdutf::implementation const&, unsigned long)
0x0000000000018c90  simdutf::fallback::implementation::validate_utf8(char const*, unsigned long) const
0x000000000001b540  simdutf::arm64::implementation::validate_utf8(char const*, unsigned long) const
0x000000000001cd84  simdutf::validate_utf8(char const*, unsigned long)
0x000000000001d7c0  simdutf::internal::unsupported_implementation::validate_utf8(char const*, unsigned long) const
0x000000000001e090  simdutf::internal::detect_best_supported_implementation_on_first_use::validate_utf8(char const*, unsigned long) const

You see that the info functions gives me both the function name as well as the function address. I am interested in simdutf::arm64::implementation::validate_utf8. At that point, it becomes easier to just refer to the function by its address:

(gdb) disas 0x000000000001b540
Dump of assembler code for function simdutf::arm64::implementation::validate_utf8(char const*, unsigned long) const:
   0x000000000001b540 <+0>:	stp	x29, x30, [sp, #-144]!
   0x000000000001b544 <+4>:	adrp	x0, 0xa0000
   0x000000000001b548 <+8>:	cmp	x2, #0x40
   0x000000000001b54c <+12>:	mov	x29, sp
   0x000000000001b550 <+16>:	ldr	x0, [x0, #3880]
   0x000000000001b554 <+20>:	mov	x5, #0x40                  	// #64
   0x000000000001b558 <+24>:	movi	v22.4s, #0x0
   0x000000000001b55c <+28>:	csel	x5, x2, x5, cs  // cs = hs, nlast
   0x000000000001b560 <+32>:	ldr	x3, [x0]
   0x000000000001b564 <+36>:	str	x3, [sp, #136]
   0x000000000001b568 <+40>:	mov	x3, #0x0                   	// #0
   0x000000000001b56c <+44>:	subs	x5, x5, #0x40
   0x000000000001b570 <+48>:	b.eq	0x1b7b8 <simdutf::arm64::implementation::validate_utf8(char const*, unsigned long) const+632>  // b.none
   0x000000000001b574 <+52>:	adrp	x0, 0x86000
   0x000000000001b578 <+56>:	adrp	x4, 0x86000
   0x000000000001b57c <+60>:	add	x6, x0, #0x2f0
   0x000000000001b580 <+64>:	adrp	x0, 0x86000
...

I have cut short the output because it is too long. When single functions become large, I find it more convenient to redirect the output to a file which I can process elsewhere.

gdb -q ./benchmark -ex "set pagination off" -ex "set print asm-demangle" -ex "disas 0x000000000001b540" -ex quit > gdbasm.txt

Sometimes I am just interested in doing some basic statistics such as figuring out which instructions are used by the function:

$ gdb -q ./benchmark -ex "set pagination off" -ex "set print asm-demangle" -ex "disas 0x000000000001b540" -ex quit | awk '{print $3}' | sort |uniq -c | sort -r | head
     32 and
     24 tbl
     24 ext
     18 cmhi
     17 orr
     16 ushr
     16 eor
     14 ldr
     13 mov
     10 movi

And we see that the most common instruction in this code is and. It reassures me that the code was properly compiled. I can do some research on all the generated instructions and they all seem like adequate choices given the code that I produce.

The general lesson is that looking at the generated assembly is not so difficult and with little training, it can make you a better programmer.

Tip: It helps sometimes to disable pagination (set pagination off).

Filtering numbers quickly with SVE on Amazon Graviton 3 processors

I have had access to Amazon’s latest ARM processors (graviton 3) for a few weeks. To my knowledge, these are the first widely available processors supporting Scalable Vector Extension (SVE).

SVE is part of the Single Instruction/Multiple Data paradigm: a single instruction can operate on many values at once. Thus, for example, you may add N integers with N other integers using a single instruction.

What is unique about SVE is that you work with vectors of values, but without knowing specifically how long the vectors are. This is in contrast with conventional SIMD instructions (ARM NEON, x64 SSE, AVX) where the size of the vector is hardcoded. Not only do you write your code without knowing the size of the vector, but even the compiler may not know. This means that the same binary executable could work over different blocks (vectors) of data, depending on the processor. The benefit of this approach is that your code might get magically much more efficient on new processors.

It is a daring proposal. It is possible to write code that would work on one processor but fail on another processor, even though we have the same instruction set.

But is SVE on graviton 3 processors fast? To test it out, I wrote a small benchmark. Suppose you want to prune out all of the negative integers out of an array. A textbook implementation might look as follows:

void remove_negatives_scalar(const int32_t *input, 
        int64_t count, int32_t *output) {
  int64_t i = 0;
  int64_t j = 0;
  for(; i < count; i++) {
    if(input[i] >= 0) {
      output[j++] = input[i];
    }
  }
}

However, the compiler will probably generate a branch and if your input has a random distribution, this could be inefficient code. To help matters, you may rewrite your code in a manner that is more likely to generate a branchless binary:

  for(; i < count; i++) {
    output[j] = input[i];
    j += (input[i] >= 0);
  }

Though it looks less efficient (because every input value in written out), such a branchless version is often practically faster.

I ported this last implementation to SVE using ARM intrinsic functions. At each step, we load a vector of integers (svld1_s32), we compare them with zero (svcmpge_n_s32), we remove the negative values (svcompact_s32) and we store the result (svst1_s32). During most iterations, we have a full vector of integers… Yet, during the last iteration, some values will be missing but we simply ignore them with the while_mask variable which indicates which integer values are ‘active’.  The entire code sequence is done entirely using SVE instructions: there is no need to process separately the end of the sequence, as would be needed with conventional SIMD instruction sets.

#include <arm_sve.h>
void remove_negatives(const int32_t *input, int64_t count, int32_t *output) {
  int64_t i = 0;
  int64_t j = 0;
  svbool_t while_mask = svwhilelt_b32(i, count);
  do {
    svint32_t in = svld1_s32(while_mask, input + i);
    svbool_t positive = svcmpge_n_s32(while_mask, in, 0);
    svint32_t in_positive = svcompact_s32(positive, in);
    svst1_s32(while_mask, output + j, in_positive);
    i += svcntw();
    j += svcntp_b32(while_mask, positive);
    while_mask = svwhilelt_b32(i, count);
  } while (svptest_any(svptrue_b32(), while_mask));
}

Using a graviton 3 processor and GCC 11 on my benchmark, I get the following results:

cycles/integer instructions/integer instructions/cycle
scalar 9.0 6.000 0.7
branchless scalar 1.8 8.000 4.4
SVE 0.7 1.125 1.6

The SVE code uses far fewer instructions. In this particular test, SVE is 2.5 times faster than the best competitor (branchless scalar). Furthermore, it might use even fewer instructions on future processors, as the underlying registers get wider.

Of course, my code is surely suboptimal, but I am pleased that the first SVE benchmark I wrote turns out so well. It suggests that SVE might do well in practice.

Credit: Thanks to Robert Clausecker for the related discussion.

Memory-level parallelism : Intel Ice Lake versus Amazon Graviton 3

One of the most expensive operation in a processor and memory system is a random memory access. If you try to read a value in memory, it can take tens of nanosecond on average or more. If you are waiting on the memory content for further action, your processor is effectively stalled. While our processors are generally faster, the memory latency has not improved quickly and the latency can even be higher on some of the most expensive processors. For this reason, a modern processor core can issue multiple memory requests at a given time. That is, the processor tries to load one memory element, keeps going, and can issue another load (even though the previous load is not completed), and so forth. Not long ago, Intel processor cores could support about 10 independent memory requests at a time. I benchmarked some small ARM cores that could barely issue 4 memory requests.

Today, the story is much nicer. The powerful processor cores can all sustain many memory requests. They support better memory-level parallelism.

To measure the performance of the processor, we use a pointer-chasing scheme where you ask a C program to load a memory address which contains the next memory address and so forth. If a processor could only sustain a single memory request, such a test would use all available ressources. We then modify this test so that we have have two interleaved pointer-chasing scheme, and then three and then four, and so forth. We call each new interleaved pointer-chasing component a ‘lane’.

As you add more lanes, you should see better performance, up to a maximum. The faster the performance goes up as you add lane, the more memory-level parallelism your processor core has. The best Amazon (AWS) servers come with either Intel Ice Lake or Amazon’s very own Graviton 3. I benchmark both of them, using a core of each type. The Intel processor has the upper hand in absolute terms. We achieve a 12 GB/s maximal bandwidth compared to 9 GB/s for the Graviton 3. The one-lane latency is 120 ns for the Graviton 3 server versus 90 ns for the Intel processor. The Graviton 3 appears to sustain about 19 simultaneous loads per core against about 25 for the Intel processor.

Thus Intel wins, but the Graviton 3 has nice memory-level parallelism… much better than the older Intel chips (e.g., Skylake) and much better than the early attempts at ARM-based servers.

The source code is available. I am using Ubuntu 22.04 and GCC 11. All machines have small page sizes (4kB). I chose not to tweak the page size for these experiments.

Prices for Graviton 3 are 2.32 $US/hour (64 vCPU) compared to 2.448 $US/hour for Ice Lake. So Graviton 3 appears to be marginally cheaper than the Intel chips.

When I write these posts, comparing one product to another, there is always hate mail afterward. So let me be blunt. I love all chips equally.

If you want to know which system is best for your application: run benchmarks. Comprehensive benchmarks found that Amazon’s ARM hardware could be advantageous for storage-intensive tasks.

Further reading: I enjoyed Graviton 3: First Impressions.

Data structure size and cache-line accesses

On many systems, memory is accessed in fixed blocks called “cache lines”. On Intel systems, the cache line spans 64 bytes. That is, if you access memory at byte address 64, 65… up to 127… it is all on the same cache line. The next cache line starts at address 128, and so forth.

In turn, data in software is often organized in data structures having a fixed size (in bytes). We often organize these data structures in arrays. In general, a data structure may reside on more than one cache line. For example, if I put a 5-byte data structure at byte address 127, then it will occupy the last byte of one cache line, and four bytes in the next cache line.

When loading a data structure from memory, a naive model of the cost is the number of cache lines that are accessed. If your data structure spans 32 bytes or 64 bytes, and you have aligned the first element of an array, then you only ever need to access one cache line every time you load a data structure.

What if my data structures has 5 bytes? Suppose that I packed them in an array, using only 5 bytes per instance. What if I pick one at random… how many cache lines do I touch? Expectedly, the answer is barely more than 1 cache line on average.

Let us generalize.

Suppose that my data structure spans z bytes. Let g be the greatest common divisor between z and 64. Suppose that you load one instance of the data structure at random from a large array. In general, the expected number of additional cache lines accesses is (z – g)/64. The expected total number of cache line accesses is one more: 1 + (z – g)/64. You can check that it works for z = 32, since g is then 32 and you have (z – g)/64 is (32-32)/64 or zero.

I created the following table for all data structures no larger than a cache line. The worst-case scenario is a data structure spanning 63 bytes: you then almost always touch two cache lines.

I find it interesting that you have the same expected number of cache line accesses for data structures of size 17, 20, 24. It does not follow that computational cost a data structure spanning 24 bytes is the same as the cost of a data structure spanning 17 bytes. Everything else being identical, a smaller data structure should fare better, as it can fit more easily in CPU cache.

size of data structure (z) expected cache line access
1 1.0
2 1.0
3 1.03125
4 1.0
5 1.0625
6 1.0625
7 1.09375
8 1.0
9 1.125
10 1.125
11 1.15625
12 1.125
13 1.1875
14 1.1875
15 1.21875
16 1.0
17 1.25
18 1.25
19 1.28125
20 1.25
21 1.3125
22 1.3125
23 1.34375
24 1.25
25 1.375
26 1.375
27 1.40625
28 1.375
29 1.4375
30 1.4375
31 1.46875
32 1.0
33 1.5
34 1.5
35 1.53125
36 1.5
37 1.5625
38 1.5625
39 1.59375
40 1.5
41 1.625
42 1.625
43 1.65625
44 1.625
45 1.6875
46 1.6875
47 1.71875
48 1.5
49 1.75
50 1.75
51 1.78125
52 1.75
53 1.8125
54 1.8125
55 1.84375
56 1.75
57 1.875
58 1.875
59 1.90625
60 1.875
61 1.9375
62 1.9375
63 1.96875
64 1.0

Thanks to Maximilian Böther for the motivation of this post.

Parsing JSON faster with Intel AVX-512

Many recent Intel processors benefit from a new family of instructions called AVX-512. These instructions operate over wide registers (up to 512 bits) and follow the Single instruction, multiple data (SIMD) paradigm. These new AVX-512 instructions allow you to break some speed records, such as decoding base64 data at the speed of a memory copy.

Most modern processors have SIMD instructions. The AVX-512 instructions are wider (more bits per register), but that is not necessarily their main appeal. If you merely take existing SIMD algorithms and apply them to AVX-512, you will probably not benefit as much as you would like. It is true that wider registers are beneficial, but in superscalar processors (processors that can issue several instructions per cycle), the number of instructions you can issue per cycle matters as much if not more. Typically, 512-bit AVX-512 instructions are more expensive and the processor can issue fewer of them per cycle. To fully benefit from AVX-512, you need to carefully design your code. It is made more challenging by the fact that Intel is releasing these instructions progressively: the recent processors have many new powerful AVX-512 instructions that were not initially available. Thus, AVX-512 is not “one thing” but rather a family of instruction sets.

Furthermore, early implementations of the AVX-512 instructions often lead to measurable downclocking: the processor would reduce its frequency for a time following the use of these instructions. Thankfully, the latest Intel processors to support AVX-512 (Rocket Lake and Ice Lake) have done away with this systematic frequency throttling. Thankfully, it is easy to detect these recent processors at runtime.

Amazon’s powerful Intel servers are based on Ice Lake. Thus if you are deploying your software applications to the cloud on powerful servers, you probably have pretty good support for AVX-512 already !

A few years ago, we released a really fast C++ JSON parser called simdjson. It is somewhat unique as a parser in the fact that it relies critically on SIMD instructions. On several metrics, it was and still is the fastest JSON parser though other interesting competitors have emerged.

Initially, I had written a quick and dirty AVX-512 kernel for simdjson. We never merged it and after a time, I just deleted it. I then forgot about it.

Thanks to contributions from talented Intel engineers (Fangzheng Zhang and Weiqiang Wan) as well as indirect contributions from readers of this blog (Kim Walisch and Jatin Bhateja), we produced a new and shiny AVX-512 kernel. As always, keep in mind that the simdjson is the work of many people, a whole community of dozens of contributors. I must express my gratitude to Fangzheng Zhang who first wrote to me about an AVX-512 port.

We just released in the latest version of simdjson. It breaks new speed records.

Let us consider an interesting test where you seek to scan a whole file (spanning kilobytes) to find a value corresponding to some identifier. In simdjson, the code is as follows:

   auto doc = parser.iterate(json);    
   for (auto tweet : doc.find_field("statuses")) {
      if (uint64_t(tweet.find_field("id")) == find_id) {
        result = tweet.find_field("text");
        return true;
      }
    }
    return false;

On a Tiger Lake processor, with GCC 11, I get a 60% increase in the processing speed, expressed by the number of input bytes processed per second.

simdjson (512-bit SIMD): new 7.4 GB/s
simdjson (256-bit SIMD): old 4.6 GB/s

The speed gain is so important because in this task we mostly just read the data, and we do relatively little secondary processing. We do not create a tree out of the JSON data, we do not create a data structure.

The simdjson library has a minify function which just strips unnecessary spaces from the input. Maybe surprisingly, we are more than twice as fast as the previous baseline:

simdjson (512-bit SIMD): new 12 GB/s
simdjson (256-bit SIMD): old 4.3 GB/s

Another reasonable benchmark is to fully parse the input into a DOM tree with full validation. Parsing a standard JSON file (twitter.json), I get nearly a 30% gain:

simdjson (512-bit SIMD): new 3.6 GB/s
simdjson (256-bit SIMD): old 2.8 GB/s

While 30% may sound unexciting, we are starting from a fast baseline.

Could we do better? Assuredly. There are many AVX-512 instructions that we are not using yet. We do not use ternary Boolean operations (vpternlog). We are not using the new powerful shuffle functions (e.g., vpermt2b). We have an example of coevolution: better hardware requires new software which, in turn, makes the hardware shine.

Of course, to get these new benefits, you need recent Intel processors with adequate AVX-512 support and, evidently, you also need relatively recent C++ processors. Some of the recent laptop-class Intel processors do not support AVX-512 but you should be fine if you rely on AWS and have big Intel nodes.

You can grab our release directly or wait for it to reach one of the standard package managers (MSYS2, conan, vcpkg, brew, debian, FreeBSD, etc.).

Avoid exception throwing in performance-sensitive code

There are various ways in software to handle error conditions. In C or Go, one returns error code. Other programming languages like C++ or Java prefer to throw exceptions. One benefit of using exceptions is that it keeps your code mostly clean since the error-handling code is often separate.

It is debatable whether handling exceptions is better than dealing with error codes. I will happily use one or the other.

What I will object to, however, is the use of exceptions for control flow. It is fine to throw an exception when a file cannot be opened, unexpectedly. But you should not use exceptions to branch on the type of a value.

Let me illustrate.

Suppose that my code expects integers to be always positive. I might then have a function that checks such a condition:

int get_positive_value(int x) {
    if(x < 0) { throw std::runtime_error("it is not positive!"); }
    return x;
}

So far, so good. I am assuming that the exception is normally never thrown. It gets thrown if I have some kind of error.

If I want to sum the absolute values of the integers contained in an array, the following branching code is fine:

    int sum = 0;
    for (int x : a) {
        if(x < 0) {
            sum += -x;
        } else {
            sum += x;
        }
    }

Unfortunately, I often see solutions abusing exceptions:

    int sum = 0;
    for (int x : a) {
        try {
            sum += get_positive_value(x);
        } catch (...) {
            sum += -x;
        }
    }

The latter is obviously ugly and hard-to-maintain code. But what is more, it can be highly inefficient. To illustrate, I wrote a small benchmark over random arrays containing a few thousand elements. I use the LLVM clang 12 compiler on a skylake processor. The normal code is 10000 times faster in my tests!

normal code 0.05 ns/value
exception 500 ns/value

Your results will differ but it is generally the case that using exceptions for control flow leads to suboptimal performance. And it is ugly too!