How expensive is integer-overflow trapping in C++?

Integers in programming languages have a valid range but arithmetic operations can result in values that exceed such ranges. For example, adding two large integers can result in an integer that cannot be represented in the integer type. We often refer to such error conditions as overflows.

In a programming languages like Swift, an overflow will result in the program aborting its execution. The rationale is that once an arithmetic operation has failed, everything else the program might be doing is suspect and you are better off aborting the program. Most other programming languages are not so cautious. For example, a Rust program compiled in release mode will not abort by default.

In C++, most compilers will simply ignore the overflow. However, popular compilers give you choices. When using GCC and clang, you can specify that integer overflows should result in a program crash (abort) using the -ftrapv flag.

I was curious about the performance implications so I wrote a small program that simply adds all of the values in a large array. The answer I sought turns out to depend critically on the choice of compiler:

GCC 9 clang 9
no trapping 0.17 ns/int 0.11 ns/int
trapping 2.1 ns/int 0.32 ns/int
slowdown 12 x 3 x

With no trapping, the clang compiler beats GCC (0.11 vs. 0.17) by a 50% margin but this should not preoccupy us too much: it is a single microbenchmark.

What is a lot more significant is that enabling overflow trapping in GCC incurs an order of magnitude slowdown. Though it is only one microbenchmark, the size of the result suggests that we should be concerned. Looking at the assembly, I find that the clang compiler generates sensible code on x64 processor, with simple jumps added when the overflow is detected. Meanwhile, GCC seems to call poorly optimized runtime library functions.

Overall this one test does establish that checking for overflows can be expensive.

Credit: This blog post was motivated by an email by Stefan Kanthak.

Science and Technology links (September 19th 2020)

  1. A large city dating back 4,300 years has been discovered in China. It predates the Chinese civilization. At its center was a wide pyramid supporting a 20-acre palace. Little is known about the people living there other than the fact that they had relatively advanced technology.
  2. Genetic testing of dead warriors from 3000 years ago in Germany reveals that they could not drink milk. It is unclear when Europeans began to drink milk.
  3. Video games might be especially beneficial to boys as it improves their literacy.
  4. Anti‐inflammatory treatment rescues age-related memory deficits in mice.
  5. People today live several years older than thirty years ago. The core of the improvement is due to a reduction of deaths due to heart disease. Meanwhile drug overdoses are more lethal than they were.

Parsing floats in C++: benchmarking strtod vs. from_chars

Programmers often need to convert a string into a floating-point numbers. For example, you might get the string “3.1416” and you would like to get the resulting value as floating-point type.

In C/C++, the standard way is the strtod function:

char * string = "3.1416";
char * string_end = string;
double x = strtod(string, &string_end);
if(string_end == string) { 
  //you have an error!
}

Unfortunately, the strtod function is locale-sensitive like many of these string processing functions. It means that it may behaved differently depending on the system where you run the code. However, all runtime libraries appear to have locale-specific functions like strtod_l or _strtod_l (Windows). You can use these locale-specific functions to specify the default locale and thus get functions that behave the same on all systems.

One nice feature of the strtod function is that it gives you back a pointer at the end of the parsed number. In this manner, you can efficiently parse a sequence of comma-separated numbers, for example.

You can also use C++ streams but they are typically not geared toward performance.

In C++17, we got a nicer option: the from_chars function. It works similarly:

std::string st = "3.1416";
double x; 
auto [p, ec] = std::from_chars(st.data(), st.data() + st.size(), x);
if (p == st.data()) {
      //you have an errors!
}

The great thing about from_chars is that it is, by definition, locale-independent. Furthermore, it is standardized, so it should be available everywhere. Unfortunately, to my knowledge, the only C++ compiler to come with a fully implemented from_chars function is Visual Studio 2019. And you need to have a recent release.

Indeed, though C++17 has been out for some time, and many compilers claim to fully support it, the runtime libraries need to catch up! You can get independent from_chars implementations, of course, but it would be nice if runtime librairies supported C++17 out of the box.

Still, we should celebrate that Microsoft is doing the right thing and eagerly supporting the latest standards.

I heard good things about the Visual Studio from_chars. And among other things, that it is very fast.

I wanted to know much faster it was. So I wrote a benchmark! I generate a large number of random values in the [0,1] interval and I record how long it takes to parse them back.

stdtod (C) 120 MB/s
from_chars (C++17) 140 MB/s

You do get a nice 20% performance boost, at no cost whatsoever. Brilliant.

Good students find questions, not answers

It is often believed that learning is a simple matter of collecting answers and replies.  I suspect that “learn mechanistically how to answer the questions” is a great way for weak students to pass courses, and for smart students to ace courses. However, I believe that if you really want to learn the material, you have to learn to ask good questions and stop focusing on answers.

If you ask experienced professors, they will probably admit that some students who do very well on tests (A+ all the way) make uninspiring graduate students, while less stellar undergraduate students can end up being superb graduate students.

Part of the difference is that in graduate school, you have to ask questions first. A researcher’s job is to come up with new questions, not necessarily new answers. That is because once you have a good question, the answer can often be derived mechanistically.

It is somewhat counterintuitive if you have been trained in conventional schools. For example, you see mathematics as a set of “obvious questions” followed by “non-obvious answers”. And you think that, surely, the hard part was coming up with the answers.

This leads you to think that the way forward is to work really hard on really complicated answers to simple questions. But where will the questions come from?

What is a good question?

  1. A question should be at the frontier of the domain of knowledge as you understand it. To really master the material, you have to go beyond the surface and understand its limits.
  2. It should trigger interest from the part of the recipient. Your question should be interesting, not just to you, but to the person receiving the question. You may wonder why it should matter: it does because it puts a psychological bar, ruling out tempting but shallow questions.

A good question is usually prepared, not improvised. Most people need a few minutes to come up with a good question. But there are some quick recipes that often deliver good questions…

  • Questions about definitions… It is common that a speaker will use terms without defining them clearly. If you are unsure about how you would define a term, it might make for a great question.
  • « How do you know » questions are often great. If the expert tells you that programming in a given manner reduces memory usage, you may ask « How would I know ? »

The real reason for asking good questions is that the process itself will force you to learn. That is, if you are interested in learning, in really learning, you will not focus on coming up with answers, rather you will focus on coming up with questions.

You can see it play out if you pay attention. Go to an event where “expert learners” are trying to up their game. For example, a good workshop with people at the top of their game. If the event is well managed, you will observe that people are asking each other good questions.

Experienced professors do it effortlessly. The best engineers are great at it.

As an experiment, next time you encounter an expert in a field that you do not yet master, focus on trying to come up with good, non-obvious questions. I bet that you will learn a lot.

Science and Technology links (September 5th 2020)

  1. Single cells are able to navigate complex mazes. E.g., it works with mouse pancreatic cancer cells.
  2. Body builders and athletes sometimes take a supplement called AKG. It was found to prolong life in worms. Researchers have now found that AKG massively delays frailty in mice: old mice that took AKG look better and are stronger. The mice also live a bit longer.
  3. Hair dyes applied at home are not associated with greater cancer risks.
  4. Plaque in the arteries is correlated with cardiovascular diseases (heart attacks) and tends to accumulate with age. Researchers have determined that a daily supplement of something called IPE could reverse the amount of plaque in a relatively small clinical trial.
  5. A company called NuScale got a new nuclear reactor design approved in the US. It took over two million pages of documentation. These are new, small and very safe reactors. They should become commercially available in a few years.
  6. People who suffer from diabetes are at greater risk of death from diseases like COVID 19. Researchers found that those taking metformin (a cheap and popular anti-diabetes drug) are protected from COVID 19.
  7. In mice, both caloric restriction (eating less) and a drug called rapamycin, prolongs female fertility into old age.

Sentinels can be faster

A standard trick in programming is to use “sentinel values”. These are special values that represent metadata efficiently.

The C language represents strings as a sequences of characters terminated with the null character. The null character is a sentinel that indicates the string length. E.g., my name (“Lemire”) would be represent as the string “Lemire\0” in memory when using the C language where I represent the null character as ‘\0’ . The length of my name (6 characters) is never stored explicitly.

If you are a long-time C programmer, you probably think that the null sentinel is an “obvious” concept but newcomers to programming often find it counterintuitive. It is a non-trivial idea.

The null sentinel is “space efficient” in C. It is almost surely the case that other programming languages have more memory overhead per string.

Yet I remember reading that null-terminated strings were a design flaw in the C language from the point of view of performance. Indeed, you frequently need to know the length of a string and when you do, you need to scan all of the string to find the null sentinel and compute the string length. Other ancient programming languages like Pascal had string values with length prefixes. Today, I expect most programming languages to have steered clear of null-terminated strings.

But the use of sentinels can also lead to computationally efficient code though you do not need null characters per se. Let me illustrate.

Suppose that you are given a string and that you want to skip all of the leading spaces. If you are only given the starting point of the string and either its length or its ending point, then you might be stuck with code like so:

const char * skip_leading_spaces(const char * start, const char * end) {
  while((start != end) && (is_space(*start))) {
    start++;
  }
  return start;
}

For each character, you have to check that you are still within the string, and you have to check whether it is a space. You end up doing two comparisons! The C++ programming languages has a strong bias in favour of such constructions. The popular functional programming idioms are often built on top of similar code.

Of course, you could get smarter. For example, you could branch on the string size to try to avoid one of the check at each character.  But if the string length is hard to predict, you may end up with small benefits and complicated code.

What if you know that the string is either null-terminated, or that it contains at least one non-space character that can serve as a de facto sentinel ? Then you can use simpler code where you only have to load a new character and check that it is a space:

const char * skip_leading_spaces(const char * start) {
  while(is_space(*start)) {
    start++;
  }
  return start;
}

You end up with half the number of comparisons!

Is it faster?

It can be a lot faster. I wrote a little benchmark where I generate random strings having a random number of leading spaces. The sentinel approach is 40% faster in my tests.

Of course, the results depend critically on your data and on the exact problem you are trying to solve. Yet I think it is worth keeping sentinels in mind when you need to write high performance code, and you are dealing with irregular data.

Science and Technology links (August 29th 2020)

  1. In children, higher video game time is positively associated with cognition (i.e., kids who play more video games are smarter). Note that it does not follow that playing video games makes you smarter; it could be that smarter kids are more interested in video games.
  2. I always assumed that the introduction of antibiotics after world war 2 would be visible in the longevity statistics. But neither the introduction of vaccines nor the widespread introduction of antibiotics is visible in the longevity curves as a sharp upward discontinuity. Instead, we see a boring monotonic curve: people live longer and longer without any specific boost due to the introduction of a given technology. In fact, researchers tell us that antibiotics did not have a dramatic effect on the mortality of infectious diseases. I have never encountered a good explanation regarding our ever increasing longevity.
  3. Loeb discusses the effect of ever increasing longevity. He points out that, for the very old, there is no cake large enough to fit one candle per year on a birthday cake. Instead, he suggests, we should use a number of candle proportional to the logarithm of the age. So at year 2, you would get one candle, you might get 2 at year 4, 3 at year 8, 4 at year 16, 5 at year 32, 6 at year 64 and 6 at year 128. It scales much better.
  4. A single injection of a human gene believed to boost muscles and strength in old mice did just that: the old mice got stronger. Evidently, it suggests that the same could be done in elderly human beings.
  5. Computer speed has historically increased due to Moore’s law: processors acquired more and more transitors and got faster clock speeds. Some people have been pessimistic about the future. In There’s plenty of room at the Top, Leiserson et al. point out that we can do much to keep improving the performance of our computers. They conjecture that it will require more specific techniques, such as a better integration of software and hardware. If true, this would be in contrast with historical trends where we have increasingly abstracted out the hardware within the software, and tended to rely on generic hardware designs. It may help vertically integrated vendors like Apple while hurting horizontal players like Intel.

Science and Technology links (August 9th 2020)

  1. The BBC reports that diversity and anti-bias training is of little use and may even be counterproductive if the goal is reduce biases:

    “The effect of bias training is very weak if you look at the long run,” says Kalev. “A company is better off doing nothing than mandatory diversity training.”

    Some research warns that such training may spur more racism by enticing people to think in terms of “races”.

  2. Our ancestors frequently died of smallpox, and it did not affect just kids or the poor:

    [Louis XIV] was a man who almost died of smallpox when he was 9 years old and lost nearly all of his legitimate heirs — his son, a grandson and a great-grandson — along with his younger brother, another grandson and a great-grandson, to smallpox. Eventually, he was succeeded by his second great-grandson, who became Louis XV and died (you guessed it) of smallpox.

    In America, smallpox is usually associated with the decimation of Native Americans, but Europeans were not immune to the disease. As late as the 18th century, for example, smallpox killed about 400,000 Europeans annually. The overall mortality rate was 20% to 60%. Among infants, it was more than 80% and was one of the reasons for the low overall life expectancy of 20 to 30 years. The disease was eradicated in 1980. Today, we don’t think of smallpox any more than we think of the bubonic plague, which, in five short years, killed almost one-third of all Europeans in the 14th century.

  3. Eating well is associated with high greenhouse gas emissions:

    After adjustment for energy intake, high-nutritional-quality diets had significantly higher greenhouse gas emissions (+9% and +22% for men and women, respectively) than did low-nutritional-quality diets.

  4. It seems that the extended ice age our ancestors survived was caused by several volcanic eruptions. This was only a few thousands years ago.
  5. In the UK, coal use has fallen to levels so low that it compares with pre-industrial levels.
  6. Though the population in the Western world is aging quickly, the number of people affected by dementia (e.g., Alzheimer’s) is falling. We do not know why.
  7. We are reportedly renaming genes to avoid the limitions of Microsoft Excel. (This should be an object of shame for Microsoft.)
  8. Mammals like human beings have a limited ability to recover from brain damage. In particular, we are not very good at producing new neurons. However, recent work shows that other cells in the brain can become neuron stem cells (neurons able to produce other neurons) in response to injury.

Performance tip: constructing many non-trivial objects is slow

I started programming professionally when Java came out and right about when C++ was the “hot new thing”. Following the then-current fashion, I looked down at C and Pascal programming.

I fully embraced object-oriented programming. Everything had to be represented as an object. It was the right thing to do. Like many young (and not so young) programmers, I suffered from cargo cult programming. There was a new fashion and its advocates were convincing. I became a staunch advocate myself.

When you program in C, allocating and de-allocating dynamic ressources is hard work. Thus you tend to give much consideration to how and when you are going to allocate ressources. More “modern” languages like Java and C++ have freed us from such considerations. Largely, the trend has continued with a few isolated exceptions.

Over time, I came to recognize one vicious performance anti-pattern that comes with object-oriented programming: the widespread use of small but non-trivial objects. Objects that do a non-trivial amount of work at the beginning or end of their life are “non-trivial”. For example, arrays and strings with sizes determined at runtime are non-trivial.

The net result is that in many instances, with a few lines of code, you can generate thousands or millions of tiny non-trivial objects. For example, in JavaScript, the JSON.parse function takes an input JSON string (a single string), and maps it into lots and lots of tiny objects. Another example is the programmer who loads a file, line by line, storing each line into a new string instance, and then splits the line into dynamically allocated substrings. These programming strategies are fine as long as they are not used in performance sensitive code.

Let us run a little experiment using a fast programming language (C++). Starting from a long random strings, I am going to randomly generate 100,000 small substrings, having a random length of less than 16 bytes using non-trivial objects (std::string). For comparison, I am going to create 100,000 identical small strings using trivial objects (C++17 std::string_view). A std::string_view is basically just a pointer to a memory region otherwise managed.

How quickly can I copy these strings? I published the source code of a benchmark.

I am expressing the speed in volume per second (GB/s) where the volume is given by the sum of all of the small strings. Your results will vary, but here is what I get under GNU GCC 9 (Linux) with an AMD Rome processor:

non-trivial objects (std::string) 0.8 GB/s
trivial objects (std::string_view) 15 GB/s

The std::string implementation is likely optimized for handling small strings. If you repeat the same experiment while storing your data in a std::vector instance, you may get catastrophic performance.

Though 0.8 GB/s may sound fast, keep in mind that it is the speed to do something trivial (copying the strings). If even the trivial parts of your software are slow, there is no hope that your whole system is going to be fast.

Furthermore, do not expect higher-level languages like Python or JavaScript to do better. The C++ performance is likely a sensible upper bound on how well the approach works.

Recommendation: In performance-sensitive code, you should avoid creating or copying thousands or millions of non-trivial objects.

Science and Technology links (August 1st 2020)

  1. In Japan, a large dam is being constructed almost entirely by robots.
  2. Naked mole rats are mammals that do not age in the sense that their fitness and mortality follows a flat curve, and not a Gompertz curve. Senescent cells are large dysfunctional cells that we tend to accumulate as we age. Researchers report that naked mole rats have few senescent cells. It may be because their senescent cells are less resilient than that of mice.
  3. Mice consuming a little bit of alcohol regularly live longer and less likely to become obese.
  4. Cable TV is dying and the COVID-19 pandemic does appear to be helping. AT&T lost 900,000 cable subscribers in the first quarter of 2020 while Comcast lost 477,000 cable subscribers in the second quarter.
  5. Yeast cells age in one of two distinct, mutually exclusive ways.
  6. Smallpox was killer:

    In America, smallpox is usually associated with the decimation of Native Americans, but Europeans were not immune to the disease. As late as the 18th century, for example, smallpox killed about 400,000 Europeans annually. The overall mortality rate was 20% to 60%. Among infants, it was more than 80% and was one of the reasons for the low overall life expectancy of 20 to 30 years.

Science and Technology links (July 25th 2020)

  1. I was taught that human beings only arrived to America recently (15,000 years ago). It turns out that it is wrong. There were human beings in America 30,000 years ago. (Source: Nature)
  2. Quantum tuneling is not instantaneous contrary to what you were told.
  3. The U.S. Air Force is planning to introduced a new automated combat aircraft. They call it Skyborg.
  4. The inexpensive supplement glucosamine seems to be associated with lower risks of death.
  5. The Roman empire thrived under warm conditions:

    This record comparison consistently shows the Roman as the warmest period of the last 2000 years, about 2 °C warmer than average values for the late centuries for the Sicily and Western Mediterranean regions. After the Roman Period a general cooling trend developed in the region with several minor oscillations. We hypothesis the potential link between this Roman Climatic Optimum and the expansion and subsequent decline of the Roman Empire.

    (Source: Nature)

    It is estimated that current global warming trends could lead to a global rise in temperature by more than 2 °C though less than 3 °C.

Avoid character-by-character processing when performance matters

When processing strings, it is tempting to view them as arrays of characters (or bytes) and to process them as such.

Suppose that you would like to determine whether a string is ASCII. In ASCII, every character must be a byte value smaller than 128. A fine C++17 approach to check that a string is ASCII might look as follows.

bool is_ascii_branchy(const std::string_view v) {
   for (size_t i = 0; i < v.size(); i++) {
     if (uint8_t(v[i]) >= 128) {
       return false;
     }
   }
   return true;
}

It is important to consider at the logic of this code. What you are telling the compiler is to access all characters in sequence, check whether it is an ASCII character, and bail out if not. Thus if the string contains no ASCII character, only the first character should be read.

It might be high performance code if you are expecting the strings to mostly start with non-ASCII characters. But if you are expecting the string to be almost always ASCII, then this code is not going to be optimal.

You might complain that the compiler should be able to optimize it for you, and it will, but only within the constraints of the code you provided. Compilers are typically not in the business of redesigning your algorithms.

If you are expecting ASCII inputs, then you should just run through the string  using as few steps as possible. The following code relies on the fact that our processors can process 64-bit blocks using single instructions:

bool is_ascii_branchless(const std::string_view v) {
  uint64_t running = 0;
  size_t i = 0;
  for(; i + 8 <= v.size(); i+=8) {
    uint64_t payload;
    memcpy(&payload, v.data() + i, 8);
    running |= payload;
  }
  for(; i < v.size(); i++) {
      running |= v[i];
  }
  return (running & 0x8080808080808080) == 0;  
}

It is an optimistic function: if you encounter a non-ASCII character early on, you will end up doing a lot of needless work if the string is long.

You could try a hybrid between the two. You read 8 characters, check whether they are ASCII, and bail out if they aren’t.

bool is_ascii_hybrid(const std::string_view v) {

  size_t i = 0;
  for(; i + 8 <= v.size(); i+=8) {
    uint64_t payload;
    memcpy(&payload, v.data() + i, 8);
    if((payload & 0x8080808080808080) != 0) return false;
  }
  for(; i < v.size(); i++) {
      if((v[i] & 0x80) != 0) return false;
  }
  return true;
}

How do these functions compare? I wrote a quick benchmark with short ASCII strings (fewer than 128 characters). I get that the character-by-character runs at about half the speed. Your results vary, feel free to run my benchmark on your machine with your compiler.

character-by-character 2.0 GB/s
optimistic 3.5 GB/s
hybrid 3.4 GB/s

With some work, you can probably go much faster but be mindful of the fact that I deliberately chose a benchmark with small, fragmented, strings.

Downloading files faster by tweaking headers

I was given a puzzle recently. Someone was parsing JSON files downloaded from the network from a bioinformatics URI. One JSON library was twice as fast at the other one.

Unless you are on a private high-speed network, the time required to parse a file will always be small compared to the time required to download a file. Maybe people at Google have secret high-speed options, but most of us have to make do with speeds below 1 GB/s.

So how could it be?

One explanation might have to do with how the client (such as curl) and the web server negotiate the transmission format. Even if the actual data is JSON, what is transmitted is often in compressed form. Thankfully, you can tell you client to request some encoding. In my particular case, out of the all of the encodings I tried, gzip was much faster. The reason seems clear enough: when I requested gzip, I got 82 KB back, instead of 766 KB.

curl -H 'Accept-Encoding: gzip' $URL 0.5 s 82 KB
curl -H 'Accept-Encoding: deflate' $URL 1.0 s 766 KB
curl -H 'Accept-Encoding: br' $URL 1.0 s 766 KB
curl -H 'Accept-Encoding: identity' $URL 1.0 s 766 KB
curl -H 'Accept-Encoding: compress' $URL 1.0 s 766 KB
curl -H 'Accept-Encoding: *' $URL 1.0 s 766 KB

Sure enough, if you look at the downloaded file, it has 766 KB, but if you gzip it, you get back 82 KB.

What I find interesting is that my favorite tools (wget and curl) do not request gzip by default. At least in this instance, it would be much faster. The curl tool takes the --compressed flag to make life easier.

Of course, the point is moot if the data is already in compressed form on the server.

The cost of runtime dispatch

For high-performance software, it is sometimes needed to use different functions, depending on what the hardware supports. You might write different functions, some functions for advanced processors, others for legacy processors.

When you compile the code, the compiler does not yet know which code path will be taken. At runtime, when you start the program, the right function is chosen. This process is called runtime dispatch. Standard libraries will apply runtime dispatch without you having to do any work. However, if you write your own fancy code, you may need to apply runtime dispatching.

On Intel and AMD systems, you can do so by querying the processor, comparing the processor’s answer with the various functions you have compiled. Under Visual Studio, you can use __cpuid function while GNU GCC has __get_cpuid.

How expensive is this step?

Of course, the answer depends on your exact system but can we get some idea? I wrote a small C++ benchmark and my estimate is between 100 ns and 150 ns. So it is several hundreds of cycles.

Though it is inexpensive, if you are repeatedly calling an inexpensive function, it may not be viable to pay this price each time. So you can simply do it once, and then point at the right function for all follow-up calls.

Your only mild concern should be concurrency: what if several threads call the same function for the first time at once? In a language like C++, it is unsafe to have several threads modify the same variable. Thankfully, it is a simple matter of requiring that the change and queries be done atomically. On Intel and AMD processors, atomic accesses are often effectively free.

Science is the belief in the ignorance of experts

Science is the belief in the ignorance of experts said Richard Feynman. Feynman had a Nobel prize in physics. He was a remarquable educator: his lecture notes are still popular. He foresaw nanotechnology and quantum computing. He is credited with identifying the cause of the Space Shuttle Challenger disaster. There is a beautiful talk by his daughter and there are many interviews with his sister who became a physicist herself. I read Feynman as a young adult and it has shaped my worldview ever since.

Let me put the Feynman’s quote in context:

Science alone of all the subjects contains within itself the lesson of the danger of belief in the infallibility of the greatest teachers in the preceding generation (…) When someone says, “Science teaches such and such,” he is using the word incorrectly. Science doesn’t teach anything; experience teaches it. If they say to you, “Science has shown such and such,” you might ask, “How does science show it? How did the scientists find out? How? What? Where?” It should not be “science has shown” but “this experiment, this effect, has shown.” And you have as much right as anyone else, upon hearing about the experiments–but be patient and listen to all the evidence–to judge whether a sensible conclusion has been arrived at. (…) The experts who are leading you may be wrong. (…) I think we live in an unscientific age in which almost all the buffeting of communications and television-words, books, and so on-are unscientific. As a result, there is a considerable amount of intellectual tyranny in the name of science. (…) Science alone of all the subjects contains within itself the lesson of the danger of belief in the infallibility of the greatest teachers of the preceding generation.

Many people, including people with a PhD and a career in research, miss Feynman’s point. Let me decompose it.

  1. How do we know anything? The most common, age-old, approach is to acquire knowledge from a figure of authority or an expert. Your teacher, your mother, your boss. But these experts are routinely wrong, often in unexpected ways:
    • Back in 1955, all textbooks told you that human beings have 24 pairs of chromosomes, even though there were 23 pairs.
    • Before the 1960s, plate tectonics was often ridiculed.
    • The Nobel prize recipient Paul Krugman predicted that the Internet would have a small effect (“The growth of the Internet will slow drastically (…) By 2005, it will become clear that the Internet’s impact on the economy has been no greater than the fax machine’s.”)
    • Heinrich Herz claimed that radio waves were of no use whatsoever.
  2. Since the beginning of time, we have also given this process of knowledge-from-expertise physical forms. In modern day civilization, we have the peer-reviewed article, the professor at their university, the government scientist in a laboratory coat. In many ways, these people play the same social role as the tribe elder or the priest. The peer-reviewed article is like a sacred text.
  3. This pre-existing knowledge and its transmission is often called “science”. In such a context, anything can be a science: political science, social science, religious science, and so forth. But whether the knowledge is true or false, it may have little to do with science. It may even be that these institutions that pretend to be about science are unscientific. The fundamental defining characteristic of science, the one that Feynman explicitly identifies, is that we do not decide whether something is true or false based on authority but rather based on experience. If someone tells you that there are 24 pairs of chromosomes, you have a duty to ask “how do they know?”, “how would I find out?”.
  4. For science… It does not matter whether you are young, old. You can be rich or poor. You can be schooled or not. But you must listen, learn and be patient. In effect, you need to be a constructive skeptic. And you must question your own ideas with even more effort than you question other ideas.
    Paul Graham puts it well:

    To be a successful scientist it’s not enough just to be right. You have to be right when everyone else is wrong. Conventional-minded people can’t do that.

Interestingly, once you have appreciated science’s true nature as defined by Feynman, you see that it is generally applicable and not limited to Physics. In fact, Feynman clearly believed that the idea of science was applicable to education, for example.

Unfortunately, Feynman was correct. We live in an unscientific civilization. Science survives in niches.

Here are a few unscientific behaviours that are frustratingly common:

  1. Teach science a large set of facts and techniques. I claim that hardly any science at all gets taught in high school. You tell kids that matter is made of atoms which are made of small nucleus surrounded by electrons, at different energy levels. That was taught to one of my sons in high school. Then I asked “how do the scientists know this?” Blank stare. Science is not learning about the charge of the electron by reading a book. If that is all it were, it would be no different than pre-scientific knowledge. No. Science begins when you ask yourself how we can verify what is in the book. How should you teach science? You should essentially teach people to read or listen with constructive skepticism. If you are going to teach science, you must give it in a historical context: you must stress how our current knowledge is the accumulation of course corrections. You must stress how it is a work in progress.
  2. Receive science as a set of facts. We are sometimes given advice and told that it comes from “the science”, as if it settled anything. That is often no different that being told that the knowledge comes from “the high priest himself”. It is certainly a power play, but it is not science. When someone tells you that science is saying XYZ, you should always ask “How do people know that XYZ is true?” And as an answer, you cannot just be repeating their arguments: you must think for yourself. If you never come up with unexpected questions, you are not a scientist.
  3. “I defer to the experts” (said with pride). The definining saying of science should be that nobody knows anything. Deferring to the expert is the easy pre-scientific path, but doubting the expert is the way of science. Observe that there is a difference between listening to the experts (something Feynman advocated) and deferring to the experts. If your doctor tells you to take some pills, do listen! If you don’t understand, do be patient. Don’t throw out the expert advice. But don’t be ruled by it either!

Science may sound irrational when you spell it out as a doctrine of doubt. If you follow the path of science, you are going to be playing with ideas that are either objectively wrong, or socially wrong, at a much higher rate than if you just followed the experts. But, for scientists, genuine scientists, the goal is not to be right as often as possible. There is no contradiction between “being a good scientist” and “being wrong”. The goal is often to find what Peter Thiel might call “secret truths”. Everybody knew, at the beginning of the XXth century that it would take decades or centuries before one could fly in an airplane. You could make such a statement with full confidence, even after the Wright brothers had made their first demonstration. And many people of authority were doing just that. To uncover secret truths, you have to train yourself to ask more questions, to carry more doubts.

Given that science is fundamentally subversive, how could it emerge and survive? I believe that scientific thinking is part of a broader ideology that gives its bearers an evolutionary edge. Simply put, societies that make room for science have an edge. They build and deploy better technology. They adapt more quickly to change.

Science and Technology links (July 11th 2020)

  1. Some upcoming Mercedes cars will have augmented reality head-up displays.
  2. Intel’s new standard for high-speed cables (thunderbolt) supports 3 GB/s bandwidth. (This is very fast: internal disks usually cannot sustain such speeds.)
  3. Enrichment activities have no cognitive benefits in kids.
  4. Aspirin may help prevent cancer metastasis. It also seems to help prevent cancer reoccurence.
  5. Exercise is beneficial. It helps keep the brain young. Could you benefit from exercise without exercising at all? Plasma transfer might help:

    Exercise has a broad range of beneficial healthful effects. Horowitz et al. tested whether the beneficial effects of exercise on neurogenesis in the brain and improved cognition in aged mice could be transferred in plasma (blood without its cellular components) from one mouse to another (see the Perspective by Ansere and Freeman). Indeed, aged mice that received plasma from young or old mice that had exercised showed beneficial effects in their brains without hitting the treadmill. The authors identified glycosylphosphatidylinositol-specific phospholipase D1 as a factor in plasma that might, in part, mediate this favorable effect.

GNU GCC on x86 does not round floating-point divisions to the nearest value

I know that floating-point arithmetic is a bit crazy on modern computers. For example, floating-point numbers are not associative:

0.1+(0.2+0.3) == 0.599999999999999978
(0.1+0.2)+0.3 == 0.600000000000000089

But, at least, this is fairly consistent in my experience. You should simply not assume fancy properties like associativity to work in the real world.

Today I stumbled on a fun puzzle. Consider the following:

double x = 50178230318.0;
double y = 100000000000.0;
double ratio = x/y;

If God did exist, the variable ratio would be 0.50178230318 and the story would end there. Unfortunately, there is no floating-point number that is exactly 0.50178230318. Instead it falls between the floating-point number 0.501782303179999944 and the floating-point number 0.501782303180000055.

It important to be a bit more precise. The 64-bit floating-point standard represents numbers as a 53-bit mantissa followed by a power of two. So let us spell it out the way the computer sees it:

0.501782303179999944 == 4519653187245114  * 2 ** -53
0.501782303180000055 == 4519653187245115  * 2 ** -53

We have to pick the mantissa 4519653187245114 or the mantissa 4519653187245115. There is no way to represent exactly anything that falls in-between using 64-bit floating-point numbers. So where does 0.50178230318 fall exactly? We have approximately…

0.50178230318 = 4519653187245114.50011795456 * 2 ** -53

So the number is best approximated by the largest of the two values (0.501782303180000055).

Goldberg in What every computer scientist should know about floating-point arithmatic tells us that rounding must be to the nearest value:

The IEEE standard requires that the result of addition, subtraction, multiplication and division be exactly rounded. That is, the result must be computed exactly and then rounded to the nearest floating-point number (using round to even). (…) One reason for completely specifying the results of arithmetic operations is to improve the portability of software. When a program is moved between two machines and both support IEEE arithmetic, then if any intermediate result differs, it must be because of software bugs, not from differences in arithmetic. Another advantage of precise specification is that it makes it easier to reason about floating-point. Proofs about floating-point are hard enough, without having to deal with multiple cases arising from multiple kinds of arithmetic. Just as integer programs can be proven to be correct, so can floating-point programs, although what is proven in that case is that the rounding error of the result satisfies certain bounds.

Python gets it right:

>>> "%18.18f" % (50178230318.0/100000000000.0)
'0.501782303180000055'

JavaScript gets it right:

> 0.50178230318 == 0.501782303180000055
true
> 0.50178230318 == 0.501782303179999944
false

So the story would end there, right?

Let us spin up the GNU GCC 7 compiler for x86 systems and run the following C/C++ program:

#include <stdio.h>
int main() {
  double x = 50178230318.0;
  double y = 100000000000.0;
  double ratio = x/y;
  printf("x/y  = %18.18f\n", ratio);
}

Can you predict the result?

$ g++ -o round round.cpp
$ ./round
x/y  = 0.501782303179999944

So GNU GCC actually picks the smallest and furthest value, instead of the nearest value.

You may doubt me so I have created a docker-based test.

You might think that it is a bug that should be reported, right?

There are dozens if not hundreds of similar reports to the GNU GCC team. They are being flagged as invalid.

Let me recap: the GNU GCC compiler may round the result of a division between two floating-point numbers to a value that is not the nearest. And it is not considered a bug.

The explanation is that the compiler first rounds to nearest using 80 bits and then rounds again (this is called double rounding). This is what fancy numerical folks call FLT_EVAL_METHOD = 2.

However, the value of FLT_EVAL_METHOD remains at 2 even if you add optimization flags such as -O2, and yet the result will change. The explanation is that the optimizer figures out the solution at compile-time and does so ignoring the FLT_EVAL_METHOD value. Why it is allowed to do so is beyond me.

Maybe you think it does not matter. After all, the value is going to be close, right? However, if you are an experienced programmer, you know the value of having deterministic code. You run the code and you always get the same results. If the results depend, some of the time, on your exact compiler flag, it makes your life much more difficult.

You can also try to pass GNU GCC the flags -msse -mfpmath=sse, as experts recommend, but as my script demonstrates, it does not solve the issue (and then you get FLT_EVAL_METHOD = -1). You need to also add an appropriate target (e.g., -msse -mfpmath=sse -march=pentium4). In effect, when using GNU GCC, you cannot get away from specifying a target. The flags -msse -mfpmath=sse alone will silently fail to help you.

Some people have recommended using other flags to switch the compiler in pc64 mode (-pc64). While it would fix this particular example, it does not fix the general problem of floating-point accuracy. It will just create new problems.

If you are confused as to why all of this could be possible without any of it being a bug, welcome to the club. My conclusion is that you should probably never compile C/C++ using GNU GCC for a generic x86 target. It is broken.

You can examine the assembly on godbolt.

Note: Please run my tests in the specified docker images so that you get the exact same configuration as I do.

Science and Technology links (June 20th 2020)

  1. UCLA researchers have achieved widespread rejuvenation in old mice through blood plasma diluation, a relatively simple process.

    (…) these results establish broad tissues rejuvenation by a single replacement of old blood plasma with physiologic fluid: muscle repair was improved, fibrosis was attenuated, and inhibition of myogenic proliferation was switched to enhancement; liver adiposity and fibrosis were reduced; and hippocampal neurogenesis was increased.(…) These findings are most consistent with the conclusion that the age-altered systemic milieu inhibits the health and repair of multiple tissues in the old mice, and also exerts a dominant progeric effect on the young partners in parabiosis or blood exchange.

    They plan to conduct clinical trials in human beings “soon”.

  2. It used to be that universities would happily pay large sums to private publishers like Elsevier for access to the research articles. The prestigious MIT joins the ranks of the universities who are challenging Elsevier.
  3. Medical doctors follow clinical practice guidelines. In turn, the producers of these guidelines are often funded by the industry, and they fail to disclose it.
  4. The upcoming Sony PlayStation 5 will have a disk with a bandwidth of over 5 GB/s. For comparison, good Apple laptops typically achieve only about 2 GB/s, and older conventional disks are 10 to 20 times slower.

    Our already-low tolerance for slow and unresponsive applications and web sites will fall. Loading screens, loading bars, and similar “make the user wait” strategies will become more and more annoying. We will come to expect application updates to occur in the blink of an eye.

    Programmers used to blame disk and network performance, but these excuses will not hold in the near future. More and more, poor performance will be due to poor software engineering. I gave a talk recently on the topic: data engineering at the speed of your disk (slides).

Update: Someone objected that disks with 6Gb/s bandwidth are already commonplace and have been inexpensive for many years. That is true, but 6Gb/s is 10 times slower than 5 GB/s. Notice that ‘b’ stands for bit whereas ‘B’ stands for byte.

Computational overhead due to Docker under macOS

For my programming work, I tend to assume that I have a Linux environnement. That is true whether I am under Windows, under macOS or under a genuine Linux.

How do I emulate Linux wherever I go? I use docker containers. Effectively, the docker container gives me a small subsystem where everything is “as if” I was under Linux.

Containers are a trade-off. They offer a nice sandbox where your code can run, isolated from the rest of your system. However they also have lower performance. Disk and network access is slower. I expect that it is true wherever you run your containers.

However, part of your computing workload might be entirely computational. If you are training a model or filtering data, you may not be allocating memory, writing to disk or accessing the network. In such cases, my tests suggest that you have pretty much the same performance whether you are running your tasks inside a container, or outside of the container… as long as your host is Linux.

When running docker under Windows or macOS, docker must rely on a virtual machine. Under Windows, it may use VirtualBox or other solutions, depending on your configuration, whereas it appears to use Hyperkit under macOS. These virtual machines are highly efficient, but they still carry an overhead.

Let me benchmark a simple Go program that just repeatedly computes random numbers and compares them with the value 0. It prints out the result at the end.

package main

import (
        "fmt"
        "math/rand"
)

func main() {
        counter := 0
        for n := 0; n < 1000000000; n++ {
                if rand.Int63() == 0 {
                        counter += 1
                }
        }
        fmt.Printf("number of zeros: %d \n", counter)
}

It is deliberately simple. I am going to use Go 1.14 (always).

Under macOS, I get that my program takes 11.7 s to run.

$ go build -o myprogram
$ time ./myprogram
number of zeros: 0

real	0m11.911s
user	0m11.660s
sys	0m0.034s

I am ignoring the “sys” time since I only want the computational time (“user”).

Let me run the same program after starting a docker container (from an ubuntu 20.10 image):

$ go build -o mydockerprogram
$ time ./mydockerprogram
number of zeros: 0

real	0m12.038s
user	0m12.026s
sys	0m0.025s

So my program now takes 12 s, so 3% longer. Observe that my methodology is not fool-proof: I do not know that this 3% slowdown is due to the overhead incurred by docker. However, it bounds the overhead.

Let me do something fun. I am going to start the container and run my program in the container, and then shut it down.

$ time run 'bash -c "time ./mydockerprogram"'
number of zeros: 0

real	0m12.012s
user	0m12.003s
sys	0m0.008s

real	0m12.545s
user	0m0.086s
sys	0m0.041s

It now takes 0.5 s longer. That is the time it takes for me start a container, do nothing, and then shut it down. Doing it in this manner takes 8% longer than running it natively in macOS.

Of course, if you run many small jobs, the 0.5 s is going to hurt you. It may come to dominate the running time.

If you want to squeeze every ounce of computational performance out your machine, it is likely that you should avoid the docker overhead under macOS. A 3% overhead may prove to be unacceptable. However, for developing and benchmarking your code, it may well be an acceptable trade-off.

Reusing a thread in C++ for better performance

In a previous post, I measured the time necessary to start a thread, execute a small job and return.

auto mythread = std::thread([] { counter++; });
mythread.join();

The answer is thousands of nanoseconds. Importantly, that is the time as measured by the main thread. That is, sending the query, and getting back the result, takes thousands of nanoseconds and thousands of cycles. The work in my case is just incrementing a counter: any task more involved will increase the overall cost. The C++ standard API also provides an async function to call one function and return: it is practically equivalent to starting a new thread and joining it, as I just did.

Creating a new thread each time is fine if you have a large task that needs to run for milliseconds. However, if you have tiny tasks, it won’t do.

What else could you do? Instead of creating a thread each time, you could create a single thread. This thread loops and periodically sleep, waiting to be notified that there is work to be done. I am using the C++11 standard approach.

  std::thread thread = std::thread([this] {
    while (!exiting) {
      std::unique_lock<std::mutex> lock(locking_mutex);
      cond_var.wait(lock, [this]{return has_work||exiting;});
      if (exiting) {
        break;
      }
      counter++;
      has_work = false;
      lock.unlock();
      cond_var.notify_all();
    }
  });

It should be faster and overall more efficient. You should expect gains ranging from 2x to 5x. If you use a C++ library with thread pools and/or workers, it is likely to adopt such an approach, albeit with more functionality and generality. However, the operating system is in charge of waking up the thread and may not do so immediately so it is not likely to be the fastest approach.

What else could you do? You could simply avoid as much as possible system dependencies and just loop on an atomic variable. The downside of the tight loop (lockspin) approach is that your thread might fully use the processor while it waits. However, you should expect it to get to work much quicker.

  std::thread thread = std::thread([this] {
    thread_started.store(true);
    while (true) {
      while (!has_work.load()) {
        if (exiting.load()) {
          return;
        }
      }
      counter++;
      has_work.store(false);
    }
  });

The results will depend crucially on your processor and on your operation system. Let me report the rough numbers I get with an Intel-based linux box and GNU GCC 8.

new thread each time 9,000 ns
async call 9,000 ns
worker with mutexes 5,000 ns
worker with spinlock 100 ns

My source code is available.