Performance tip: constructing many non-trivial objects is slow

I started programming professionally when Java came out and right about when C++ was the “hot new thing”. Following the then-current fashion, I looked down at C and Pascal programming.

I fully embraced object-oriented programming. Everything had to be represented as an object. It was the right thing to do. Like many young (and not so young) programmers, I suffered from cargo cult programming. There was a new fashion and its advocates were convincing. I became a staunch advocate myself.

When you program in C, allocating and de-allocating dynamic ressources is hard work. Thus you tend to give much consideration to how and when you are going to allocate ressources. More “modern” languages like Java and C++ have freed us from such considerations. Largely, the trend has continued with a few isolated exceptions.

Over time, I came to recognize one vicious performance anti-pattern that comes with object-oriented programming: the widespread use of small but non-trivial objects. Objects that do a non-trivial amount of work at the beginning or end of their life are “non-trivial”. For example, arrays and strings with sizes determined at runtime are non-trivial.

The net result is that in many instances, with a few lines of code, you can generate thousands or millions of tiny non-trivial objects. For example, in JavaScript, the JSON.parse function takes an input JSON string (a single string), and maps it into lots and lots of tiny objects. Another example is the programmer who loads a file, line by line, storing each line into a new string instance, and then splits the line into dynamically allocated substrings. These programming strategies are fine as long as they are not used in performance sensitive code.

Let us run a little experiment using a fast programming language (C++). Starting from a long random strings, I am going to randomly generate 100,000 small substrings, having a random length of less than 16 bytes using non-trivial objects (std::string). For comparison, I am going to create 100,000 identical small strings using trivial objects (C++17 std::string_view). A std::string_view is basically just a pointer to a memory region otherwise managed.

How quickly can I copy these strings? I published the source code of a benchmark.

I am expressing the speed in volume per second (GB/s) where the volume is given by the sum of all of the small strings. Your results will vary, but here is what I get under GNU GCC 9 (Linux) with an AMD Rome processor:

non-trivial objects (std::string) 0.8 GB/s
trivial objects (std::string_view) 15 GB/s

The std::string implementation is likely optimized for handling small strings. If you repeat the same experiment while storing your data in a std::vector instance, you may get catastrophic performance.

Though 0.8 GB/s may sound fast, keep in mind that it is speed to do something trivial (copying the strings). If even the trivial parts of your software are slow, there is no hope that your whole system is going to be fast.

Furthermore, do not expect higher-level languages like Python or JavaScript to do better. The C++ performance is likely a sensible upper bound on how well the approach works.

Recommendation: In performance-sensitive code, you should avoid creating or copying thousands or millions of non-trivial objects.

Science and Technology links (August 1st 2020)

  1. In Japan, a large dam is being constructed almost entirely by robots.
  2. Naked mole rats are mammals that do not age in the sense that their fitness and mortality follows a flat curve, and not a Gompertz curve. Senescent cells are large dysfunctional cells that we tend to accumulate as we age. Researchers report that naked mole rats have few senescent cells. It may be because their senescent cells are less resilient than that of mice.
  3. Mice consuming a little bit of alcohol regularly live longer and less likely to become obese.
  4. Cable TV is dying and the COVID-19 pandemic does appear to be helping. AT&T lost 900,000 cable subscribers in the first quarter of 2020 while Comcast lost 477,000 cable subscribers in the second quarter.
  5. Yeast cells age in one of two distinct, mutually exclusive ways.
  6. Smallpox was killer:

    In America, smallpox is usually associated with the decimation of Native Americans, but Europeans were not immune to the disease. As late as the 18th century, for example, smallpox killed about 400,000 Europeans annually. The overall mortality rate was 20% to 60%. Among infants, it was more than 80% and was one of the reasons for the low overall life expectancy of 20 to 30 years.

Science and Technology links (July 25th 2020)

  1. I was taught that human beings only arrived to America recently (15,000 years ago). It turns out that it is wrong. There were human beings in America 30,000 years ago. (Source: Nature)
  2. Quantum tuneling is not instantaneous contrary to what you were told.
  3. The U.S. Air Force is planning to introduced a new automated combat aircraft. They call it Skyborg.
  4. The inexpensive supplement glucosamine seems to be associated with lower risks of death.
  5. The Roman empire thrived under warm conditions:

    This record comparison consistently shows the Roman as the warmest period of the last 2000 years, about 2 °C warmer than average values for the late centuries for the Sicily and Western Mediterranean regions. After the Roman Period a general cooling trend developed in the region with several minor oscillations. We hypothesis the potential link between this Roman Climatic Optimum and the expansion and subsequent decline of the Roman Empire.

    (Source: Nature)

    It is estimated that current global warming trends could lead to a global rise in temperature by more than 2 °C though less than 3 °C.

Avoid character-by-character processing when performance matters

When processing strings, it is tempting to view them as arrays of characters (or bytes) and to process them as such.

Suppose that you would like to determine whether a string is ASCII. In ASCII, every character must be a byte value smaller than 128. A fine C++17 approach to check that a string is ASCII might look as follows.

bool is_ascii_branchy(const std::string_view v) {
   for (size_t i = 0; i < v.size(); i++) {
     if (uint8_t(v[i]) >= 128) {
       return false;
     }
   }
   return true;
}

It is important to consider at the logic of this code. What you are telling the compiler is to access all characters in sequence, check whether it is an ASCII character, and bail out if not. Thus if the string contains no ASCII character, only the first character should be read.

It might be high performance code if you are expecting the strings to mostly start with non-ASCII characters. But if you are expecting the string to be almost always ASCII, then this code is not going to be optimal.

You might complain that the compiler should be able to optimize it for you, and it will, but only within the constraints of the code you provided. Compilers are typically not in the business of redesigning your algorithms.

If you are expecting ASCII inputs, then you should just run through the string  using as few steps as possible. The following code relies on the fact that our processors can process 64-bit blocks using single instructions:

bool is_ascii_branchless(const std::string_view v) {
  uint64_t running = 0;
  size_t i = 0;
  for(; i + 8 <= v.size(); i+=8) {
    uint64_t payload;
    memcpy(&payload, v.data() + i, 8);
    running |= payload;
  }
  for(; i < v.size(); i++) {
      running |= v[i];
  }
  return (running & 0x8080808080808080) == 0;  
}

It is an optimistic function: if you encounter a non-ASCII character early on, you will end up doing a lot of needless work if the string is long.

You could try a hybrid between the two. You read 8 characters, check whether they are ASCII, and bail out if they aren’t.

bool is_ascii_hybrid(const std::string_view v) {

  size_t i = 0;
  for(; i + 8 <= v.size(); i+=8) {
    uint64_t payload;
    memcpy(&payload, v.data() + i, 8);
    if((payload & 0x8080808080808080) != 0) return false;
  }
  for(; i < v.size(); i++) {
      if((v[i] & 0x80) != 0) return false;
  }
  return true;
}

How do these functions compare? I wrote a quick benchmark with short ASCII strings (fewer than 128 characters). I get that the character-by-character runs at about half the speed. Your results vary, feel free to run my benchmark on your machine with your compiler.

character-by-character 2.0 GB/s
optimistic 3.5 GB/s
hybrid 3.4 GB/s

With some work, you can probably go much faster but be mindful of the fact that I deliberately chose a benchmark with small, fragmented, strings.

Downloading files faster by tweaking headers

I was given a puzzle recently. Someone was parsing JSON files downloaded from the network from a bioinformatics URI. One JSON library was twice as fast at the other one.

Unless you are on a private high-speed network, the time required to parse a file will always be small compared to the time required to download a file. Maybe people at Google have secret high-speed options, but most of us have to make do with speeds below 1 GB/s.

So how could it be?

One explanation might have to do with how the client (such as curl) and the web server negotiate the transmission format. Even if the actual data is JSON, what is transmitted is often in compressed form. Thankfully, you can tell you client to request some encoding. In my particular case, out of the all of the encodings I tried, gzip was much faster. The reason seems clear enough: when I requested gzip, I got 82 KB back, instead of 766 KB.

curl -H 'Accept-Encoding: gzip' $URL 0.5 s 82 KB
curl -H 'Accept-Encoding: deflate' $URL 1.0 s 766 KB
curl -H 'Accept-Encoding: br' $URL 1.0 s 766 KB
curl -H 'Accept-Encoding: identity' $URL 1.0 s 766 KB
curl -H 'Accept-Encoding: compress' $URL 1.0 s 766 KB
curl -H 'Accept-Encoding: *' $URL 1.0 s 766 KB

Sure enough, if you look at the downloaded file, it has 766 KB, but if you gzip it, you get back 82 KB.

What I find interesting is that my favorite tools (wget and curl) do not request gzip by default. At least in this instance, it would be much faster. The curl tool takes the --compressed flag to make life easier.

Of course, the point is moot if the data is already in compressed form on the server.

The cost of runtime dispatch

For high-performance software, it is sometimes needed to use different functions, depending on what the hardware supports. You might write different functions, some functions for advanced processors, others for legacy processors.

When you compile the code, the compiler does not yet know which code path will be taken. At runtime, when you start the program, the right function is chosen. This process is called runtime dispatch. Standard libraries will apply runtime dispatch without you having to do any work. However, if you write your own fancy code, you may need to apply runtime dispatching.

On Intel and AMD systems, you can do so by querying the processor, comparing the processor’s answer with the various functions you have compiled. Under Visual Studio, you can use __cpuid function while GNU GCC has __get_cpuid.

How expensive is this step?

Of course, the answer depends on your exact system but can we get some idea? I wrote a small C++ benchmark and my estimate is between 100 ns and 150 ns. So it is several hundreds of cycles.

Though it is inexpensive, if you are repeatedly calling an inexpensive function, it may not be viable to pay this price each time. So you can simply do it once, and then point at the right function for all follow-up calls.

Your only mild concern should be concurrency: what if several threads call the same function for the first time at once? In a language like C++, it is unsafe to have several threads modify the same variable. Thankfully, it is a simple matter of requiring that the change and queries be done atomically. On Intel and AMD processors, atomic accesses are often effectively free.

Science is the belief in the ignorance of experts

Science is the belief in the ignorance of experts said Richard Feynman. Feynman had a Nobel prize in physics. He was a remarquable educator: his lecture notes are still popular. He foresaw nanotechnology and quantum computing. He is credited with identifying the cause of the Space Shuttle Challenger disaster. There is a beautiful talk by his daughter and there are many interviews with his sister who became a physicist herself. I read Feynman as a young adult and it has shaped my worldview ever since.

Let me put the Feynman’s quote in context:

Science alone of all the subjects contains within itself the lesson of the danger of belief in the infallibility of the greatest teachers in the preceding generation (…) When someone says, “Science teaches such and such,” he is using the word incorrectly. Science doesn’t teach anything; experience teaches it. If they say to you, “Science has shown such and such,” you might ask, “How does science show it? How did the scientists find out? How? What? Where?” It should not be “science has shown” but “this experiment, this effect, has shown.” And you have as much right as anyone else, upon hearing about the experiments–but be patient and listen to all the evidence–to judge whether a sensible conclusion has been arrived at. (…) The experts who are leading you may be wrong. (…) I think we live in an unscientific age in which almost all the buffeting of communications and television-words, books, and so on-are unscientific. As a result, there is a considerable amount of intellectual tyranny in the name of science. (…) Science alone of all the subjects contains within itself the lesson of the danger of belief in the infallibility of the greatest teachers of the preceding generation.

Many people, including people with a PhD and a career in research, miss Feynman’s point. Let me decompose it.

  1. How do we know anything? The most common, age-old, approach is to acquire knowledge from a figure of authority or an expert. Your teacher, your mother, your boss. But these experts are routinely wrong, often in unexpected ways:
    • Back in 1955, all textbooks told you that human beings have 24 pairs of chromosomes, even though there were 23 pairs.
    • Before the 1960s, plate tectonics was often ridiculed.
    • The Nobel prize recipient Paul Krugman predicted that the Internet would have a small effect (“The growth of the Internet will slow drastically (…) By 2005, it will become clear that the Internet’s impact on the economy has been no greater than the fax machine’s.”)
    • Heinrich Herz claimed that radio waves were of no use whatsoever.
  2. Since the beginning of time, we have also given this process of knowledge-from-expertise physical forms. In modern day civilization, we have the peer-reviewed article, the professor at their university, the government scientist in a laboratory coat. In many ways, these people play the same social role as the tribe elder or the priest. The peer-reviewed article is like a sacred text.
  3. This pre-existing knowledge and its transmission is often called “science”. In such a context, anything can be a science: political science, social science, religious science, and so forth. But whether the knowledge is true or false, it may have little to do with science. It may even be that these institutions that pretend to be about science are unscientific. The fundamental defining characteristic of science, the one that Feynman explicitly identifies, is that we do not decide whether something is true or false based on authority but rather based on experience. If someone tells you that there are 24 pairs of chromosomes, you have a duty to ask “how do they know?”, “how would I find out?”.
  4. For science… It does not matter whether you are young, old. You can be rich or poor. You can be schooled or not. But you must listen, learn and be patient. In effect, you need to be a constructive skeptic. And you must question your own ideas with even more effort than you question other ideas.
    Paul Graham puts it well:

    To be a successful scientist it’s not enough just to be right. You have to be right when everyone else is wrong. Conventional-minded people can’t do that.

Interestingly, once you have appreciated science’s true nature as defined by Feynman, you see that it is generally applicable and not limited to Physics. In fact, Feynman clearly believed that the idea of science was applicable to education, for example.

Unfortunately, Feynman was correct. We live in an unscientific civilization. Science survives in niches.

Here are a few unscientific behaviours that are frustratingly common:

  1. Teach science a large set of facts and techniques. I claim that hardly any science at all gets taught in high school. You tell kids that matter is made of atoms which are made of small nucleus surrounded by electrons, at different energy levels. That was taught to one of my sons in high school. Then I asked “how do the scientists know this?” Blank stare. Science is not learning about the charge of the electron by reading a book. If that is all it were, it would be no different than pre-scientific knowledge. No. Science begins when you ask yourself how we can verify what is in the book. How should you teach science? You should essentially teach people to read or listen with constructive skepticism. If you are going to teach science, you must give it in a historical context: you must stress how our current knowledge is the accumulation of course corrections. You must stress how it is a work in progress.
  2. Receive science as a set of facts. We are sometimes given advice and told that it comes from “the science”, as if it settled anything. That is often no different that being told that the knowledge comes from “the high priest himself”. It is certainly a power play, but it is not science. When someone tells you that science is saying XYZ, you should always ask “How do people know that XYZ is true?” And as an answer, you cannot just be repeating their arguments: you must think for yourself. If you never come up with unexpected questions, you are not a scientist.
  3. “I defer to the experts” (said with pride). The definining saying of science should be that nobody knows anything. Deferring to the expert is the easy pre-scientific path, but doubting the expert is the way of science. Observe that there is a difference between listening to the experts (something Feynman advocated) and deferring to the experts. If your doctor tells you to take some pills, do listen! If you don’t understand, do be patient. Don’t throw out the expert advice. But don’t be ruled by it either!

Science may sound irrational when you spell it out as a doctrine of doubt. If you follow the path of science, you are going to be playing with ideas that are either objectively wrong, or socially wrong, at a much higher rate than if you just followed the experts. But, for scientists, genuine scientists, the goal is not to be right as often as possible. There is no contradiction between “being a good scientist” and “being wrong”. The goal is often to find what Peter Thiel might call “secret truths”. Everybody knew, at the beginning of the XXth century that it would take decades or centuries before one could fly in an airplane. You could make such a statement with full confidence, even after the Wright brothers had made their first demonstration. And many people of authority were doing just that. To uncover secret truths, you have to train yourself to ask more questions, to carry more doubts.

Given that science is fundamentally subversive, how could it emerge and survive? I believe that scientific thinking is part of a broader ideology that gives its bearers an evolutionary edge. Simply put, societies that make room for science have an edge. They build and deploy better technology. They adapt more quickly to change.

Science and Technology links (July 11th 2020)

  1. Some upcoming Mercedes cars will have augmented reality head-up displays.
  2. Intel’s new standard for high-speed cables (thunderbolt) supports 3 GB/s bandwidth. (This is very fast: internal disks usually cannot sustain such speeds.)
  3. Enrichment activities have no cognitive benefits in kids.
  4. Aspirin may help prevent cancer metastasis. It also seems to help prevent cancer reoccurence.
  5. Exercise is beneficial. It helps keep the brain young. Could you benefit from exercise without exercising at all? Plasma transfer might help:

    Exercise has a broad range of beneficial healthful effects. Horowitz et al. tested whether the beneficial effects of exercise on neurogenesis in the brain and improved cognition in aged mice could be transferred in plasma (blood without its cellular components) from one mouse to another (see the Perspective by Ansere and Freeman). Indeed, aged mice that received plasma from young or old mice that had exercised showed beneficial effects in their brains without hitting the treadmill. The authors identified glycosylphosphatidylinositol-specific phospholipase D1 as a factor in plasma that might, in part, mediate this favorable effect.

GNU GCC on x86 does not round floating-point divisions to the nearest value

I know that floating-point arithmetic is a bit crazy on modern computers. For example, floating-point numbers are not associative:

0.1+(0.2+0.3) == 0.599999999999999978
(0.1+0.2)+0.3 == 0.600000000000000089

But, at least, this is fairly consistent in my experience. You should simply not assume fancy properties like associativity to work in the real world.

Today I stumbled on a fun puzzle. Consider the following:

double x = 50178230318.0;
double y = 100000000000.0;
double ratio = x/y;

If God did exist, the variable ratio would be 0.50178230318 and the story would end there. Unfortunately, there is no floating-point number that is exactly 0.50178230318. Instead it falls between the floating-point number 0.501782303179999944 and the floating-point number 0.501782303180000055.

It important to be a bit more precise. The 64-bit floating-point standard represents numbers as a 53-bit mantissa followed by a power of two. So let us spell it out the way the computer sees it:

0.501782303179999944 == 4519653187245114  * 2 ** -53
0.501782303180000055 == 4519653187245115  * 2 ** -53

We have to pick the mantissa 4519653187245114 or the mantissa 4519653187245115. There is no way to represent exactly anything that falls in-between using 64-bit floating-point numbers. So where does 0.50178230318 fall exactly? We have approximately…

0.50178230318 = 4519653187245114.50011795456 * 2 ** -53

So the number is best approximated by the largest of the two values (0.501782303180000055).

Goldberg in What every computer scientist should know about floating-point arithmatic tells us that rounding must be to the nearest value:

The IEEE standard requires that the result of addition, subtraction, multiplication and division be exactly rounded. That is, the result must be computed exactly and then rounded to the nearest floating-point number (using round to even). (…) One reason for completely specifying the results of arithmetic operations is to improve the portability of software. When a program is moved between two machines and both support IEEE arithmetic, then if any intermediate result differs, it must be because of software bugs, not from differences in arithmetic. Another advantage of precise specification is that it makes it easier to reason about floating-point. Proofs about floating-point are hard enough, without having to deal with multiple cases arising from multiple kinds of arithmetic. Just as integer programs can be proven to be correct, so can floating-point programs, although what is proven in that case is that the rounding error of the result satisfies certain bounds.

Python gets it right:

>>> "%18.18f" % (50178230318.0/100000000000.0)
'0.501782303180000055'

JavaScript gets it right:

> 0.50178230318 == 0.501782303180000055
true
> 0.50178230318 == 0.501782303179999944
false

So the story would end there, right?

Let us spin up the GNU GCC 7 compiler for x86 systems and run the following C/C++ program:

#include <stdio.h>
int main() {
  double x = 50178230318.0;
  double y = 100000000000.0;
  double ratio = x/y;
  printf("x/y  = %18.18f\n", ratio);
}

Can you predict the result?

$ g++ -o round round.cpp
$ ./round
x/y  = 0.501782303179999944

So GNU GCC actually picks the smallest and furthest value, instead of the nearest value.

You may doubt me so I have created a docker-based test.

You might think that it is a bug that should be reported, right?

There are dozens if not hundreds of similar reports to the GNU GCC team. They are being flagged as invalid.

Let me recap: the GNU GCC compiler may round the result of a division between two floating-point numbers to a value that is not the nearest. And it is not considered a bug.

The explanation is that the compiler first rounds to nearest using 80 bits and then rounds again (this is called double rounding). This is what fancy numerical folks call FLT_EVAL_METHOD = 2.

However, the value of FLT_EVAL_METHOD remains at 2 even if you add optimization flags such as -O2, and yet the result will change. The explanation is that the optimizer figures out the solution at compile-time and does so ignoring the FLT_EVAL_METHOD value. Why it is allowed to do so is beyond me.

Maybe you think it does not matter. After all, the value is going to be close, right? However, if you are an experienced programmer, you know the value of having deterministic code. You run the code and you always get the same results. If the results depend, some of the time, on your exact compiler flag, it makes your life much more difficult.

You can also try to pass GNU GCC the flags -msse -mfpmath=sse, as experts recommend, but as my script demonstrates, it does not solve the issue (and then you get FLT_EVAL_METHOD = -1). You need to also add an appropriate target (e.g., -msse -mfpmath=sse -march=pentium4). In effect, when using GNU GCC, you cannot get away from specifying a target. The flags -msse -mfpmath=sse alone will silently fail to help you.

Some people have recommended using other flags to switch the compiler in pc64 mode (-pc64). While it would fix this particular example, it does not fix the general problem of floating-point accuracy. It will just create new problems.

If you are confused as to why all of this could be possible without any of it being a bug, welcome to the club. My conclusion is that you should probably never compile C/C++ using GNU GCC for a generic x86 target. It is broken.

You can examine the assembly on godbolt.

Note: Please run my tests in the specified docker images so that you get the exact same configuration as I do.

Science and Technology links (June 20th 2020)

  1. UCLA researchers have achieved widespread rejuvenation in old mice through blood plasma diluation, a relatively simple process.

    (…) these results establish broad tissues rejuvenation by a single replacement of old blood plasma with physiologic fluid: muscle repair was improved, fibrosis was attenuated, and inhibition of myogenic proliferation was switched to enhancement; liver adiposity and fibrosis were reduced; and hippocampal neurogenesis was increased.(…) These findings are most consistent with the conclusion that the age-altered systemic milieu inhibits the health and repair of multiple tissues in the old mice, and also exerts a dominant progeric effect on the young partners in parabiosis or blood exchange.

    They plan to conduct clinical trials in human beings “soon”.

  2. It used to be that universities would happily pay large sums to private publishers like Elsevier for access to the research articles. The prestigious MIT joins the ranks of the universities who are challenging Elsevier.
  3. Medical doctors follow clinical practice guidelines. In turn, the producers of these guidelines are often funded by the industry, and they fail to disclose it.
  4. The upcoming Sony PlayStation 5 will have a disk with a bandwidth of over 5 GB/s. For comparison, good Apple laptops typically achieve only about 2 GB/s, and older conventional disks are 10 to 20 times slower.

    Our already-low tolerance for slow and unresponsive applications and web sites will fall. Loading screens, loading bars, and similar “make the user wait” strategies will become more and more annoying. We will come to expect application updates to occur in the blink of an eye.

    Programmers used to blame disk and network performance, but these excuses will not hold in the near future. More and more, poor performance will be due to poor software engineering. I gave a talk recently on the topic: data engineering at the speed of your disk (slides).

Update: Someone objected that disks with 6Gb/s bandwidth are already commonplace and have been inexpensive for many years. That is true, but 6Gb/s is 10 times slower than 5 GB/s. Notice that ‘b’ stands for bit whereas ‘B’ stands for byte.

Computational overhead due to Docker under macOS

For my programming work, I tend to assume that I have a Linux environnement. That is true whether I am under Windows, under macOS or under a genuine Linux.

How do I emulate Linux wherever I go? I use docker containers. Effectively, the docker container gives me a small subsystem where everything is “as if” I was under Linux.

Containers are a trade-off. They offer a nice sandbox where your code can run, isolated from the rest of your system. However they also have lower performance. Disk and network access is slower. I expect that it is true wherever you run your containers.

However, part of your computing workload might be entirely computational. If you are training a model or filtering data, you may not be allocating memory, writing to disk or accessing the network. In such cases, my tests suggest that you have pretty much the same performance whether you are running your tasks inside a container, or outside of the container… as long as your host is Linux.

When running docker under Windows or macOS, docker must rely on a virtual machine. Under Windows, it may use VirtualBox or other solutions, depending on your configuration, whereas it appears to use Hyperkit under macOS. These virtual machines are highly efficient, but they still carry an overhead.

Let me benchmark a simple Go program that just repeatedly computes random numbers and compares them with the value 0. It prints out the result at the end.

package main

import (
        "fmt"
        "math/rand"
)

func main() {
        counter := 0
        for n := 0; n < 1000000000; n++ {
                if rand.Int63() == 0 {
                        counter += 1
                }
        }
        fmt.Printf("number of zeros: %d \n", counter)
}

It is deliberately simple. I am going to use Go 1.14 (always).

Under macOS, I get that my program takes 11.7 s to run.

$ go build -o myprogram
$ time ./myprogram
number of zeros: 0

real	0m11.911s
user	0m11.660s
sys	0m0.034s

I am ignoring the “sys” time since I only want the computational time (“user”).

Let me run the same program after starting a docker container (from an ubuntu 20.10 image):

$ go build -o mydockerprogram
$ time ./mydockerprogram
number of zeros: 0

real	0m12.038s
user	0m12.026s
sys	0m0.025s

So my program now takes 12 s, so 3% longer. Observe that my methodology is not fool-proof: I do not know that this 3% slowdown is due to the overhead incurred by docker. However, it bounds the overhead.

Let me do something fun. I am going to start the container and run my program in the container, and then shut it down.

$ time run 'bash -c "time ./mydockerprogram"'
number of zeros: 0

real	0m12.012s
user	0m12.003s
sys	0m0.008s

real	0m12.545s
user	0m0.086s
sys	0m0.041s

It now takes 0.5 s longer. That is the time it takes for me start a container, do nothing, and then shut it down. Doing it in this manner takes 8% longer than running it natively in macOS.

Of course, if you run many small jobs, the 0.5 s is going to hurt you. It may come to dominate the running time.

If you want to squeeze every ounce of computational performance out your machine, it is likely that you should avoid the docker overhead under macOS. A 3% overhead may prove to be unacceptable. However, for developing and benchmarking your code, it may well be an acceptable trade-off.

Reusing a thread in C++ for better performance

In a previous post, I measured the time necessary to start a thread, execute a small job and return.

auto mythread = std::thread([] { counter++; });
mythread.join();

The answer is thousands of nanoseconds. Importantly, that is the time as measured by the main thread. That is, sending the query, and getting back the result, takes thousands of nanoseconds and thousands of cycles. The work in my case is just incrementing a counter: any task more involved will increase the overall cost. The C++ standard API also provides an async function to call one function and return: it is practically equivalent to starting a new thread and joining it, as I just did.

Creating a new thread each time is fine if you have a large task that needs to run for milliseconds. However, if you have tiny tasks, it won’t do.

What else could you do? Instead of creating a thread each time, you could create a single thread. This thread loops and periodically sleep, waiting to be notified that there is work to be done. I am using the C++11 standard approach.

  std::thread thread = std::thread([this] {
    while (!exiting) {
      std::unique_lock<std::mutex> lock(locking_mutex);
      cond_var.wait(lock, [this]{return has_work||exiting;});
      if (exiting) {
        break;
      }
      counter++;
      has_work = false;
      lock.unlock();
      cond_var.notify_all();
    }
  });

It should be faster and overall more efficient. You should expect gains ranging from 2x to 5x. If you use a C++ library with thread pools and/or workers, it is likely to adopt such an approach, albeit with more functionality and generality. However, the operating system is in charge of waking up the thread and may not do so immediately so it is not likely to be the fastest approach.

What else could you do? You could simply avoid as much as possible system dependencies and just loop on an atomic variable. The downside of the tight loop (lockspin) approach is that your thread might fully use the processor while it waits. However, you should expect it to get to work much quicker.

  std::thread thread = std::thread([this] {
    thread_started.store(true);
    while (true) {
      while (!has_work.load()) {
        if (exiting.load()) {
          return;
        }
      }
      counter++;
      has_work.store(false);
    }
  });

The results will depend crucially on your processor and on your operation system. Let me report the rough numbers I get with an Intel-based linux box and GNU GCC 8.

new thread each time 9,000 ns
async call 9,000 ns
worker with mutexes 5,000 ns
worker with spinlock 100 ns

My source code is available.

Science and Technology links (June 6th 2020)

  1. A small number of people are responsible for a disproportionate number of inventions and innovations. Why are these people different? Using neuroimaging techniques, scientists find that superior individuals may not have distinct processes per se, but rather they use common processes differently, in a more economical fashion:

    (…) extraordinary creative ability is not the outcome of a unique set of neurocognitive processes; rather, it is associated with the same neural mechanisms that support ordinary creativity, but to a different degree (…). Indeed, our findings would support the argument that similar creative outcomes (…) come about with a less extensive recruitment of brain networks shown to contribute to creative thought (…), which we speculate may allow eminent creators to pursue concurrently, for example, multiple lines of creative thought.

    This suggests that most of us with a healthy brain could potentially become highly creative thinkers. It would seem important to determine whether that is true.

  2. The average female mammal lives about 20% longer than the corresponding male. In human beings, womens have only an 8% advantage (men have significantly shorter lives). This difference is not attributed to the riskier behavior of males. Rather, a credible explanation is that males have a weaker immune system.
  3. Sea urchins can regenerate appendages throughout their lives. They come into different subspecies with long and short lives. The end of our chromosomes contains repeated sequences called telomeres. These telomeres get shorter with every cell division, unless they are replenished (e.g., via telomerase). It is often theorized that aging is characterized or explained by telomere shortening. However, in sea urchins, the telomeres do not get shorter with life because of constant telomerase activity. And yet, there are short-lived (i.e. aging) sea urchins.
  4. Scientists have created hair-bearing human skin from stem cells. Though the authors do not stress this possibility, it seems obvious that it could lead to therapies for treating hair loss in human beings:

    Moreover, we show that skin organoids form planar hair-bearing skin when grafted onto nude mice. Together, our results demonstrate that nearly complete skin can self-assemble in vitro and be used to reconstitute skin in vivo.

    This was not tested in human being. Yet some doctors are optimistic:

    This achievement places us closer to generating a limitless supply of hair follicles that can be transplanted to the scalps of people who have thinning or no hair.

  5. The sun creates skin damage over time and contributes to a particular form of skin aging that is quite visible in some older people. The damage goes deep in the skin and is therefore challenging. Scientists have used stem cells to attempt to reverse the damage. In at least some (human) patients, the results can be characterized as an extensive reversal of the damage deep in the skin.
  6. Our cells need a compound called NAD+ to produce energy. As we age, we tend to have less and less NAD+. A particular disease called mitochondrial myopathy leads to NAD+ deficiency. Scientists found that niacin (an expensive supplement) was an efficient NAD+ booster in these patients. (It does not mean that you should take niacin.)

How Innovation Works (book review)


I read How Innovation Works by Matt Ridley in a few hours. It is a delicious book.

Ridley distinguishes invention from innovation. The inventor creates something new, the innovator applies the novelty to change the world. Jeff Bezos innovated with his Amazon site, but he may not be much of an inventor. Scientists often invent new “things” but they often fail to innovate.

Innovation is often quite positive and Ridley attributes much of the large gains in wealth and well-being that we have known. So what makes innovation possible, or what causes innovation?

Ridley does not know exactly. However, he identifies some conditions that favour innovation:

  • You need some freedom, the freedom to try new enterprises, the freedom to apply new ideas.
  • You need some wealth. Necessity does not reliably drive innovation according to Ridley.
  • You need the ability to make mistakes and learn quickly from them. Innovation does not happen overnight from the brain of a genius. It is a deeply iterative process.

Thus we get little innovation in the nuclear industry because it is difficult to get approval for a new nuclear power plant. And it is difficult to experiment. However, we get a lot of innovation on the Web where anyone can try to offer a new service and where it is easy to iterate quickly.

What is the role of government in innovation? It is often believed that government drives innovation. However, it does not match the historical record. Until sometime in the XXth century, governments did not have any idea that they could promote innovation. Yet innovation did occur. Furthermore, governments decided to adopt a laissez-faire policy with respect to the Web, which enables massive innovation.

When you look at how much difficulty the state has to even embrace innovation, you cannot believe that it is also the source of our innovation. Ridley, like myself, is pessimistic regarding government interventions like patent protections. We get more innovation when people are free to iterate and reuse ideas. Furthermore, focusing the innovators on patents takes time and money away from innovation.

I am also skeptical of the ability of research grants to spur innovation, even when said grants and tied to industry collaboration. I do think that governments can play a positive role, besides protecting free enterprise and free markets: governments can issue prizes and contracts. For example, NASA may not be able to innovate and get us in space cheaply. However, NASA can offer contracts to space companies. Governments could similarly encourage progress in medicine by giving prizes to the first innovators to reach certain milestones. For example, we could give 10 billion dollars to the first team to stop cognitive decline in Alzheimer’s patients, at a reasonable cost. I get the feeling that Ridley would agree.

Research grants tend to favour the incumbents. Research prizes are more likely to reward innovators.

It is true that the innovators are not rewarded for nearly all of the wealth that their innovation generate… but innovations are frequently a team sport. We may think that only the Wright brothers invented the aeroplane and made it work, leading to all the marvellous innovations… but many people had a hand in their work. They corresponded with enthusiast who were experimenting with planers. They had an employee design a very light and powerful engine. It is not clear that by trying to ensure that innovators get a larger share of the benefits, we get more innovation and this assumes that we innovators are always fairly identified. Is the first people to patent an idea the sole innovator?

Where will innovation happen in the near future? The answer that seems obvious today, and the one that Ridley goes to is China. China seems uniquely able to try new things as far as technology and industry goes. Ridley is not pleased by this likely outcome given that China lacks democracy. Instead, Ridley believes that we could renew our habit of creating innovation elsewhere if only we choose to.

The Go compiler needs to be smarter

One of my favorite languages is the Go language. I love its simplicity. It is popular and useful in a cloud setting. Many popular tools are written in Go, and for good reasons.

I gave a talk on Go last year and I was asked for a criticism of Go. I do not mind that Go lacks exceptions or generics. These features are often overrated.

However, as charming as Go might be, I find that its compiler is not on par with what I expect from other programming languages. It could be excused when Go was young and immature. But I think that the Go folks now need to tackle these issues.

My first beef with the compiler is that it is shy about inlining. Inlining is the process by which you bring  a function into another function, bypassing the need for a function call. Even when the function call is inexpensive, inlining often brings many benefits.

Go improved its inlining strategies over time, but I would still describe it as “overly shy”.

Let me consider the following example where I first define a function which sums the element in an array, and then I call this function on an array I just defined:

func sum(x  []uint64) uint64 {
    var sum = uint64(0)
    for _, v := range x {
        sum += v
    }
    return  sum
}


func fun() uint64 {
    x := []uint64{10, 20, 30}
    return sum(x)
}

Whether you use Rust, Swift, C, C++… you expect a good optimizing compiler to basically inline the call to the ‘sum’ function and then to figure out that the answer can be determined at compile time and to optimize the ‘fun’ function to something trivial.

Not so in Go. It will construct the array and then call the ‘sum’ function.

In practice, it means that if you want good performance in Go, you often have to manually inline your functions. And I mean: fully inline. You have to write a really explicit function if you want the Go compiler to optimize the computation away, like so:

func fun3() uint64 { 
    x := [3]uint64{10001, 21, 31}
    return x[0] + x[1] + x[2] 
}

My second concern with the Go language is that it has no real concept of runtime constant variable. That is, you have compile-time constants but if you have a variable that is set once in the life of your program, and never change, Go will still treat it as if it could change. The compiler does not help you.

Let us take an example. Go has added nice function that give you access to fast processor instructions. For example, most x64 processors have a popcnt instruction that gives you the number 1-bit in a 64-bit word. It used to be that the only way to access this instruction in Go was by writing assembly. Thas been resolved. So let us put this code into action:

import "math/bits"

func silly() int {
    return  bits.OnesCount64(1) + bits.OnesCount64(2)
}
    

This function should return 2 since both values provided (1 and 2) have exactly one bit set. I bet that most C/C++ compilers can figure that one out. But we may excuse Go for not getting there.

Go needs to check, before using the popcnt instruction, that the processor supports it. When you start Go, it queries the processor and fills a variable with this knowledge. This could be done at compile-time but then your binary would crash or worse when run on a processor that does not support popcnt.

In a language with just-in-time compilation like Java or C#, the processor is detected at compile-time so no check is needed. In less fanciful languages like C or C++, the programmer needs to check what the processor supports themselves.

I can excuse Go for checking that popcnt is supported each and every time that the ‘silly’ function called. But that is not what Go does. Go checks it twice:

        cmpb    runtime.x86HasPOPCNT(SB), $0
        jeq     silly_pc115
        movl    $1, AX
        popcntq AX, AX
        cmpb    runtime.x86HasPOPCNT(SB), $0
        jeq     silly_pc85
        movl    $2, CX
        popcntq CX, CX

That is because the compiler does not trust, or cannot determine, that the variable ‘runtime.x86HasPOPCNT’ is a runtime constant.

Some people will object that such checks are inexpensive. I think that this view should be challenged:

  • As is apparent in the assembly code I provide, you might be doubling or at least increasing by 50% the number of instructions required. A comparison and a jump is cheap, but so is popcnt (some processors can retire two popcnt per cycle!). Increasing the number of instructions makes code slower.
  • It is true that the branch/jump is likely to be correctly predicted by the processor. This makes the guarding code much cheaper than a branch that could sometimes be mispredicted. But that does not mean that you are not getting hurt:

    Even when it is impossible to remove all branches, reducing the number of branches “almost always taken” or “almost never taken” may help the processor better predict the remaining branches.  (…) A possible simplified explanation for this phenomenon is that processors use the history of recent branches to predict future branches. Uninformative branches may reduce the ability of the processors to make good predictions.

Go’s saving grace is that it makes it easy to integrate assembly code into your code base. So you can write your performance-critical in C, compile it, and use the result in your Go project. That is how we do it in roaring, for example. People have ported the really fast Stream VByte encoding and the very fast simdjson parser in Go, again by using assembly. It works.

However, it leaves the bulk of the Go software running at a fraction of the performance it could reach with a great optimizing compiler.

Appendix: Compiling Go with gccgo solves these particular problems. However, reportedly, the overall performance of gccgo is worse.

Science and Technology links (May 30th 2020)

  1. We might soon be able to buy memory cards with speeds nearing 4 GB/s. For comparison, an expensive and recent macBook currently has a disk with a 2 GB/s bandwidth. The PlayStation 5 should have a 5 GB/s bandwith.
  2. Human industry has boosted the amount of CO2 in the atmosphere. This has two predictible outcomes: slightly higher global temperature (CO2 has a mild green house effect) and higher plant productivity (CO2 acts as a fertilizer). The CO2 fertilization effect is strong: a 30% gain from 1900 in photosynthesis efficiency. Moreover, higher plant productivity translates into more CO2 capture and thus it tends to reduce the quantity of CO2. Haverd et al. report that we may have underestimated the carbon sink effect of CO2 fertilization.
  3. The successful e-commerce firm, Shopify, will allow most of its employees to work remotely in the future.
  4. Human beings may have hunted mammoths by chasing them into predetermined traps.
  5. There is a theory that sending your kids to a more selective school help them because being exposed to high achieving peers raises their level. But it seems that this peer effect is a myth. In other words, paying a lot of money to send your kids to an exclusive school is probably a waste. (It does not imply that sending your kids to a dysfunctional school is harmless.)
  6. We should apply with care the principle that extraordinary claims require extraordinary evidence. Indeed, this principle can be used to reject results that violate human consensus and slow the progress of science. Indeed, scientific progress is often characterized by a change in the consensus as we go from one realization to another.
  7. We can prevent age-related bone losses in mice by tweaking the content of their blood plasma.
  8. Many recent advances in artificial intelligence do not hold up to scrutiny. This is why you will often hear me dismiss novelty with the phrase “originality is overrated”.

    Kolter says researchers are more motivated to produce a new algorithm and tweak it until it’s state-of-the-art than to tune an existing one. The latter can appear less novel, he notes, making it “much harder to get a paper from.”

    The net result is that researchers tend to overrate novelty and originality. In practice, you often get better results by selecting time-tested approaches and ignoring the hype.

    So, how should read research, knowing that much of it won’t stand to scrutiny?

    1. Do not dismiss older research merely because it is older. Do the opposite: focus your energies on older work still in use.
    2. Instead of picking up papers one by one, try to find the underlying themes. In effect, dismiss each individual paper, and instead focus on the recurring themes and effects. If an idea only appears in one paper, it probably can be discarded. If it is appears again and again and proves useful, it might be worth knowing.

Mapping an interval of integers to the whole 64-bit range, fairly?

In my blog post A fast alternative to the modulo reduction, I described how one might map 64-bit values to an interval of integers (say from 0 to N) with minimal bias and without using an expensive division. All one needs to do is to compute x * N ÷ 264 where ‘÷’ is the integer division. A division by a power of two is just a shift. Such an approach is simple and efficient.

Let us consider the opposite problem.

Suppose that I give you integers in the range from 0 (inclusively) to some value N (exclusively). We want to map these values to integers spanning the whole 64-bit range. Obviously, since we only have N distinct values to start with, we cannot expect our function to cover all possible 64-bit values. Nevertheless, we would like to be “as fair as possible”. Let us use only integers.

Let 264 ÷ N be the integer division of 264  by N. Let 264 % N be the remainder of the integer division. Because N is a constant, these are constants as well.

As a first attempt, I might try to map integer x using the function (264 ÷ N) * x. It works beautifully when N is a power of two, but, in general, it is a tad crude. In particular, the result of this multiplication can never exceed N – (264 % N) whereas it starts at value 0 when x is 0 so it is biased toward smaller values when (264 % N) is non-zero (which is always true when N is not a power of two).

Let 264 % N be the remainder of the integer division. To compensate, we need to add a value that can up to 264 % N. Mapping integers in the interval from 0 to N to integers in the interval from 0 to 264 % N cannot be done with a simple multiplication. We do not want to use divisions in general because they are expensive. However, we can use a shift, that is, a division by a power of two. So let us look for a map of the form (264 ÷ N) * x + (u * x)÷264  for some unknown value u. We know that x can never exceed N, but that when x reaches N, the value of (u * x)÷264 should be close to 264 % N. So we set (u * N)÷264  to 264 % N and solve for u, which gives us that u must be at least (264 % N) * 264 ÷ N.

The values (264 ÷ N) and (264 % N) * 264 ÷ N need to computed just once: let us call them m and n respectively. We finally get the function m * x + (n * x)÷264. In Python, the code might look as follows…

def high(x,y):
    return (x*y >> 64) % 2**64

def inv(x):
   n = ((2**64) % N)*2**64 // N
   m = 2**64 // N
   return m*(x+1) + high(x,n)

How good is this formula? To test it out, I can test how well it can invert the approach that goes in the other direction. That is, if I plug x * N ÷ 264, I would hope to get back x, or something very close for suitable values of N. That is, I want m * x * N ÷ 264 + (n * x * N ÷ 264)÷264 to be almost x. In general, I cannot hope to find x exactly because some information was lost in the process. Yet I expect to have a lower bound on the error of ceil(264/ N). I benchmark my little routine using a probabilistic test and the results are promising. In the worst case, I can be orders of magnitude larger than the lower bound, but most of the time, my estimated error is lower than the lower bound. This suggests that though my approach is not quite suitable to get back x, with a little bit of extra mathematics and a few more instructions, I could probably make it work exactly within a strict error bound.

Programming inside a container

I have a small collection of servers, laptops and desktops. My servers were purchased and configured at different times. By design, they have different hardware and software configurations. I have processors from AMD, Intel, Ampere and Rockchip. I have a wide range of Linux distributions, both old and new. I also mostly manage everything myself, with some help from our lab technician for the initial setup.

The net result is that I sometimes end up with very interesting systems that are saddled with old Linux distributions. Reinstalling Linux and keeping it safe and secure is hard. Furthermore, even if I update my Linux distributions carefully, I may end up with a different Linux distribution than my collaborators, with a different compiler and so forth. Installing multiple different compilers on the same Linux distribution is time consuming.

So what can you do instead?

I could run virtual machines. With something like VirtualBox, you can run Linux inside Windows insider macOS. It is beautiful. But it is also slow and computationally expensive.

You can switch to containers, and Docker specifically, which have much less overhead. Docker is a ubiquitous tool in cloud computing. As a gross characterization, Docker allows you to run Linux inside Linux. It is a sandbox, but a sandbox that runs almost directly on the host. Unlike virtual machines, my tests show that on computationally intensive tasks, Docker containers run at “native speed” (bare metal). There are reports that system interaction is slower. Network connections and disk access is slower. For my purposes, it is fine.

If you must, you can also run Docker under macOS and Windows, though there will then be more overhead, I expect.

The idea of a container approach is to always start from a pristine state. So you define the configuration that your database server needs to have, and you launch it, in this precise state each time. This makes your infrastructure predictable.

It is not as perfect as it sounds. You still critically depend on the quality of the container you start from. Various hacking can be necessary if you need two applications with different requirements to run together in the same image.

Still, containers work well enough that they are basically sustaining our civilization: much of the cloud-based applications are based on containers one way or another.

Containers were built to deploy software into production. Programming inside containers is not directly supported: you will not find much documentation about it and there is simply not business model around it. What do I mean by “programming inside containers”? I mean that I’d to start a C programming project, decide that I will use the Linux Ubuntu 16.10 distribution and that I will compile and run my code under Linux Ubuntu 16.10, even though my server might be running a totally different Linux distribution (or might be under macOS).

The first problem is that your disk and the disk of the image built from the container are distinct. A running image does not have free access to the underlying server (the host). Remember that it is a sandbox.

So you can do all of your work inside the image. However, remember that the point of container technology is to always start from a pristine state. If you load up an image, do some work, and leave… your work is gone. Images are immutable by design. It is a great thing: you cannot easily mess up an image by tinkering with it accidentally.

You can, after doing some work inside an image, take a snapshot of the new state, commit it and create a new image from which you would start again. It is complicated and not practical.

What else could you do? What you can do instead is keep the image stateless, as images are meant to be. The image will only contain the compiler and build tools. There is no reason to change any of these tools. You will have all of your code in a directory, as you would normally do. To run and compile code, you will enter in the the image and run your commands. You can bind the repository from the host disk to the image just as you enter it.

This works much better, but there are glitches if you are issuing directly your docker command lines:

  1. Depending on how Docker is configured on your machine, you may find that you are unable to read or write to the disk bound to the image from the image. A quickfix is to run the image with privileged access but it is normally frowned upon (and unnecessary).
  2. The files that you create or modify from within the Docker image will appear on the host disk, often with strange file permissions. For example, maybe all of the files are owned by the root user. I had a research assistant that had a good workaround: he ran Linux as root all the time. I do not recommend such a strategy.

These glitches come from the strange way in which Docker deals with permissions and security. Contrary to what you may have read, it is not a simple matter of setting user and group identifiers: it may be sufficient on some systems but not on systems supporting Security-Enhanced Linux which require additional care.

And finally, you need to remember lots of complicated commands. If you are anything like me, you would rather not to have to think about Docker. You want to focus all of your attention on the code.

So the solution is to use a little script. In my case I use a bash script. You can find it on GitHub. It handles messy commands and file permissions for you.

For years, I tried to avoid having to rely on a script, but it is simply unavoidable to work productively.

Basically, I copy two files at the root of the directory where I want to work (Dockerfile and run), and then I type:

./run bash

And that is all. I am now in a subshell, inside the host directory. I can run programs, compile them. I have complete access to a recent Ubuntu distribution. This works even under the ARM-based servers that I have.

The run script can take other commands as well, so I can use it as part of other scripts.

Science and Technology links (May 16th 2020)

  1. Most of our processors, whether in our PCs or mobile phones, are 64-bit processors. In the case of your PC, it has been so for a couple of decades. Unfortunately, we have been stuck with 32-bit operating systems for a long time. Apple has stopped supporting 32-bit applications in the most recent version of its macOS operating system and 32-bit applications have stopped running on the iPhone quite some time ago. Microsoft will now stop providing 32-bit versions of its operating system.
  2. Moths transport pollen at night and play an important role as pollinators.
  3. If you transplant fecal microbes from old to young rats, you age the young rats. We have solid evidence that blood transfusions from old mammals to young mammals will age the young recipient (Nature, 2016). The other direction, evidently more useful, is achievable in vitro, but harder to achieve in actual organisms. In what might be a breakthrough, Horvath et al. report system-wide rejuventation in rats using plasma (blood) transfusion: the treatment more than halved the epigenetic ages of blood, heart, and liver tissue. Details are missing and we should reserve judgement. Horvath is a well regarded scientist from UCLA.

Encoding binary in ASCII very fast

In software, we typically work with binary values. That is, we have arbitrary streams of bytes. To encode these arbitrary stream of bytes in standard formats like email, HTML, XML, JSON, we often need to convert them to a standard format like base64. You can encode and decode base64 very quickly.

But what if you do not care for standards and just want to go fast and have simple code? Maybe all you care about is that the string is ASCII. That is, it must be a stream of bytes with the most significant bit of each byte set to zero. In such a case, you want to convert any 7-byte value  into an 8-byte value, and back.

I can do it in about five instructions (not counting stores and moves) both ways in standard C: five instructions to encode, and five instructions to decode. It is less than one instruction per byte. I could not convince myself that my solution is optimal.

// converts any value in [0, 2**56) into a value that
// could pass as ASCII (every 8th bit is zero)
// can be inverted with convert_from_ascii
uint64_t convert_to_ascii(uint64_t x) {
  return ((0x2040810204081 * (x & 0x80808080808080)) 
         & 0xff00000000000000) +
         (x & 0x7f7f7f7f7f7f7f);
}
// converts any 8 ASCII chars into an integer in [0, 2**56),
// this inverts convert_to_ascii
uint64_t convert_from_ascii(uint64_t x) {
  return ((0x102040810204080 * (x >> 56)) 
         & 0x8080808080808080) +
         (x & 0xffffffffffffff);
}

Under recent Intel processors, the pdep/pext instructions may serve the same purpose. However, they are slow under AMD processors and there is no counterpart under ARM.