Building better software with better tools: sanitizers versus valgrind

We often have to write code using  permissive programming languages like C and C++. They tend to generate hard-to-debug problems that can crash your applications. Thankfully, many compilers offer “sanitizers”. I discussed them in my post No more leaks with sanitize flags in gcc and clang. I strongly encourage the use of sanitizers as I think it is the modern way to write C and C++. When many people describe how impossibly difficult it is to build good software in C and C++, they often think about old-school bare metal C and C++ where the code do all sorts of mysterious things without any protection. Then they feel compelled to run their code in a debugger and to manually run through it. You should not write code this way! Get some tools! Sanitizers can catch undefined behaviour, memory leaks, buffer overflows, data races, and so forth.

Sanitizers are a game changer, they bring C++ closer to languages like Java and Rust. They do not bring the safety to production, since you probably do not want to use sanitizers in production or release, but as you are building and testing your code, they help you a great deal catch potential issues right away.

A competitive solution that people often use is a great tool called “valgrind“. It is a general-purpose tool that checks your software as it runs. Under Windows, you have related programs like Application Verifier and WinDbg.

I believe you should almost always use sanitizers when they are available. Here is a comparison between tools like valgrind and sanitizers.

    1. With the caveat that valgrind needs support for all the instructions your software is using, valgrind can run pretty much any software, even when you do not have the source code. Sanitizers work at the compiler level, so you need the source code. Thus if you need to debug a closed source library, sanitizers are unhelpful.
    2. Sanitizers can catch problems that valgrind will not catch. For example, it will catch undesirable undefined behaviour: code that may work right now but may not work if you use a different compiler or a different processor. They can catch unsafe memory accesses that will look safe to valgrind.
    3. Sanitizers are more precise. You often can turn on or off specific sanitizers for specific functions.
    4. If your compiler has sanitizers, you can run your tests with the sanitizers on simply by turning on some flags.
    5. Valgrind is slow. Like debuggers, it often does not scale. If you are working over large data sets, it might take a really long time. People often dismiss “execution time”, and it is easy to do if you work on toy problems, but performance is an essential quality-of-life attribute. I do not think you can run valgrind in a simulated production setting. However, you can compile your code with sanitizers and emulate a production setting. Sure, your throughput is going to be impacted, but the effect is not large. Code with sanitizers is not 10x slower, valgrind is.
    6. Sanitizers are relatively new and so the support is sometimes missing.
      • For example, under macOS, Apple does not yet ship a compiler that can detect memory leaks, you need to install your own compiler.
      • Even if you compile your code with debug symbols, it is common for the sanitizers to report the errors without proper links to the source code, you often need to fiddle with the system configuration.
      • Under Linux, when using GNU GCC, I have found it necessary to use the gold linker to get good results (-fuse-ld=gold): the default link frequently gives me errors when I try to use sanitizers.
      • The “memory sanitizer” that check that you do not read from uninitialized inputs is not available under GNU GCC and under LLVM requires you to manually replace the C++ standard library and possibly recompile all of your software with the sanitizer enabled (including all dependencies) if you want to avoid false positives.
      • And Visual Studio has some of its own sanitizers, but it is largely behind LLVM. Better sanitizers may be coming to Visual Studio 2019.
      • Furthermore, you cannot freely use all possible sanitizers at once.

So, sadly, there are cases when sanitizers are just not available to you. Yet I think it is a safe bet that all competitive C/C++ compilers will soon have powerful sanitizers.

Bitset decoding on Apple’s A12

In my post Really fast bitset decoding for “average” densities, I reported on our work accelerating the decoding of bitsets. E.g., given a 64-bit register, you want to find the location of every 1-bit. So given 0b110011, you would want to get 0, 1, 4, 5. We want to do this operation with many such registers. When the content of the register is hard to predict, you can be almost certain to get a mispredicted branch with every new register to be decoded. On modern processors with deep and wide pipelines, these mispredictions can become a bottleneck.

On recent x64 processors, we find that it is beneficial to decode in bulk: e.g., assume that there are at least 4 set bits, decode them without checking whether there are four of them. The code might look as follow:

  while (word != 0) {
    result[i] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+1] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+2] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+3] = trailingzeroes(word);
    word = word & (word - 1);
    i+=4;
  }

So we are trading branches for a few more instructions. If the branches are easy to predict, that is a bad trade, but if the branches are hard to predict, it can be beneficial.

We consider the scenario where the input data contains neither very many nor very few 1-bit, and where their distribution is hard to predict. With our approach, we can get it down to less than 3.5 cycles per 1-bit decoded on recent Intel processors. To achieve this kind of speed, we retire nearly 2.5 instructions per cycle.

What about ARM processors? Sadly, I have not yet been able to reliably measure the same can of high speed. The Linux-based ARM systems I have seem to be limited to a quasi-scalar execution mode, retiring never much more than 1.5 instructions per cycle. Either these ARM processors are not powerful enough, or else I am not benchmarking properly.

An additional difficulty is that ARM processors do not have a fast 64-bit population-count instruction (to determine the number of 1-bit per register). Instead, you must use an instruction which finds the number of 1-bit per byte, and sum that up using another instruction. So while one instruction suffices on an Intel processor, at least two (or more) instructions are necessary, and so the cost and total latency is higher. Similarly ARM processors lack a “trailing zero” instruction: you have to reverse the bit order and use a “leading zero” instruction. So maybe ARM processors are just fundamentally at a disadvantage on this task compared to Intel processors. But I am not convinced that it is the case. If I look at the instructions counts, they seem to be similar between ARM and Intel code. That is, while ARM makes you work harder to compute some operations, Intel has its own limitations. It may all average out.

So I’d like to be certain that I have a powerful ARM processor to give ARM a fighting chance. Thankfully I do have many powerful ARM processors… I have one in my iPhone for example. Trouble is, I cannot instrument it and install Linux on it. I cannot easily use any compiler I’d like. Still, I can run benchmarks and record the time elapsed. All I need to do is write a little mobile application. I record the nanoseconds per set bit. It seems that the Apple A12 in my iPhone is limited to 2.5 GHz, so I multiply the result by 2.5 to get the number of cycles.

conventional 1.7 ns 4.125 cycles
fast 1.2 ns 3 cycles

If these numbers can be trusted, then the Apple A12 might possibly be more efficient than an Intel Skylake processor (3.5 cycles vs. 3 cycles). Given that Apple’s A12 is reputed to have a really wide pipeline, this sounds credible.

Using Apple’s Instruments tool, I got that the fast decoding approach runs at 3.7 instructions per cycle.

My code is available. If you have a Mac and an iPhone, you should be able to reproduce my results.

Update: The latest version of my code measures the clock speed as part of the benchmark.

Setting up a ROCKPro64 (powerful single-card computer)

A few months ago, I ordered ROCKPro64. If you are familiar with the Raspberry Pi, then it is a bit of the same… an inexpensive computer that comes in the form of a single card. The ROCKPro64 differs from the Raspberry Pi in that it is much closer in power to a normal PC. You make a decent laptop out of it. It has enough memory to do useful work and a decent 6-core processor  (dual ARM Cortex A72 and quad ARM Cortex A53). I bought the following components:

  • ROCKPro64 4GB Single Board Computer ($80)
  • ROCKPro64 aluminium casing ($15) 
  • 64GB eMMC module ($35)
  • USB adapter for the eMMC Module ($5)
  • ROCKPro64 power supply ($13)

I also had an ethernet cable at home. I connected the ethernet cable to my iMac, which is connected to the Internet via Wifi, and I configured macOS to enable Internet sharing via the (previously unused) ethernet port. You can probably connect the ROCKPro64 to the Internet by wifi, but I always prefer the reliability of ethernet cables. So I connected the ROCKPro64 to the Internet via this ethernet  cable. I did not plug anything else into it.

I wanted to install Linux on the ROCKPro64. At first, I went to Ubuntu, grabbed a release there, but it was a bad idea. It does not work. I finally figured out that you have to download Linux releases tailored to the hardware. So I got the latest version of Debian for the ROCKPro64 for GitHub. I prefer Ubuntu, but debian is good too. Maybe importantly, I used a release that was specific to the ROCKPro64 (with rockpro64 in the name).

You then need to get the operating system on the eMMC module. The eMMC module is a bit like an SD card, but you can’t plug it into you computer. However, you can plug it in the USB adapter you just bought. I did so.

In theory, you could run the ROCKPro64 out of an SD card. I do not like to work with SD cards: they are slow and unreliable. I am hoping to get better performance and durability out of the eMMC module.

I downloaded a piece of software called “etcher“. After launching it, it asked which image I wanted to use, I selected the Linux image file I had downloaded (exact name: stretch-minimal-rockpro64-0.7.9-1067-arm64.img.xz). Then it asked for the destination drive, so I plug in my USB adapter. I ignored macOS warnings about the content being unreadable and I just hit the “flash” button in etcher. I waited about five minutes.

When etcher told me everything was fine, removed the eMMC module and put it on the ROCKPro64 (there is a dedicated area on the board). I then plugin my power cord to the ROCKPro64. The network adapter lights turned on and after a short time a white LED light near the reset button came on.

I went on my iMac and in a terminal window, I typed “arp -a”. There was the following line among others:

? (192.168.2.2) at 5a:59:0:de:6b:4e on bridge100 ifscope [bridge]

The password and identifiers are rock64, so I used ssh to connect to board:

$ ssh [email protected]
[email protected]’s password:
_ __ _ _
_ __ ___ ___| | ___ __ _ __ ___ / /_ | || |
| ‘__/ _ \ / __| |/ / ‘_ \| ‘__/ _ \| ‘_ \| || |_
| | | (_) | (__| <| |_) | | | (_) | (_) |__ _|
|_| \___/ \___|_|\_\ .__/|_| \___/ \___/ |_|
|_|
Linux rockpro64 4.4.132-1075-rockchip-ayufan-ga83beded8524 #1 SMP Thu Jul 26 08:22:22 UTC 2018 aarch64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
[email protected]:~$

After playing with the machine a bit, I wanted to shut it down. I think you want to type “systemctl poweroff”.

Notice how I am not connecting a mouse, a monitor or a keyboard to it. For what I want to do with it, I do not need any of that.

I find it inconvenient to remember the IP address of the machine. To be able to log in as “ssh [email protected]”, just type the following:

sudo apt-get update
sudo apt-get install avahi-daemon avahi-dnsconfd avahi-discover avahi-utils libnss-mdns
sudo service avahi-daemon start

Throw in the ‘ssh-copy-id’ command and you can log in without typing a password.

The modern way to run software on Linux is to use containers (e.g., docker). You can almost follow line-by-line instructions found online with the caveat that whenever they write “amd64”, your need to substitute “arm64”. Also I find it handy to add myself to the ‘docker’ group to avoid having to run docker as root:

sudo usermod -a -G docker myusername

Science and Technology links (May 11th 2019)

  1. Bone marrow transplanted from young mice to very old (almost dying) mice extended the life of the old mice by 30%. The authors conclude that bone-marrow transplantation affects the intrinsic aging mechanism.
  2. Artery calcification is an easily diagnosed condition which predicts cardiovascular diseases. In animal studies, vitamin K2 supplement reduced artery calcification. There is an ongoing clinical trial to test whether the same occurs in human beings. The study should conclude later this year.
  3. Ovarian tissues can be removed and frozen, and then reinserted into a women’s body, allowing her to become pregnant. Though it is not simple, it seems to work.

Almost picking N distinct numbers at random

In Picking N distinct numbers at random: how to do it fast?, I describe how to quickly pick N distinct integer values are random from a range of integer values. It comes down to using either bitset/bitmap or a hash set.

The bitset approach is very fast when you need to pick many integer values out of a small range. The hash set approach is very fast if you need to pick very few values, irrespective of the range of values.

What about the middle ground? What if you need to pick lots of integer values from an even greater range?

Because N is large, you may not care to get exactly N values. That is, if you need to pick 100 million integer values at random, it might be fine to pick 99,999,999 integer values.

What can you do?

  1. Fill an array with N randomly generated integer values (using a uniform distribution).
  2. Sort the array.
  3. Remove duplicates.

That is pretty good, but the sort function could be expensive if N is large: it is O(N log N), after all.

Assuming that there are no duplicates, can we model this using probabilities? What is the distribution corresponding to the first value? We have N values picked out of a large range. So the probability that any value has been picked is N over the range. We recognize the geometric distribution. Once you have found the first value, you can repeat this same reasoning except that we now have N-1 values over a somewhat restricted range (because we generate them in sorted order).

  1. Generate a value over the range R using a geometric distribution with probability N/R.
  2. Append the value to my output array.
  3. Reduce the range R with the constraint that all future values must be larger than the last value appended to the output array.
  4. Decrement N by one.
  5. Repeat until N is 0.

You can use the fact that we can cheaply generate numbers according to a geometric distribution:

floor(log(random_number_in_0_1()) /log(1-p));

All you need is a way to generate random numbers in the unit interval [0,1] but that is easy. In C++ and many other programming languages, you have builtin support for geometric distributions.

The net result is an O(N) algorithm to pick N values at random over a range.

There is a catch, however. My model is not quite correct. For one thing, we do not quite have a geometric distribution: it is only valid if the range is very, very large. This manifests itself by the fact that the values I generate may exceed the range (a geometric distribution is unbounded). We can patch things up by stopping the algorithm once a value exceeds the range or some other anomaly occurs.

So I ran a benchmark where I have to pick 100,000,000 values among all integers smaller than 40,000,000,000. I get that the time per value generated is about half using the geometric-distribution approach:

sort-based 170 ns
geometric 80 ns

For larger arrays, I can achieve 3x to 4x gains but then my software runs out of memory.

My code is available.

What else could you do? Another practical approach would be to divide up the range into many small subranges and to use the fact that the number of values within each subrange follows a binomial distribution (which can be approximated by a normal distribution), to do a divide-and-conquer approach: instead of having a pick many values in a large range problem, we would have several small “pick few values into a small range” problems. For each small problem, you can afford to do a sort-based approach since sorting small arrays is fast.

Science and Technology links (May 4th 2019)

  1. It is often believed that colleges help class mobility… if you were born poor, college can help your rise up. So is there more class mobility when more people go to college? Maybe not:

    (…) researchers have characterized a college degree as a great equalizer leveling the playing field, and proposed that expanding higher education would promote mobility. This line of reasoning rests on the implicit assumption that the relatively high mobility observed among college graduates reflects a causal effect of college completion on intergenerational mobility, an assumption that has rarely been rigorously evaluated (…) I find that once selection processes are adjusted for, intergenerational income mobility among college graduates is very close to that among non-graduates.

  2. How many people does it take to form a group? The answer appears to be “at least 5”.
  3. The productivity of a computer science professor is not sensitive to where they got his PhD, but it is correlated with their place of employment.
  4. If you have ever taken management classes, you know about Maslow’s pyramid of human needs. However, Maslow never proposed such a pyramid.
  5. Twenty years ago, Microsoft introduced a mouse that did away with the mechanical ball and replaced with with a digital camera and lights.
  6. Rats were given olive oil, sunflower oil or fish oil. Rats consuming sunflower oil have a shorter lifespan.
  7. Seed oils are a source of Omega 6. It appears that keeping your consumption of Omega 6 low (compared to your Omega 3 intake) is important to reduce cardiovascular events.
  8. Shifting our energy production from coal to gas would allow us to meet our climate stabilization targets.
  9. Many studies find that red meat is associated with higher cancer rates. It is not known whether it is a mere correlation or a causal factor. However, if you narrow down your investigation to Asia, the correlation disappears. The association between red meat and cancer is apparently specific to the Western civilization.
  10. It appears that 60% of all bird species come from Australia.
  11. A new Alzheimer’s-like disease has been identified (it is called “LATE”).

Really fast bitset decoding for “average” densities

Suppose I give you a word and you need to determine the location of the 1-bits. For example, given the word 0b100011001, you would like to get 0,3,4,8.

You could check the value of each bit, but that would take too long. A better approach is use the fact that modern processors have fast instructions to count the number of “trailing zeros” (on x64 processors, you have tzcnt). Given 0b100011001, this instruction would give you 0. Then you if you set this first bit to zero (getting 0b100011000), the trailing-zero instruction gives you 3, and so forth. Conveniently enough, many processors can set the least significant 1-bit to zero using a single instruction (blsr); you can implement the desired operation in most programming languages like C as a bitwise AND: word & (word - 1).

Thus, the following loop should suffice and it is quite efficient…

  while (word != 0) {
    result[i] = trailingzeroes(word);
    word = word & (word - 1);
    i++;
  }

How efficient is it exactly?

To answer this question, we first need to better define the problem. If the words you are receiving have few 1-bits (say less than one 1-bit per 64-bit words), then you have the sparse regime, and it becomes important to detect quickly zero inputs, for example. If half of your bits are set, you have the dense regime and it is best handled using using vectorization and lookup tables.

But what do you do when your input data is neither really sparse (that is, you almost never have zero inputs) nor really dense (that is, most of your bits are set to zero)? In such cases, the fact that the instructions in your loop are efficient does not help you as much as you’d like because you have another problem: almost every word will result in at least one mispredicted branch. That is, your processor has a hard time predicting when the loop will stop. This prevent your processor from doing a good job retiring instructions.

You can try to have fewer branches at the expense of more instructions:

  while (word != 0) {
    result[i] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+1] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+2] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+3] = trailingzeroes(word);
    word = word & (word - 1);
    i+=4;
  }

The downside of this approach is that you need an extra step to count how many 1-bit there are in your words. Thankfully, it is a cheap operation that can be resolved with a single instruction on x64 processors.

This ‘unrolled’ approach can void more than half of the mispredicted branches, at the expense of a few fast instructions. It results in a substantial reduction in the number of CPU cycles elapsed (GNU GCC 8, Skylake processor):

cycles / 1-bit instructions / 1-bit branch misses / word
conventional 4.7 8.2 0.68
fast 3.4 8.2 0.41

So we save about 1.3 cycles per 1-bit with the fast approach. Can the mispredicted branches explain this gain? There about 6 bits set per input word, so the number of mispredicted branches per 1-bit is either 0.15 or 0.065. If you multiply these fractions by 15 cycles (on the assumption that each mispredicted branch costs 15 cycles), you get 2.25 cycles and 1 cycles; or a difference of 1.25 cycles. It does seem credible that the mispredicted branches are an important factor.

I offer my source code, it runs under Linux.

We use this decoding approach in simdjson.

How close are we to the optimal scenario? We are using one instruction per 1-bit to count the number of trailing zeros, one instruction to zero the least significant 1-bit, one instruction to advance a pointer where we write, one store instruction. Let us say about 5 instructions. We are getting 9.8 instructions. So we probably cannot reduce the instruction count by most than a factor of two without using a different algorithmic approach.

Still, I expect that further gains are possible, maybe you can go faster by a factor of two or so.

Futher reading: Parsing Gigabytes of JSON per Second and Bits to indexes in BMI2 and AVX-512.

Credit: Joint work with Geoff Langdale. He has a blog.

Science and Technology links (April 27th 2019)

  1. Women who use oral contraceptives have a harder time recognizing emotions of others.
  2. Worldwide, livestock has ten times the mass of wild animals and nearly twice the mass of human beings. Fishes have more than ten times the mass of human beings.
  3. Using software, we can map brain activity to speech so that listeners can easily identify words (source: Nature).
  4. Aging is associated with a reduction in tissue NAD levels. In turn, this is believed to be associated with physical decline. It is not entirely clear what makes the NAD level decline, but it seems that it might be caused by the accumulation of senescent cells. As we age, we tend to accumulate these non-functional and slightly harmful “dead” cells called “senescent cells”. We now have drugs called senolytics that can remove some of these senescent cells, with at least one ongoing clinical trial.

Speeding up a random-access function?

A common problem in software performance is that you are essentially limited by memory access. Let us consider such a function where you write at random locations in a big array.

 for ( i = 0; i < N; i++) {
    // hash is a randomized hash function
    bigarray[hash(i)] = i; 
 }

This is a good model for how one might construct a large hash table, for example.

It can be terribly slow if the big array is really large because each and every access is likely to be an expensive cache miss.

Can you speed up this function?

It is difficult, but you might be able to accelerate it a bit, or maybe more than a bit. However, it will involve doing extra work.

Here is a strategy which works, if you do it just right. Divide your big array into regions. For each region, create a stack. Instead of writing directly to the big array, when you are given a hash value, locate the corresponding stack, and append the hash value to it. Then, later, go through the stacks and apply them to the big array.

 
 for ( i = 0; i < N; i++) {
    loc = hash(i)
    add loc, i to buffer[loc / bucketsize]
 }
 for each buffer {
   for each loc,i in buffer
     bigarray[loc] = i
 }

It should be clear that this second strategy is likely to save some expensive cache misses. Indeed, during the first phase, we append to only a few stacks: the top of each stack is likely to be in cache because we have few stacks. Then when you unwind the stacks, you are writing in random order, but within a small region of the big array.

This is a standard “external memory” algorithm: people used to design a lot of these algorithms when everything was on disk and when disks were really slow.

So how well do I do? Here are my results on a Skylake processor using GNU GCC 8 (with -O3 -march=native, THP set to madvise).

cycles/access instructions/access
standard 57 13
buffered 45 36

So while the buffered version I coded uses three times as many instructions, and while it needs to allocate a large buffer, it still comes up on top.

My code is available.

The shopper’s dilemma: wait for new technology or buy now?

Technology is accelerating. It took less than a decade for smartphone to go from 1% of the population to almost everyone. Television took longer. The phone even longer.

Anyone who has been in the market for a technology product knows about what I call the “shopper’s dilemma”. Should you buy the current iPhone or wait another six months for an even better iPhone?

It sounds like a form of interest. You either take $1000 to buy the current model, or hold on to your $1000 and buy a much better model in six months. However, there is no free lunch: by waiting you lose access to the current product for six months.

The shopper’s dilemma also applies more broadly.

Consider medical therapies. You could have eye surgery today for a good improvement in your eyesight, or wait in five years for much better surgery giving you great eyesight. Should you wait or should you take whatever is available today? If you are sick and badly in need of treatment, there is no choice. But sometimes you can afford to wait.

The shopper’s dilemma becomes increasingly more challenging as technology accelerates. Its effect is more and more important. The variance also increases: some progress comes suddenly while unexpected setbacks delay long-promised breakthroughs.

How do different people behave when faced with this dilemma?

  1. Not everyone is aware of the rate of progress. Some people are pessimistic. These people will tend to favour buying now. They are betting against the future. Somewhat ironically, this means that if you work in marketing, you should probably avoid the topic of “progress”.
  2. Technophiles, or people who follow closely technology, should favour delaying their acquisitions. They are betting on the future being better. I conjecture that they might be more likely to delay purchases or therapies.

It seems that I have a testable conjecture. It should be easy to test?