Science and Technology links (June 15th 2019)

Science and Technology links (June 7th 2019)

Nearly Divisionless Random Integer Generation On Various Systems

It is common in software to need random integers within a range of values. For example, you may need to pick an object at random in an array. Random shuffling algorithms require such random integers.

Typically, you generate a regular integers (say a 64-bit integer). From these integers, you find a way to get integers without a range. O’Neill showed that this apparently unimportant operation can be more expensive than generating the original random integers.

Naively, you could take the random integer and compute the remainder of the division by the size of the interval. It works because the remainder of the division by D is always smaller than D. Yet it introduces a statistical bias. You can do better without being slower. The conventional techniques supported in Java and Go require at least one or two integer division per integer generated. Unfortunately, division instructions are relatively expensive.

If you are willing to suffer a slight statistical bias, you can generate a floating-point values in [0,1) and multiply the result by the size of the interval. It avoids the division but introduces other overhead.

There is a better approach that requires no division most of the time:

uint64_t nearlydivisionless ( uint64_t s ) {
  uint64_t x = random64 () ;
  __uint128_t m = ( __uint128_t ) x * ( __uint128_t ) s;
  uint64_t l = ( uint64_t ) m;
  if (l < s) {
    uint64_t t = -s % s;
    while (l < t) {
      x = random64 () ;
      m = ( __uint128_t ) x * ( __uint128_t ) s;
      l = ( uint64_t ) m;
  return m >> 64;

Let us suppose that I shuffle an array of size 1000. I generate 64-bit integers (or floating-point numbers) which I convert to indexes. I use C++ but I reimplement the algorithm used in Java. Let me look at the number of nanoseconds per input key being shuffled on a Skylake processor (Intel):

Java’s approach 7.30 ns
Floating point approach (biased) 6.23 ns
Nearly division-free 1.91 ns

So the division-free approach is much better.

Is this result robust to the hardware? Let us try on Cannon Lake processor where the division instruction is faster…

Java’s approach 3.75 ns
Floating point approach (biased) 8.24 ns
Nearly division-free 2.53 ns

The division-free approach is less beneficial because of the faster division, but the gains are still there.

What about an ARM processor? Let us try on an Ampere Skylark.

Java’s approach 20.67 ns
Floating point approach (biased) 14.73 ns
Nearly division-free 8.24 ns

Again, the division-free approach wins.

Practically speaking, avoiding integer divisions is a good way to generate faster code.

I make my code available. To ease reproducibility, I have used docker containers: you should be able to reproduce my results.

Further reading: Fast Random Integer Generation in an Interval, ACM Transactions on Modeling and Computer Simulation 29 (1), 2019

Science and Technology links (June 1st 2019)

  1. The DeepMind engineers built an artificial intelligence (a software program) that can learn to play 3D shooter games at super-human levels.
  2. Researchers have found a way to grow large quantities of cells that can generate blood; it works for mice.
  3. Europe passed a law to protect privacy online (GDPR). It comes with annoying warnings regarding cookies. This law had a significant negative effect on emerging European technology firms (startups). It has also increased Google’s monopoly power at the expense of smaller players.
  4. When adjusting for the unreliability of solar and wind, existing coal will be cheaper than new solar and wind in all the major regions until at least 2040“.
  5. There has a been a massive drop in the number of books borrowed from libraries by college students. Myself, I switched to ebooks a decade ago and my office contains a single book: Knuth’s book.
  6. Even old people suffering from Alzheimer’s make new neurons (brain cells).

Science and Technology links (May 25th 2019)

  1. Oculus released the Quest, its latest VR goggles. It requires no wiring, no PC. It supports six degrees of freedom, so it is “true VR”. They cost less than $600. The reviews are critical. The goggles still weight too much for long term use and the software is much too limited. I have ordered a pair which I should get in a couple of weeks. I expect that it would work with our GapMinder VR.Oculus is owned by Facebook and there are privacy concerns that come with the Oculus software.
  2. Though many jobs do not legally require a college education, it is increasingly the case that employers require a college degree. It might be irrational according to a Harvard-Business-School report:

    While a majority of employers pay between 11% and 30% more for college graduates, many employers also report that non-graduates with experience perform nearly or equally well on critical dimensions like time to reach full productivity, time to promotion, level of productivity, or amount of oversight required. Moreover, employers incur significant indirect costs. Seeking college graduates makes many middle- skills jobs harder to fill, and once hired, college graduates demonstrate higher turnover rates and lower engagement levels. A systemic view of the total economics of hiring college graduates shows that companies should be extraordinarily cautious before raising credential requirements for middle- skill positions and should not gravitate toward college graduates based only on a vague notion that it might improve the quality of their workforce.

    Requiring college degrees is a policy that harms some communities:

    Degree inflation particularly hurts populations with college graduation rates lower than the national average, such as Blacks and Hispanics, age 25 years and older.

    Apple’s CEO Tim Cook stated that half of Apple’s US employment last year was made up of people who did not have four-year degrees. Cook stresses that there is a mismatch between what colleges teach and assess, and the actual business needs.

    Some of the best software engineers I have worked with did not have a college degree or even any college. I have done some of my best work with these people.

    My impression is that the leading software companies put little weight on even a computer-science degree in the sense that they have applicants undergo extensive technical testing. Such testing would be foreign to lawyers, nurses or medical doctors.

    I still think that if you are going to go to college, studying computer science is a wise move. But I would also not put a degree as a requirement when hiring if I could as an employer.

  3. Many of my environmentally conscious colleagues attempt to offset the footprint of their international travel and conferences by the purchase of carbon credits. The general idea is that you can keep on flying many times a year, you can keep on organizing international events, because you give money so that people in Brazil plant trees, forgo farming and oil in favor of solar panels, and so forth. It is like magic! But if an idea is incredibly convenient, maybe it is just not true… Song does an excellent job of arguing against carbon credits.
  4. Between 40% to 50% of your income is due to your genes.
  5. According to an article in Nature, blocking the CD22 protein with antibodies rejuvenates the brain of a mouse. There is hope that it might be applicable to human beings as well.
  6. Google software can translate your speech using your own voice.
  7. Xenon gas can protect the brain and neurons after a brain injury, in mice.
  8. Smartphones can detect who is carrying them by their gait.

Measuring the system clock frequency using loops (Intel and ARM)

In my previous post, Bitset decoding on Apple’s A12, I showed that Apple’s latest ARM-based processor can decode set bits out of a stream of words using 3 cycles per set bit. This compares favourably to Intel processors since I never could get one of them to do better than 3.5 cycles per set bit on the same benchmark. It is also more than twice as efficient as an ARM-based server on a per cycle basis.

In my initial blog post on the Apple A12, I complained that I was merely speculating regarding the clock speed. That is, I read on Wikipedia that the A12 could run at 2.5GHz and I assumed that this was the actual speed.

It was immediately pointed out to me by Vlad Krasnov of Cloudflare, and later by Travis Downs, that I can measure the clock speed, if only indirectly, by writing a small loop in assembly.

There are many ways to achieve the desired goal. On Intel x64 processor, you can write a tight loop that decrements a counter by one with each iteration. At best, this can run in one subtraction per cycle and Intel processors are good enough to achieve this rate:

; initialize 'counter' with the desired number
dec counter ; decrement counter
jnz label ; goes to label if counter is not zero

In fact, Intel processors fuse the two instruction (dec and jnz) into a single microinstruction and they have special optimizations for tight loops.

You can write a function that runs in almost exactly “x” cycles for any “x” large enough. Of course, it could take a bit longer but if “x” is small enough and you try enough times, you can get a good result.

What about ARM processors? You can apply the same recipe, though you have to use different instructions.

; initialize 'counter' with the desired number
subs counter, counter, #1 ; decrement counter
bne label ; goes to label if counter is not zero

In my initial version, I had unrolled these loops. The intuition was that I could bring the overhead of the cost down to zero. However, the code just looked ugly. Yet Travis Downs and Steve Canon insisted that the unrolled loop was the way to do. So I do it two ways: as an unrolled loop, and as a tight loop.

My code is available. I also provide a script which can run the executable several times. Keep in mind that the frequency of your cores can change, and it takes some time before an unsolicited processor gets back to its highest frequency.

The tight loop is pretty, but the price I pay is that I must account for the cost of the loop. Empirically, I found that the ARM Skylark as well as the ARM Cortex A72 take two cycles per iteration. The Apple A12 takes a single cycle per iteration. It appears that the other ARM processors might not be able to take a branch every cycle, so they may be unable to support a one-cycle-per-iteration loop.

Of course, you can measure the CPU frequency with performance counters, so my little program is probably useless on systems like Linux. However, you can also grab my updated ios application, it will tell you about your processor speed on your iPhone. On the Apple A12, I do not need to do anything sophisticated to measure the frequency as I consistently get 2.5 GHz with a simple tight loop after I have stressed the processor enough.

Credit: I thank Vlad Krasnov (Cloudflare), Travis Downs and Steve Canon (Apple) for their valuable inputs.

Science and Technology links (May 18th 2019)

  1. Though depression has genetic components, the previous studies that identified “depression genes” are probably all bogus. They are the results of poorly conducted research, using underpowered studies to reach untenable conclusions.
  2. Many species of salmons die shortly after procreation, a bit like annual plants. Both can be viewed as an example of programmed aging and death. It turns out that if salmons gets infected by a larval parasite, the salmon will go on to live much older. (Credit: P. D. Mangan)
  3. The blood plasma of young and old mammals differ. We believe that old mammals have higher levels of “aging agents” in their blood. It appears that VCAM1 is one such agent in the blood plasma of old mice. Removing VCAM1 has a rejuvenating effect (in mice):

    Systemic administration of anti-VCAM1 antibody or genetic ablation of Vcam1 in BECs counteracts the detrimental effects of plasma from aged individuals on young brains and reverses aging aspects, including microglial reactivity and cognitive deficits, in the brains of aged mice. Together, these findings establish brain endothelial VCAM1 at the blood–brain barrier as a possible target to treat age-related neurodegeneration.

    (Source: Nature)

  4. Fibromyalgia, a terrible and slightly mysterious disease, might be closely related to diabetes.
  5. You are almost 50 times more likely to undergo harmful treatments after prostate cancer test than to have your life extended:

    1410 men would need to be screened and 48 additional cases of prostate cancer would need to be treated to prevent one death from prostate cancer

  6. In a large colorectal cancer study (over 40,000 people) that went on for 18 years, they found that 26.8% of the people offered screening died while 26.7% of the people not offered the screening died.

Building better software with better tools: sanitizers versus valgrind

We often have to write code using  permissive programming languages like C and C++. They tend to generate hard-to-debug problems that can crash your applications. Thankfully, many compilers offer “sanitizers”. I discussed them in my post No more leaks with sanitize flags in gcc and clang. I strongly encourage the use of sanitizers as I think it is the modern way to write C and C++. When many people describe how impossibly difficult it is to build good software in C and C++, they often think about old-school bare metal C and C++ where the code do all sorts of mysterious things without any protection. Then they feel compelled to run their code in a debugger and to manually run through it. You should not write code this way! Get some tools! Sanitizers can catch undefined behaviour, memory leaks, buffer overflows, data races, and so forth.

Sanitizers are a game changer, they bring C++ closer to languages like Java and Rust. They do not bring the safety to production, since you probably do not want to use sanitizers in production or release, but as you are building and testing your code, they help you a great deal catch potential issues right away.

A competitive solution that people often use is a great tool called “valgrind“. It is a general-purpose tool that checks your software as it runs. Under Windows, you have related programs like Application Verifier and WinDbg.

I believe you should almost always use sanitizers when they are available. Here is a comparison between tools like valgrind and sanitizers.

    1. With the caveat that valgrind needs support for all the instructions your software is using, valgrind can run pretty much any software, even when you do not have the source code. Sanitizers work at the compiler level, so you need the source code. Thus if you need to debug a closed source library, sanitizers are unhelpful.
    2. Sanitizers can catch problems that valgrind will not catch. For example, it will catch undesirable undefined behaviour: code that may work right now but may not work if you use a different compiler or a different processor. They can catch unsafe memory accesses that will look safe to valgrind.
    3. Sanitizers are more precise. You often can turn on or off specific sanitizers for specific functions.
    4. If your compiler has sanitizers, you can run your tests with the sanitizers on simply by turning on some flags.
    5. Valgrind is slow. Like debuggers, it often does not scale. If you are working over large data sets, it might take a really long time. People often dismiss “execution time”, and it is easy to do if you work on toy problems, but performance is an essential quality-of-life attribute. I do not think you can run valgrind in a simulated production setting. However, you can compile your code with sanitizers and emulate a production setting. Sure, your throughput is going to be impacted, but the effect is not large. Code with sanitizers is not 10x slower, valgrind is.
    6. Sanitizers are relatively new and so the support is sometimes missing.
      • For example, under macOS, Apple does not yet ship a compiler that can detect memory leaks, you need to install your own compiler.
      • Even if you compile your code with debug symbols, it is common for the sanitizers to report the errors without proper links to the source code, you often need to fiddle with the system configuration.
      • Under Linux, when using GNU GCC, I have found it necessary to use the gold linker to get good results (-fuse-ld=gold): the default link frequently gives me errors when I try to use sanitizers.
      • The “memory sanitizer” that check that you do not read from uninitialized inputs is not available under GNU GCC and under LLVM requires you to manually replace the C++ standard library and possibly recompile all of your software with the sanitizer enabled (including all dependencies) if you want to avoid false positives.
      • And Visual Studio has some of its own sanitizers, but it is largely behind LLVM. Better sanitizers may be coming to Visual Studio 2019.
      • Furthermore, you cannot freely use all possible sanitizers at once.

So, sadly, there are cases when sanitizers are just not available to you. Yet I think it is a safe bet that all competitive C/C++ compilers will soon have powerful sanitizers.

Bitset decoding on Apple’s A12

In my post Really fast bitset decoding for “average” densities, I reported on our work accelerating the decoding of bitsets. E.g., given a 64-bit register, you want to find the location of every 1-bit. So given 0b110011, you would want to get 0, 1, 4, 5. We want to do this operation with many such registers. When the content of the register is hard to predict, you can be almost certain to get a mispredicted branch with every new register to be decoded. On modern processors with deep and wide pipelines, these mispredictions can become a bottleneck.

On recent x64 processors, we find that it is beneficial to decode in bulk: e.g., assume that there are at least 4 set bits, decode them without checking whether there are four of them. The code might look as follow:

  while (word != 0) {
    result[i] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+1] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+2] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+3] = trailingzeroes(word);
    word = word & (word - 1);

So we are trading branches for a few more instructions. If the branches are easy to predict, that is a bad trade, but if the branches are hard to predict, it can be beneficial.

We consider the scenario where the input data contains neither very many nor very few 1-bit, and where their distribution is hard to predict. With our approach, we can get it down to less than 3.5 cycles per 1-bit decoded on recent Intel processors. To achieve this kind of speed, we retire nearly 2.5 instructions per cycle.

What about ARM processors? Sadly, I have not yet been able to reliably measure the same can of high speed. The Linux-based ARM systems I have seem to be limited to a quasi-scalar execution mode, retiring never much more than 1.5 instructions per cycle. Either these ARM processors are not powerful enough, or else I am not benchmarking properly.

An additional difficulty is that ARM processors do not have a fast 64-bit population-count instruction (to determine the number of 1-bit per register). Instead, you must use an instruction which finds the number of 1-bit per byte, and sum that up using another instruction. So while one instruction suffices on an Intel processor, at least two (or more) instructions are necessary, and so the cost and total latency is higher. Similarly ARM processors lack a “trailing zero” instruction: you have to reverse the bit order and use a “leading zero” instruction. So maybe ARM processors are just fundamentally at a disadvantage on this task compared to Intel processors. But I am not convinced that it is the case. If I look at the instructions counts, they seem to be similar between ARM and Intel code. That is, while ARM makes you work harder to compute some operations, Intel has its own limitations. It may all average out.

So I’d like to be certain that I have a powerful ARM processor to give ARM a fighting chance. Thankfully I do have many powerful ARM processors… I have one in my iPhone for example. Trouble is, I cannot instrument it and install Linux on it. I cannot easily use any compiler I’d like. Still, I can run benchmarks and record the time elapsed. All I need to do is write a little mobile application. I record the nanoseconds per set bit. It seems that the Apple A12 in my iPhone is limited to 2.5 GHz, so I multiply the result by 2.5 to get the number of cycles.

conventional 1.7 ns 4.125 cycles
fast 1.2 ns 3 cycles

If these numbers can be trusted, then the Apple A12 might possibly be more efficient than an Intel Skylake processor (3.5 cycles vs. 3 cycles). Given that Apple’s A12 is reputed to have a really wide pipeline, this sounds credible.

Using Apple’s Instruments tool, I got that the fast decoding approach runs at 3.7 instructions per cycle.

My code is available. If you have a Mac and an iPhone, you should be able to reproduce my results.

Update: The latest version of my code measures the clock speed as part of the benchmark.

Setting up a ROCKPro64 (powerful single-card computer)

A few months ago, I ordered ROCKPro64. If you are familiar with the Raspberry Pi, then it is a bit of the same… an inexpensive computer that comes in the form of a single card. The ROCKPro64 differs from the Raspberry Pi in that it is much closer in power to a normal PC. You make a decent laptop out of it. It has enough memory to do useful work and a decent 6-core processor  (dual ARM Cortex A72 and quad ARM Cortex A53). I bought the following components:

  • ROCKPro64 4GB Single Board Computer ($80)
  • ROCKPro64 aluminium casing ($15) 
  • 64GB eMMC module ($35)
  • USB adapter for the eMMC Module ($5)
  • ROCKPro64 power supply ($13)

I also had an ethernet cable at home. I connected the ethernet cable to my iMac, which is connected to the Internet via Wifi, and I configured macOS to enable Internet sharing via the (previously unused) ethernet port. You can probably connect the ROCKPro64 to the Internet by wifi, but I always prefer the reliability of ethernet cables. So I connected the ROCKPro64 to the Internet via this ethernet  cable. I did not plug anything else into it.

I wanted to install Linux on the ROCKPro64. At first, I went to Ubuntu, grabbed a release there, but it was a bad idea. It does not work. I finally figured out that you have to download Linux releases tailored to the hardware. So I got the latest version of Debian for the ROCKPro64 for GitHub. I prefer Ubuntu, but debian is good too. Maybe importantly, I used a release that was specific to the ROCKPro64 (with rockpro64 in the name).

You then need to get the operating system on the eMMC module. The eMMC module is a bit like an SD card, but you can’t plug it into you computer. However, you can plug it in the USB adapter you just bought. I did so.

In theory, you could run the ROCKPro64 out of an SD card. I do not like to work with SD cards: they are slow and unreliable. I am hoping to get better performance and durability out of the eMMC module.

I downloaded a piece of software called “etcher“. After launching it, it asked which image I wanted to use, I selected the Linux image file I had downloaded (exact name: stretch-minimal-rockpro64-0.7.9-1067-arm64.img.xz). Then it asked for the destination drive, so I plug in my USB adapter. I ignored macOS warnings about the content being unreadable and I just hit the “flash” button in etcher. I waited about five minutes.

When etcher told me everything was fine, removed the eMMC module and put it on the ROCKPro64 (there is a dedicated area on the board). I then plugin my power cord to the ROCKPro64. The network adapter lights turned on and after a short time a white LED light near the reset button came on.

I went on my iMac and in a terminal window, I typed “arp -a”. There was the following line among others:

? ( at 5a:59:0:de:6b:4e on bridge100 ifscope [bridge]

The password and identifiers are rock64, so I used ssh to connect to board:

$ ssh [email protected]
[email protected]’s password:
_ __ _ _
_ __ ___ ___| | ___ __ _ __ ___ / /_ | || |
| ‘__/ _ \ / __| |/ / ‘_ \| ‘__/ _ \| ‘_ \| || |_
| | | (_) | (__| <| |_) | | | (_) | (_) |__ _|
|_| \___/ \___|_|\_\ .__/|_| \___/ \___/ |_|
Linux rockpro64 4.4.132-1075-rockchip-ayufan-ga83beded8524 #1 SMP Thu Jul 26 08:22:22 UTC 2018 aarch64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
[email protected]:~$

After playing with the machine a bit, I wanted to shut it down. I think you want to type “systemctl poweroff”.

Notice how I am not connecting a mouse, a monitor or a keyboard to it. For what I want to do with it, I do not need any of that.

I find it inconvenient to remember the IP address of the machine. To be able to log in as “ssh [email protected]”, just type the following:

sudo apt-get update
sudo apt-get install avahi-daemon avahi-dnsconfd avahi-discover avahi-utils libnss-mdns
sudo service avahi-daemon start

Throw in the ‘ssh-copy-id’ command and you can log in without typing a password.

The modern way to run software on Linux is to use containers (e.g., docker). You can almost follow line-by-line instructions found online with the caveat that whenever they write “amd64”, your need to substitute “arm64”. Also I find it handy to add myself to the ‘docker’ group to avoid having to run docker as root:

sudo usermod -a -G docker myusername