Measuring the system clock frequency using loops (Intel and ARM)

In my previous post, Bitset decoding on Apple’s A12, I showed that Apple’s latest ARM-based processor can decode set bits out of a stream of words using 3 cycles per set bit. This compares favourably to Intel processors since I never could get one of them to do better than 3.5 cycles per set bit on the same benchmark. It is also more than twice as efficient as an ARM-based server on a per cycle basis.

In my initial blog post on the Apple A12, I complained that I was merely speculating regarding the clock speed. That is, I read on Wikipedia that the A12 could run at 2.5GHz and I assumed that this was the actual speed.

It was immediately pointed out to me by Vlad Krasnov of Cloudflare, and later by Travis Downs, that I can measure the clock speed, if only indirectly, by writing a small loop in assembly.

There are many ways to achieve the desired goal. On Intel x64 processor, you can write a tight loop that decrements a counter by one with each iteration. At best, this can run in one subtraction per cycle and Intel processors are good enough to achieve this rate:

; initialize 'counter' with the desired number
label:
dec counter ; decrement counter
jnz label ; goes to label if counter is not zero

In fact, Intel processors fuse the two instruction (dec and jnz) into a single microinstruction and they have special optimizations for tight loops.

You can write a function that runs in almost exactly “x” cycles for any “x” large enough. Of course, it could take a bit longer but if “x” is small enough and you try enough times, you can get a good result.

What about ARM processors? You can apply the same recipe, though you have to use different instructions.

; initialize 'counter' with the desired number
label:
subs counter, counter, #1 ; decrement counter
bne label ; goes to label if counter is not zero

In my initial version, I had unrolled these loops. The intuition was that I could bring the overhead of the cost down to zero. However, the code just looked ugly. Yet Travis Downs and Steve Canon insisted that the unrolled loop was the way to do. So I do it two ways: as an unrolled loop, and as a tight loop.

My code is available. I also provide a script which can run the executable several times. Keep in mind that the frequency of your cores can change, and it takes some time before an unsolicited processor gets back to its highest frequency.

The tight loop is pretty, but the price I pay is that I must account for the cost of the loop. Empirically, I found that the ARM Skylark as well as the ARM Cortex A72 take two cycles per iteration. The Apple A12 takes a single cycle per iteration. It appears that the other ARM processors might not be able to take a branch every cycle, so they may be unable to support a one-cycle-per-iteration loop.

Of course, you can measure the CPU frequency with performance counters, so my little program is probably useless on systems like Linux. However, you can also grab my updated ios application, it will tell you about your processor speed on your iPhone. On the Apple A12, I do not need to do anything sophisticated to measure the frequency as I consistently get 2.5 GHz with a simple tight loop after I have stressed the processor enough.

Credit: I thank Vlad Krasnov (Cloudflare), Travis Downs and Steve Canon (Apple) for their valuable inputs.

Science and Technology links (May 18th 2019)

  1. Though depression has genetic components, the previous studies that identified “depression genes” are probably all bogus. They are the results of poorly conducted research, using underpowered studies to reach untenable conclusions.
  2. Many species of salmons die shortly after procreation, a bit like annual plants. Both can be viewed as an example of programmed aging and death. It turns out that if salmons gets infected by a larval parasite, the salmon will go on to live much older. (Credit: P. D. Mangan)
  3. The blood plasma of young and old mammals differ. We believe that old mammals have higher levels of “aging agents” in their blood. It appears that VCAM1 is one such agent in the blood plasma of old mice. Removing VCAM1 has a rejuvenating effect (in mice):

    Systemic administration of anti-VCAM1 antibody or genetic ablation of Vcam1 in BECs counteracts the detrimental effects of plasma from aged individuals on young brains and reverses aging aspects, including microglial reactivity and cognitive deficits, in the brains of aged mice. Together, these findings establish brain endothelial VCAM1 at the blood–brain barrier as a possible target to treat age-related neurodegeneration.

    (Source: Nature)

  4. Fibromyalgia, a terrible and slightly mysterious disease, might be closely related to diabetes.
  5. You are almost 50 times more likely to undergo harmful treatments after prostate cancer test than to have your life extended:

    1410 men would need to be screened and 48 additional cases of prostate cancer would need to be treated to prevent one death from prostate cancer

  6. In a large colorectal cancer study (over 40,000 people) that went on for 18 years, they found that 26.8% of the people offered screening died while 26.7% of the people not offered the screening died.

Building better software with better tools: sanitizers versus valgrind

We often have to write code using  permissive programming languages like C and C++. They tend to generate hard-to-debug problems that can crash your applications. Thankfully, many compilers offer “sanitizers”. I discussed them in my post No more leaks with sanitize flags in gcc and clang. I strongly encourage the use of sanitizers as I think it is the modern way to write C and C++. When many people describe how impossibly difficult it is to build good software in C and C++, they often think about old-school bare metal C and C++ where the code do all sorts of mysterious things without any protection. Then they feel compelled to run their code in a debugger and to manually run through it. You should not write code this way! Get some tools! Sanitizers can catch undefined behaviour, memory leaks, buffer overflows, data races, and so forth.

Sanitizers are a game changer, they bring C++ closer to languages like Java and Rust. They do not bring the safety to production, since you probably do not want to use sanitizers in production or release, but as you are building and testing your code, they help you a great deal catch potential issues right away.

A competitive solution that people often use is a great tool called “valgrind“. It is a general-purpose tool that checks your software as it runs. Under Windows, you have related programs like Application Verifier and WinDbg.

I believe you should almost always use sanitizers when they are available. Here is a comparison between tools like valgrind and sanitizers.

    1. With the caveat that valgrind needs support for all the instructions your software is using, valgrind can run pretty much any software, even when you do not have the source code. Sanitizers work at the compiler level, so you need the source code. Thus if you need to debug a closed source library, sanitizers are unhelpful.
    2. Sanitizers can catch problems that valgrind will not catch. For example, it will catch undesirable undefined behaviour: code that may work right now but may not work if you use a different compiler or a different processor. They can catch unsafe memory accesses that will look safe to valgrind.
    3. Sanitizers are more precise. You often can turn on or off specific sanitizers for specific functions.
    4. If your compiler has sanitizers, you can run your tests with the sanitizers on simply by turning on some flags.
    5. Valgrind is slow. Like debuggers, it often does not scale. If you are working over large data sets, it might take a really long time. People often dismiss “execution time”, and it is easy to do if you work on toy problems, but performance is an essential quality-of-life attribute. I do not think you can run valgrind in a simulated production setting. However, you can compile your code with sanitizers and emulate a production setting. Sure, your throughput is going to be impacted, but the effect is not large. Code with sanitizers is not 10x slower, valgrind is.
    6. Sanitizers are relatively new and so the support is sometimes missing.
      • For example, under macOS, Apple does not yet ship a compiler that can detect memory leaks, you need to install your own compiler.
      • Even if you compile your code with debug symbols, it is common for the sanitizers to report the errors without proper links to the source code, you often need to fiddle with the system configuration.
      • Under Linux, when using GNU GCC, I have found it necessary to use the gold linker to get good results (-fuse-ld=gold): the default link frequently gives me errors when I try to use sanitizers.
      • The “memory sanitizer” that check that you do not read from uninitialized inputs is not available under GNU GCC and under LLVM requires you to manually replace the C++ standard library and possibly recompile all of your software with the sanitizer enabled (including all dependencies) if you want to avoid false positives.
      • And Visual Studio has some of its own sanitizers, but it is largely behind LLVM. Better sanitizers may be coming to Visual Studio 2019.
      • Furthermore, you cannot freely use all possible sanitizers at once.

So, sadly, there are cases when sanitizers are just not available to you. Yet I think it is a safe bet that all competitive C/C++ compilers will soon have powerful sanitizers.

Bitset decoding on Apple’s A12

In my post Really fast bitset decoding for “average” densities, I reported on our work accelerating the decoding of bitsets. E.g., given a 64-bit register, you want to find the location of every 1-bit. So given 0b110011, you would want to get 0, 1, 4, 5. We want to do this operation with many such registers. When the content of the register is hard to predict, you can be almost certain to get a mispredicted branch with every new register to be decoded. On modern processors with deep and wide pipelines, these mispredictions can become a bottleneck.

On recent x64 processors, we find that it is beneficial to decode in bulk: e.g., assume that there are at least 4 set bits, decode them without checking whether there are four of them. The code might look as follow:

  while (word != 0) {
    result[i] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+1] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+2] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+3] = trailingzeroes(word);
    word = word & (word - 1);
    i+=4;
  }

So we are trading branches for a few more instructions. If the branches are easy to predict, that is a bad trade, but if the branches are hard to predict, it can be beneficial.

We consider the scenario where the input data contains neither very many nor very few 1-bit, and where their distribution is hard to predict. With our approach, we can get it down to less than 3.5 cycles per 1-bit decoded on recent Intel processors. To achieve this kind of speed, we retire nearly 2.5 instructions per cycle.

What about ARM processors? Sadly, I have not yet been able to reliably measure the same can of high speed. The Linux-based ARM systems I have seem to be limited to a quasi-scalar execution mode, retiring never much more than 1.5 instructions per cycle. Either these ARM processors are not powerful enough, or else I am not benchmarking properly.

An additional difficulty is that ARM processors do not have a fast 64-bit population-count instruction (to determine the number of 1-bit per register). Instead, you must use an instruction which finds the number of 1-bit per byte, and sum that up using another instruction. So while one instruction suffices on an Intel processor, at least two (or more) instructions are necessary, and so the cost and total latency is higher. Similarly ARM processors lack a “trailing zero” instruction: you have to reverse the bit order and use a “leading zero” instruction. So maybe ARM processors are just fundamentally at a disadvantage on this task compared to Intel processors. But I am not convinced that it is the case. If I look at the instructions counts, they seem to be similar between ARM and Intel code. That is, while ARM makes you work harder to compute some operations, Intel has its own limitations. It may all average out.

So I’d like to be certain that I have a powerful ARM processor to give ARM a fighting chance. Thankfully I do have many powerful ARM processors… I have one in my iPhone for example. Trouble is, I cannot instrument it and install Linux on it. I cannot easily use any compiler I’d like. Still, I can run benchmarks and record the time elapsed. All I need to do is write a little mobile application. I record the nanoseconds per set bit. It seems that the Apple A12 in my iPhone is limited to 2.5 GHz, so I multiply the result by 2.5 to get the number of cycles.

conventional 1.7 ns 4.125 cycles
fast 1.2 ns 3 cycles

If these numbers can be trusted, then the Apple A12 might possibly be more efficient than an Intel Skylake processor (3.5 cycles vs. 3 cycles). Given that Apple’s A12 is reputed to have a really wide pipeline, this sounds credible.

Using Apple’s Instruments tool, I got that the fast decoding approach runs at 3.7 instructions per cycle.

My code is available. If you have a Mac and an iPhone, you should be able to reproduce my results.

Update: The latest version of my code measures the clock speed as part of the benchmark.

Setting up a ROCKPro64 (powerful single-card computer)

A few months ago, I ordered ROCKPro64. If you are familiar with the Raspberry Pi, then it is a bit of the same… an inexpensive computer that comes in the form of a single card. The ROCKPro64 differs from the Raspberry Pi in that it is much closer in power to a normal PC. You make a decent laptop out of it. It has enough memory to do useful work and a decent 6-core processor  (dual ARM Cortex A72 and quad ARM Cortex A53). I bought the following components:

  • ROCKPro64 4GB Single Board Computer ($80)
  • ROCKPro64 aluminium casing ($15) 
  • 64GB eMMC module ($35)
  • USB adapter for the eMMC Module ($5)
  • ROCKPro64 power supply ($13)

I also had an ethernet cable at home. I connected the ethernet cable to my iMac, which is connected to the Internet via Wifi, and I configured macOS to enable Internet sharing via the (previously unused) ethernet port. You can probably connect the ROCKPro64 to the Internet by wifi, but I always prefer the reliability of ethernet cables. So I connected the ROCKPro64 to the Internet via this ethernet  cable. I did not plug anything else into it.

I wanted to install Linux on the ROCKPro64. At first, I went to Ubuntu, grabbed a release there, but it was a bad idea. It does not work. I finally figured out that you have to download Linux releases tailored to the hardware. So I got the latest version of Debian for the ROCKPro64 for GitHub. I prefer Ubuntu, but debian is good too. Maybe importantly, I used a release that was specific to the ROCKPro64 (with rockpro64 in the name).

You then need to get the operating system on the eMMC module. The eMMC module is a bit like an SD card, but you can’t plug it into you computer. However, you can plug it in the USB adapter you just bought. I did so.

In theory, you could run the ROCKPro64 out of an SD card. I do not like to work with SD cards: they are slow and unreliable. I am hoping to get better performance and durability out of the eMMC module.

I downloaded a piece of software called “etcher“. After launching it, it asked which image I wanted to use, I selected the Linux image file I had downloaded (exact name: stretch-minimal-rockpro64-0.7.9-1067-arm64.img.xz). Then it asked for the destination drive, so I plug in my USB adapter. I ignored macOS warnings about the content being unreadable and I just hit the “flash” button in etcher. I waited about five minutes.

When etcher told me everything was fine, removed the eMMC module and put it on the ROCKPro64 (there is a dedicated area on the board). I then plugin my power cord to the ROCKPro64. The network adapter lights turned on and after a short time a white LED light near the reset button came on.

I went on my iMac and in a terminal window, I typed “arp -a”. There was the following line among others:

? (192.168.2.2) at 5a:59:0:de:6b:4e on bridge100 ifscope [bridge]

The password and identifiers are rock64, so I used ssh to connect to board:

$ ssh [email protected]
ro[email protected]’s password:
_ __ _ _
_ __ ___ ___| | ___ __ _ __ ___ / /_ | || |
| ‘__/ _ \ / __| |/ / ‘_ \| ‘__/ _ \| ‘_ \| || |_
| | | (_) | (__| <| |_) | | | (_) | (_) |__ _|
|_| \___/ \___|_|\_\ .__/|_| \___/ \___/ |_|
|_|
Linux rockpro64 4.4.132-1075-rockchip-ayufan-ga83beded8524 #1 SMP Thu Jul 26 08:22:22 UTC 2018 aarch64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
[email protected]:~$

After playing with the machine a bit, I wanted to shut it down. I think you want to type “systemctl poweroff”.

Notice how I am not connecting a mouse, a monitor or a keyboard to it. For what I want to do with it, I do not need any of that.

I find it inconvenient to remember the IP address of the machine. To be able to log in as “ssh [email protected]”, just type the following:

sudo apt-get update
sudo apt-get install avahi-daemon avahi-dnsconfd avahi-discover avahi-utils libnss-mdns
sudo service avahi-daemon start

Throw in the ‘ssh-copy-id’ command and you can log in without typing a password.

The modern way to run software on Linux is to use containers (e.g., docker). You can almost follow line-by-line instructions found online with the caveat that whenever they write “amd64”, your need to substitute “arm64”. Also I find it handy to add myself to the ‘docker’ group to avoid having to run docker as root:

sudo usermod -a -G docker myusername

Science and Technology links (May 11th 2019)

  1. Bone marrow transplanted from young mice to very old (almost dying) mice extended the life of the old mice by 30%. The authors conclude that bone-marrow transplantation affects the intrinsic aging mechanism.
  2. Artery calcification is an easily diagnosed condition which predicts cardiovascular diseases. In animal studies, vitamin K2 supplement reduced artery calcification. There is an ongoing clinical trial to test whether the same occurs in human beings. The study should conclude later this year.
  3. Ovarian tissues can be removed and frozen, and then reinserted into a women’s body, allowing her to become pregnant. Though it is not simple, it seems to work.

Almost picking N distinct numbers at random

In Picking N distinct numbers at random: how to do it fast?, I describe how to quickly pick N distinct integer values are random from a range of integer values. It comes down to using either bitset/bitmap or a hash set.

The bitset approach is very fast when you need to pick many integer values out of a small range. The hash set approach is very fast if you need to pick very few values, irrespective of the range of values.

What about the middle ground? What if you need to pick lots of integer values from an even greater range?

Because N is large, you may not care to get exactly N values. That is, if you need to pick 100 million integer values at random, it might be fine to pick 99,999,999 integer values.

What can you do?

  1. Fill an array with N randomly generated integer values (using a uniform distribution).
  2. Sort the array.
  3. Remove duplicates.

That is pretty good, but the sort function could be expensive if N is large: it is O(N log N), after all.

Assuming that there are no duplicates, can we model this using probabilities? What is the distribution corresponding to the first value? We have N values picked out of a large range. So the probability that any value has been picked is N over the range. We recognize the geometric distribution. Once you have found the first value, you can repeat this same reasoning except that we now have N-1 values over a somewhat restricted range (because we generate them in sorted order).

  1. Generate a value over the range R using a geometric distribution with probability N/R.
  2. Append the value to my output array.
  3. Reduce the range R with the constraint that all future values must be larger than the last value appended to the output array.
  4. Decrement N by one.
  5. Repeat until N is 0.

You can use the fact that we can cheaply generate numbers according to a geometric distribution:

floor(log(random_number_in_0_1()) /log(1-p));

All you need is a way to generate random numbers in the unit interval [0,1] but that is easy. In C++ and many other programming languages, you have builtin support for geometric distributions.

The net result is an O(N) algorithm to pick N values at random over a range.

There is a catch, however. My model is not quite correct. For one thing, we do not quite have a geometric distribution: it is only valid if the range is very, very large. This manifests itself by the fact that the values I generate may exceed the range (a geometric distribution is unbounded). We can patch things up by stopping the algorithm once a value exceeds the range or some other anomaly occurs.

So I ran a benchmark where I have to pick 100,000,000 values among all integers smaller than 40,000,000,000. I get that the time per value generated is about half using the geometric-distribution approach:

sort-based 170 ns
geometric 80 ns

For larger arrays, I can achieve 3x to 4x gains but then my software runs out of memory.

My code is available.

What else could you do? Another practical approach would be to divide up the range into many small subranges and to use the fact that the number of values within each subrange follows a binomial distribution (which can be approximated by a normal distribution), to do a divide-and-conquer approach: instead of having a pick many values in a large range problem, we would have several small “pick few values into a small range” problems. For each small problem, you can afford to do a sort-based approach since sorting small arrays is fast.

Science and Technology links (May 4th 2019)

  1. It is often believed that colleges help class mobility… if you were born poor, college can help your rise up. So is there more class mobility when more people go to college? Maybe not:

    (…) researchers have characterized a college degree as a great equalizer leveling the playing field, and proposed that expanding higher education would promote mobility. This line of reasoning rests on the implicit assumption that the relatively high mobility observed among college graduates reflects a causal effect of college completion on intergenerational mobility, an assumption that has rarely been rigorously evaluated (…) I find that once selection processes are adjusted for, intergenerational income mobility among college graduates is very close to that among non-graduates.

  2. How many people does it take to form a group? The answer appears to be “at least 5”.
  3. The productivity of a computer science professor is not sensitive to where they got his PhD, but it is correlated with their place of employment.
  4. If you have ever taken management classes, you know about Maslow’s pyramid of human needs. However, Maslow never proposed such a pyramid.
  5. Twenty years ago, Microsoft introduced a mouse that did away with the mechanical ball and replaced with with a digital camera and lights.
  6. Rats were given olive oil, sunflower oil or fish oil. Rats consuming sunflower oil have a shorter lifespan.
  7. Seed oils are a source of Omega 6. It appears that keeping your consumption of Omega 6 low (compared to your Omega 3 intake) is important to reduce cardiovascular events.
  8. Shifting our energy production from coal to gas would allow us to meet our climate stabilization targets.
  9. Many studies find that red meat is associated with higher cancer rates. It is not known whether it is a mere correlation or a causal factor. However, if you narrow down your investigation to Asia, the correlation disappears. The association between red meat and cancer is apparently specific to the Western civilization.
  10. It appears that 60% of all bird species come from Australia.
  11. A new Alzheimer’s-like disease has been identified (it is called “LATE”).

Really fast bitset decoding for “average” densities

Suppose I give you a word and you need to determine the location of the 1-bits. For example, given the word 0b100011001, you would like to get 0,3,4,8.

You could check the value of each bit, but that would take too long. A better approach is use the fact that modern processors have fast instructions to count the number of “trailing zeros” (on x64 processors, you have tzcnt). Given 0b100011001, this instruction would give you 0. Then you if you set this first bit to zero (getting 0b100011000), the trailing-zero instruction gives you 3, and so forth. Conveniently enough, many processors can set the least significant 1-bit to zero using a single instruction (blsr); you can implement the desired operation in most programming languages like C as a bitwise AND: word & (word - 1).

Thus, the following loop should suffice and it is quite efficient…

  while (word != 0) {
    result[i] = trailingzeroes(word);
    word = word & (word - 1);
    i++;
  }

How efficient is it exactly?

To answer this question, we first need to better define the problem. If the words you are receiving have few 1-bits (say less than one 1-bit per 64-bit words), then you have the sparse regime, and it becomes important to detect quickly zero inputs, for example. If half of your bits are set, you have the dense regime and it is best handled using using vectorization and lookup tables.

But what do you do when your input data is neither really sparse (that is, you almost never have zero inputs) nor really dense (that is, most of your bits are set to zero)? In such cases, the fact that the instructions in your loop are efficient does not help you as much as you’d like because you have another problem: almost every word will result in at least one mispredicted branch. That is, your processor has a hard time predicting when the loop will stop. This prevent your processor from doing a good job retiring instructions.

You can try to have fewer branches at the expense of more instructions:

  while (word != 0) {
    result[i] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+1] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+2] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+3] = trailingzeroes(word);
    word = word & (word - 1);
    i+=4;
  }

The downside of this approach is that you need an extra step to count how many 1-bit there are in your words. Thankfully, it is a cheap operation that can be resolved with a single instruction on x64 processors.

This ‘unrolled’ approach can void more than half of the mispredicted branches, at the expense of a few fast instructions. It results in a substantial reduction in the number of CPU cycles elapsed (GNU GCC 8, Skylake processor):

cycles / 1-bit instructions / 1-bit branch misses / word
conventional 4.7 8.2 0.68
fast 3.4 8.2 0.41

So we save about 1.3 cycles per 1-bit with the fast approach. Can the mispredicted branches explain this gain? There about 6 bits set per input word, so the number of mispredicted branches per 1-bit is either 0.15 or 0.065. If you multiply these fractions by 15 cycles (on the assumption that each mispredicted branch costs 15 cycles), you get 2.25 cycles and 1 cycles; or a difference of 1.25 cycles. It does seem credible that the mispredicted branches are an important factor.

I offer my source code, it runs under Linux.

We use this decoding approach in simdjson.

How close are we to the optimal scenario? We are using one instruction per 1-bit to count the number of trailing zeros, one instruction to zero the least significant 1-bit, one instruction to advance a pointer where we write, one store instruction. Let us say about 5 instructions. We are getting 9.8 instructions. So we probably cannot reduce the instruction count by most than a factor of two without using a different algorithmic approach.

Still, I expect that further gains are possible, maybe you can go faster by a factor of two or so.

Futher reading: Parsing Gigabytes of JSON per Second and Bits to indexes in BMI2 and AVX-512.

Credit: Joint work with Geoff Langdale. He has a blog.

Science and Technology links (April 27th 2019)

  1. Women who use oral contraceptives have a harder time recognizing emotions of others.
  2. Worldwide, livestock has ten times the mass of wild animals and nearly twice the mass of human beings. Fishes have more than ten times the mass of human beings.
  3. Using software, we can map brain activity to speech so that listeners can easily identify words (source: Nature).
  4. Aging is associated with a reduction in tissue NAD levels. In turn, this is believed to be associated with physical decline. It is not entirely clear what makes the NAD level decline, but it seems that it might be caused by the accumulation of senescent cells. As we age, we tend to accumulate these non-functional and slightly harmful “dead” cells called “senescent cells”. We now have drugs called senolytics that can remove some of these senescent cells, with at least one ongoing clinical trial.