How long should you work on a problem ?

Lev Reyzin says that working too long on a problem might be unproductive:

I, personally, have diminishing (or negative?) returns to my creative work as I explicitly work on a problem past some amount of time. I often have insights coming to me out of nowhere while I’m relaxing or enjoying hobbies on nights or weekends.

Whenever one considers innovative endeavors and their productivity, one must consider that innovation is fundamentally wasteful. The problem with innovation is not about how to get as much of it for as little of a cost as possible. It is to get innovation at all. By optimizing your production function, you risk losing all of it. I sooner blame someone for his publication list being too long than being too short, said Dijkstra.

My view is that we tend to underestimate “intellectual latency”. There is a delay between the time you approach a new idea and the time you have fully considered it.

Thus our brains are not unlike computers. Your processor might be able to run at 4 GHz and be able to retire 4 instructions per cycle… but very rarely are you able to reach such a throughput while working on a single task. The productive intellectuals that I know tend to work on a few ideas at once. Maybe they are writing a book while building a piece of software and writing an essay.

So you should not focus on one unique task in the hope of finishing it faster. You may complete it slightly faster if you omit everything else but the sum total of your productivity might be much lower.

There is also a social component to human cognition. If you hold on to a problem for very long, working tirelessly on it, you may well deprive yourself of the input of others. You should do go work and then quickly invite others to improve on your work. No matter how smart you think you are, you cannot come close to the superior ingenuity of the open world.

Energy and sanity are essential ingredients of sustain intellectual productivity. Hammering at a single problem for a long time is both maddening and energy limiting. Our brains are wired to like learning about new ideas. Your brain wants to be free to explore.

And finally, the most important reason to limit the amount of work you invest on a single task is that it is a poor strategy even if you can do it with all the energy and intelligence in the world. Sadly, most of what you do is utterly useless. You are like an investor in a stock market where almost all stocks are losers. Putting all your money on one stock would ensure your ruin. You cannot know, at any given time, what will prove useful. Maybe going outside and playing with your son sounds like a waste of your potential right now, but it might be the one step that puts your life on a trajectory of greatness. You want to live your life diversely, touching many people, trying many things. Learn to cook. Make cocktails. Dance. Go to the theatre. Play video games. Write assembly code. Craft a novel.

Many years ago I started to blog. I also started publishing my software as open source in a manner that could be useful to others. I started posting my research papers as PDFs that anyone could download. None of these decisions seemed wise at first. They took time away from “important problems”. I was ridiculed at one point or another for all of them. Yet these three decisions ended up being extremely beneficial to me.

Further reading: Peer-reviewed papers are getting increasingly boring

Science and Technology links (June 12th 2021)

    1. We completed the sequencing of the human genome.
    2. AstraZeneca’s drug Lynparza cut combined risk of recurrence of breast cancer or death by 42% among women in study.
    3. Glycine and N-acetylcysteine supplementation improves muscle strength and cognition.
    4. We found Egypt’s ancient capital. It has been lost of 3,400 years. It is unclear why it was abandoned.
    5. We estimate that over 6 million Americans have Alzheimer’s. Currently, Alzheimer’s is effectively incurable and no drug is known to reverse or halt it. Once you have Alzheimer’s, you start an irreversible and drastic cognitive decline. The USA has approved the first Alzheimer’s new drug in 20 years. The American government decided to approve the new drug, aducanumab, even though it is unclear whether it genuinely stops Alzheimer’s. It clears the brain from accumulated proteins and might slow the cognitive decline, but that latter claim is uncertain. The approval is controversial as the company producing aducanumab stands to make a lot of money while, possibly, providing no value whatsoever to the patient (and the drug might even have negative side-effects). Yet by deploying the drug today, we stand to learn much about its potential benefits and, if you are affected by Alzheimer’s, you may feel that you do not have much to lose.
    6. Trials begin on lozenge that rebuilds tooth enamel.
    7. The Google folks founded an anti-aging company called Calico a few years ago. One of the star employee is Cynthia Kenyon who is famous for showing that aging is malleable. Their latest paper suggests that we might be able to rejuvenate individual cells within the body safely.
    8. The incoming new disks (SSD) have a sequential read speed of up to 14 GB/s (PCIe 5.0).
    9. We are curing the blinds: “researchers added light-sensitive proteins to the man’s retina, giving him a blurry view of objects”. (Source: New York Times)
    10. You might think that government research grants are given on merit. Maybe not always. Applicants who shared both a home and a host organization with one panellist received a grant 40% more often than average. (Source: Nature)
    11. The Roman Empire thrived under a climate that was much warmer. The Empire declined when the temperature got colder:

      This record comparison consistently shows the Roman as the warmest period of the last 2 kyr, about 2 °C warmer than average values for the late centuries for the Sicily and Western Mediterranean regions. After the Roman Period a general cooling trend developed in the region with several minor oscillations. We hypothesis the potential link between this Roman Climatic Optimum and the expansion and subsequent decline of the Roman Empire.

      (Source: Nature)

    12. Reportedly, China is increasing its coal power capacity at a rate of 21% per year. Its yearly increase alone is six time Germanyʼs entire coal-fired capacity.

Computing the number of digits of an integer even faster

I my previous blog post, I documented how one might proceed to compute the number of digits of an integer quickly. E.g., given the integer 999, you want 3 but given the integer 1000, you want 4. It is effectively the integer logarithm in base 10.

On computers, you can quickly compute the integer logarithm in base 2, and it follows that you can move from one to the other rather quickly. You just need a correction which you can implement with a table. A very good solution found in references such as Hacker’s Delight is as follows:

    static uint32_t table[] = {9, 99, 999, 9999, 99999,
    999999, 9999999, 99999999, 999999999};
    int y = (9 * int_log2(x)) >> 5;
    y += x > table[y];
    return y + 1;

Except for the computation of the integer logarithm, it involves a multiplication by 9, a shift, a conditional move, a table lookup and an increment. Can you do even better? You might! Kendall Willets found an even more economical solution.

int fast_digit_count(uint32_t x) {
  static uint64_t table[] = {
      4294967296,  8589934582,  8589934582,  8589934582,  12884901788,
      12884901788, 12884901788, 17179868184, 17179868184, 17179868184,
      21474826480, 21474826480, 21474826480, 21474826480, 25769703776,
      25769703776, 25769703776, 30063771072, 30063771072, 30063771072,
      34349738368, 34349738368, 34349738368, 34349738368, 38554705664,
      38554705664, 38554705664, 41949672960, 41949672960, 41949672960,
      42949672960, 42949672960};
  return (x + table[int_log2(x)]) >> 32;

If I omit the computation of the integer logarithm in base 2, it requires just a table lookup, an addition and a shift:

add     rax, qword ptr [8*rcx + table]
shr     rax, 32

The table contains the numbers ceil(log10(2j)) * 232 + 232 – 10ceil(log10(2j)) for j from 2 to 30, and then just ceil(log10(2j)) for j = 31 and j = 32. The first value is 232 .

My implementation of Kendall’s solution is available.

Using modern C++, you can compute the table using constant expressions.

Further reading: Josh Bleecher Snyder has a blog post on this topic which tells the whole story.

Computing the number of digits of an integer quickly

Suppose I give you an integer. How many decimal digits would you need to write it out? The number ‘100’ takes 3 digits whereas the number ’99’ requires only two.

You are effectively trying to compute the integer logarithm in base 10 of the number. I say ‘integer logarithm’ because you need to round up to the nearest integer.

Computers represent numbers in binary form, so it is easy to count the logarithm in base two. In C using GCC or clang, you can do so as follows using a counting leading zeroes function:

int int_log2(uint32_t x) { return 31 - __builtin_clz(x|1); }

Though it looks ugly, it is efficient. Most optimizing compilers on most systems will turn this into a single instruction.

How do you convert the logarithm in base 2 into the logarithm in base 10? From elementary mathematics, we have that log10 (x) = log2(x) / log2(10). So all you need is to divide by log2(10)… or get close enough. You do not want to actually divide, especially not by a floating-point value, so you want mutiply and shift instead. Multiplying and shifting is a standard technique to emulate the division.

You can get pretty close to a division by log2(10) if you multiply by 9 and then divide by 32 (2 to the power of 5). The division by a power of two is just a shift. (I initially used a division by a much larger power but readers corrected me.)

Unfortunately, that is not quite good enough because we do not actually have the logarithm in base 2, but rather a truncated version of it. Thus you may need to do a off-by-one correction. The following code works:

    static uint32_t table[] = {9, 99, 999, 9999, 99999, 
    999999, 9999999, 99999999, 999999999};
    int y = (9 * int_log2(x)) >> 5;
    y += x > table[y];
    return y + 1;

It might compile to the following assembly:

        or      eax, 1
        bsr     eax, eax
        lea     eax, [rax + 8*rax]
        shr     eax, 5
        cmp     dword ptr [4*rax + table], edi
        adc     eax, 0
        add     eax, 1

Loading from the table probably incurs multiple cycles of latency (e.g., 3 or 4). The x64 bsr instruction has also a long latency of 3 or 4 cycles. My code is available.

You can port this function to Java as follows if you assume that the number is non-negative:

        int l2 = 31 - Integer.numberOfLeadingZeros(num|1);
        int ans = ((9*l2)>>>5);
        if (num > table[ans]) { ans += 1; }
        return ans + 1;

I wrote this blog post to answer a question by Chris Seaton on Twitter. After writing it up, I found that the always-brilliant Travis Downs had proposed a similar solution with a table lookup. I believe he requires a larger table. Robert Clausecker once posted a solution that might be close to what Travis has a in mind.

Furthermore, if the number of digits is predictable, then you can write code with branches and get better results in some cases. However, you should be concerned with the fact that a single branch miss can cost you 15 cycles and  tens of instructions.

Update: This a followup to this blog post… Computing the number of digits of an integer even faster

Further reading: Converting binary integers to ASCII character and Integer log 10 of an unsigned integer — SIMD version

Note: Victor Zverovich stated on Twitter than the fmt C++ library relies on a similar approach. Pete Cawley showed that you could achieve the same result that I got initially by multiplying by 77 and then shifting by 8, instead of my initially larger constants. He implemented his solution for LuaJIt. Giulietti pointed out to me by email that almost exactly the same routine appears in Hacker’s Delight at the end of chapter 11.

All models are wrong

All models are wrong, but some are useful is a common saying in statistics. It does not merely apply to statistics, however. It is general observation. Box (1976) wrote an  influential article on the topic. He says that you make progress by successively making models followed by experiments, followed by more models, and then more experiments. And so forth. Importantly, a more sophisticated model might be worse and lead to mediocrity.

Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.

Kelly echoed this sentiment in his essay on thinkism:

Without conducting experiments, building prototypes, having failures, and engaging in reality, an intelligence can have thoughts but not results. It cannot think its way to solving the world’s problems.

Once you take for granted that all models are incorrect, the next question to ask is which models are useful, says Box:

Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. However, cunningly chosen parsimonious models often do provide remarkably useful approximations. For example, the law PV = RT relating pressure P, volume V and temperature T of an “ideal” gas via a constant R is not exactly true for any real gas, but it frequently provides a useful approximation and furthermore its structure is informative since it springs from a physical view of the behavior of gas molecules. For such a model there is no need to ask the question “Is the model true?”. If “truth” is to be the “whole truth” the answer must be “No”. The only question of interest is “Is the model illuminating and useful?”.

How do you know if something is useful? By testing it out in the real world!

Why would all models be incorrect? Part of the answer has to do with the fact that pure logic, pure mathematics only works locally. It does not scale. It does not mean that pure logic is ‘bad’, only that its application is limited.

Programmers and other system designers are ‘complexity managers’. If you are working with very strict rules in a limited domain, you can make pure logic prevail. A programmer can prove that a given function is correct. However, at scale, all software, all laws, all processes, all theories, all constitutions are incorrect. You assess whether they are useful. You check that they are correct in the way you care about. You cannot run a country or a large business with logic alone. In a sense, pure logic does not scale. Too many people underestimate the forces that push us toward common law and away from top-down edicts.

If you are a programmer, you should therefore not seek to make your software flawless and perfect. You may end up with worse software in the process. It may become overengineered. To get good software, test it out in practice. See how well it meets the needs of your clients. Then revise it, again and again.

If you are doing research, you should not work from models alone. Rather, you should start from a model, test it out in a meaningful manner, refine it again (based on your experience) and so forth, in a virtuous circle.

Evidently, running a business has to follow the same paradigm. All business plans are wrong, some are useful. Start with a plan, but revise it at the end of the first day.

In whatever you do, focus on what works. Try to test out your ideas as much as possible. Adjust your models frequently.

Never, never accept an untested model no matter how smart and clever its authors are.

Credit: I am grateful to Peter Turney for links and ideas.

Further reading:  Gödel’s loophole.

Science and Technology links (May 22nd 2021)

  1. Most computer chips today in flagship phones and computers use a process based on a 5 nm or larger resolution. Finer resolutions usually translate into lower energy usage and lower heat production. Given that many of our systems are limited by heat or power, finer resolutions lead to higher performance. The large Taiwanese chip maker (TSMC) announced a breakthrough that might allow much finer resolutions (down to 1 nm). IBM recently reported a similar but less impressive breakthrough. It is unclear whether the American giant, Intel, is keeping up.
  2. Good science is reproducible. If other researchers follow whatever you describe in your research article, they should get the same results. That is, what you report should be an objective truth and not the side-effect of your beliefs or of plain luck. Unfornately, we rarely try to reproduce results. When we do, it is common to be unable to reproduce the results from a peer-reviewed research papers. The system is honor-based: we trust that people do their best to check their own results. What happens when mistakes happen? Over time, other researchers will find out. Unfortunately, reporting such failures is typically difficult. Nobody likes to make ennemies and the burden of the proof is always on you when you want to denounce other people’s research. It is so common that we have a name for the effect: the replication crisis. The reproduction crisis has attracted more and more attention because it is becoming an existential threat: if a system produces research that cannot be trusted, the whole institution might fall. We see the reproduction crisis in psychology, cancer research and machine learning. Researchers now report that unreproducible research can be cited 100 times more than reproducible research. It suggests that people who produce unreproducible research might have an advantage in their careers and that they might go up the ranks faster.
  3. Recent PCs and tablets store data on solid-state drives (SSDs) that can be remarkably fast. The latest Sony PlayStation has an SSD with a bandwidth exceeding 5 GB/s. Conventional (spinning) disks have lagged behind with a bandwidth of about 200 MB/s. However, conventional disks can be much larger. It seems, however, that conventional disks might be getting faster. The hard drive maker Seagate has been selling conventional disks that have a bandwidth of 500 MB/s.
  4. As you age, you accumulate cells that are dysfunctional and should otherwise die, they are called senescent cells. We are currently developing therapies to remove them. Martinez-Zamudio et al. report that a large fraction of some cells in the immune system of older human beings are senescent (e.g., 64%). Clearing these senescent cells could have a drastic effect. We shall soon know.
  5. There are 50 billion birds and 1.6 billion sparrows.
  6. Computer scientists train software neural networks (for artificial intelligence) using backpropagation. It seems that people believe that such a mechanism (backpropagation) is not likely to exist in biology. Furthermore, people seem to believe that in biological brain, learning is “local” (at the level of the synapse). Recently, researchers have shown that we can train software neural networks using another technique that is ‘biologically plausible’ called zero-divergence inference learning. The implicit assumption is that these software systems are thus a plausible model for biological brains. It is unclear to me whether that’s a valid scientific claim: is it falsifiable?
  7. Ancient Romans used lead for everything. It appears that Roman children suffered from lead poisoning and had a related high mortality rate.
  8. Knight et al. found strong evidence to support the hypothesis that vitamin D could help prevent breast cancer. Taking walks outside in the sun while not entirely covered provides your body with vitamin D.
  9. Mammal hearts do not regenerate very well. Hence, if your heart is damaged, it may never repair itself. It appears that some specific stem cells can survive when grafted to living hearts and induce regeneration.
  10. Persistent short sleep duration (6 hours or less) is associated with a 30% increased dementia risk. (Note that this finding does not imply that people sleeping a lot are in good health. It also does not imply that you are sick if you are sleeping little.)
  11. Researchers have rejuvenated the blood cells of old mice using a form of vitamin B3.
  12. There are 65 animal species that can laugh.
  13. Between the years 1300 and 1400, the area near Greenland became relatively free from ice, as it was effectively exporting its ice to subartic regions. It seems to match the beginning of the Little Ice Age, a time when, at least in Europe, cold temperatures prevailed.

Counting the number of matching characters in two ASCII strings

Suppose that you give me two ASCII strings having the same number of characters. I wish to compute efficiently the number of matching characters (same position, same character). E.g., the strings ‘012c’ and ‘021c’ have two matching characters (‘0’ and ‘c’).

The conventional approach in C would look as follow:

uint64_t standard_matching_bytes(char * c1, char * c2, size_t n) {
    size_t count = 0;
    size_t i = 0;
    for(; i < n; i++) {
        if(c1[i]  == c2[i]) { count++; }
    return count;

There is nothing wrong with this code. An optimizing compiler can auto-vectorize this code so that it will do far fewer than one instruction per byte, given long enough strings.

However, it does appear that the routine looks at every character, one by one. So it looks like you are loading two values, then you are comparing and then incrementing a counter, for each character. So it might compile to over 5 instructions per character (prior to auto-vectorization).

What you can do instead is load the data in blocks of 8 bytes, into 64-bit integers as in the following code. Do not be mislead by the apparently expensive memcpy calls: an optimizing compiler will turn these function calls into a single load instruction.

uint64_t matching_bytes(char * c1, char * c2, size_t n) {
    size_t count = 0;
    size_t i = 0;
    uint64_t x, y;
    for(; i + sizeof(uint64_t) <= n; i+= sizeof(uint64_t)) {
      memcpy(&x, c1 + i, sizeof(uint64_t) );
      memcpy(&y, c2 + i, sizeof(uint64_t) );
      count += matching_bytes_in_word(x,y);
    for(; i < n; i++) {
        if(c1[i]  == c2[i]) { count++; }
    return count;

So we just need a function that can compare two 64-bit integers and find how many matching bytes there are. Thankfully there are fairly standard techniques to do so such as the following. (I borrowed part of the routine from Wojciech Muła.)

uint64_t matching_bytes_in_word(uint64_t x, uint64_t y) {
  uint64_t xor_xy = x ^ y;
  const uint64_t t0 = (~xor_xy & 0x7f7f7f7f7f7f7f7fllu) + 0x0101010101010101llu;
  const uint64_t t1 = (~xor_xy & 0x8080808080808080llu);
  uint64_t zeros = t0 & t1;
  return ((zeros >> 7) * 0x0101010101010101ULL) >> 56;

With this routine, you can bring down the instruction count to about 2 per character, including all the overhead and the data loading. It is strictly better than what you could with character-by-character processing by a factor of two (for long strings).

Though I seem to restrict the problem to ASCII inputs, my code actually counts the number of matching bytes. If you know that the input is ASCII, you can further optimize the routine.

I leave it as an exercise for the reader to write a function that counts the number of matching characters within a range, or to determine whether all characters in a given range match.

The proper way to solve this problem is with SIMD instructions, and most optimizing compilers should do that for you starting from a simple loop. However, if it is not possible and you have relatively long strings, then the approach I described could be beneficial.

My source code is available.

Converting binary integers to ASCII characters: Apple M1 vs AMD Zen2

Programmers often need to write integers as characters. Thus given the 32-bit value 1234, you might need a function that writes the characters 1234. We can use the fact that the ASCII numeral characters are in sequence in the ASCII table: ‘0’+0 is ‘0’, ‘0’+1 is ‘1’ and so forth. With this optimization in mind, the standard integer-to-string algorithm looks as follow:

while(n >= 10)
  p = n / 10
  r = n % 10
  write '0' + r
  n = p
write '0' + n

This algorithm writes the digits in reverse. So actual C/C++ code will write a pointer that you decrement (and not increment):

  while (n >= 10) {
    const p = n / 10;
    r = n % 10;
    n = p;
    *c-- = '0' + r;
  *c-- = '0' + n;

You can bound the size of the string (10 characters for 32-bit integers, 20 characters for 64-bit integers). If you have signed integers, you can detect the sign initially and make the integer value non-negative, write out the digits and finish with the sign character if needed. If you know that your strings are long, you can do better by writing out the characters two at a time using lookup tables.

How fast is this function ? It is going to take dozens of instructions and CPU cycles. But where is the bottleneck?

If you look at the main loop, and pay only attention to the critical data dependency, you divide your numerator by 10, then you check its value, and so forth. So your performance is bounded by the speed at which you can divide the numerator by 10.

The division instruction is relatively slow, but most compilers will convert it into a multiplication and a shift. It implies that the whole loop has a latency of about 5 cycles if you count three cycles for the multiplication and one cycle for the shift, with one cycle for the loop overhead. Of course, the function must also compute the remainder and write out the result, but their cost is maybe less important. It is not that these operations are themselves free: computing the remainder is more expensive than computing the quotient. However, we may get them almost for free because they are on a critical data dependency path.

How correct is this analysis? How likely is it that you are just bounded by the division by 10? The wider your processor, the more instructions it can retire per cycle, the more true you’d expect this analysis to be. Our commodity processors are already quite wide. Conventional Intel/AMD processors can retire about 4 instructions per cycle. The Apple M1 processor can retire up to 8 instructions per cycle.

To test it out, let us add a function which only writes out the most significant digit.

  while (n >= 10) {
    n /= 10;
  *c = '0' + char(n);

Here is the number of nanoseconds required per integer on average according to a benchmark I wrote. The benchmark is designed to measure the latency.

function Apple M1 clang 12 AMD Zen2 gcc 10
fake itoa 11.6 ns/int 10.9 ns/int
real itoa 12.1 ns/int 12.0 ns/int

According to these numbers, my analysis seems correct on both processors. The numbers are a bit closer in the case of the Apple M1 processor, but my analysis is not sufficiently fine to ensure that this difference is significant.

Hence, at least in this instance, your best chance of speeding up this function is either by dividing by 10 faster (in latency) or else by reducing the number of iterations (by processing the data in large chunks). The latter is already found in production code.

In the comments, Travis Downs remarks that you can also try to break the chain of dependencies (e.g., by dividing the task in two).

Further reading: Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries, Software: Practice and Experience 49 (6), 2019

Science and Technology links (May 15th 2021)

  1. There were rainforests near the south pole 90 million years ago.
  2. Though commercial exchanges are typically win-win for both the buyer and the seller, people tend to view the buyer as more likely to be taken advantage of.
  3. People with low self-esteem are more likely to blame the political system for their personal problems.
  4. Moscona find compelling evidence that the introduction of patents for plants in 1985 was followed by increased innovation.  This suggests that government interventions in agriculture can help increase productivity and entice further research and development.
  5. Ahmed et al. find that employers favour female candidates. Men are especially well advised to stay away from female-dominated fields:

    Male applicants were about half as likely as female applicants to receive a positive employer response in female-dominated occupations.

    Female applicants do not suffer from such discrimination according to this study. Note that this new study only supports earlier findings. For example, Williams and Ceci find that academic female applicants have an enormous advantage over academic male applicants:

    Contrary to prevailing assumptions, men and women faculty members from all four fields preferred female applicants 2:1 over identically qualified males with matching lifestyles (single, married, divorced), with the exception of male economists, who showed no gender preference.

    We also view managers as less moral when they fire women according to Reynolds et al.

  6. Working from home may not reduce output, but it seems to reduce productivity in the sense that workers need more hours to get the same work done. Importantly, this remains true even after accounting for the reduction in commute time. That is, it would appear that though people do not have to commute, they reinvest all of that time, and more, into their work. It applies to both people with children and people without children though the negative effect is more pronounced with people having children. (It is only a single study so it should be taken with some skepticism.)
  7. The gut of infants is free from microbial colonization before birth.
  8. Worldwide, since 2000, we have gained the equivalent of France is forest ground.
  9. We can transplant fresh ovaries in mice, and we might soon be able to do so in women.
  10. The amount of insulin in your blood increases as you age, robustly, irrespective of other factors. In layman’s terms, you are getting more and more diabetic over time.
  11. Scientists create early embryos that are part human, part monkey.


Constructing arrays of Boolean values in Java

It is not uncommon that we need to represent an array of Boolean (true or false) values. There are multiple ways to do it.

The most natural way could be to construct an array of booleans (the native Java type). It is likely that when stored in an array, Java uses a byte per value.

boolean[] array = new boolean[listSize];
for(int k = 0; k < listSize; k++) {
  array[k] = ((k & 1) == 0) ? true : false;

You may also use a byte type:

byte[] array = new byte[listSize];
for(int k = 0; k < listSize; k++) {
  array[k] = ((k & 1) == 0) ? (byte)1 : (byte)0;

You can get more creative and you could do it using an array of strings:

String[] array = new String[listSize];
for(int k = 0; k < listSize; k++) {
  array[k] = ((k & 1) == 0) ? "Found" : "NotFound";

In theory, Java could optimize the array so that it requires only one bit per entry. In practice, each reference to a string value will use either 32 bits or 64 bits. The string values themselves use extra memory, but Java is probably smart enough not to store multiple times in memory the string “Found”. It might store it just once.

And then you can do it using a BitSet, effectively using about a bit per value:

BitSet bitset = new BitSet(listSize);
for(int k = 0; k < listSize; k++) {
  if((k & 1) == 0) { bitset.set(k); }

The BitSet has tremendous performance advantages: low memory usage, fancy algorithms that benefit from word-level parallelism, and so forth.

Typically, you do not just construct such an array, you also use it. But let us say that I just want to construct it as fast as possible, how do these techniques differ? I am going to use 64K array with OpenJDK 8 on an Apple M1 processor.

My source code is available. In my benchmark, the content of the arrays is  known at compile time which is an optimistic case (the compiler could just precompute the results!). My results are as follow:

boolean 23 us
byte 23 us
String 60 us
BitSet 50 us

You may divide by 65536 to get the cost in nanoseconds per entry. You may further divide by 3.2GHz to get the number of cycles per entry.

We should not be surprised that the boolean (and byte) arrays are fastest. It may require just one instruction to set the value. The BitSet is about 3 times slower due to bit manipulations. It will also use 8 times less memory.

I was pleasantly surprised by the performance of the String approach. It will use between 4 and 8 times more memory than the simple array, and thus 32 to 64 times more memory than the BitSet approach, but it is reasonably competitive with the BitSet approach. But we should not be surprised. The string values are known at compile-time. Storing a reference to a string should be more computationally expensive than storing a byte value. These numbers tell us that Java can and will optimize String assignments.

I would still disapprove strongly of the use of String instances to store Boolean values. Java may not be able to always optimize away the computational overhead of class instances.

Furthermore, if you do not care about the extra functionality of the BitSet class, with its dense representation, then an array of boolean values (the native type) is probably quite sane.

Science and Technology links (May 1st 2021)

  1. Growing your own food could lower your carbon footprint by 3-5%.
  2. In recent years, we have acquired the ability to measure biological age: your chronological age does not necessarily match your biological since some people age faster. We measure biological aging with gene expression. Researchers found that an eight-week program of diet, exercise, and meditation could reduce biological age by two years. The study was small and short.
  3. You may have heard that younger and older people are happier, with middle-aged people reporting less happiness. Kratz suggests that such studies might not have been methodologically sound.
  4. Our brain is not good at producing new neurons. However, we have a rather abundant supply of another brain cell, astrocytes. Researchers took astrocytes and converted them into fully functional, integrated neurons (in mice). (Source: Nature)
  5. A beer might produce between 200,000 and 2 million bubbles before going flat.
  6. We now have the technology to edit your genes, or the genes your babies. CRISPR-Cas9 allows us to edit individual genes in a cell. But editing genes might often be unnecessary for medical purposes: it might suffice to silence or express the gene. Researchers have come up with a new technique called CRISPRoff which might do just that.
  7. The USA is approving the use of drones over people and at night.
  8. Omega 3 supplements may lower inflammation and boost repair mechanisms. (Source: Nature)
  9. You may finally have an effective vaccine against Malaria.
  10. It appears that bimekizumab might be a remarkably effective drug against psoriasis or other related diseases.
  11. New anti-depressants appear to increase suicide risks among teenagers.
  12. When seeking venture capital, female and Asians entrepreneurs may have slightly more luck with investors.
  13. As your cells divide, it is believed that small mutations are introduced. Recent research suggests that even cells that never divide may mutate.
  14. Adult mammals heal from injuries by forming scars. A new drug may prevent scars. It works in mice.
  15. Worms live longer when they have less food. It appears that smelling food is enough to make this effect go away. (Source: Nature)
  16. Reducing your blood pressure is always a good thing.

Ideal divisors: when a division compiles down to just a multiplication

The division instruction is one of the most expensive instruction in your CPU. Thus optimizing compilers often compile divisions by known constants down to a multiplication followed by a shift. However, in some lucky cases, the compiler does not even need a shift. I call the corresponding divisors ideal. For the math. geeks, they are related to Fermat numbers.

For 32-bit unsigned integers, we have two such divisors (641 and 6700417). For 64-bit unsigned integers, we have two different ones (274177 and 67280421310721). They are factors for 232 + 1 and 264 + 1 respectively. They are prime numbers.

So you have that

n/274177 = ( n * 67280421310721 ) >> 64


n/67280421310721 = ( n * 274177 ) >> 64.

In these expressions, the multiplication is the full multiplication (to a 128-bit result). It looks like there is still a ‘shift’ by 64 bits, but the ‘shift’ disappears in practice after compilation.

Of course, not all compilers may be able to pull this trick, but many do. Here is the assembly code produced by GCC when compiling n/274177 and n/67280421310721 respectively for an x64 target.

        movabs  rdx, 67280421310721
        mov     rax, rdi
        mul     rdx
        mov     rax, rdx
        mov     rax, rdi
        mov     edx, 274177
        mul     rdx
        mov     rax, rdx

You get similar results with ARM. It looks like ARM works hard to build the constant, but it is mostly a distraction again.

        mov     x1, 53505
        movk    x1, 0xf19c, lsl 16
        movk    x1, 0x3d30, lsl 32
        umulh   x0, x0, x1
        mov     x1, 12033
        movk    x1, 0x4, lsl 16
        umulh   x0, x0, x1

What about remainders?

What a good compiler will do  is to first compute the quotient, and then do a multiplication and a subtraction to derive the remainder. It is the general strategy. Thus, maybe surprisingly, it is more expensive to compute a remainder than a quotient in many cases!

You can do a bit better in some cases. There is a trick from our Faster Remainder by Direct Computation paper that compilers do not know about. You can compute the remainder directly, using exactly two multiplications (and a few move instructions):

n % 274177 = (uint64_t( n * 67280421310721 ) * 274177) >> 64


n % 67280421310721 = (uint64_t( n * 274177 ) * 67280421310721) >> 64.

In other words, the following two C++ functions are strictly equivalent:

// computes n % 274177
uint64_t div1(uint64_t n) {
    return n % 274177;

// computes n % 274177
uint64_t div2(uint64_t n) {
    return (uint64_t( n * 67280421310721 ) 
              * __uint128_t(274177)) >> 64;

Though the second function is more verbose and uglier, it will typically compile to more efficient code involving just two multiplications, back to back. It may seem a lot but it is likely better than what the compiler will do.

In any case, if you are asked to pick a prime number and you expect to have to divide by it, you might consider these ideal divisors.

Further reading. Integer Division by Constants: Optimal Bounds

Some useful regular expressions for programmers

In my blog post, My programming setup, I stressed how important regular expressions are to my programming activities.

Regular expressions can look intimidating and outright ugly. However, they should not be underestimated.

Someone asked for examples of regular expressions that I rely upon. Here a few.

  1. It is commonly considered a faux pas to include ‘trailing white space’ in code. That is, your lines should end with the line-return control characters and nothing else. In a regular expression, the end of the string (or line) is marked by the ‘$’ symbol, and a white-space can be indicated with ‘\s’, and a sequence of one or more white space is ‘\s+’. Thus if I search for ‘\s+$‘, I will locate all offending lines.
  2. It is often best to avoid non-ASCII characters in source code. Indeed, in some cases, there is no standard way to tell the compiler about your character encoding, so non-ASCII might trigger problems. To check all non-ASCII characters, you may do [^\x00-\x7F].
  3. Sometimes you insert too many spaces between a variable or an operator. Multiple spaces are fine at the start of a line, since they can be used for indentation, but other repeated spaces are usually in error. You can check for them with the expression \b\s{2,}. The \b indicate a word boundary.
  4. I use spaces to indent my code, but I always use an even number of spaces (2, 4, 8, etc.). Yet I might get it wrong and insert an odd number of spaces in some places. To detect these cases, I use the expression ^(\s\s)*\s[^\s]. To delete the extra space, I can select it with look-ahead and look-behind expressions such as (?<=^(\s\s)*)\s(?=[^\s]).
  5. I do not want a space after the opening parenthesis nor before the closing parenthesis. I can check for such a case with (\(\s|\s\)). If I want to remove the spaces, I can detect them with a look-behind expression such as (?<=\()\s.
  6. Suppose that I want to identify all instances of a variable, I can search for \bmyname\b. By using word boundaries, I ensure that I do not catch instances of the string inside other functions or variable names. Similarly, if I want to select all variable that end with some expression, I can do it with an expression like \b\w*myname\b.

The great thing with regular expressions is how widely applicable they are.

Many of my examples have to do with code reformatting. Some people wonder why I do not simply use code reformatters. I do use such tools all of the time, but they are not always a good option. If you are going to work with other people who have other preferences regarding code formatting, you do not want to trigger hundreds of formatting changes just to contribute a new function. It is a major faux pas to do so. Hence you often need to keep your reformatting in check.

A trichotomy of intellectual activity

I like to separate intellectual work among three categories:

  1. Emulation: the reproduction or direct application of existing ideas. Most academic work and maybe most business work falls in this category. You seek the best ideas and you reproduce them, sometimes with minor adaptations. As argued convincing  in Zero To One by Peter Thiel, it represents the bulk of what might pass as entrepreneurial activity. In the Social Leap, von Hippel argues that our brain are so large in large part because of the need to deal with our social reality, with its incentives to follow the herd cognitively. We have a strong tendency to emulate, and it is probably a great trait. One an idea starts spreading, it keeps on spreading. Without emulation, good ideas would not spread. We have even constructed entire institutions to support emulation: schools and universities. Kuhn might have called emulation “normal science”.
  2. Free inquiry is when you set aside what people are doing and you go on your own, trying to ask new questions, find new tools, or apply tools in a new way. You deliberately avoid the taken path. We have orders of magnitude more scholars and researchers than a century ago, but who believes that we have free inquiry than in the Einstein era? In Science Is Getting Less Bang for Its Buck, Collison and Nielsen argue that science has slowed enormously per dollar or hour spent. They would have to acknowledge that the number of research papers and patents has continued on its course, growing exponentially over time. If science is slowing but the output is continuing, then it suggests that most of the work has become emulation. While our institutions like to take credit for “out of the blue” innovations, the case that, for example, the CERN is responsible for the invention of the Web is shaky at best. Rather, our institutions are good at taking credit. There is clear evidence that some societies and cultures are better at free inquiry than others. For example, Jews represent less than 0.2% of the world’s population but they have received 40% of the Nobel prizes in economics. This suggests that the rest of humanity could stand to learn a thing or two about how to have fresh ideas.
  3. Transfer: bringing abstract ideas in the real world. You may think you know how you would design a COVID-19 vaccine from a DNA data dump, but actually doing it is transfer. You take existing mature ideas and you turn them into a new product or service. While many people assume that once an idea has matured in the abstract, bringing it to bear is easy. Yet transfer is a difficult process. Kealey and Nelson found that ninety per cent of new technology arises from the industrial development of existing technology, not from academic science. Though we had remarkable success with COVID-19 vaccines, we must recognize that they were developed under special circumstances with massive investments in transfer. There are not many more therapies approved by the government every year than there were in the 1950s. In fact, there is even a decline in the number of new therapies for the worst (killer) diseases. The very fact that an emergency (COVID-19) enabled us to act much faster than would have been otherwise possible suggests that much of the ongoing research (cancer, aging, heart disease) is probably happening at a far slower rate than it could if we cared enough. It now seems possible that the technology used to produce a COVID-19 vaccine in weeks could produce cancer vaccines. Why did we need a pandemic to find out that about this great technology and what it can deliver? Meanwhile you learn that archaic paper records submitted by fax hold up real-time COVID-19 data in hospitals: it is 2021 and our hospitals are still unable to send data over the Internet. Many organizations only adopt new ideas and new technologies by emulation: they move once everyone has decided to move.

My expectation is that human beings consistently over-invest in emulation and underinvest in both free inquiry and transfer. Emulation is far easier to manage and scale than free inquiry and transfer. The benefits are most obvious. However, I expect we could evolve faster if we treated more problems the way we treat COVID-19: as true “mission critical” problems.

Science and Technology links (April 17th 2021)

    1. Moderna built their COVID 19 vaccine without having the virus on site. They viewed it as a software problem.
    2. Human and mice with red hair have elevated pain thresholds.
    3. Tumors (cancer) consume high levels of sugar. You would think that it means that cancer cells consume a lot of sugar, but it appears that it is not case. Non-cancer cells within tumors are responsible for the high intake in sugar.
    4. Exergames are physical activities that also stress our intelligence. It appears that exergames might have tangible cognitive benefits.
    5. Women are entering menopause at an ever older age. They gained a year and half since since 1960.
    6. About 2.5 billion tyrannosaurus rex lived during the 2.4 million years that the species existed.
    7. Medical doctors tend to be pessimistic and to overestimate your probability of disease, which may lead to unnecessary treatments.

How fast can you sort arrays of integers in Java?

Programming languages come with sorting functions by default. We can often do much better. For example, Downs has showed that radix sort can greatly surpass default sort functions in C++. Radix sort is you friend if you want to sort large arrays of integers.

What about Java? Richard Startin and Gareth Andrew Lloyd have been working hard to improve the sorting function used inside the RoaringBitmap library. Though we use a custom radix sort function, it is not difficult to make it more generic, so that it can sort any array of integers. I came up with the following code:

public static void radixSort(int[] data) {
  int[] copy = new int[data.length];
  int[] level0 = new int[257];
  int[] level1 = new int[257];
  int[] level2 = new int[257];
  int[] level3 = new int[257];
  for (int value : data) {
    value -= Integer.MIN_VALUE;
    level0[(value & 0xFF) + 1] ++;
    level1[((value >>> 8) & 0xFF) + 1] ++;
    level2[((value >>> 16) & 0xFF) + 1] ++;
    level3[((value >>> 24) & 0xFF) + 1] ++;
  for (int i = 1; i < level0.length; ++i) {
    level0[i] += level0[i - 1];
    level1[i] += level1[i - 1];
    level2[i] += level2[i - 1];
    level3[i] += level3[i - 1];
  for (int value : data) {
    copy[level0[(value - Integer.MIN_VALUE) & 0xFF]++] = value;
  for (int value : copy) {
    data[level1[((value - Integer.MIN_VALUE)>>>8) & 0xFF]++] 
       = value;
  for (int value : data) {
    copy[level2[((value - Integer.MIN_VALUE)>>>16) & 0xFF]++] 
       = value;
  for (int value : copy) {
    data[level3[((value - Integer.MIN_VALUE)>>>24) & 0xFF]++] 
      = value;

It is about as unsophisticated as it looks. We compute four histograms, one per byte in an integer: Java stores integers using 4-byte words. Then we do 4 passes through the data. We could make it more sophisticated by examining the histogram: if the higher-level histograms are trivial, we can skip some passes. We could extend it to Java longs though we would then need 4 extra passes. It is also possible to generalize to floating-point numbers.

The strange subtraction with MIN_VALUE are to accommodate the fact that Java has signed integers (positive and negative) under a two complement’s format.

Let us compare it against the default Arrays.sort function in Java. We want to sort 1 million integers, generated uniformly at random. Using Java 8 on an Apple M1 processor, we get that RadixSort is ten times faster than Arrays.sort.

Arrays.sort 60 ms
RadixSort 5 ms

There are some caveats. The radix sort function is likely to use more memory. Furthermore, the results are sensitive to the input data (both its size and its distribution). Nevertheless, for some systems, radix sort can be a net win.

My code is available.

My programming setup

As my GitHub profile indicates, I program almost every single working day of the year. I program in many different languages such C++, C, Go, Java, JavaScript, Python, R, Swift, Rust, C#; even though I do not master all of these languages. Some of the projects I work on can be considered “advanced programming”. Some of my code is used in production.

I was recently asked how I build code.

I do not use a debugger in my main work. I will use debuggers exceptionally for hacking code that I do not understand. Before you object, keep in mind that I am in good company: Linus Torvalds, Brian W. Kernighan, Rob Pike and Guido van Rossum have stated that they do not rely primarily (or at all) on debuggers. Stepping through code is a slow, tiring process.

My primary environnement these days is Visual Studio Code. It is great for the following reasons:

  1. It is portable. I can switch from macOS to Windows and back, and it keeps working.
  2. It is fast. I know that’s a ridiculous statement to make considering that it is written in JavaScript, but it is just very smooth.
  3. It is simple. It is only a text editor after all.
  4. It allows me to program in all of the languages I care about.
  5. It is a state-of-the-art editor. I am never frustrated by a missing feature. I very rarely encounter bugs.
  6. I use fancy regular expressions all of the time and Visual Studio Code supports every expression I can throw to it.
  7. It has a a great terminal application embedded. It is better than most of the terminal applications provided by operating systems (especially under Windows).
  8. It integrates very nicely with ssh. A lot of my work involving hacking code that resides on a server (often linux). With Visual Studio Code, I can basically edit and run code on any system where ssh lives. It is so good that I sometimes forget that I am connected to a remote server.

That is pretty much all you would see on my screen when I am programming: Visual Studio Code. I do a lot of my work in the terminal. I use git all of the time for version control. The Go and Rust tools work nicely in a shell. For C++ and C, I use CMake or simple Makefiles, again mostly with command lines. You can invoke CMake command lines under Windows too, and have the Visual Studio compiler build your code. I find that C# has really nice support for command line builds and tests. Obviously, Java works well with gradle or maven. JavaScript has node and npm. I use Swift though not for build applications so I can rely on the Swift Package Manager. My go-to scripting language is Python. From the command-line results, if there are errors, Visual Studio Code often allows me to click and get directly at the offending line.

I often program inside Docker containers and that works really well with a terminal and an editor. Docker has been a game changer for me: I am no longer limited by whatever tools I can install on my host system.

I stay away from IDEs. People keep in mind that IDEs are not at all recent. Like many people, I started programming using IDEs. I learned to program seriously with Turbo Pascal, and then I used Visual Basic, Delphi, Visual Studio, Eclipse and so forth. I am not against IDEs per se and I will happily spin up Visual Studio, Xcode or IntelliJ but I do not want my workflow to depend on a specific tool and I like to script everything away. I realize that many people want to be able to press a button that builds and test their code, but I prefer to type ctest or go test or cargo test or npm test. Importantly, I want my code to work for other people no matter what their setup is: I find it offensive to require that collaborators use the same graphical tools that I do. Furthermore, my experience has been that though learning to use the command-line tools is initially harder, it tends to pay off in the long term via better maintainability, more automation, and a deeper knowledge of the tools.

Importantly, I am not “locked in” with Visual Studio Code. I can switch to any other text editor in an instant. And I sometimes do. If I need to change a file remotely, I might often use vim.

I never use code completion. If the editor provides it, I disable it. However, I often spend a lot of time reading a project’s documentation. If the documentation is missing, I might read the source code instead.

How do I deal with bugs? As I just stated, I will almost never run through the code execution. Instead I will use one of the following strategies:

  1. In an unsafe language like C++, you can get really nasty problems if you have illegal memory accesses. If I am getting strange bugs, I might run my code with sanitizers. It is a game changer: it makes C and C++ usable languages. There are various limitations to sanitizers and they do tend to make the compile-run cycle longer, so I do not use them routinely. In fact, I build most of my code in release mode; I rarely compile in debug mode.
  2. I write a lot of tests. In fact, most of the code that I write consists of tests. Once you have extensive tests, you usually can narrow down the bugs to one piece of code. I will then just read it over carefully. If I just sit quietly and study a small enough piece of code, I can often eventually understand and fix the bug. I am probably not a better programmer than you are, but you do not write tests, then I can be certain that your code has more bugs.
  3. I can make an obsessive use of continuous integration tests. Running all tests, on all platforms, with every single code change is great. In this manner, I avoid accumulating several bugs.
  4. I write documentation. Often, merely explaining (in the code or elsewhere) what a function should do is enough to figure out the bugs.
  5. I might use logging in hard cases. That is, I print out how the data is being processed. In sufficiently complex code, it pays to insert logging instructions describing the execution of the code, these logging instructions are enabled or disabled at compile time. These logs are rarely about the state of specific variables or location in the code. Rather, they present a semantically consistent log of how the data is getting processed. In my experience, it is only needed in specific cases to study what I might call “meta-bugs”: all of your functions are bug-free, but they interact poorly due to bad assumptions.

My work is often focused on performance. I sometimes examine the assembly output from the compiler, but rarely so. You can get most of the performance with simple programming techniques. However, you need to run benchmarks. I might not know your systems better than you do, but if you never write benchmarks, then my code is probably faster.

I also do not tend to rely very much on static analysis. I do not pay much attention to compiler warnings. I also shy away from “grand theories” of programming. I will happily use functional programming or object-oriented-programming, but I will never get stuck in one approach.

I would describe my approach to programming as being mostly empirical. I write code, then I test it then I benchmark it. This works in all programming languages, from C to JavaScript.

Remark: This is how I work but not a prescription on how other people should work. It is fine that you prefer to work differently, I encourage diversity. It is not fine to act as if I am lecturing you to work a certain way.

Science and Technology links (March 27th 2021)

    1. Scientists, including climate-science researchers, often travel to faraway places for conferences. Attending a live conference is time consuming and expensive. The cost is relative: attending a $3000 conference in Hawaii is cheap for the Harvard student, but a considerably higher expense for people elsewhere in the world. The conference is also out of reach for people who need to care for young children. People often claim that the environmental cost of the travel can be offset but there is no hard evidence that it actually works. The pandemic has forced scientists to suspend their travel. Should they resume? Niner and Wassermann write:

      Avoiding international travel and associated bureaucracy, time and expense could overcome many of the historic injustices preventing many from participating in and benefiting from international conferences, and also avoid the emissions associated with international air travel. However, prior to 2020, there has been resistance to moving these events online because of the perception that the value of conferences cannot be cultivated online. (…) we conclude that holding international conferences online (..) is a significant improvement in the capacity of conferences to meet the moral imperatives of the conservation community by addressing the climate crisis and some of the systemic injustices within the field.

    2. We found a largely preserved basket that is 10,000-year old. This was shortly after the last ice age.
    3. Wearable devices can help people lose weight.
    4. Disk drives should exceed 100 TB by 2030. 100 TB is enough to store everything your eyes can see during a decade.
    5. There is such a thing as too much exercise.
    6. Students in many countries do not gain critical thinking skills in college.
    7. Your nervous system is protected by myelin. As you age, your myelin regenerate more and more poorly. It appears that this might be reversible.
    8. There are more human twins than ever.
    9. A 70-year-old Albatross has given birth. It is generally difficult to tell how old a bird is, but this particular Albatross was tagged many years ago. Like many other animal species, we believe that many sea birds experience negligible senescence which means that they do not become less fertile or more frail with time.
    10. Though you keep the same genes all your life, your genes are marked by methylation and that can change over time. In other words, your DNA is not “stateless”, it can change. Some genes can thus become silent while others can become active. Using machine learning, we can use your methylation state to determine your biological age. Researchers have reported that methylation can further predict your clinical outcome when affected by SARS-CoV-2 (COVID 19). In time, this could be used to quickly identify people most at risk for some diseases. Meanwhile, losing weight appears to lower your biological age as defined by methylation. You might become biologically younger by about a year if you lose a lot of weight.
    11. When doing resistance training (lifting weights), I have often been told to lift to failure (until you no longer can lift). It seems that it might be poor advice. Instead, you may want to tailor your training to maximize volume (total effort).
    12. A low-carbohydrate diet might help you keep your muscle mass as you grow older. It does in mice. Further, vegans may have poorer bone health as they age.

Counting cycles and instructions on the Apple M1 processor

When benchmarking software, we often start by measuring the time elapsed. If you are benchmarking data bandwidth or latency, it is the right measure. However, if you are benchmarking computational tasks where you avoid  disk and network accesses and where you only access a few pages of memory, then the time elapsed is often not ideal because it can vary too much from run to run and it provides too little information.

Most processors will adjust their frequency in response to power and thermal constraints. Thus you should generally avoid using a laptop. Yet even if you can get stable measures, it is hard to reason about your code from a time measurement. Processors operate in cycles, retiring instructions. They have branches, and sometimes they mispredict these branches. These are the measures you want!

You can, of course, translate the time in CPU cycles if you know the CPU frequency. But it might be harder than it sounds because even without physical constraints, processors can vary their frequency during a test. You can measure the CPU frequency using predictable loops. It is a little bit awkward.

Most people then go to a graphical tool like Intel VTune or Apple Instruments. These are powerful tools that can provide fancy graphical displays, run samples, record precise instruction counts and so forth. They also tend to work across a wide range of programming languages.

These graphical tools use the fact that processor vendors include “performance counters” in their silicon. You can tell precisely how many instructions were executed between two points in time.

Sadly, these tools can be difficult to tailor to your needs and to automate. Thankfully, the Linux kernel exposes performance counters, on most processors. Thus if you write code for Linux, you can rather easily query the performance counters for yourself. Thus you can put markers in your code and find out how many instructions or cycles were spent between these markers. We often refer to such code as being “instrumented”. It requires you to modify your code and it will not work in all programming languages, but it is precise and flexible. It even works under Docker if you are into containers. You may need privileged  access to use the counters. Surely you can also access the performance counters from your own program under Windows, but I never found any documentation nor any example.

My main laptop these days is a new Apple macbook with an M1 processor. This ARM processor is remarkable. In many ways, it is more advanced that comparable Intel processors. Sadly, until recently, I did not know how to instrument my code for the Apple M1.

Recently, one of the readers of my blog (Duc Tri Nguyen) showed me how, inspired by code from Dougall Johnson. Dougall has been doing interesting research on Apple’s processors. As far as I can tell, it is entirely undocumented and could blow up your computer. Thankfully, to access the performance counters, you need administrative access (wheel group). In practice, it means that you could start your instrumented program in a shell using sudo so that your program has, itself, administrative privileges.

To illustrate the approach, I have posted a full C++ project which builds an instrumented benchmark. You need administrative access and an Apple M1 system. I assume you have installed the complete developer kit with command-line utilities provided by Apple.

I recommend measuring both minimal counters as well as the average counters. When the average is close to the minimum, you usually have reliable results. The maximum is less relevant in computational benchmarks. Observe that measures taken during a benchmark are not normally distributed: they are better described as following a log-normal distribution.

The core of the benchmark looks like the following C++ code:

  performance_counters agg_min{1e300};
  performance_counters agg_avg{0.0};
  for (size_t i = 0; i < repeat; i++) {
    performance_counters start = get_counters();
    performance_counters end = get_counters();
    performance_counters diff = end - start;
    agg_min = agg_min.min(diff);
    agg_avg += diff;
  agg_avg /= repeat;

Afterward, it is simply a matter of printing the results. I decided to benchmark floating-point number parsers in C++. I get the following output:

# parsing random numbers
model: generate random numbers uniformly in the interval [0.000000,1.000000]
volume: 10000 floats
volume = 0.0762939 MB 
model: generate random numbers uniformly in the interval [0.000000,1.000000]
volume: 10000 floats
volume = 0.0762939 MB
    strtod    375.92 instructions/float (+/- 0.0 %)
                75.62 cycles/float (+/- 0.1 %)
                4.97 instructions/cycle
                88.93 branches/float (+/- 0.0 %)
                0.6083 mis. branches/float

fastfloat    162.01 instructions/float (+/- 0.0 %)
                22.01 cycles/float (+/- 0.0 %)
                7.36 instructions/cycle
                38.00 branches/float (+/- 0.0 %)
                0.0001 mis. branches/float

As you can see, I get the average  number of instructions, branches and mispredicted branches for every floating-point number. I also get the number of instructions retired per cycle. It appears that on this benchmark, the Apple M1 processor gets close to 8 instructions retired per cycle when parsing numbers with the fast_float library. That is a score far higher than anything possible on an Intel processor.

You should note how precise the results are: the minimum and the average number of cycles are almost identical. It is quite uncommon in my experience to get such consistent numbers on a laptop. But these Apple M1 systems seem to show remarkably little variation. It suggests that there is little in the way of thermal constraints. I usually avoid benchmarking on laptops, but I make an exception with these laptops.

You should be mindful that the ‘number of instructions’ is an ill-defined measure. Processors can fuse or split instructions, and they can decide to count or not the number of speculative instructions. In my particular benchmark, there are few mispredicted branches so that the difference between speculative instructions and actually retired instructions is not important.

To my knowledge, none of this performance-counter access is documented by Apple. Thus my code should be viewed with suspicion. It is possible that these numbers are not what I take them to be. However, the numbers are generally credible.

My source code is available.

Note: Though my code only works properly under the Apple M1 processor, I believe it could be fixed to support Intel processors.

Apple’s M1 processor and the full 128-bit integer product

If I multiply two 64-bit integers (having values in [0, 264)), the product requires 128 bits. Intel and AMD processors (x64) can compute the full (128-bit) product of two 64-bit integers using a single instruction (mul). ARM processors, such as those found in your mobile phone, require two instructions to achieve the same result: mul computes the least significant 64 bits while mulh computes the most significant 64 bits.

I believe that it has often meant that computing the full 128-bit product was more expensive, everything else being equal, on ARM processors than on x64 (Intel) processors. However, the instruction set does not have to determine the performance. For example, ARM processors can recognize that I am calling both instructions (mul  and mulh) and process them more efficiently. Or they may simply have very inexpensive multipliers.

To explore the problem, let us pick two pseudo-random number generators, splitmix and wyhash:

uint64_t splitmix() {
    uint64_t z = (state += UINT64_C(0x9E3779B97F4A7C15));
    z = (z ^ (z >> 30)) * UINT64_C(0xBF58476D1CE4E5B9);
    z = (z ^ (z >> 27)) * UINT64_C(0x94D049BB133111EB);
    return z ^ (z >> 31);
uint64_t wyhash() {
    state += 0x60bee2bee120fc15ull;
    __uint128_t tmp = (__uint128_t)(state)*0xa3b195354a39b70dull;
    uint64_t m1 = (tmp >> 64) ^ tmp;
    tmp = (__uint128_t)m1 * 0x1b03738712fad5c9ull;
    return (tmp >> 64) ^ tmp;

As I reported earlier, wyhash should almost always be faster on an Intel or AMD processor as it is only two multiplications with an addition whereas the splitmix function is made of two multiplications with several other operations. However, wyhash requires two full multiplications whereas splitmix requires only two 64-bit products. If the two full multiplications in wyhash are equivalent two four multiplications, then wyhash becomes more expensive.

I wrote a small C++ benchmark to measure the time (in nanoseconds) that it takes to compute a random value using Apple’s new M1 processor (ARM, 3.2 GHz). The compiler is Apple clang version 12 which comes by default on the new Apple Silicon laptops.

Apple M1
wyhash 0.60 ns/value
splitmix 0.85 ns/value

My test suggests that it takes a bit under 3 cycles to generate a number with splitmix and a bit under 2 cycles to generate a number with wyhash. The wyhash generator is much faster than splitmix on the Apple M1 processor (by about 40% to 50%) which suggests that Apple Silicon is efficient at computing the full 128-bit product of two 64-bit integers. It would be nicer to be able to report the number of instructions and cycles, but I do not know how to instrument code under macOS for such measures.

Further reading: Apple M1 Microarchitecture Research by Dougall Johnson

Credit: Maynard Handley suggested this blog post.

Update: The numbers were updated since they were off by a factor of two due to a typographical error in the code.

Update 2: An interested reader provided me with the means to instrument the code. The precise number of cycles per value is a bit over 2.8 for splitmix and exactly 2 for wyhash. Please see my repository for the corresponding code.

Appendix. Some readers are demanding assembly outputs. The splitmix function compiles to 9 assembly instructions

LBB7_2:                                 ; =>This Inner Loop Header: Depth=1
	eor	x12, x9, x9, lsr #30
	mul	x12, x12, x10
	eor	x12, x12, x12, lsr #27
	mul	x12, x12, x11
	eor	x12, x12, x12, lsr #31
	str	x12, [x0], #8
	add	x9, x9, x8
	subs	x1, x1, #1              ; =1	LBB7_2

while the wyhash function compiles to 10 assembly instructions

LBB8_2:                                 ; =>This Inner Loop Header: Depth=1
	mul	x12, x9, x10
	umulh	x13, x9, x10
	eor	x12, x13, x12
	mul	x13, x12, x11
	umulh	x12, x12, x11
	eor	x12, x12, x13
	str	x12, [x0], #8
	add	x9, x9, x8
	subs	x1, x1, #1              ; =1	LBB8_2