Science and Technology links (June 6th 2020)

  1. A small number of people are responsible for a disproportionate number of inventions and innovations. Why are these people different? Using neuroimaging techniques, scientists find that superior individuals may not have distinct processes per se, but rather they use common processes differently, in a more economical fashion:

    (…) extraordinary creative ability is not the outcome of a unique set of neurocognitive processes; rather, it is associated with the same neural mechanisms that support ordinary creativity, but to a different degree (…). Indeed, our findings would support the argument that similar creative outcomes (…) come about with a less extensive recruitment of brain networks shown to contribute to creative thought (…), which we speculate may allow eminent creators to pursue concurrently, for example, multiple lines of creative thought.

    This suggests that most of us with a healthy brain could potentially become highly creative thinkers. It would seem important to determine whether that is true.

  2. The average female mammal lives about 20% longer than the corresponding male. In human beings, womens have only an 8% advantage (men have significantly shorter lives). This difference is not attributed to the riskier behavior of males. Rather, a credible explanation is that males have a weaker immune system.
  3. Sea urchins can regenerate appendages throughout their lives. They come into different subspecies with long and short lives. The end of our chromosomes contains repeated sequences called telomeres. These telomeres get shorter with every cell division, unless they are replenished (e.g., via telomerase). It is often theorized that aging is characterized or explained by telomere shortening. However, in sea urchins, the telomeres do not get shorter with life because of constant telomerase activity. And yet, there are short-lived (i.e. aging) sea urchins.
  4. Scientists have created hair-bearing human skin from stem cells. Though the authors do not stress this possibility, it seems obvious that it could lead to therapies for treating hair loss in human beings:

    Moreover, we show that skin organoids form planar hair-bearing skin when grafted onto nude mice. Together, our results demonstrate that nearly complete skin can self-assemble in vitro and be used to reconstitute skin in vivo.

    This was not tested in human being. Yet some doctors are optimistic:

    This achievement places us closer to generating a limitless supply of hair follicles that can be transplanted to the scalps of people who have thinning or no hair.

  5. The sun creates skin damage over time and contributes to a particular form of skin aging that is quite visible in some older people. The damage goes deep in the skin and is therefore challenging. Scientists have used stem cells to attempt to reverse the damage. In at least some (human) patients, the results can be characterized as an extensive reversal of the damage deep in the skin.
  6. Our cells need a compound called NAD+ to produce energy. As we age, we tend to have less and less NAD+. A particular disease called mitochondrial myopathy leads to NAD+ deficiency. Scientists found that niacin (an expensive supplement) was an efficient NAD+ booster in these patients. (It does not mean that you should take niacin.)

How Innovation Works (book review)

I read How Innovation Works by Matt Ridley in a few hours. It is a delicious book.

Ridley distinguishes invention from innovation. The inventor creates something new, the innovator applies the novelty to change the world. Jeff Bezos innovated with his Amazon site, but he may not be much of an inventor. Scientists often invent new “things” but they often fail to innovate.

Innovation is often quite positive and Ridley attributes much of the large gains in wealth and well-being that we have known. So what makes innovation possible, or what causes innovation?

Ridley does not know exactly. However, he identifies some conditions that favour innovation:

  • You need some freedom, the freedom to try new enterprises, the freedom to apply new ideas.
  • You need some wealth. Necessity does not reliably drive innovation according to Ridley.
  • You need the ability to make mistakes and learn quickly from them. Innovation does not happen overnight from the brain of a genius. It is a deeply iterative process.

Thus we get little innovation in the nuclear industry because it is difficult to get approval for a new nuclear power plant. And it is difficult to experiment. However, we get a lot of innovation on the Web where anyone can try to offer a new service and where it is easy to iterate quickly.

What is the role of government in innovation? It is often believed that government drives innovation. However, it does not match the historical record. Until sometime in the XXth century, governments did not have any idea that they could promote innovation. Yet innovation did occur. Furthermore, governments decided to adopt a laissez-faire policy with respect to the Web, which enables massive innovation.

When you look at how much difficulty the state has to even embrace innovation, you cannot believe that it is also the source of our innovation. Ridley, like myself, is pessimistic regarding government interventions like patent protections. We get more innovation when people are free to iterate and reuse ideas. Furthermore, focusing the innovators on patents takes time and money away from innovation.

I am also skeptical of the ability of research grants to spur innovation, even when said grants and tied to industry collaboration. I do think that governments can play a positive role, besides protecting free enterprise and free markets: governments can issue prizes and contracts. For example, NASA may not be able to innovate and get us in space cheaply. However, NASA can offer contracts to space companies. Governments could similarly encourage progress in medicine by giving prizes to the first innovators to reach certain milestones. For example, we could give 10 billion dollars to the first team to stop cognitive decline in Alzheimer’s patients, at a reasonable cost. I get the feeling that Ridley would agree.

Research grants tend to favour the incumbents. Research prizes are more likely to reward innovators.

It is true that the innovators are not rewarded for nearly all of the wealth that their innovation generate… but innovations are frequently a team sport. We may think that only the Wright brothers invented the aeroplane and made it work, leading to all the marvellous innovations… but many people had a hand in their work. They corresponded with enthusiast who were experimenting with planers. They had an employee design a very light and powerful engine. It is not clear that by trying to ensure that innovators get a larger share of the benefits, we get more innovation and this assumes that we innovators are always fairly identified. Is the first people to patent an idea the sole innovator?

Where will innovation happen in the near future? The answer that seems obvious today, and the one that Ridley goes to is China. China seems uniquely able to try new things as far as technology and industry goes. Ridley is not pleased by this likely outcome given that China lacks democracy. Instead, Ridley believes that we could renew our habit of creating innovation elsewhere if only we choose to.

The Go compiler needs to be smarter

One of my favorite languages is the Go language. I love its simplicity. It is popular and useful in a cloud setting. Many popular tools are written in Go, and for good reasons.

I gave a talk on Go last year and I was asked for a criticism of Go. I do not mind that Go lacks exceptions or generics. These features are often overrated.

However, as charming as Go might be, I find that its compiler is not on par with what I expect from other programming languages. It could be excused when Go was young and immature. But I think that the Go folks now need to tackle these issues.

My first beef with the compiler is that it is shy about inlining. Inlining is the process by which you bring  a function into another function, bypassing the need for a function call. Even when the function call is inexpensive, inlining often brings many benefits.

Go improved its inlining strategies over time, but I would still describe it as “overly shy”.

Let me consider the following example where I first define a function which sums the element in an array, and then I call this function on an array I just defined:

func sum(x  []uint64) uint64 {
    var sum = uint64(0)
    for _, v := range x {
        sum += v
    return  sum

func fun() uint64 {
    x := []uint64{10, 20, 30}
    return sum(x)

Whether you use Rust, Swift, C, C++… you expect a good optimizing compiler to basically inline the call to the ‘sum’ function and then to figure out that the answer can be determined at compile time and to optimize the ‘fun’ function to something trivial.

Not so in Go. It will construct the array and then call the ‘sum’ function.

In practice, it means that if you want good performance in Go, you often have to manually inline your functions. And I mean: fully inline. You have to write a really explicit function if you want the Go compiler to optimize the computation away, like so:

func fun3() uint64 { 
    x := [3]uint64{10001, 21, 31}
    return x[0] + x[1] + x[2] 

My second concern with the Go language is that it has no real concept of runtime constant variable. That is, you have compile-time constants but if you have a variable that is set once in the life of your program, and never change, Go will still treat it as if it could change. The compiler does not help you.

Let us take an example. Go has added nice function that give you access to fast processor instructions. For example, most x64 processors have a popcnt instruction that gives you the number 1-bit in a 64-bit word. It used to be that the only way to access this instruction in Go was by writing assembly. Thas been resolved. So let us put this code into action:

import "math/bits"

func silly() int {
    return  bits.OnesCount64(1) + bits.OnesCount64(2)

This function should return 2 since both values provided (1 and 2) have exactly one bit set. I bet that most C/C++ compilers can figure that one out. But we may excuse Go for not getting there.

Go needs to check, before using the popcnt instruction, that the processor supports it. When you start Go, it queries the processor and fills a variable with this knowledge. This could be done at compile-time but then your binary would crash or worse when run on a processor that does not support popcnt.

In a language with just-in-time compilation like Java or C#, the processor is detected at compile-time so no check is needed. In less fanciful languages like C or C++, the programmer needs to check what the processor supports themselves.

I can excuse Go for checking that popcnt is supported each and every time that the ‘silly’ function called. But that is not what Go does. Go checks it twice:

        cmpb    runtime.x86HasPOPCNT(SB), $0
        jeq     silly_pc115
        movl    $1, AX
        popcntq AX, AX
        cmpb    runtime.x86HasPOPCNT(SB), $0
        jeq     silly_pc85
        movl    $2, CX
        popcntq CX, CX

That is because the compiler does not trust, or cannot determine, that the variable ‘runtime.x86HasPOPCNT’ is a runtime constant.

Some people will object that such checks are inexpensive. I think that this view should be challenged:

  • As is apparent in the assembly code I provide, you might be doubling or at least increasing by 50% the number of instructions required. A comparison and a jump is cheap, but so is popcnt (some processors can retire two popcnt per cycle!). Increasing the number of instructions makes code slower.
  • It is true that the branch/jump is likely to be correctly predicted by the processor. This makes the guarding code much cheaper than a branch that could sometimes be mispredicted. But that does not mean that you are not getting hurt:

    Even when it is impossible to remove all branches, reducing the number of branches “almost always taken” or “almost never taken” may help the processor better predict the remaining branches.  (…) A possible simplified explanation for this phenomenon is that processors use the history of recent branches to predict future branches. Uninformative branches may reduce the ability of the processors to make good predictions.

Go’s saving grace is that it makes it easy to integrate assembly code into your code base. So you can write your performance-critical in C, compile it, and use the result in your Go project. That is how we do it in roaring, for example. People have ported the really fast Stream VByte encoding and the very fast simdjson parser in Go, again by using assembly. It works.

However, it leaves the bulk of the Go software running at a fraction of the performance it could reach with a great optimizing compiler.

Appendix: Compiling Go with gccgo solves these particular problems. However, reportedly, the overall performance of gccgo is worse.

Science and Technology links (May 30th 2020)

  1. We might soon be able to buy memory cards with speeds nearing 4 GB/s. For comparison, an expensive and recent macBook currently has a disk with a 2 GB/s bandwidth. The PlayStation 5 should have a 5 GB/s bandwith.
  2. Human industry has boosted the amount of CO2 in the atmosphere. This has two predictible outcomes: slightly higher global temperature (CO2 has a mild green house effect) and higher plant productivity (CO2 acts as a fertilizer). The CO2 fertilization effect is strong: a 30% gain from 1900 in photosynthesis efficiency. Moreover, higher plant productivity translates into more CO2 capture and thus it tends to reduce the quantity of CO2. Haverd et al. report that we may have underestimated the carbon sink effect of CO2 fertilization.
  3. The successful e-commerce firm, Shopify, will allow most of its employees to work remotely in the future.
  4. Human beings may have hunted mammoths by chasing them into predetermined traps.
  5. There is a theory that sending your kids to a more selective school help them because being exposed to high achieving peers raises their level. But it seems that this peer effect is a myth. In other words, paying a lot of money to send your kids to an exclusive school is probably a waste. (It does not imply that sending your kids to a dysfunctional school is harmless.)
  6. We should apply with care the principle that extraordinary claims require extraordinary evidence. Indeed, this principle can be used to reject results that violate human consensus and slow the progress of science. Indeed, scientific progress is often characterized by a change in the consensus as we go from one realization to another.
  7. We can prevent age-related bone losses in mice by tweaking the content of their blood plasma.
  8. Many recent advances in artificial intelligence do not hold up to scrutiny. This is why you will often hear me dismiss novelty with the phrase “originality is overrated”.

    Kolter says researchers are more motivated to produce a new algorithm and tweak it until it’s state-of-the-art than to tune an existing one. The latter can appear less novel, he notes, making it “much harder to get a paper from.”

    The net result is that researchers tend to overrate novelty and originality. In practice, you often get better results by selecting time-tested approaches and ignoring the hype.

    So, how should read research, knowing that much of it won’t stand to scrutiny?

    1. Do not dismiss older research merely because it is older. Do the opposite: focus your energies on older work still in use.
    2. Instead of picking up papers one by one, try to find the underlying themes. In effect, dismiss each individual paper, and instead focus on the recurring themes and effects. If an idea only appears in one paper, it probably can be discarded. If it is appears again and again and proves useful, it might be worth knowing.

Mapping an interval of integers to the whole 64-bit range, fairly?

In my blog post A fast alternative to the modulo reduction, I described how one might map 64-bit values to an interval of integers (say from 0 to N) with minimal bias and without using an expensive division. All one needs to do is to compute x * N ÷ 264 where ‘÷’ is the integer division. A division by a power of two is just a shift. Such an approach is simple and efficient.

Let us consider the opposite problem.

Suppose that I give you integers in the range from 0 (inclusively) to some value N (exclusively). We want to map these values to integers spanning the whole 64-bit range. Obviously, since we only have N distinct values to start with, we cannot expect our function to cover all possible 64-bit values. Nevertheless, we would like to be “as fair as possible”. Let us use only integers.

Let 264 ÷ N be the integer division of 264  by N. Let 264 % N be the remainder of the integer division. Because N is a constant, these are constants as well.

As a first attempt, I might try to map integer x using the function (264 ÷ N) * x. It works beautifully when N is a power of two, but, in general, it is a tad crude. In particular, the result of this multiplication can never exceed N – (264 % N) whereas it starts at value 0 when x is 0 so it is biased toward smaller values when (264 % N) is non-zero (which is always true when N is not a power of two).

Let 264 % N be the remainder of the integer division. To compensate, we need to add a value that can up to 264 % N. Mapping integers in the interval from 0 to N to integers in the interval from 0 to 264 % N cannot be done with a simple multiplication. We do not want to use divisions in general because they are expensive. However, we can use a shift, that is, a division by a power of two. So let us look for a map of the form (264 ÷ N) * x + (u * x)÷264  for some unknown value u. We know that x can never exceed N, but that when x reaches N, the value of (u * x)÷264 should be close to 264 % N. So we set (u * N)÷264  to 264 % N and solve for u, which gives us that u must be at least (264 % N) * 264 ÷ N.

The values (264 ÷ N) and (264 % N) * 264 ÷ N need to computed just once: let us call them m and n respectively. We finally get the function m * x + (n * x)÷264. In Python, the code might look as follows…

def high(x,y):
    return (x*y >> 64) % 2**64

def inv(x):
   n = ((2**64) % N)*2**64 // N
   m = 2**64 // N
   return m*(x+1) + high(x,n)

How good is this formula? To test it out, I can test how well it can invert the approach that goes in the other direction. That is, if I plug x * N ÷ 264, I would hope to get back x, or something very close for suitable values of N. That is, I want m * x * N ÷ 264 + (n * x * N ÷ 264)÷264 to be almost x. In general, I cannot hope to find x exactly because some information was lost in the process. Yet I expect to have a lower bound on the error of ceil(264/ N). I benchmark my little routine using a probabilistic test and the results are promising. In the worst case, I can be orders of magnitude larger than the lower bound, but most of the time, my estimated error is lower than the lower bound. This suggests that though my approach is not quite suitable to get back x, with a little bit of extra mathematics and a few more instructions, I could probably make it work exactly within a strict error bound.

Programming inside a container

I have a small collection of servers, laptops and desktops. My servers were purchased and configured at different times. By design, they have different hardware and software configurations. I have processors from AMD, Intel, Ampere and Rockchip. I have a wide range of Linux distributions, both old and new. I also mostly manage everything myself, with some help from our lab technician for the initial setup.

The net result is that I sometimes end up with very interesting systems that are saddled with old Linux distributions. Reinstalling Linux and keeping it safe and secure is hard. Furthermore, even if I update my Linux distributions carefully, I may end up with a different Linux distribution than my collaborators, with a different compiler and so forth. Installing multiple different compilers on the same Linux distribution is time consuming.

So what can you do instead?

I could run virtual machines. With something like VirtualBox, you can run Linux inside Windows insider macOS. It is beautiful. But it is also slow and computationally expensive.

You can switch to containers, and Docker specifically, which have much less overhead. Docker is a ubiquitous tool in cloud computing. As a gross characterization, Docker allows you to run Linux inside Linux. It is a sandbox, but a sandbox that runs almost directly on the host. Unlike virtual machines, my tests show that on computationally intensive tasks, Docker containers run at “native speed” (bare metal). There are reports that system interaction is slower. Network connections and disk access is slower. For my purposes, it is fine.

If you must, you can also run Docker under macOS and Windows, though there will then be more overhead, I expect.

The idea of a container approach is to always start from a pristine state. So you define the configuration that your database server needs to have, and you launch it, in this precise state each time. This makes your infrastructure predictable.

It is not as perfect as it sounds. You still critically depend on the quality of the container you start from. Various hacking can be necessary if you need two applications with different requirements to run together in the same image.

Still, containers work well enough that they are basically sustaining our civilization: much of the cloud-based applications are based on containers one way or another.

Containers were built to deploy software into production. Programming inside containers is not directly supported: you will not find much documentation about it and there is simply not business model around it. What do I mean by “programming inside containers”? I mean that I’d to start a C programming project, decide that I will use the Linux Ubuntu 16.10 distribution and that I will compile and run my code under Linux Ubuntu 16.10, even though my server might be running a totally different Linux distribution (or might be under macOS).

The first problem is that your disk and the disk of the image built from the container are distinct. A running image does not have free access to the underlying server (the host). Remember that it is a sandbox.

So you can do all of your work inside the image. However, remember that the point of container technology is to always start from a pristine state. If you load up an image, do some work, and leave… your work is gone. Images are immutable by design. It is a great thing: you cannot easily mess up an image by tinkering with it accidentally.

You can, after doing some work inside an image, take a snapshot of the new state, commit it and create a new image from which you would start again. It is complicated and not practical.

What else could you do? What you can do instead is keep the image stateless, as images are meant to be. The image will only contain the compiler and build tools. There is no reason to change any of these tools. You will have all of your code in a directory, as you would normally do. To run and compile code, you will enter in the the image and run your commands. You can bind the repository from the host disk to the image just as you enter it.

This works much better, but there are glitches if you are issuing directly your docker command lines:

  1. Depending on how Docker is configured on your machine, you may find that you are unable to read or write to the disk bound to the image from the image. A quickfix is to run the image with privileged access but it is normally frowned upon (and unnecessary).
  2. The files that you create or modify from within the Docker image will appear on the host disk, often with strange file permissions. For example, maybe all of the files are owned by the root user. I had a research assistant that had a good workaround: he ran Linux as root all the time. I do not recommend such a strategy.

These glitches come from the strange way in which Docker deals with permissions and security. Contrary to what you may have read, it is not a simple matter of setting user and group identifiers: it may be sufficient on some systems but not on systems supporting Security-Enhanced Linux which require additional care.

And finally, you need to remember lots of complicated commands. If you are anything like me, you would rather not to have to think about Docker. You want to focus all of your attention on the code.

So the solution is to use a little script. In my case I use a bash script. You can find it on GitHub. It handles messy commands and file permissions for you.

For years, I tried to avoid having to rely on a script, but it is simply unavoidable to work productively.

Basically, I copy two files at the root of the directory where I want to work (Dockerfile and run), and then I type:

./run bash

And that is all. I am now in a subshell, inside the host directory. I can run programs, compile them. I have complete access to a recent Ubuntu distribution. This works even under the ARM-based servers that I have.

The run script can take other commands as well, so I can use it as part of other scripts.

Science and Technology links (May 16th 2020)

  1. Most of our processors, whether in our PCs or mobile phones, are 64-bit processors. In the case of your PC, it has been so for a couple of decades. Unfortunately, we have been stuck with 32-bit operating systems for a long time. Apple has stopped supporting 32-bit applications in the most recent version of its macOS operating system and 32-bit applications have stopped running on the iPhone quite some time ago. Microsoft will now stop providing 32-bit versions of its operating system.
  2. Moths transport pollen at night and play an important role as pollinators.
  3. If you transplant fecal microbes from old to young rats, you age the young rats. We have solid evidence that blood transfusions from old mammals to young mammals will age the young recipient (Nature, 2016). The other direction, evidently more useful, is achievable in vitro, but harder to achieve in actual organisms. In what might be a breakthrough, Horvath et al. report system-wide rejuventation in rats using plasma (blood) transfusion: the treatment more than halved the epigenetic ages of blood, heart, and liver tissue. Details are missing and we should reserve judgement. Horvath is a well regarded scientist from UCLA.

Encoding binary in ASCII very fast

In software, we typically work with binary values. That is, we have arbitrary streams of bytes. To encode these arbitrary stream of bytes in standard formats like email, HTML, XML, JSON, we often need to convert them to a standard format like base64. You can encode and decode base64 very quickly.

But what if you do not care for standards and just want to go fast and have simple code? Maybe all you care about is that the string is ASCII. That is, it must be a stream of bytes with the most significant bit of each byte set to zero. In such a case, you want to convert any 7-byte value  into an 8-byte value, and back.

I can do it in about five instructions (not counting stores and moves) both ways in standard C: five instructions to encode, and five instructions to decode. It is less than one instruction per byte. I could not convince myself that my solution is optimal.

// converts any value in [0, 2**56) into a value that
// could pass as ASCII (every 8th bit is zero)
// can be inverted with convert_from_ascii
uint64_t convert_to_ascii(uint64_t x) {
  return ((0x2040810204081 * (x & 0x80808080808080)) 
         & 0xff00000000000000) +
         (x & 0x7f7f7f7f7f7f7f);
// converts any 8 ASCII chars into an integer in [0, 2**56),
// this inverts convert_to_ascii
uint64_t convert_from_ascii(uint64_t x) {
  return ((0x102040810204080 * (x >> 56)) 
         & 0x8080808080808080) +
         (x & 0xffffffffffffff);

Under recent Intel processors, the pdep/pext instructions may serve the same purpose. However, they are slow under AMD processors and there is no counterpart under ARM.

Science and Technology links (May 2nd 2020)

  1. As we age, we tend to produce less of NAD+, an essential chemical compound for our bodies. We can restore youthful levels of NAD+ by using freely available supplements such as NMN. It is believed that such supplements have an anti-aging or rejuvenating effect. Some scientists believe that restoring NAD+ might make the powerplants (mitochondria) of your cells youthful again. A recent paper shows that we can reverse cognitive aging in mice using NMN. The treated mice were more agile, comparable to young mice. The paper suggests that it might be beneficial for some neurodegenerative diseases. (Note: I caution you that many things that work in mice may not work in human beings.)
  2. While we reject at a visceral level discrimination based on ethnicity, sexual orientation or gender, we are collectively happy to discriminate against older people. Nobody would think it is acceptable to fire women when they become pregnant, but it is acceptable to terminate someone’s employment because they are too old. Professor Paul Ewart from Oxford University fought his employer in court because while he was force to leave his post. He is now 73 years old. I do not know Ewart, but he was a co-author of over 10 papers in 2019 and he still held research grants, so I take it that he is still active in research. Ewart wrote:

    It cannot be right to dismiss active physicists, or indeed any productive academic, at some arbitrary age. It is traumatic to be forcibly retired, especially when one’s work is still in full swing and there are new ideas to be explored. The “emeritus” status offered by Oxford, instead of full employment, is of no use to experimental scientists who need a research team and principal-investigator status to apply for their own research funding. It is simply ageism that underlies many of the arguments used to justify mandatory retirement. Age is often used as a proxy for competence and this lazy stereotype feeds off the myth that scientists have their best ideas when they are young. Indeed, studies have shown that a scientist’s most impactful work can occur at any stage in their career. (…) It is ageism that sees a 40-year-old as “filling” a post whereas a 65-year-old is “blocking” a post. Apart from providing the dignity and fulfilment of employment, there are general imperatives for people to work longer, including the growing pension burden and increases in life expectancy. Recent studies by the World Health Organization also highlight the physical and mental health benefits of working longer. (…) Ageism is endemic in our society and attitudes persist that would be recognized as shameful if they related to race, religion or sexual orientation. The University of Oxford’s claim that dismissing older academics is necessary to maintain its high standards is simply ageism, implying that academic performance deteriorates with age. If retirement policies are to be truly evidence-based, they need to be justified by reasoned estimates of proportionality that are consistent with the available data.

    Ewart  showed, both with a model and hard data, that forcing older people to go only had marginal benefits for younger people and gender diversity. It can be fair for an employer to seek to review the performance of its staff and to let people go when they fail below a productivity threshold. You may object that it is warranted to let 73 years old go since they are, maybe, less productive than younger folks. But on the same account, would we let young female professors who choose to have young children go if we can establish, statistically, that female professors with young children are less productive? Of course, we would not. The basis of our policies against older people are rooted in prejudice. On the basis of reason, we should encourage older people who are productive to keep working as long as possible.  So what happened? Ewart went to court and won. How did Oxford respond? They are considering an appeal. No doubt, they will have to argue that forcing older productive people to leave is good…

  3. Economists tell us that our economies do not grow as fast as they once did and thus, we are getting richer (or less poor) at a slower rate than what might have been expected a few decades ago. There is debate regarding these numbers since it is hard to tell how much Facebook, Zoom and the iPhone XR would be worth to someone in 1980.What might be causing this stagnation? If we believe that the problem lies with economics, then financial and fiscal tools might be the solution. Ramey argues for a different cause: she suggests that we might be in a technological lull. If she is correct, then the fix might be to encourage more technological innovation rather than jump at fiscal or financial interventions.
  4. Most of the cells in our bodies can only divide so many times. This is known as the Hayflick limit and was once argue to fundamentally limit the lifespan of animals. However, we now know that stem cells, the cells that produce our specialized cells, do not face such a limit normally. The Hayflick limit comes from the fact that with each division, the tail end of our chromosome (the telomeres) get shorter. However, stem cells benefit from telomerase and thus may replenish their telomeres. Yet there are diseases where stem cells lose this ability to restore their telomeres. Researchers have shown that we can restore telomere lengths in such cases, in mice.
  5. As we age, we accumulate “senescent cells”. They are large and dysfunctional cells that have often reached the Hayflick limit. They should normally die, but some of them stick around. Removing them is believed to form the basis for a rejuvenation therapy. The difficulty is to find effective compounds that kill senescent cells but nothing else. A paper in Nature presents a new approach that is apparently safe and selective, in both human beings and mice. In old mice, the therapy restored physical function and reduced age-related inflammation.

For case-insensitive string comparisons, avoid char-by-char functions

Sometimes we need to compare strings in a case-insensitive manner. For example, you might want ‘abc’ and ‘ABC’ to be considered. It is a well-defined problem for ASCII strings. In C/C++, there are basically two common approaches. You can do whole string comparisons:

bool isequal = (strncasecmp(string1, string2, N) == 0);

Or you can do character-by-character comparisons, mapping each and every character to a lower-case version and comparing that:

bool isequal{true};
for (size_t i = 0; i < N; i++) {
      if (tolower(string1[i]) != tolower(string2[i])) {
        is_the_same = false;

Intuitively, the second version is worse because it requires more code. We might also expect it to be slower. How much slower? I wrote a quick benchmark to test it out:

strncasecmp Linux/GNU GCC macOS/LLVM
strncasecmp 0.15 ns/byte 1 ns/byte
tolower 4.5 ns/byte 4.0 ns/byte

I got these results with GNU GCC under Linux. And on a different machine running macOS.

So for sizeable strings, the character-by-character approach might be 4 to 40 times slower! Results will vary depending on your standard library and of the time of the day. However, in all my tests, strncasecmp is always substantially faster. (Note: Microsoft provides similar functions under different names, see _strnicmp for example.)

Could you go faster? I tried rolling my own and it runs at about 0.3 ns/byte. So it is faster than the competition under macOS, but slower under Linux. I suspect that the standard library under Linux must rely on a vectorized implementation which might explain how it beats me by a factor of two.

I bet that if we use vectorization, we can beat the standard librairies.


My code is available.

Sampling efficiently from groups

Suppose that you have to sample a student at random in a school. However, you cannot go into a classroom and just pick a student. All you are allowed to do is to pick a classroom, and then let the teacher pick a student at random. The student is then removed from the classroom. You may then have to pick a student at random, again, but with a classroom that is now missing a student. Importantly, I care about the order in which the students were picked. But I also have a limited memory: there are too many students, and I can’t remember too many. I can barely remember how many students I have per classroom.

If the classrooms contain a variable number of students, you cannot just pick classrooms at random. If you do so, students in classrooms with lots of students are going to be less likely to be picked.

So instead, you can do it as follows. If there are N students in total, pick a number x between 1 and N. If the number x is smaller or equal to the number of students in the first classroom, pick the first classroom and stop. If the number is smaller or equal to the number of students in the first two classrooms, pick the second classroom, and so forth. If you remove the student, you have to account for the fact that the sum (N) is reduced and the number of students in the classroom is being updated.

    while(N > 0) {
            int x = r.nextInt(sum); // random integer in [0,sum)
            int runningsum = 0;
            int z = 0;
            for(; z < runninghisto.length; z++) {
                runningsum += runninghisto[z];
                if(y < runningsum) { break; }
            // found z
            runninghisto[z] -= 1;
            N -= 1;

If you have many classrooms, this algorithm can be naive because each time you must pick a classroom you may need to iterate over all classroom counts. Suppose you have K classroom: you may need to do N K tasks to solve the problem.

Importantly, the problem is dynamic: the number of students in each classroom is not fixed. You cannot simply precompute some kind of map once and for all.

Can you do better than the naive algorithm? I can craft a “logarithmic” algorithm, one where the number of operations you do is the logarithm of the number of classrooms. I believe that my algorithm is efficient in practice.

So divide your classrooms into two sets of classrooms of size R and S so that R + S = N. Pick an integer in [0,N). If it is smaller than R, decrement R and pick a buffer in the corresponding set. If it is larger than or equal to R, then go with the other set of buffers. So we have reduced the size of the problem in two. We can repeat the same trick recursively. Divide the two subsets of classrooms and so forth. Instead of merely maintaining the total number of students in all of the classrooms, you keep track of the counts in a tree-like manner.

We can operationalize this strategy efficiently in software:

  • Build an array containing the number students in each classroom.
  • Replace every two counts by the sum of two counts: instead of storing the number of students in second classroom, store the number of students in the first two classrooms.
  • Replace every four counts by the sum of the four counts: where the count of the fourth class was, store the sum of the number of students in the first four classrooms.
  • And so forth.

In Java, it might look like this:

int level = 1;
for(;(1<<level) < runninghisto.length; level++) {
  for(int z = 0; 
     z + (1<<level) < runninghisto.length; 
    z += 2*(1<<level)) {
       runninghisto[z + (1<<level)] += runninghisto[z];

You effectively have a tree of prefixes which you can then use to find the right classroom without visiting all of the classroom counts. You visit a tree with a height that is the logarithm of the number of classrooms. At each level, you may have to decrement a counter.

        while(N > 0) {
            int y = r.nextInt(sum);
            // select logarithmic time
            level = maxlevel;
            int offset = 0;
            for(; level >= 0; level -= 1) {
                if(y >= runninghisto[offset + (1<<level)]) {
                    runninghisto[offset + (1<<level)] -= 1;
                    offset += (1<<level);
            // found offset
            N -= 1;

I have written a Java demonstration of this idea. For simplicity, I assume that the number of classrooms is a power of two, but this limitation could be lifted by doing a bit more work. Instead of N K operations, you can now do only N log K operations.

Importantly, the second approach uses no more memory than the first one. All you need is enough memory to maintain the histogram.

My Java code includes a benchmark.

Update. If you are willing to use batch processing and use a little bit of extra memory, you might be able to do even better in some cases. Go back to the original algorithm. Instead of picking one integer in [0,N), pick C for some number C. Then find out in which room each of these k integers fit, as before, decrementing the counts. If you pick the C integers in sorted order, a simple sequential scan should be enough. For C small compared to N, you can pick the C integers in linear time using effective reservoir sampling. Then you can shuffle the result again using Knuth’s random shuffle, in linear time. So you get to do about N K / C operations. (Credit: Travis Downs and Richard Startin.)

Further reading: Hierarchical bin buffering: Online local moments for dynamic external memory arrays, ACM Transactions on Algorithms.

Follow-up: Sampling Whisky by Richard Startin

Science and Technology links (April 25th 2020)

  1. People’s muscles tends to become weaker with age, a process called sarcopenia. It appears that eating more fruits and vegetables is associated with far lower risks of sarcopenia. Unfortunately, it is merely an association: it does not mean that if you eat more fruits and vegetables, you are at lower risk for sarcopenia. We just do not know.
  2. In human beings, males are more diverse than female with respect to many attributes. Just look at height: it is common that, in a classroom, both the shortest and tallest individual are boys. In chimpanzees, researchers report that males have greater brain structure diversity.
  3. Scientists have created “healing patches” for hearts. They greatly accelerate heart repair following a heart attack, in mice and pigs. It is made out of heart pig tissue, with the pig cells removed. The patches are believed to be safe, they can be frozen for later use.
  4. Many people lose some of their hair over time. About half of all men go bald over time. We do not quite understand the process. New research suggests that hair loss may result from the loss of stem cells in the skin. Stem cells are “progenitor cells” able to produce new specialized cells. The research suggests that stem cell therapies could be beneficial to fight hair loss.
  5. It is commonly believed that evolution made us as fit as possible. In this model, aging is the inevitable result of entropy. I do not believe this entropic model. In simulations, researchers have shown that a population of worms might be fitter if its individuals age and die. This should not surprise us: annual plants have evolved out of perennial plants. Your garden flowers can often be easily be kept alive indefinitely even though they are annuals. Your plants age and die on schedule because it is advantageous for the species. Animals of a similar size have frequently vastly different lifespans… naked mole rats are rats that are not subject to Gompertz curves (they are effectively ageless).

Rounding integers to even, efficiently

When dividing a numerator n by a divisor d, most programming languages round “down”. It means that 1/2 is 0. Mathematicians will insist that 1/2 and claim that you really are computing floor(1/2). But let me think like a programmer. So 3/2 is 1.

If you always want to round up, you can instead compute (– 1)/d.  If I want to apply it to n is 3 and d is 2, I do (3 + 2 – 1) /2 = 2.

We sometimes want to round the value to the nearest integer. So 1.4 becomes 1, 1.6 becomes 2. But what if you have 1.5 (the result of 3/2)? You can either round up or round down.

  • You can round “up” when hitting the midpoint with  (d/2)/d.
  • You can round “down” when hitting the midpoint with (– d/2)/d.

But there is a third common way to round. You want to round to even. It means that when you hit the midpoint, you round to the nearest even integer. So 1.5 becomes 2, 2.5 becomes 2, 3.5 becomes 4 and so forth. I am sure that it has been worked out before, but I could not find an example so I rolled my own:

off = (n + d / 2)
roundup = off / d;
ismultiple = ((off % d) == 0);
iseven = (d & 1) == 0;
return (ismultiple && iseven) ? roundup - (roundup & 1) : roundup;

Though there is a comparison and what appears like a branch, I expect most compilers to produce essentially branchless code. The result should be about five instructions on most hardware. I expect that it will be faster than converting the result to a floating-point number, rounding it up and converting the resulting floating-point number back.

I have posted working source code.

Why does it work?

Firstly, you never have to worry about hitting the midpoint if the divisor is odd. The remainder of a division by an odd number cannot be equal to d/2. Thus, starting from the round-up approach (d/2)/d we only need to correct the result. When n is equal to d + d/2, we need to round up, when it is equal to 2 d + d/2 we need to round down and so forth.  So we only need to remove 1 from (d/2)/when n is equal to 2kd+d/2 for some integer k. But when n is equal to 2kd+d/2, I have that n + d/2 is equal to (2k+1)d. That is, the quotient is odd.

Even better: You can make it explicitly branchless (credit: Falk Hüffner):

off = (n + d / 2)
roundup = off / d;
ismultiple = ((off % d) == 0);
return (d | (roundup ^ ismultiple)) & roundup

Nitpicking: You may be concerned with overflows. Indeed, if both n and d are large, it is possible for n+d/2 to exceed to allowable range. However, the above functions are used in the Linux kernel. And you can make them more robust at the expensive of a little bit more work.

Credit: This blog post was motivated by an email exchange with Colin Bartlett,

Science and Technology links (April 11th 2020)

  1. Greenland sharks reach their sexual maturity when they are 150 years old and they live hundreds of years. Some living sharks today were born in the 16th century.
  2. Only 5% of the general population rates themselves as below average in intelligence. And they overestimate the intelligence of their mate even more.
  3. Fresh soil has a distinctive smell which may be caused by bacteria that are trying to attract specific animals.
  4. Earth’s mantle is kept warm due to radioactivity. So we are basically living on a warm radioactive rock.
  5. Employees who work from home can be noticeably more productive. My own estimate is that I am about 20% more productive when working from home. However, social interactions are easier in person.
  6. We had the coldest Artic winter since 1979. It caused the first genuine ozone hole in the Artic. The ozone layer protects the Earth from radiations.
  7. According a recent research paper in Nature, Fermented foods (cheese, butter, alcoholic beverages, yogurt) as well as fibers may have anti-obesity properties due to the production of short-chain fatty acids (acetate, propionate, and butyrate). Propionate is also used as a food preservative as well as being produced by our own gut bacteria when digesting fibers, or by bacteria in the process of fermentation.  There is much evidence that propionate may helpful for obesity: Lin et al. (2012), Chambers et al. (2019), Arora et al. (2011) . In contrast, a widely reported previous work by Tirosh et al. suggest that propionate leads to weight gain in mice and might promote obesity and metabolic abnormalities. The press reported the idea that propionate could make use fat by presenting it as a food preservative:

    My take? The science is probably “complicated” and we do not have all the answers. This being said, while it is true that there may be propionate in your bread and pizza, to prevent it from moulding, and while it is also true that pizza and bread may make you fat, I’d worry a lot more about the calories in bread and pizza rather than the effect of propionate itself.

  8. The current pandemics might be the start of a revolution in peer review.

Multiplying backward for profit

Most programming languages have integer types with arithmetic operations like multiplications, additions and so forth. Our main processors support 64-bit integers which means that you can deal with rather large integers. However, you cannot represent everything with a 64-bit integer. What if you need to multiply 5100 by 37? Programming languages like Python or JavaScript will gracefully handle it. In programming languages like C/C++, you will need to use a library. In Java, you will have to use special classes part of the standard library.

You will almost never need to deal with integers that do not fit in 64 bits, unless you are doing cryptography or some esoteric numerical analysis. Nevertheless, it does happen that you need to work with very large integers.

How does the product 5100 by 37 gets computed if processors are limited to 64-bit words? Underneath, your software represents such large integers using multiple 64-bit words. The standard approach computes the product starting from the least significant bits. Thus if you are multiplying an integer that requires a single machine word (w) with an integer that requires n machine words, you will use n multiplications starting with a multiplication between the word w and the least significant word of the other integer, going up to the most significant words. The code in C/C++ might look like so:

// multiply w * b where b contains n words, n > 0
p = full_multiplication(w, b[0]); // 64-bit x 64-bit => 128-bit
output[0] = p.low;// least significant 64-bits
uint64_t r = p.high;// most significant 64-bits
for (i = 1; i < n; i++) {
  p = full_multiplication(w, b[i]);
  p.low += r; // add r with carry
  if(p.low < r) p.high++;
  output[i] = p.low; // least significant 64-bits
  r = p.high;

It is akin to how we are taught to multiply in school: start with the least significant digits, and go up. I am assuming that all integers are positive: it is not difficult to generalize the problem to include signed integers.

So far, it is quite boring. So let me add a twist.

// multiply w * b where b contains n words, n > 0
p = full_multiplication(w, b[0]); // 64-bit x 64-bit => 128-bit
output[0] = p.low;// least significant 64-bits
uint64_t r = p.high;// most significant 64-bits
for (i = 1; i < n; i++) {
p = full_multiplication(w, b[i]);
p.low += r; // add r with carry
if(p.low < r) p.high++;
output[i] = p.low; // least significant 64-bits
r = p.high;
output[n] = r;

You can also multiply backward, starting from the most significant words. Though you might think it would be less efficient, you can still do the same product using the same n multiplications. The code is going to be a bit more complicated because you have to carry the overflow that you may encounter in the less significant words upward. Nevertheless, you can implement it in C/C++ using only a few extra lines of code. According to my hasty tests, it is only marginally slower (by about 20%).

// multiply w * b where b contains n words, n > 0
p = full_multiplication(w, b[n-1]); // 64-bit x 64-bit => 128-bit
uint64_t r = p.high// least significant 64-bits ;
output[n - 1] = p.low;// most significant 64-bits
output[n] = p.high; 
for (i = n-2; i >=0; i--) {
    p = full_multiplication(w, b[i]);
    output[i] = p.low; // least significant 64-bits 
    // check for overflow
    bool overflow = (output[i + 1] + p.high < output[i + 1]);
    output[i + 1] += p.high;
    for (size_t j = i + 2; overflow; j++) {
      output[j]++; // propagate the carry
      overflow = (output[j] == 0);
// multiply w * b where b contains n words, n > 0
p = full_multiplication(w, b[n-1]); // 64-bit x 64-bit => 128-bit
uint64_t r = p.high// least significant 64-bits ;
output[n – 1] = p.low;// most significant 64-bits
output[n] = p.high;
for (i = n-2; i >=0; i–) {
p = full_multiplication(w, b[i]);
output[i] = p.low; // least significant 64-bits
// check for overflow
bool overflow = (output[i + 1] + p.high < output[i + 1]);
output[i + 1] += p.high;
for (size_t j = i + 2; overflow; j++) {
output[j]++ // propagate the carry;
overflow = (output[j] == 0);

Why would computing the multiplication backward ever be useful?

Suppose that you are not interested in the full product. Maybe you just need to check whether the result in within some bounds, or maybe you just need a good approximation of the product. Then starting from the most significant bits could be helpful if you can stop the computation after you have enough significant bits.

It turns out that you can do so, efficiently. You can compute the most significant k words using no more than an expected + 0.5 multiplications. Furthermore, if you are careful, you can later resume the computation and complete it.

At each step, after doing k multiplications, and computing k + 1 words, these k + 1 most significant words are possibly underestimating the true k + 1 most significant words because we have not added the carry from the less significant multiplications. However, we can bound the value of the carry: it is less than w. To see that it must be so, let r be the number of remaining words in the multiword integer that we have not yet multiplied by w. The maximal value of these words is 264r – 1. So we are going to add, at most, (264r – 1)w to our partially computed integer: the overflow (carry) above the r is at most ((264r – 1)w)/264a value strictly less than w.

This means that if we add w – 1 to the least significant of the k + 1 words and it does not overflow, then we know that the k most significant words are exact, and we can stop. This might happen, roughly, half the time, assuming that we are dealing with random inputs. When it does overflow, you can just continue and compute one more product. If, again, adding w – 1 to the least significant word you computed does not create an overflow, you can stop, confident that the + 1 most significant words are exact.

However, you then gain another more powerful stopping condition: if the second least significant word is not exactly 264 – 1, then you can also stop, confident that k most significant words are exact, because adding w to the least significant word can, at most, translate in a carry of +1 to the second least significant word. Because it is quite unlikely that you will end up with exactly the value 264 – 1, we know that, probabilistically, you will not need more than k + 1 multiplications to compute exactly the k more significant words. And quite often, you will be able to stop after k multiplications.

My code implementing these ideas is a bit too long for a blog post, but I have published it as its own GitHub repository, so you can check it out for yourself. It comes complete with tests and benchmarks.

I have restricted my presentation to the case where at least one integer fits in a single word. Generalizing the problem to the case where you have two multiword integers is a fun problem. If you have ideas about it, I’d love to chat.

Implementation notes: The code is mostly standard C++. The main difficulty is to be able to compute the full multiplication which takes two 64-bit words and generates two 64-bit words to represent the product. To my knowledge, there is no standard way to do it in C/C++ but most compilers offer you some way to do it. At the CPU level, computing the full product is always supported (64-bit ARM and x64) with efficient instructions.

Credit: This work is inspired by notes and code from Micheal Eisel.

We released simdjson 0.3: the fastest JSON parser in the world is even better!

Last year (2019), we released the simjson library. It is a C++ library available under a liberal license (Apache) that can parse JSON documents very fast. How fast? We reach and exceed 3 gigabytes per second in many instances. It can also parse millions of small JSON documents per second.

The new version is much faster. How much faster? Last year, we could parse a file like simdjson at a speed of 2.0 GB/s, and then we reached 2.2 GB/s. We are now reaching 2.5 GB/s. Why go so fast? In comparison, a fast disk can reach  5 GB/s and the best network adapters are even faster.

The following plot presents the 2020 simdjson library (version 0.3) compared with the fastest standard compliant C++ JSON parsers (RapidJSON and sajson). It ran on a single Intel Skylake core, and the code was compiled with the GNU GCC 9 compiler. All tests are reproducible using Docker containers.

In this plot, RapidJSON and simjson have exact number parsing, while RapidJSON (fast float) and sajson use approximate number parsing. Furthermore, sajson has only partial unicode validation whereas other parsers offer exact encoding (UTF8) validation.

If we only improved the performance, it would already be amazing. But our new release pack a whole lot of improvements:

  1. Multi-Document Parsing: Read a bundle of JSON documents (ndjson) 2-4x faster than doing it individually.
  2. Simplified API: The API has been completely revamped for ease of use, including a new JSON navigation API and fluent support for error code and exception styles of error handling with a single API. In the past, using simdjson was a bit of a chore, the new approach is definitively modern, see for yourself:
    auto cars_json = R"( [
      { "make": "Toyota", "model": "Camry",  "year": 2018, 
           "tire_pressure": [ 40.1, 39.9 ] },
      { "make": "Kia",    "model": "Soul",   "year": 2012, 
           "tire_pressure": [ 30.1, 31.0 ] },
      { "make": "Toyota", "model": "Tercel", "year": 1999, 
           "tire_pressure": [ 29.8, 30.0 ] }
    ] )"_padded;
    dom::parser parser;
    dom::array cars = parser.parse(cars_json).get<dom::array>();
    // Iterating through an array of objects
    for (dom::object car : cars) {
      // Accessing a field by name
      cout << "Make/Model: " << car["make"] 
               << "/" << car["model"] << endl;
      // Casting a JSON element to an integer
      uint64_t year = car["year"];
      cout << "- This car is " << 2020 - year 
               << "years old." << endl;
      // Iterating through an array of floats
      double total_tire_pressure = 0;
      for (double tire_pressure : car["tire_pressure"]) {
        total_tire_pressure += tire_pressure;
      cout << "- Average tire pressure: " 
          << (total_tire_pressure / 2) << endl;
      // Writing out all the information about the car
      for (auto [key, value] : car) {
        cout << "- " << key << ": " << value << endl;

  3. Exact Float Parsing: simdjson parses floats flawlessly at high speed.
  4. Fallback implementation: simdjson now has a non-SIMD fallback implementation, and can run even on very old 64-bit machines. This means that you no longer need to check whether the system supports simdjson.
  5. Automatic allocation: as part of API simplification, the parser no longer has to be preallocated: it will adjust automatically when it encounters larger files.
  6. Runtime selection API: We have exposed simdjson’s runtime CPU detection and implementation selection as an API, so you can tell what implementation we detected and test with other implementations.
  7. Error handling your way: Whether you use exceptions or check error codes, simdjson lets you handle errors in your style. APIs that can fail return simdjson_result, letting you check the error code before using the result. But if you are more comfortable with exceptions, skip the error code
    and cast straight to the value you need, and exceptions will be thrown automatically if an error happens. Use the same API either way!
  8. Error chaining: We also worked to keep non-exception error-handling short and sweet. Instead of having to check the error code after every single operation, now you can chain JSON navigation calls like looking up an object field or array element, or casting to a string, so that you only have to check the error code once at the very end.
  9. We now have a dedicated web site ( in addition to the GitHub site (

Credit: many people contributed to simdjson, but John Keiser played a substantial role worthy of mention.

Science and Technology links (March 28th 2020)

    1. In a laboratory, we know how to turn any of our cells into youthful stem cells using something called the Yamanaka. If you expose cells to such factors for a short time, they appear to remain functional specialized cells but become more youthful. Researchers demonstrated this theory using cartilage cells from elderly patients suffering from osteoarthritis. They also rejuvenated muscle cells. It now remains to do the same in live human beings.
    2. It has been widely reported that artificial intelligence, and specifically deep learning, can match or surpass clinicians in medical imaging tasks. Unfortunately, it appears that this is far from demonstrated with the necessary rigor:

      Few prospective deep learning studies and randomised trials exist in medical imaging. Most non-randomised trials are not prospective, are at high risk of bias, and deviate from existing reporting standards. Data and code availability are lacking in most studies, and human comparator groups are often small.

    3. Apple released its latest tablet (the iPad Pro) with an integrated LiDAR that can map accurately your environment at up to 5 meters of distance. A LiDAR is basically a laser-radar. It was famously used by NASA to map the lunar surface in the 1970s but it was a technology out of reach to all of us twenty years ago: reserved for military and highly specialized applications.
    4. Japan and Korea have more than 13 hospital beds per 1000 people; Spain, Italy, and the U.S. have about 3 beds per 1000 people.
    5. Due to a worldwide pandemic, we are running the largest distance-learning experiment in history. Countless teachers worldwide are teaching online for the first time.
    6. Modern disks (such as a USB drive) might be lighter when they are filled with data than when they are empty.
    7. Our smartphones will soon move from 4G networks to 5G networks. The latter are much faster. However, they cause the phones to consume 20% more energy according to some report.
    8. A few decades ago, most of us had computers with a single processor. Over time we acquired processors with two processor cores, each core acting as a processor. Then we got four cores with some cores able to act as if they are made of two or four “logical” processors. The next gaming consoles (e.g., the PS5) will have main CPUs with eight processor cores. It is not difficult to find servers that have well over a hundred logical processors. Yet it appears that Windows was limited to at most 64 logical processors, possibly because the architects of Windows never imagined that we would soon reach this limit.

Avoiding cache line overlap by replacing one 256-bit store with two 128-bit stores

Memory is organized in cache lines, frequently blocks of 64 bytes. On Intel and AMD processors, you can store and load memory in blocks of various sizes, such as 64 bits, 128 bits or 256 bits.

In the old days, and on some limited devices today, reading and storing to memory required you to respect alignment. You could not simply write any block memory anywhere. Today you mostly can write wherever you like. There is a small penalty for misalignment but the penalty is typically under 10% as I argued in my 2012 post Data alignment for speed: myth or reality?

Yet writing or reading from two cache lines (what Intel calls a cache line split) instead of a single one is likely to be more expensive at least some of the time. Let us explore an interesting scenario. It is sometimes possible to avoid crossing a cache line boundary by doing two memory accesses instead of a single large one. Is it worth it?

Cache lines in memory are aligned on addresses that are divisible by 64 bytes. Suppose that you would want to store 256 bits of data every 64 bytes, at just the right offset so that the 256 bits overlap two cache lines. You hit last 16 bytes of one cache line and the first 16 bytes of the second one. You can achieve the desired results by starting with an offset of 48 bytes. That is, you find find a memory address that is divisible by 64 bytes, and then you add 48 bytes.

In code, using Intel intrinsics, it looks as follow:

char * p = ...
for (size_t i = 0; i < ... ; i++) {
  _mm256_storeu_si256(p + i * 64, vec);

You can avoid entirely crossing the cache line bounding by first storing 128-bit of data at the 48-byte offset, and then storing another 128-bit of data. The first store is at the end of the first cache line and the second store is at the beginning of the second one.

char * p = ...
for (size_t i = 0; i < ... ; i++) {
      _mm_storeu_si128(p + i * 64, vec);
      _mm_storeu_si128(p + i * 64 + 16, vec);

How do these two approaches fare? I wrote a simple benchmark that stores many blocks of 256-bit at a 48-byte offset. It either stores it in one 256-bit step or in two 128-bit steps. I record the number of cycles per iteration on an AMD Rome processor. I rely on the the pre-installed RedHat LLVM compiler (clang version 3.4.2).

A single 256-bit write 2.33 cycles
Two 128-bit writes 2.08 cycles

It is a gain of slightly over 10% for the two 128-bit writes. What if you remove the 48-byte offset (or set it to zero)? Then both benchmark clock at 2.08 cycles per iteration. I expect that the 48-byte offset is a best-case scenario for the two 128-bit writes: if you change it then both approaches have the same cache-line overlap problems. So this 10% gain requires you to choose the alignment carefully.

My source code is available. Of course, your results will depend on the processor and to some extend on the compiler. You should be able to run my benchmark program on your own Linux x64 system.

Be mindful that if you are getting worse results on a per cycle basis on comparable hardware, you might be limited by your compiler. An analysis of the assembly might be required.

Further reading: Travis Downs has an interesting complementary analysis. He finds that unaligned stores crossing a 32-byte boundary can be tremendously expensive (i.e., 5 cycles) on the type of processor I am using for these tests (Zen 2). The 32-byte boundary exists irrespective of cache lines. Meanwhile, he finds that stores aligned exactly on a 32-byte boundary are much cheaper (1 cycle).

Number of atoms in the universe versus floating-point values

It is estimated that there are about 1080 atoms in the universe. The estimate for the total number of electrons is similar.

It is a huge number and it far exceeds the maximal value of a single-precision floating-point type in our current computers (which is about 1038).

Yet the maximal value that we can represent using the common double-precision floating-point type is larger than 10308. It is an unimaginable large number. There will never be any piece of engineering involving as many as 10308 parts.

Using a double-precision floating-point value, we can represent easily the number of atoms in the universe. We could also represent the number of ways you can pick any three individual atoms at random in the universe.

If your software ever produces a number so large that it will not fit in a double-precision floating-point value, chances are good that you have a bug.

Further reading: Lloyd N. Trefethen, Numerical Analysis, Princeton Companion to Mathematics, 2008