Science and Technology links (May 11th 2019)

  1. Bone marrow transplanted from young mice to very old (almost dying) mice extended the life of the old mice by 30%. The authors conclude that bone-marrow transplantation affects the intrinsic aging mechanism.
  2. Artery calcification is an easily diagnosed condition which predicts cardiovascular diseases. In animal studies, vitamin K2 supplement reduced artery calcification. There is an ongoing clinical trial to test whether the same occurs in human beings. The study should conclude later this year.
  3. Ovarian tissues can be removed and frozen, and then reinserted into a women’s body, allowing her to become pregnant. Though it is not simple, it seems to work.

Almost picking N distinct numbers at random

In Picking N distinct numbers at random: how to do it fast?, I describe how to quickly pick N distinct integer values are random from a range of integer values. It comes down to using either bitset/bitmap or a hash set.

The bitset approach is very fast when you need to pick many integer values out of a small range. The hash set approach is very fast if you need to pick very few values, irrespective of the range of values.

What about the middle ground? What if you need to pick lots of integer values from an even greater range?

Because N is large, you may not care to get exactly N values. That is, if you need to pick 100 million integer values at random, it might be fine to pick 99,999,999 integer values.

What can you do?

  1. Fill an array with N randomly generated integer values (using a uniform distribution).
  2. Sort the array.
  3. Remove duplicates.

That is pretty good, but the sort function could be expensive if N is large: it is O(N log N), after all.

Assuming that there are no duplicates, can we model this using probabilities? What is the distribution corresponding to the first value? We have N values picked out of a large range. So the probability that any value has been picked is N over the range. We recognize the geometric distribution. Once you have found the first value, you can repeat this same reasoning except that we now have N-1 values over a somewhat restricted range (because we generate them in sorted order).

  1. Generate a value over the range R using a geometric distribution with probability N/R.
  2. Append the value to my output array.
  3. Reduce the range R with the constraint that all future values must be larger than the last value appended to the output array.
  4. Decrement N by one.
  5. Repeat until N is 0.

You can use the fact that we can cheaply generate numbers according to a geometric distribution:

floor(log(random_number_in_0_1()) /log(1-p));

All you need is a way to generate random numbers in the unit interval [0,1] but that is easy. In C++ and many other programming languages, you have builtin support for geometric distributions.

The net result is an O(N) algorithm to pick N values at random over a range.

There is a catch, however. My model is not quite correct. For one thing, we do not quite have a geometric distribution: it is only valid if the range is very, very large. This manifests itself by the fact that the values I generate may exceed the range (a geometric distribution is unbounded). We can patch things up by stopping the algorithm once a value exceeds the range or some other anomaly occurs.

So I ran a benchmark where I have to pick 100,000,000 values among all integers smaller than 40,000,000,000. I get that the time per value generated is about half using the geometric-distribution approach:

sort-based 170 ns
geometric 80 ns

For larger arrays, I can achieve 3x to 4x gains but then my software runs out of memory.

My code is available.

What else could you do? Another practical approach would be to divide up the range into many small subranges and to use the fact that the number of values within each subrange follows a binomial distribution (which can be approximated by a normal distribution), to do a divide-and-conquer approach: instead of having a pick many values in a large range problem, we would have several small “pick few values into a small range” problems. For each small problem, you can afford to do a sort-based approach since sorting small arrays is fast.

Science and Technology links (May 4th 2019)

  1. It is often believed that colleges help class mobility… if you were born poor, college can help your rise up. So is there more class mobility when more people go to college? Maybe not:

    (…) researchers have characterized a college degree as a great equalizer leveling the playing field, and proposed that expanding higher education would promote mobility. This line of reasoning rests on the implicit assumption that the relatively high mobility observed among college graduates reflects a causal effect of college completion on intergenerational mobility, an assumption that has rarely been rigorously evaluated (…) I find that once selection processes are adjusted for, intergenerational income mobility among college graduates is very close to that among non-graduates.

  2. How many people does it take to form a group? The answer appears to be “at least 5”.
  3. The productivity of a computer science professor is not sensitive to where they got his PhD, but it is correlated with their place of employment.
  4. If you have ever taken management classes, you know about Maslow’s pyramid of human needs. However, Maslow never proposed such a pyramid.
  5. Twenty years ago, Microsoft introduced a mouse that did away with the mechanical ball and replaced with with a digital camera and lights.
  6. Rats were given olive oil, sunflower oil or fish oil. Rats consuming sunflower oil have a shorter lifespan.
  7. Seed oils are a source of Omega 6. It appears that keeping your consumption of Omega 6 low (compared to your Omega 3 intake) is important to reduce cardiovascular events.
  8. Shifting our energy production from coal to gas would allow us to meet our climate stabilization targets.
  9. Many studies find that red meat is associated with higher cancer rates. It is not known whether it is a mere correlation or a causal factor. However, if you narrow down your investigation to Asia, the correlation disappears. The association between red meat and cancer is apparently specific to the Western civilization.
  10. It appears that 60% of all bird species come from Australia.
  11. A new Alzheimer’s-like disease has been identified (it is called “LATE”).

Really fast bitset decoding for “average” densities

Suppose I give you a word and you need to determine the location of the 1-bits. For example, given the word 0b100011001, you would like to get 0,3,4,8.

You could check the value of each bit, but that would take too long. A better approach is use the fact that modern processors have fast instructions to count the number of “trailing zeros” (on x64 processors, you have tzcnt). Given 0b100011001, this instruction would give you 0. Then you if you set this first bit to zero (getting 0b100011000), the trailing-zero instruction gives you 3, and so forth. Conveniently enough, many processors can set the least significant 1-bit to zero using a single instruction (blsr); you can implement the desired operation in most programming languages like C as a bitwise AND: word & (word - 1).

Thus, the following loop should suffice and it is quite efficient…

  while (word != 0) {
    result[i] = trailingzeroes(word);
    word = word & (word - 1);

How efficient is it exactly?

To answer this question, we first need to better define the problem. If the words you are receiving have few 1-bits (say less than one 1-bit per 64-bit words), then you have the sparse regime, and it becomes important to detect quickly zero inputs, for example. If half of your bits are set, you have the dense regime and it is best handled using using vectorization and lookup tables.

But what do you do when your input data is neither really sparse (that is, you almost never have zero inputs) nor really dense (that is, most of your bits are set to zero)? In such cases, the fact that the instructions in your loop are efficient does not help you as much as you’d like because you have another problem: almost every word will result in at least one mispredicted branch. That is, your processor has a hard time predicting when the loop will stop. This prevent your processor from doing a good job retiring instructions.

You can try to have fewer branches at the expense of more instructions:

  while (word != 0) {
    result[i] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+1] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+2] = trailingzeroes(word);
    word = word & (word - 1);
    result[i+3] = trailingzeroes(word);
    word = word & (word - 1);

The downside of this approach is that you need an extra step to count how many 1-bit there are in your words. Thankfully, it is a cheap operation that can be resolved with a single instruction on x64 processors.

This ‘unrolled’ approach can void more than half of the mispredicted branches, at the expense of a few fast instructions. It results in a substantial reduction in the number of CPU cycles elapsed (GNU GCC 8, Skylake processor):

cycles / 1-bit instructions / 1-bit branch misses / word
conventional 4.7 8.2 0.68
fast 3.4 8.2 0.41

So we save about 1.3 cycles per 1-bit with the fast approach. Can the mispredicted branches explain this gain? There about 6 bits set per input word, so the number of mispredicted branches per 1-bit is either 0.15 or 0.065. If you multiply these fractions by 15 cycles (on the assumption that each mispredicted branch costs 15 cycles), you get 2.25 cycles and 1 cycles; or a difference of 1.25 cycles. It does seem credible that the mispredicted branches are an important factor.

I offer my source code, it runs under Linux.

We use this decoding approach in simdjson.

How close are we to the optimal scenario? We are using one instruction per 1-bit to count the number of trailing zeros, one instruction to zero the least significant 1-bit, one instruction to advance a pointer where we write, one store instruction. Let us say about 5 instructions. We are getting 9.8 instructions. So we probably cannot reduce the instruction count by most than a factor of two without using a different algorithmic approach.

Still, I expect that further gains are possible, maybe you can go faster by a factor of two or so.

Futher reading: Parsing Gigabytes of JSON per Second and Bits to indexes in BMI2 and AVX-512.

Credit: Joint work with Geoff Langdale. He has a blog.

Science and Technology links (April 27th 2019)

  1. Women who use oral contraceptives have a harder time recognizing emotions of others.
  2. Worldwide, livestock has ten times the mass of wild animals and nearly twice the mass of human beings. Fishes have more than ten times the mass of human beings.
  3. Using software, we can map brain activity to speech so that listeners can easily identify words (source: Nature).
  4. Aging is associated with a reduction in tissue NAD levels. In turn, this is believed to be associated with physical decline. It is not entirely clear what makes the NAD level decline, but it seems that it might be caused by the accumulation of senescent cells. As we age, we tend to accumulate these non-functional and slightly harmful “dead” cells called “senescent cells”. We now have drugs called senolytics that can remove some of these senescent cells, with at least one ongoing clinical trial.

Speeding up a random-access function?

A common problem in software performance is that you are essentially limited by memory access. Let us consider such a function where you write at random locations in a big array.

 for ( i = 0; i < N; i++) {
    // hash is a randomized hash function
    bigarray[hash(i)] = i; 

This is a good model for how one might construct a large hash table, for example.

It can be terribly slow if the big array is really large because each and every access is likely to be an expensive cache miss.

Can you speed up this function?

It is difficult, but you might be able to accelerate it a bit, or maybe more than a bit. However, it will involve doing extra work.

Here is a strategy which works, if you do it just right. Divide your big array into regions. For each region, create a stack. Instead of writing directly to the big array, when you are given a hash value, locate the corresponding stack, and append the hash value to it. Then, later, go through the stacks and apply them to the big array.

 for ( i = 0; i < N; i++) {
    loc = hash(i)
    add loc, i to buffer[loc / bucketsize]
 for each buffer {
   for each loc,i in buffer
     bigarray[loc] = i

It should be clear that this second strategy is likely to save some expensive cache misses. Indeed, during the first phase, we append to only a few stacks: the top of each stack is likely to be in cache because we have few stacks. Then when you unwind the stacks, you are writing in random order, but within a small region of the big array.

This is a standard “external memory” algorithm: people used to design a lot of these algorithms when everything was on disk and when disks were really slow.

So how well do I do? Here are my results on a Skylake processor using GNU GCC 8 (with -O3 -march=native, THP set to madvise).

cycles/access instructions/access
standard 57 13
buffered 45 36

So while the buffered version I coded uses three times as many instructions, and while it needs to allocate a large buffer, it still comes up on top.

My code is available.

The shopper’s dilemma: wait for new technology or buy now?

Technology is accelerating. It took less than a decade for smartphone to go from 1% of the population to almost everyone. Television took longer. The phone even longer.

Anyone who has been in the market for a technology product knows about what I call the “shopper’s dilemma”. Should you buy the current iPhone or wait another six months for an even better iPhone?

It sounds like a form of interest. You either take $1000 to buy the current model, or hold on to your $1000 and buy a much better model in six months. However, there is no free lunch: by waiting you lose access to the current product for six months.

The shopper’s dilemma also applies more broadly.

Consider medical therapies. You could have eye surgery today for a good improvement in your eyesight, or wait in five years for much better surgery giving you great eyesight. Should you wait or should you take whatever is available today? If you are sick and badly in need of treatment, there is no choice. But sometimes you can afford to wait.

The shopper’s dilemma becomes increasingly more challenging as technology accelerates. Its effect is more and more important. The variance also increases: some progress comes suddenly while unexpected setbacks delay long-promised breakthroughs.

How do different people behave when faced with this dilemma?

  1. Not everyone is aware of the rate of progress. Some people are pessimistic. These people will tend to favour buying now. They are betting against the future. Somewhat ironically, this means that if you work in marketing, you should probably avoid the topic of “progress”.
  2. Technophiles, or people who follow closely technology, should favour delaying their acquisitions. They are betting on the future being better. I conjecture that they might be more likely to delay purchases or therapies.

It seems that I have a testable conjecture. It should be easy to test?

Science and Technology links (April 20th 2019)

  1. Early-career setback cause a performance improvement among those who persevere. This is related to the observation that immigrants are four times more likely to become millionaires. In biology, that is called hormesis: by challenging your muscles, you get stronger; by exposing yourself to some radiations or starving a little, you live longer. So you should seek out challenges and expose your kids to difficulties.
  2. We have a significant bias in favor of tall men. Tall men get promoted more often and earn more money; they are also much more successful with women. However, heightism, like ageism, is considered an acceptable form of discrimination. It is fine to mock a man because he is small or old. It is not fine to mock a man for being gay, transgender or black.
  3. Canadians who finish high school, get a full time job and only have children within marriage have less than a one percent chance of being poor.
  4. Currently, a sizeable fraction of men go bald with age and there is relatively little that can be done. There is some good surgery, but it is expensive. Products like minoxidil work, but only so. A new product (clascoterone) has passed a Phase II clinical trial with good results. It seems quite safe and probably more effective than current drugs.
  5. A stem-cell therapy for knee arthritis got solid results during a clinical trial.
  6. It seems that you can bring back some brain function hours after death (in pigs).
  7. We are making progress against the “bubble boy” syndrome.
  8. For every 100 women who earn a bachelor’s degree from US colleges and universities there are 74 men.

Parsing short hexadecimal strings efficiently

It is common to represent binary data or numbers using the hexadecimal notation. Effectively, we use a base-16 representation where the first 10 digits are 0, 1, 2, 3, 5, 6, 7, 8, 9 and where the following digits are A, B, C, D, E, F, with the added complexity that we can use either lower or upper case (A or a).

We sometimes want to convert strings of hexadecimal characters into a numerical value. For simplicity, let us assume that we have sequences of four character. This is a common use case due to unicode escape sequences in C, JavaScript, C# and so forth.

Each character is represented as a byte value using its corresponding ASCII code point. So ‘0’ becomes 48, ‘1’ is 49, ‘A’ is 65 and so forth.

The most efficient approach I have found is to simply rely on memoization. Build a 256-byte array where 48 (or ‘0’) is mapped to 0, 65 (or ‘A’) is mapped to 10 and so forth. As an extra feature, map all disallowed values to -1 so we can detect them. Then just lookup the four values and combine them.

uint32_t hex_to_u32_lookup(const uint8_t *src) {
  uint32_t v1 = digittoval[src[0]];
  uint32_t v2 = digittoval[src[1]];
  uint32_t v3 = digittoval[src[2]];
  uint32_t v4 = digittoval[src[3]];
  return v1 << 12 | v2 << 8 | v3 << 4 | v4;

What else could you do?

You could replace the table lookup with a fancy mathematical function:

uint32_t convertone(uint8_t c) {
  return (c & 0xF) + 9 * (c >> 6);

How do they compare? I implemented both of these and I find that the table lookup approach is more than twice as fast when the function is called frequently. I report the number of instructions and the number of cycles to parse 4-character sequences on a Skylake processor (code compiled with GNU GCC 8).

Instruction count Cycle count
lookup 18 4.3
math 38 9.6

I am still frustrated by the cost of this operation. Using 4 cycles to convert 4 characters to a number feels like too much of an expense.

My source code is available (run it under Linux).

Further reading: Fast hex number string to int by Johnny Lee; Using PEXT to convert from hexadecimal ASCII to number by Mula.

Science and Technology links (April 13th 2019)

  1. There is little evidence that digital screens are harmful to teenager’s mental health. If there is an effect, it is small.
  2. Cotton bags must be reused thousands of times before they match the environmental performance of plastic bags. Organic cotton bags are much worse than regular ones, requiring 20,000 reuse instead of only 7,000, due to the lower yield of organic farming. Cotton bags cannot be recycled. Paper bags must be reused dozens of times to have the same environmental impact as single-use plastic bags. For extra points, compute how many years you need to use an organic cotton bag, at a rate of two utilization a week, to use it 20,000 times. Why are we outlawing plastic bags, and not reusable organic cotton bags?
  3. I never understood the appeal of artificial-intelligence system that take very little input from human beings (self-taught software). Rich Sutton makes a powerful case for it:

    The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.

    To put it another way, our most powerful weapon for ‘smarter’ software is to design systems that get better as we add more computational power, and then to add the computational power.

    The net trend is to build software that looks more and more like ‘brute force’ at a high level, but with increasing sophistication in the computational substrate to provide the necessary brute force.

  4. Goldstein, Qvist and Pinker make a powerful case for nuclear power in the New York Times. Nuclear power is safe, clean, relatively inexpensive and environmentally friendly. Renewal energies are not the solution despite all the propaganda at the moment:

    Where will this gargantuan amount of carbon-free energy come from? The popular answer is renewables alone, but this is a fantasy. Wind and solar power are becoming cheaper, but they are not available around the clock, rain or shine, and batteries that could power entire cities for days or weeks show no sign of materializing any time soon. Today, renewables work only with fossil-fuel backup. Germany, which went all-in for renewables, has seen little reduction in carbon emissions.

  5. Human beings have better color perception than most other mammals.

    Humans, some primates, and some marsupials see an extended range of colors, but only by comparison with other mammals. Most non-mammalian vertebrate species distinguish different colors at least as well as humans, and many species of birds, fish, reptiles and amphibians, and some invertebrates, have more than three cone types and probably superior color vision to humans.

    So why would human beings have superior color vision compared to other mammals?

    A recent evolutionary account posits that trichromacy facilitates detecting subtle skin color changes to better distinguish important social states related to proceptivity, health, and emotion in others.

  6. As you age, your working memory degrades. A Nature article reports on how this can be reversed with electric brain stimulation.
  7. Genetically modified plants (GMOs) have reduced pesticide use by 37% while improving yields by 22%. Though no new technology is free from risk, neither lower yields nor higher pesticide use are free from risk.
  8. The poverty rate in China went from 34.5% of the population to 0.7% of the population between 2001 and 2015.