Science and Technology links (March 16th, 2018)

  1. From the beginning of the 20th century to 2010, the life expectancy at birth for females in the United States increased by more than 32 years. The 3 major causes of death for females in 1900 were pneumonia and influenza, tuberculosis, and enteritis and diarrhea. In 2010, the 3 major causes were heart disease, all cancers, and stroke.
  2. It looks like Dwarf stars could be orbited by habitable planets.
  3. More evidence that intelligence is genetic.
  4. Sugar and bread are killing you: Dietary Carbohydrates Impair Healthspan and Promote Mortality (in Cell).
  5. It turns out that people in organized crime are probably saner than you’d expect: “we were able to determine that in the sample analysed there was not one subject with a psychotic personality”.
  6. If you made it to Pluto and could somehow survive, would there be enough light to read? More than enough according to Cook.
  7. China has reduced fine particulates in the air by a third in four years.
  8. You can cure blindness (in mice) using small wires: “artificial photoreceptors based on gold nanoparticle-decorated titania nanowire arrays restored visual responses in the blind mice with degenerated photoreceptors”. (In Nature.)
  9. According to Nature, a science doctorate has high value in the UK and Canadian job markets. It sounds true to me. However, you should simply not expect to automatically become a professor: “Nearly 30% of those with full- or part-time jobs ended up in academia.”
  10. Brenda Milner is a professor at McGill University who is going to turn 100 this summer. She is still an active professor with an ongoing publication record. Here is what the New Times wrote about her last year:

    “People think because I’m 98 years old I must be emerita,” she said. “Well, not at all. I’m still nosy, you know, curious.” (…) Dr. Milner continues working, because she sees no reason not to. Neither McGill nor the affiliated Montreal Neurological Institute and Hospital has asked her to step aside. She has funding: In 2014 she won three prominent achievement awards, which came with money for research.

  11. Mozilla has released an open source speech recognition model “so that anyone can develop compelling speech experiences” (via Leonid Boytsov).
  12. One of my favorite authors, Brian Martin, has published a new book: Vaccination panic in Australia. We all know that vaccination can be an effective public health policy. So you think that it is crazy to question vaccination policies? Not so fast. Brian explains carefully that there is room for reasonable disagreement on how exactly vaccination is to be used. But most importantly, the book reviews how authorities proceed to suppress dissent, even reasonable well-founded dissent. The book can be freely accessed online.
  13. As we get older, our muscles tend to disappear. It is a condition called Sarcopenia, first coined in 1988. It still unclear what causes it, but there is now evidence that it has to do with the disappearance of nerves. Even if we did nothing to cure cancer and heart diseases, simply keeping the muscles of older people strong would make a huge difference. Sadly, we have barely begun to consider maybe doing something about it.

Iterating over hash sets quickly in Java

There are many ways in software to represent a set. The most common approach is to use a hash table. We define a “hash function” that takes as an input our elements and produces as an output an integer that “looks random”. Then your element is stored at the location indicated by the value of the hash function. To check whether the value is in the hash function, you compute the hash function and look at the appropriate location memory. Your elements will thus appear in “random” order.

This means that unless you do extra work, iterating through the elements in your hash set will be done in random order. Let me pull out my Swift console to demonstrate:

$ var x = Set<Int>()
$ x.insert(1)
$ x.insert(2)
$ x.insert(3)
$ for i in x { print(i) }

That is right: you insert the numbers 1, 2, 3 and the numbers 2, 3, 1 come out.

You get the same kind of result in Python:

>>> x = set()
>>> x.add(-1)
>>> x.add(-2)
>>> x.add(-3)
>>> x
set([-2, -1, -3])

That is, the hash set is visited starting with the elements having the smallest hash function value. The hash function is often designed to appear random, so the elements come out randomly.

This randomness can take programmers by surprise, so some programming languages like JavaScript and PHP preserve the “insertion order”. If you pull out your JavaScript console, you get that the set keeps track of the order in which you inserted the element and gives it back to you:

> var x = new Set()
> x.add(-3)
Set { -3 }
> x.add(-2)
Set { -3, -2 }
> x.add(-1)
Set { -3, -2, -1 }
> x.add(3)
Set { -3, -2, -1, 3 }
> x.add(2)
Set { -3, -2, -1, 3, 2 }
> x.add(1)
Set { -3, -2, -1, 3, 2, 1 }

This can be implemented as a linked list working on top of the hash table.

Java supports both approaches through HashSet and LinkedHashSet.

The LinkedHashSet will use more memory, but it gives back the elements in insertion order. The HashSet gives back the element in an order determined in large part by the hash function. The LinkedHashSet may allow you to iterate over the elements faster because you are essentially bypassing the hash table entirely and just following the linked list. Linked lists are not great in the sense that each node being visited can incur a cache miss. However, Java’s HashSet is implemented using a fancy chaining approach, so you will be chasing pointers in memory and possibly also having cache misses.

So it would seem like LinkedHashSet is a good choice in Java if you are not memory bound.

To explore this problem, I took a set made of 1 million integers generated randomly. I insert them into both a HashTable and a LinkedHashTable. Then I sum the values. I run my benchmark on a Skylake processor with Java 8:

classsum values
HashSet50 ops/s
LinkedHashSet150 ops/s

My numbers are clear: in my tests, it is three times faster to sum up the values in a LinkedHashSet. You can reproduce my benchmark.

Science and Technology links (March 9th, 2018)

  1. The Audi A8, which goes on sale this year, will be the first car to offer Level 3 autonomy, which means that as a driver, you are expected to be able to relinquish control of the car to the computer in most instances.
  2. Standard tests do predict how well you will do in college with reasonable accuracy.
  3. On the topic of psychology being in trouble as a field… introductory psychology textbooks are misrepresenting the science of intelligence:

    (…) over 93 per cent of the books covered Gardner’s multiple intelligences and over 89 per cent covered Sternberg’s triarchic theory of intelligence, even though neither of these theories are mainstream or well-supported by evidence (…) In contrast, fewer than a quarter of the books covered the most strongly supported contemporary, hierarchical theories (…) Other common inaccuracies included promotion of the idea that it is not possible to measure intelligence in a meaningful way (…), and claims that intelligence is only relevant in academic settings (…). Among the logical fallacies in the books is what’s known as Lewontin’s fallacy – this idea, advanced in six of the books, states that because humans share about 99 per cent of the same genes, that genes cannot therefore have a role in the differences between individuals or groups. (…) Twelve other fallacies appeared in the books such as (…) claiming that intelligence doesn’t exist because it is a collection of abilities (…), Warne highlights issues around the discussion of the taboo topic of race and IQ; textbook authors overplaying the role of stereotype threat, and authors having a tendency to overestimate environmental influences on intelligence (…)

  4. According to several news reports: Luck plays a significant role in how succesful you are. The article that generated this discussion proves nothing of the sort. More critically, it is unclear whether believing that success comes through luck is a good belief to hold. I prefer to assume that I am responsible for the outcomes, both positives and negatives.
  5. Nature reports on nanorobots that can deliver cancer medicine in a targetted manner.
  6. According to the famous Harvard economist Rogoff, a bitcoin might be worth only $100 in ten years. I dicussed Rogoff’s work earlier regarding the Reinhart-Rogoff case.
  7. According to an article in Nature, genetics explains only a small part of differences in how long a person lives.
  8. Visible light affects how long worms live, still in Nature. It means you must control for lightning when trying to reproduce experiments on worms.
  9. Inviting men with no symptoms to a one-off test for prostate cancer does not save lives according to results from the largest ever prostate cancer trial conducted over 10 years. That more medical testing does not save more lives is unintuitive, but it is well known in epidemiology.

Iterating over set bits quickly (SIMD edition)

Suppose that you have a long sequence of bits 10101011100000… you want to visit all the bits set to 1. That is, given 10101011100000, you would like to get the indexes of all the bits set to one: 0,2,4,6,7,8,9.

In a recent blog post, I reviewed fast techniques to iterate over the position of the bits set to 1 in such a bit stream. A fast function in C to solve the problem makes use of the trailing-zero instruction found in recent x64 processors and generated by the __builtin_ctzll intrinsic in several compilers such as LLVM’s clang and GNU gcc. The trailing-zero instruction gives us the number of consecutive zeroes present in a word starting from the least significant bit (so that odd integers have a number of trailing zeros equal to 0). The code is relatively elegant:

size_t bitmap_decode_ctz(uint64_t *bitmap, 
  size_t bitmapsize, 
  uint32_t *out) {
  size_t pos = 0;
  uint64_t bitset;
  for (size_t k = 0; k < bitmapsize; ++k) {
    bitset = bitmap[k];
    while (bitset != 0) {
      uint64_t t = bitset & -bitset;
      int r = __builtin_ctzll(bitset);
      out[pos++] = k * 64 + r;
      bitset ^= t;
  return pos;

In the comments, powturbo (it is a pseudonym) remarked that you could solve the same problem using SIMD instructions. SIMD instructions operate on vectors of words and are often faster. I have not looked at powturbo’s code, but I happen to have an implementation of my own, inherited from my friend Nathan Kurz. It basically processes bytes one by one, looking up the indexes in a table (8kB). Omitting the table, the code is just a bit longer than the trailing-zero version:

int bitmap_decode_avx2(uint64_t * array, 
  size_t sizeinwords, uint32_t *out) {
 uint32_t *initout = out;
 __m256i baseVec = _mm256_set1_epi32(-1);
 __m256i incVec = _mm256_set1_epi32(64);
 __m256i add8 = _mm256_set1_epi32(8);

 for (int i = 0; i < sizeinwords; ++i) {
  uint64_t w = array[i];
  if (w == 0) {
   baseVec = _mm256_add_epi32(baseVec, incVec);
  } else {
   for (int k = 0; k < 4; ++k) {
    uint8_t byteA = (uint8_t) w;
    uint8_t byteB = (uint8_t)(w >> 8);
    w >>= 16;
    __m256i vecA = _mm256_load_si256(vt[byteA]);
    __m256i vecB = _mm256_load_si256(vt[byteB]);
    uint8_t advanceA = lengthTable[byteA];
    uint8_t advanceB = lengthTable[byteB];
    vecA = _mm256_add_epi32(baseVec, vecA);
    baseVec = _mm256_add_epi32(baseVec, add8);
    vecB = _mm256_add_epi32(baseVec, vecB);
    baseVec = _mm256_add_epi32(baseVec, add8);
    _mm256_storeu_si256((__m256i *) out, vecA);
    out += advanceA;
    _mm256_storeu_si256((__m256i *) out, vecB);
    out += advanceB;
 return out - initout;

(At the expense of some performance, you can use a smaller 2kB table.)

As usual, it looks like gibberish if you do not know about Intel intrinsics, but trust me: this code does nothing complicated. It is basically just memoization.

So how does the vectorized version (using SIMD instructions) does against the trailing-zero version? Here are approximate timings on a Skylake processor:

0.0625~6.5 cycles per int~6 cycles per int
0.125~5 cycles per int~3 cycles per int
0.25~4 cycles per int~2 cycles per int
0.5~3 cycles per int~1.25 cycles per int
0.9~3 cycles per int~0.4 cycles per int

For dense bitstreams, my timings are accurate, but they grow more and more inaccurate for sparser bitstreams. Nevertheless, the pattern is clear: if you have dense bitstreams, the SIMD code is about twice as fast. The gains are higher when the bitmaps are very dense. However, the gains disappear when the bit stream is sparse…

I should point out that these timings are not directly comparable with those from my earlier blog post because I write the indexes to an array instead of calling a function for each index. Thus I am currently solving a slightly easier problem.

My code is available.

Note: The “compress” instructions in AVX-512 (e.g., vpcompressd) could make this problem easier. Sadly, AVX-512 instructions are not yet widely available on commodity processors.

Science and Technology links (March 2nd, 2018)

  1. Flashing lights might cure Alzheimer’s, according to Nature.
  2. There is no paradox: being obese is definitively bad for you.
  3. Class attendance predicts success in college.
  4. Barbara Streisan had her dog cloned, more than once.
  5. Contrary to what we have been told: stopping or reducing dietary fiber intake reduces constipation.
  6. Fasting can help prevent and treat cancer. Sadly, my wife would not let me fast if I wanted to (I’m quite thin as it is).
  7. If you cannot fast, maybe you can take aspirin: Aspirin mimics some of the effects of caloric restriction.
  8. Mitochondria, the power plants of your cells, run at a temperature of 50°C.
  9. Apparently, nobody knows how airplanes fly. (Credit: Leonid Boytsov)
  10. Another well-established psychology result bites the dust:

    Dijksterhuis and van Knippenberg (1998) reported that participants primed with a category associated with intelligence (“professor”) subsequently performed 13% better on a trivia test than participants primed with a category associated with a lack of intelligence (“soccer hooligans”). (…) The procedure used in those replications served as the basis for this multilab Registered Replication Report. A total of 40 laboratories collected data for this project, and 23 of these laboratories met all inclusion criteria. Here we report the meta-analytic results for those 23 direct replications (total N = 4,493), which tested whether performance on a 30-item general-knowledge trivia task differed between these two priming conditions (results of supplementary analyses of the data from all 40 labs, N = 6,454, are also reported). We observed no overall difference in trivia performance between participants primed with the “professor” category and those primed with the “hooligan” category (0.14%) and no moderation by gender.

    Psychology as a field is in big trouble.

  11. An old reference that should serve as a good reminder not to trust what you read:

    Functional MRI (fMRI) is 25 years old, yet surprisingly its most common statistical methods have not been validated using real data. Here, we used resting-state fMRI data from 499 healthy controls to conduct 3 million task group analyses. Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. (…) Our results suggest that the principal cause of the invalid cluster inferences is spatial autocorrelation functions that do not follow the assumed Gaussian shape.

    I like this example because it is very common for statisticians to build into their model the assumption that the data must follow some prescribed distribution. Real-life is complicated and rarely behaves like our models.

  12. We are setting up a mobile phone network on the Moon.
  13. The state of California will authorize fully automated cars on its roads.
  14. The cells in your heart do not regenerate. Scientists have found that by turning four genes “on”, they can get the cells to divide and maybe regenerate your heart.
  15. PhD students face significant mental health problems.

Vectorized shifts: are immediates faster?

Our processors are all equipped with vector instructions also called SIMD (single instruction multiple data). One common instruction is the “shift”. Roughly speaking, a shift is either a multiplication by a power of two or a division by a power of two, when considering the data as unsigned integers. Thus 1<<3 is 8.

Older Intel and AMD processors have shift instructions so that you can shift several integers at once. So given [1,2,3,4], you can compute [1<<3, 2<<3, 3<<3, 4<<3] using a single instruction. It is faster, obviously.

However, one form of these instructions can only shift by an “immediate” integer value, that is, the integer must be passed explicitly to the instructions (as opposed to being read from a register). That is an annoying constraint. It is good if you want to shift all your values by 3 bits, and 3 is a fixed quantity, but what if you want to shift by a quantity that is known only at runtime? You can’t. Thus older processors can’t vectorize the following code very well using immediate integers:

void scalarshift(uint32_t *array, size_t N,  int shift) {
  for (size_t k = 0; k < N; ++k) {
    array[k] = array[k] >> shift;

Thankfully, there is also a version of this instruction that takes a non-immediate value, but it must be passed as a vector register: you store the shift count in the first few bits of the vector register.

More recent processors can do variable shifts. Thus given the vectors [1,2,3,4] and [7,8,9,10], one could compute [1<<7, 2<<8, 3<<9, 4<<10] in one instruction. And the vector [7,8,9,10] does not need to be known at compile time.

For some reason, Intel has preserved the older form of the instructions. If you have an immediate integer, you can shift with it. This saves you from having to populate a shift vector and explicitly occupying a register.

But is using immediate integers when possible worth it from a performance point of view? Agner Fog’s instruction table shows little difference between the two forms of the instruction, but I wanted to check for myself. Thus I wrote two versions of the same code, shifting a whole array.

The version with immediate values looks like this…

void vectorshift(uint32_t *array, size_t N) {
  __m256i * a = (__m256i *) array;
  for (size_t k = 0; k  < N / 8 ; k ++, a++) {
    __m256i v = _mm256_loadu_si256(a);
    v = _mm256_srli_epi32(v,SHIFT);

It looks messy because I use intrinsic functions, but you should be able to figure out quickly what my code is doing.

The version with variable shifts is just a bit more complicated…

void vectorshift(uint32_t *array, size_t N, int shift) {
  __m256i * a = (__m256i *) array;
  __m256i s = _mm256_set1_epi32(shift);
  for (size_t k = 0; k  < N / 8 ; k ++, a++) {
    __m256i v = _mm256_loadu_si256(a);
    v = _mm256_srlv_epi32(v,s);

In practice, you want to generously unroll these loops for greater speed.

What is the verdict? On a skylake processor, both reach an identical speed, at 0.35 cycles per input integer. The instruction count is nearly identical.

Thus it is probably not worth bothering with shifts by immediate values for performance reasons.

My code is available.

Science and Technology links (February 24th, 2018)

Iterating over set bits quickly

A common problem in my line of work is to iterate over the set bits (bits having value 1) in a large array.

My standard approach involves a “counting trailing zeroes” function. Given an integer, this function counts how many consecutive bits are zero starting from the less significant bits. Any odd integer has no “trailing zero”. Any even integer has at least one “trailing zero”, and so forth. Many compilers such as LLVM’s clang and GNU GCC have an intrinsic called __builtin_ctzl for this purpose. There are equivalent standard functions in Java (numberOfTrailingZeros), Go and so forth. There is a whole Wikipedia page dedicated to these functions, but recent x64 processors have a fast dedicated instruction so the implementation is taken care at the processor level.

The following function will call the function “callback” with the index of each set bit:

uint64_t bitset;
for (size_t k = 0; k < bitmapsize; ++k) {
    bitset = bitmap[k];
    while (bitset != 0) {
      uint64_t t = bitset & -bitset;
      int r = __builtin_ctzl(bitset);
      callback(k * 64 + r);
      bitset ^= t;

The trick is that bitset & -bitset returns an integer having just the least significant bit of bitset turned on, all other bits are off. With this observation, you should be able to figure out why the routine work.

Note that your compiler can probably optimize bitset & -bitset to a single instruction on x64 processors. Java has an equivalent function called lowestOneBit.

If you are in a rush, that’s probably not how you’d program it. You would probably iterate through all bits, in this manner:

uint64_t bitset;
for (size_t k = 0; k < bitmapsize; ++k) {
    bitset = bitmap[k];
    size_t p = k * 64;
    while (bitset != 0) {
      if (bitset & 0x1) {
      bitset >>= 1;
      p += 1;

Which is faster?

Obviously, you have to make sure that your code compiles down to the fast x64 trailing-zero instruction. If you do, then the trailing-zero approach is much faster.

I designed a benchmark where the callback function just adds the indexes together. The speed per decoded index will depend on the density (fraction of set bits). I ran my benchmark on Skylake processor:

0.125~5 cycles per int40 cycles per int
0.25~3.5 cycles per int30 cycles per int
0.5~2.6 cycles per int23 cycles per int

My code is available.

Thus using a fast trailing-zero function is about ten times faster.

Credit: This post was inspired by Wojciech Muła.

Science and Technology links (February 16th, 2018)

  1. In all countries, in all years–without exception–girls did better than boys in academic performance (PISA) tests.
  2. Vinod Khosla said:

    There are, perhaps, a few hundred sensors in the typical car and none in the body. A single ad shown to you on Facebook has way more computing power applied to it than a $10,000 medical decision you have to make.

  3. The gender of an unknown user can be identified with an accuracy of over 95% using the way a user is typing.
  4. Citation counts work better than a random baseline (by a margin of 10%) in distinguishing important seminal research papers.
  5. By consuming vitamin B3 (or niacin), you can increase your body’s production of Nicotinamide Adenine Dinucleotide (NAD+ for short). It turns out that NAD+ supplementation normalizes key Alzheimer’s features (in mice). If I suffered from Alzheimer’s, I would not hesitate to take niacin supplements.
  6. The U.S. has never produced so much oil.
  7. According to Nature, birds living in large groups are smarter.
  8. A few years ago, we were told that the Pacific nation of Tuvalu would soon disappear due to climate change. In fact, for now, it is growing in size.

Science and Technology links (February 9th, 2018)

  1. We shed 50 million skin cells every day.
  2. A mutant crayfish reproduces by cloning. To my knowledge, this might be the largest animal to reproduce by cloning.

    Before about 25 years ago, the species simply did not exist (…) it has spread across much of Europe and gained a toehold on other continents. In Madagascar, where it arrived about 2007, it now numbers in the millions (…)

    I note two interesting aspects to this story. The first one is that it shows that, contrary to a common belief, new species are created even today. The second one is that it brings us back to an interesting puzzle. Cloning is a lot more efficient that sex, for procreation. So why do most large animals use sex? See The Red Queen: Sex and the Evolution of Human Nature.

  3. Some evidence that moderate (but not heavy) alcohol consumption might be good for your brain. I still would not recommend you start drinking if you aren’t drinking right now.
  4. While the average is 106 boys born to every 100 girls, for vegetarian mothers the ratio is just 85 boys to 100 girls. In other words, being a vegetarian makes it much more likely that you will give birth to girls.
  5. Researchers can simulate a worm’s brain with a few artificial neurons.
  6. Elon Musk’s SpaceX company launched the most powerful rocket in the world:

    The Falcon Heavy is the world’s 4th highest capacity rocket ever to be built (…) Falcon Heavy was designed from the outset to carry humans into space, including the Moon and Mars (…) The Falcon Heavy was developed with private capital with Musk stating that the cost was more than $500 million. No government financing was provided for its development.

    The Verge has a nice documentary on YouTube.

  7. Mitochondria are the “power stations” of our cells. As we age, we tend to accumulate malfunctioning mitochondria which might lead to various medical conditions. Researchers have found that a drug targetting mitochondria could improve cognition in old mice.
  8. Graphics processors are in high demand. Some of the best ones are made by NVIDIA. Year-over-year, NVIDIA’s full-year revenue increased 41% to finish at $9.71 billion in 2017.
  9. Using lasers, we found whole new Mayan settlements:

    The data reveals that the area was three or four times more densely populated than originally thought. “I mean, we’re talking about millions of people, conservatively,” says Garrison. “Probably more than 10 million people.”

  10. According to a recent research article, vitamin D-3 has the potential to significantly reverse the damage that high blood pressure, diabetes, atherosclerosis, and other diseases inflict on the cardiovascular system.
  11. A vaccine entirely cleared mice from cancer.