My kindergarten story

Though I was a straight-A student for most of my high school and college years, I failed kindergarten. I have told this story many times but I realize that I have never dedicated a blog post to it.

I ended up with a PhD from a top Canadian university and some of my research as an academic has been impactful.

But, as a kid, I wasn’t a good fit for schooling.

I could not tie my shoe laces. You might think that I was particularly lacking in hand-eye coordination, but I don’t think that’s the case. To this day, if you meet me, there is a good chance that my shoe laces will be undone. Yet while I am not an athlete, I can beat some hard video games that are demanding as far as reflexes and hand-eye coordination goes. I don’t know exactly why I am not good with shoe laces. I can do knots, as I was a boy scout. For some reasons, I am just not good with shoe laces. Yet it was one of the tests you had to pass in kindergarten.

The other test you had to pass was to learn your phone number. To this day, if you ask me for my phone number, I just don’t know it. I have to look it up, copy and paste it. So I failed this test.

You had to learn to count up to 10. I decided that since I was 5, it was fine to only know how to count up to 5. So I failed this test as well.

I probably failed all sorts of other tests as well.

So I was put in a class for students with learning disabilities. It wasn’t a punishment, I actually enjoyed it.

In the end, I would get really good grades. But my troubles were not over. Like Mandelbrot, I never learned the multiplication tables, instead I learned tricks to do multiplications in my head. Teachers tried really hard to prevent such an approach and you had to hide your schemes. I also never learned the quadratic formula, instead I figured out (on my own probably) how to complete the square quickly enough.

What I did learn, apparently, is independent thinking. Because I had a difficult realtionship with teachers and what they wanted, I think I learned to figure out what I wanted to learn on my own.

How far can you scale interleaved binary searches?

The binary search is the standard, textbook, approach when searching through sorted arrays. In a previous post, I showed how you can do multiple binary searches faster if you interleave them. The core idea is that while the processor is waiting for data from one memory access, you can squeeze in several more from another search. So you effectively compute them “in parallel” even though the code itself is single-threaded.

I hinted that it probably made no sense to scale this up beyond 32 interleaved searches. That is, you cannot do much better than a factor of 9 speedup using a recent Intel processor (Cannon Lake), with a lesser benefit on older architectures (Skylake). I have since written a benchmark to prove that it is so by trying all possible interleave counts between 2 and 64.

So how could we squeeze even more performance? Why can’t we reach a speedup factor of 15 or 20? A significant cost might be related to pages: computers organize their memory in blocks (pages) spanning a small number of bytes (e.g., 4kB). Yet they also can only keep thousands of pages available at any one time. If you ask for a page that has not been recently accessed, the processor needs to remap it, an expensive process. If you are doing lots of random access in a large array, it is likely that you will frequently hit a page that has not been recently accessed, and you might be limited speed-wise by the resulting miss. You can alleviate this problem by asking that the operating system use “huge pages”.

I present the results as a figure which I hope will be intuitive. There is quite a bit of measurement noise, which I leave as a sort of error bar.

  • Huge pages help: we go from a maximum speedup of about 9 to 11. This suggests that page size is a concern. Using huge pages is not very practical, however. Thankfully, there are other ways to alleviate page issues, such as well timed prefetching.
  • The plot shows that there is a very nice performance progression up to about 9 parallel searches, and that up to 9 parallel searches, there is little difference between the version with regular pages and the version with huge pages. I suspect that up to that point (9 searches), we are not fast enough for page size to be a performance concern.
  • Whether we use regular or huge pages, we reach a “noisy” plateau, some kind of hard speed limit. Furthermore, we do not scale up the performance linearly with the number of parallel searches, and even when we do, the slope is much smaller than one. What might be the limiting  factors? I am not sure. If you know, please share your thoughts!

My code is available, it should run everywhere C++ runs. Yet the benchmark assumes that you use the GNU GCC compiler, because I do not know how to deliberately generate the needed conditional move instructions with other compilers. Again, this is the result of joint work with Nathan Kurz and Travis Downs, but the mistakes are mine.

Science and Technology links (September 14th 2019)

  1. Streaming music makes up 80% of the revenue of the music industry. Revenue is up 18% for the first six months of 2019. This follows a record year in 2018 when the music industry reached its highest revenues in 10 years. Though it should be good news for musicians, it is also the case that record labels often take the lion share of the revenues.
  2. We have seen massive progress in the last five years in artificial intelligence. Yet we do not see obvious signs of economic gains from this progress.
  3. A widely cited study contradicting the existence of free will was fatally flawed.
  4. A common diabetes drug might spur neurogenesis and brain repair in the brain of (female) mice.
  5. Facebook is working on creating real-time models of your face for use in virtual reality.
  6. A simple and inexpensive eye test might be sufficient to detect early signs of Alzheimer’s.

Speeding up independent binary searches by interleaving them

Given a long list of sorted values, how do you find the location of a particular value? A simple strategy is to first look at the middle of the list. If your value is larger than the middle value, look at the last half of the list, if not look at the first half of the list. Then repeat with select half, looking again in the middle. This algorithm is called a binary search. One of the first algorithms that computer science students learn is the binary search. I suspect that many people figure it out as a kid.

It is hard to drastically improve on the binary search if you only need to do one.

But what if you need to execute multiple binary searches, over distinct lists? Maybe surprisingly, in such cases, you can multiply the speed.

Let us first reason about how a binary search works. The processor needs to retrieve the middle value and compare it with your target value. Before the middle value is loaded, the processor cannot tell whether it will need to access the last or first half of the list. It might speculate. Most processors have branch predictors. In the case of a binary search, the branch is hard to predict so we might expect that the branch predictor will get it wrong half the time. Still: when it speculates properly, it is a net win. Importantly, we are using the fact that processors can do multiple memory requests at once.

What else could the processor do? If there are many binary searches to do, it might initiate the second one. And then maybe initiate the third one and so forth. This might be a lot more beneficial than speculating wastefully on a single binary search.

How can it start the second or third search before it finishes the current one? Again, due to speculation. If it is far along in the first search to predict its end, it might see the next search coming and start it even though it is still waiting for data regarding the current binary search. This should be especially easy if your sorted arrays have a predictable size.

There is a limit to how many instructions your processor might reorder in this manner. Let us say, for example, that the limit is 200 instructions. If each binary search takes 100 instructions, and that this value is reliable (maybe because the arrays have fixed sizes), then your processor might be able do up to two binary searches at once. So it might go twice as fast. But it probably cannot easily go much further (to 3 or 4).

But the programmer can help. We can manually tell the processor initiate right away four binary searches:

given arrays a1, a2, a3, a4
given target t1, t2, t3, t4

compare t1 with middle of a1
compare t2 with middle of a2
compare t3 with middle of a3
compare t4 with middle of a4

cut a1 in two based on above comparison
cut a2 in two based on above comparison
cut a3 in two based on above comparison
cut a4 in two based on above comparison

compare t1 with middle of a1
compare t2 with middle of a2
compare t3 with middle of a3
compare t4 with middle of a4

...

You can go far higher than 4 interleaved searches. You can do 8, 16, 32… I suspect that there is no practical need to go beyond 32.

How well does it work? Let us take 1024 sorted arrays containing a total of 64 million integers. In each array, I want to do one and just one binary search. Between each test, I access all of data in all of the arrays a couple of times to fool the cache.

By default, if you code a binary search, the resulting assembly will be made of comparisons and jumps. Hence your processor will execute this code in a speculative manner. At least with GNU GCC, we can write the C/C++ code in such a way that the branches are implemented as “conditional move” instructions which prevents the processor from speculating.

My results on a recent Intel processor (Cannon Lake) with GNU GCC 8 are as follows:

algorithm time per search relative speed
1-wise (independent) 2100 cycles/search 1.0 ×
1-wise (speculative) 900 cycles/search 2.3 ×
1-wise (branchless) 1100 cycles/search 2.0 ×
2-wise (branchless) 800 cycles/search 2.5 ×
4-wise (branchless) 600 cycles/search 3.5 ×
8-wise (branchless) 350 cycles/search 6.0 ×
16-wise (branchless) 280 cycles/search 7.5 ×
32-wise (branchless) 230 cycles/search 9 ×

So we are able to go 3 to 4 times faster, on what is effectively a memory bound problem, by interleaving 32 binary searches together. Interleaving merely 16 searches might also be preferable on some systems.

But why is this not higher than 4 times faster? Surely the processor can issue more than 4 memory loads at once?

That is because the 1-wise search, even without speculation, already benefits from the fact that we are streaming multiple binary searches, so that more than one is ongoing at any one time. Indeed, I can prevent the processor from usefully executing more than one search either by inserting memory fences between each search, or by making the target of one search dependent on the index found in the previous search. When I do so the time goes up to 2100 cycles/array which is approximately 9 times longer than 32-wise approach. The exact ratio (9) varies depending on the exact processor: I get a factor of 7 on the older Skylake architecture and a factor of 5 on an ARM Skylark processor.

My source code is available.

Implementation-wise, I code my algorithms in pure C/C++. There is no need for fancy, esoteric instructions. The condition move instructions are pretty much standard and old at this point. Sadly, I only know how to convince one compiler (GNU GCC) to reliably produce conditional move instructions. And I have no clue how to control how Java, Swift, Rust or any other language deals with branches.

Could you do better than my implementation? Maybe but there are arguments suggesting that you can’t beat it by much in general. Each data access is done using fewer than 10 instructions in my implementation, which is far below the number of cycles and small compared to the size of the instruction buffers, so finding ways to reduce the instruction count should not help. Furthermore, it looks like I am already nearly maxing out the amount of memory-level parallelism at least on some hardware (9 on Cannon Lake, 7 on Skylake, 5 on Skylark). On Cannon Lake, however, you should be able to do better.

Credit: This work is the result of a collaboration with Travis Downs and Nathan Kurz, though all of the mistakes are mine.

Science and Technology links (September 7th 2019)

  1. In a small clinical trial, scientists administered some “anti-aging” therapies to people between their fifties and sixties. They used growth hormone to regenerate the thymus, part of our immune system which is nearly all gone when people reach 50 years old. They complemented this therapy with anti-diabetes drugs (dehydroepiandrosterone and metformin) as an attempt to compensate for the fact that growth hormone is pro-diabetic. The researchers found that in a majority of the participants, the thymus was indeed regenerated: to my knowledge, this is a breakthrough in human beings (we only knew that in worked in animal models). Further, using an epigenetic clock, they found that participants’ biological age was reversed, by 2.5 years on average. The result persisted six months after stopping the trial. The surprising results should be viewed with caution given that it was a small trial with no control. Nevertheless, to my knowledge, it is the first time that biological aging has been reversed in human beings following a supervised therapy. The article is short and readable. (Source: Nature)
  2. Researchers found out how to edit large pieces of DNA, making it potentially easier than before to construct synthetic genomes. (Source: Science)
  3. Mathematician Hanna Fry on statin drugs:

    Of a thousand people who take statins for five years, the drugs will help only eighteen to avoid a major heart attack or stroke.

  4. Chronological aging (how many years you lived) and biological aging (loss of fitness with time) differ. There is apparently a tendency for human beings to biologically age slower. The trend was confirmed between 1988 and 2010. Older men benefited most from this trend. I do not think that we fully understand this phenomenon.
  5. A woman received a cornea made from reprogrammed stem cells.
  6. Drones are used in China to spray insecticides more efficiently than farmers can.
  7. If you go Harvard University, you will have access to the best teaching and best professors, won’t you? Not so fast write Hacker and Dreifus:

    At Harvard, even untenured asssistant professors get a fully paid year to complete a promotion-worthy book. Thus in a recent year, of its history department’s six assistant professors, only two were on hand to teach classes. In Harvard’s department of philosophy that same year, almost half of its full-time faculty were away on sabbaticals. Of course it was the students who paid. Many of their undergraduate courses were canceled or given by one-year visitors unfamiliar with the byways of the university.

    Harvard undergraduate students give their professors C- on classroom performance.

  8. Regarding anxiety, no psychological interventions had greater effects compared with placebo.
  9. Men who are avid consumers of pornography are no more sexist or misogynistic than the general public. In fact, pornography users hold more gender egalitarian views.
  10. A new drug was developed quickly using deep learning.
  11. A software program can pass an 8th grade science test.
  12. Our cells can replicate only so many times before their telomeres are too short. To keep on dividing, cells must use an enzyme called telomerase. It was believed until now that, except for cancer, most of our cells (somatic cells) did not use telomerase. It appears that this belief is wrong: when our cells have short telomeres, they will sometimes use telomerase to protect themselves.

Passing integers by reference can be expensive…

In languages like C++, you can pass values to functions in two ways. You can pass by value: the value is semantically “copied” before being passed to the function. Any change you made to the value within the function will be gone after the function terminates. The value is ephemeral. You can also pass by pointer or reference: semantically, we point the new function at the value we already have. Changes made within the function to the value will remain after the scope of the function.

As a rule of thumb, whenever a data element is large (e.g., it cannot fit in registers), it is probably faster to pass by reference or pointer. For tiny data elements that fit in registers, like integers, passing them by value ought to be best.

However, sometimes it is convenient to pass integer values by reference because you want to capture how the value changed during the scope of the function. C++ makes this trivially easy as you just need to add the ampersand (&) in front of the parameter. It looks “free”.

Let us consider the silly example of some function that passes a counter (the variable ‘i’) which gets incremented and added to multiple locations in an array:

void incrementr(uint64_t &i, uint64_t *x) {
  for (int k = 0; k < SIZE; k++) {
    x[k] += i++;
  }
}

We can write an almost entirely equivalent function without using a reference. We just take the counter by value, modify it, and then return it at the end of the function:

int increment(uint64_t i, uint64_t *x) {
  for (int k = 0; k < SIZE; k++) {
    x[k] += i++;
  }
  return i;
}

I expect these two types of functions to be largely equivalent semantically, as long as the counter is assumed not to reside in the array (x). What about performance? The function call itself is different, so there might be a couple of extra move instructions in total in the pass-by-reference case in the best of cases due to calling conventions. However, compilers, like GNU GCC, produce vastly different code that go far beyond a few extra move instructions at the start and end of the function.

GNU GCC 8.3 keeps the value of the passed-by-reference value in memory throughout instead of using a register, due to an aliasing issue. Thus each time you access the counter, you need to reload it. If you modify it, you need to store the new value again. The net result is far worse performance on my Skylake processor…

cycles per value instructions per value
reference 5.9 7.0
value 1.3 3.5

The effect is quite large as you can see. My source code is available.

You can avoid the penalty by copying the passed-by-reference variable to a local variable at the start, and copying it back at end of the function. Or, if your compiler supports it, you can add the __restrict qualifier on your reference.

Other languages like Swift are likely affected as well. In Swift, you can pass an integer as an inout variable which is semantically equivalent to a reference…

func fx(_ allints: inout [Int],  _ j : inout Int) {
      for k in allints.indices {
        allints[k] = j
        j &+= 1
      }
 }

You should avoid such code if you care for performance.

Of course, these considerations are likely to be irrelevant if the function gets inlined or if the function is very inexpensive.

Science and Technology links (August 31st 2019)

  1. The Conboy laboratory in Berkeley is responsible for some of the best work in aging research. In 2005, they showed that by connecting the blood vessels of an old and a young mice, they could rejuvenate the old mice (via a process called parabiosis). This earlier work showed that there are probably “aging factors” in one’s blood. Of course, this suggests that old people could benefit from blood transfusion coming from younger people, but that is not expected to be effective. In their latest paper, the Conboy lab shows that by inhibiting something called ALK5 and by increasing oxytocin, they could rejuvenate old mice.

    Alk5i plus OT quickly and robustly enhanced neurogenesis, reduced neuro-inflammation, improved cognitive performance, and rejuvenated livers and muscle in old mice. Summarily, simultaneously re-normalizing two pathways that change with age in opposite ways (up vs. down) synergistically reverses multiple symptoms of aging.

    This looks like a major breakthrough. The authors find that the treatment (which is relatively mild in dosage) reduces the number of senescent cells which might reduce inflammation. Senescent cells are old cells that will not die and they cause all sorts of problems as we age.

  2. If you want to motivate teachers to improve student scores with bonuses, you might think to give the bonuses at the end of the year. Yet it is more effective to give the bonus at the start of the year and collect it back if the students did not meet the required targets. Also it is better to reward students for completing tasks (e.g., homeworks) rather than for achieving good results.
  3. The pharmaceutical industry has a surprisingly high carbon footprint, larger than that of the automotive industry.

How fast can scancount be?

In an earlier post, I described the following problem. Suppose that you have tens of arrays of integers. You wish to find all integer that are in more than 3 (say) of the sets. This is neither a union nor an intersection. A simple algorithm to solve this problem is to use a counter for each possible value, and increment the counter each time the value is seen. We can finally check the counters and output the answer.

counter <- array of zeros
for array x in set of arrays {
    for value y in array x {
      counter[y] += 1
}
for i = 0; i < counter.size(); i++ {
  if(counter[i] > threshold)
     output i;
}

You can do much better if you break down the space of the problem, so that all of your counters fit in the CPU cache. You solve the problem using multiple passes. It also helps if you use small counters, say 8-bit counters.

With this early optimization, I was happy to report a performance of about 16 cycles per input value. It is pretty good… but Nathan Kurz and Travis Downs pointed out that you can do much better.

What are some of the issues?

  1. I was using textbook C++ with iterators and STL algorithms. Converting the problem into lower-level code (using pointers or indexes) seems better for performance.
  2. Using 8-bit counters triggers all sorts of nasty “aliasing” problems: effectively, the compiler becomes paranoid and thinks that you might be writing all over the memory. Travis wrote a blog post on the topic and on how you can mitigate the issue.
  3. There are many low-level issues one might want to handle. For example, if you have 8-bit counter values and an 32-bit threshold value, an extra instruction might be needed to zero-extend your counter value.

We now have two much faster versions to offer. They run at between 3 to 4 cycles per input element. If you are keeping track, it is better than four times faster than my original “optimized” version. I make the code available on GitHub as a header-only C++ library.

    1. The first version is derived from a comment by Nathan Kurz. It works identically to my original version, but instead of using a second pass through the counters, we check the counter values as we modify them. Other than that, I just carefully rewrote my original C++ in “plain C”. I achieve 4 cycles per input element and a number of instructions per cycle of 3.1 on a skylake processor with GNU GCC and around 3.5 cycles per input element under LLVM clang. The implementation should be portable.
    2. Travis was nice enough to share his implementation which follows the same original algorithm, but has been vectorized: it uses fast AVX2 instructions. It is slightly faster under the GNU GCC 8 compiler, sometimes approaching 3 cycles per input element.

 

So how fast can scancount be? I have not yet answered this question.

In an earlier comment, Travis gave a bound. If each input element must result into the increment of a counter that is not a register, then we have one store per input element, and thus we need at least one cycle per input element on processors that are limited to one store per cycle.

Of course, we also need to load data: say one load to get the element, and one load to get the corresponding counter. Assuming that our hardware can do either two loads per cycle or one store per cycle, then we get a limit of at least two cycles per input element. Based on this napkin mathematics, if we achieve 4 cycles, then we are within a factor of two of the best possible algorithm.

Still: we could go much faster without breaking the speed of light. We are almost solving an embarrassingly parallel problem: we break the problem into subspaces (for the caching benefits) but this means that should be able to dispatch independent components of the problem to distinct cores.

Science and Technology links (August 24th 2019)

  1. The net contribution of the Amazon ecosystem to the world’s oxygen is effectively zero. Furthermore, there is a lot of oxygen in the air and it would be excessively difficult to lower or increase the ratio. In effect, we are not going to run out of oxygen on Earth any time soon, even if we tried to.
  2. According to Nature, the Earth is getting greener. The main driver is climate change and CO2 which acts as a fertilizer. A large fraction of this greening is happening in China and India. These countries have increased their food production by 35% since 2000.
  3. Vitamin D supplements appear to reduce cancer risks. Vitamin D supplements are typically available as either vitamin D2 or D3. We know that D2 supplements do not lower mortality risks. We do not have enough data to assess vitamin D3. (Zhang et al., 2019)
  4. There is no evidence that plastic particules in water are harmful to human beings.
  5. CNBC reports that Mackey, a vegan who runs the successful Whole Foods chain, is critical of the fashionable Beyond Meat “plant-based meat”:

    I don’t think eating highly processed foods is healthy. I think people thrive on eating whole foods. As for health, I will not endorse that, and that is about as big of criticism that I will do in public.

    Meanwhile the stock price of Beyond Meat went up 800% in a few months.

Science and Technology links (August 17th 2019)

  1. Google may soon release an augmented-reality version of Google Maps for mobile phones. I have always found it difficult to identify the streets I see in an unfamiliar setting, and it looks like Google may be solving this very problem. I am looking forward to seeing how well it works in practice.
  2. The delivery company UPS has been using self-driving trucks for some time. However the trucks are still manned by a driver and an engineer.
  3. Researchers have rejuvenated the brain cells of old rats. There is a press release.
  4. Blood pressure is an important health marker and it would be important to monitor it more broadly. There are cheap devices to measure blood pressure automatically, but they typically require that you put a strap around your arm. Researchers have shown that we can measure blood pressure using the camera of a smartphone. It remains to verify that it can work outside of the lab, in the real world.
  5. Are electric cars good for the environment? According to Holland et al., they are worse than gasoline cars on average, even though they may be preferable where you live:

    Ignoring local pollution leads to an overestimate of the benefits of electric vehicles and an underestimate of the geographic heterogeneity. Accounting for both global and local pollution, we find electric vehicles generate negative environmental benefits of 0.73 cents per mile on average relative to comparable gasoline vehicles. (…) On average, electric vehicles driven in metropolitan areas generate benefits of about $0.01 per mile while those driven outside metropolitan areas generate benefits of −1.7 cents per mile.