Speeding up independent binary searches by interleaving them

Given a long list of sorted values, how do you find the location of a particular value? A simple strategy is to first look at the middle of the list. If your value is larger than the middle value, look at the last half of the list, if not look at the first half of the list. Then repeat with select half, looking again in the middle. This algorithm is called a binary search. One of the first algorithms that computer science students learn is the binary search. I suspect that many people figure it out as a kid.

It is hard to drastically improve on the binary search if you only need to do one.

But what if you need to execute multiple binary searches, over distinct lists? Maybe surprisingly, in such cases, you can multiply the speed.

Let us first reason about how a binary search works. The processor needs to retrieve the middle value and compare it with your target value. Before the middle value is loaded, the processor cannot tell whether it will need to access the last or first half of the list. It might speculate. Most processors have branch predictors. In the case of a binary search, the branch is hard to predict so we might expect that the branch predictor will get it wrong half the time. Still: when it speculates properly, it is a net win. Importantly, we are using the fact that processors can do multiple memory requests at once.

What else could the processor do? If there are many binary searches to do, it might initiate the second one. And then maybe initiate the third one and so forth. This might be a lot more beneficial than speculating wastefully on a single binary search.

How can it start the second or third search before it finishes the current one? Again, due to speculation. If it is far along in the first search to predict its end, it might see the next search coming and start it even though it is still waiting for data regarding the current binary search. This should be especially easy if your sorted arrays have a predictable size.

There is a limit to how many instructions your processor might reorder in this manner. Let us say, for example, that the limit is 200 instructions. If each binary search takes 100 instructions, and that this value is reliable (maybe because the arrays have fixed sizes), then your processor might be able do up to two binary searches at once. So it might go twice as fast. But it probably cannot easily go much further (to 3 or 4).

But the programmer can help. We can manually tell the processor initiate right away four binary searches:

given arrays a1, a2, a3, a4
given target t1, t2, t3, t4

compare t1 with middle of a1
compare t2 with middle of a2
compare t3 with middle of a3
compare t4 with middle of a4

cut a1 in two based on above comparison
cut a2 in two based on above comparison
cut a3 in two based on above comparison
cut a4 in two based on above comparison

compare t1 with middle of a1
compare t2 with middle of a2
compare t3 with middle of a3
compare t4 with middle of a4

...

You can go far higher than 4 interleaved searches. You can do 8, 16, 32… I suspect that there is no practical need to go beyond 32.

How well does it work? Let us take 1024 sorted arrays containing a total of 64 million integers. In each array, I want to do one and just one binary search. Between each test, I access all of data in all of the arrays a couple of times to fool the cache.

By default, if you code a binary search, the resulting assembly will be made of comparisons and jumps. Hence your processor will execute this code in a speculative manner. At least with GNU GCC, we can write the C/C++ code in such a way that the branches are implemented as “conditional move” instructions which prevents the processor from speculating.

My results on a recent Intel processor (Cannon Lake) with GNU GCC 8 are as follows:

algorithm time per search relative speed
1-wise (independent) 2100 cycles/search 1.0 ×
1-wise (speculative) 900 cycles/search 2.3 ×
1-wise (branchless) 1100 cycles/search 2.0 ×
2-wise (branchless) 800 cycles/search 2.5 ×
4-wise (branchless) 600 cycles/search 3.5 ×
8-wise (branchless) 350 cycles/search 6.0 ×
16-wise (branchless) 280 cycles/search 7.5 ×
32-wise (branchless) 230 cycles/search 9 ×

So we are able to go 3 to 4 times faster, on what is effectively a memory bound problem, by interleaving 32 binary searches together. Interleaving merely 16 searches might also be preferable on some systems.

But why is this not higher than 4 times faster? Surely the processor can issue more than 4 memory loads at once?

That is because the 1-wise search, even without speculation, already benefits from the fact that we are streaming multiple binary searches, so that more than one is ongoing at any one time. Indeed, I can prevent the processor from usefully executing more than one search either by inserting memory fences between each search, or by making the target of one search dependent on the index found in the previous search. When I do so the time goes up to 2100 cycles/array which is approximately 9 times longer than 32-wise approach. The exact ratio (9) varies depending on the exact processor: I get a factor of 7 on the older Skylake architecture and a factor of 5 on an ARM Skylark processor.

My source code is available.

Implementation-wise, I code my algorithms in pure C/C++. There is no need for fancy, esoteric instructions. The condition move instructions are pretty much standard and old at this point. Sadly, I only know how to convince one compiler (GNU GCC) to reliably produce conditional move instructions. And I have no clue how to control how Java, Swift, Rust or any other language deals with branches.

Could you do better than my implementation? Maybe but there are arguments suggesting that you can’t beat it by much in general. Each data access is done using fewer than 10 instructions in my implementation, which is far below the number of cycles and small compared to the size of the instruction buffers, so finding ways to reduce the instruction count should not help. Furthermore, it looks like I am already nearly maxing out the amount of memory-level parallelism at least on some hardware (9 on Cannon Lake, 7 on Skylake, 5 on Skylark). On Cannon Lake, however, you should be able to do better.

Credit: This work is the result of a collaboration with Travis Downs and Nathan Kurz, though all of the mistakes are mine.

Science and Technology links (September 7th 2019)

  1. In a small clinical trial, scientists administered some “anti-aging” therapies to people between their fifties and sixties. They used growth hormone to regenerate the thymus, part of our immune system which is nearly all gone when people reach 50 years old. They complemented this therapy with anti-diabetes drugs (dehydroepiandrosterone and metformin) as an attempt to compensate for the fact that growth hormone is pro-diabetic. The researchers found that in a majority of the participants, the thymus was indeed regenerated: to my knowledge, this is a breakthrough in human beings (we only knew that in worked in animal models). Further, using an epigenetic clock, they found that participants’ biological age was reversed, by 2.5 years on average. The result persisted six months after stopping the trial. The surprising results should be viewed with caution given that it was a small trial with no control. Nevertheless, to my knowledge, it is the first time that biological aging has been reversed in human beings following a supervised therapy. The article is short and readable. (Source: Nature)
  2. Researchers found out how to edit large pieces of DNA, making it potentially easier than before to construct synthetic genomes. (Source: Science)
  3. Mathematician Hanna Fry on statin drugs:

    Of a thousand people who take statins for five years, the drugs will help only eighteen to avoid a major heart attack or stroke.

  4. Chronological aging (how many years you lived) and biological aging (loss of fitness with time) differ. There is apparently a tendency for human beings to biologically age slower. The trend was confirmed between 1988 and 2010. Older men benefited most from this trend. I do not think that we fully understand this phenomenon.
  5. A woman received a cornea made from reprogrammed stem cells.
  6. Drones are used in China to spray insecticides more efficiently than farmers can.
  7. If you go Harvard University, you will have access to the best teaching and best professors, won’t you? Not so fast write Hacker and Dreifus:

    At Harvard, even untenured asssistant professors get a fully paid year to complete a promotion-worthy book. Thus in a recent year, of its history department’s six assistant professors, only two were on hand to teach classes. In Harvard’s department of philosophy that same year, almost half of its full-time faculty were away on sabbaticals. Of course it was the students who paid. Many of their undergraduate courses were canceled or given by one-year visitors unfamiliar with the byways of the university.

    Harvard undergraduate students give their professors C- on classroom performance.

  8. Regarding anxiety, no psychological interventions had greater effects compared with placebo.
  9. Men who are avid consumers of pornography are no more sexist or misogynistic than the general public. In fact, pornography users hold more gender egalitarian views.
  10. A new drug was developed quickly using deep learning.
  11. A software program can pass an 8th grade science test.
  12. Our cells can replicate only so many times before their telomeres are too short. To keep on dividing, cells must use an enzyme called telomerase. It was believed until now that, except for cancer, most of our cells (somatic cells) did not use telomerase. It appears that this belief is wrong: when our cells have short telomeres, they will sometimes use telomerase to protect themselves.

Passing integers by reference can be expensive…

In languages like C++, you can pass values to functions in two ways. You can pass by value: the value is semantically “copied” before being passed to the function. Any change you made to the value within the function will be gone after the function terminates. The value is ephemeral. You can also pass by pointer or reference: semantically, we point the new function at the value we already have. Changes made within the function to the value will remain after the scope of the function.

As a rule of thumb, whenever a data element is large (e.g., it cannot fit in registers), it is probably faster to pass by reference or pointer. For tiny data elements that fit in registers, like integers, passing them by value ought to be best.

However, sometimes it is convenient to pass integer values by reference because you want to capture how the value changed during the scope of the function. C++ makes this trivially easy as you just need to add the ampersand (&) in front of the parameter. It looks “free”.

Let us consider the silly example of some function that passes a counter (the variable ‘i’) which gets incremented and added to multiple locations in an array:

void incrementr(uint64_t &i, uint64_t *x) {
  for (int k = 0; k < SIZE; k++) {
    x[k] += i++;
  }
}

We can write an almost entirely equivalent function without using a reference. We just take the counter by value, modify it, and then return it at the end of the function:

int increment(uint64_t i, uint64_t *x) {
  for (int k = 0; k < SIZE; k++) {
    x[k] += i++;
  }
  return i;
}

I expect these two types of functions to be largely equivalent semantically, as long as the counter is assumed not to reside in the array (x). What about performance? The function call itself is different, so there might be a couple of extra move instructions in total in the pass-by-reference case in the best of cases due to calling conventions. However, compilers, like GNU GCC, produce vastly different code that go far beyond a few extra move instructions at the start and end of the function.

GNU GCC 8.3 keeps the value of the passed-by-reference value in memory throughout instead of using a register, due to an aliasing issue. Thus each time you access the counter, you need to reload it. If you modify it, you need to store the new value again. The net result is far worse performance on my Skylake processor…

cycles per value instructions per value
reference 5.9 7.0
value 1.3 3.5

The effect is quite large as you can see. My source code is available.

You can avoid the penalty by copying the passed-by-reference variable to a local variable at the start, and copying it back at end of the function. Or, if your compiler supports it, you can add the __restrict qualifier on your reference.

Other languages like Swift are likely affected as well. In Swift, you can pass an integer as an inout variable which is semantically equivalent to a reference…

func fx(_ allints: inout [Int],  _ j : inout Int) {
      for k in allints.indices {
        allints[k] = j
        j &+= 1
      }
 }

You should avoid such code if you care for performance.

Of course, these considerations are likely to be irrelevant if the function gets inlined or if the function is very inexpensive.

Science and Technology links (August 31st 2019)

  1. The Conboy laboratory in Berkeley is responsible for some of the best work in aging research. In 2005, they showed that by connecting the blood vessels of an old and a young mice, they could rejuvenate the old mice (via a process called parabiosis). This earlier work showed that there are probably “aging factors” in one’s blood. Of course, this suggests that old people could benefit from blood transfusion coming from younger people, but that is not expected to be effective. In their latest paper, the Conboy lab shows that by inhibiting something called ALK5 and by increasing oxytocin, they could rejuvenate old mice.

    Alk5i plus OT quickly and robustly enhanced neurogenesis, reduced neuro-inflammation, improved cognitive performance, and rejuvenated livers and muscle in old mice. Summarily, simultaneously re-normalizing two pathways that change with age in opposite ways (up vs. down) synergistically reverses multiple symptoms of aging.

    This looks like a major breakthrough. The authors find that the treatment (which is relatively mild in dosage) reduces the number of senescent cells which might reduce inflammation. Senescent cells are old cells that will not die and they cause all sorts of problems as we age.

  2. If you want to motivate teachers to improve student scores with bonuses, you might think to give the bonuses at the end of the year. Yet it is more effective to give the bonus at the start of the year and collect it back if the students did not meet the required targets. Also it is better to reward students for completing tasks (e.g., homeworks) rather than for achieving good results.
  3. The pharmaceutical industry has a surprisingly high carbon footprint, larger than that of the automotive industry.

How fast can scancount be?

In an earlier post, I described the following problem. Suppose that you have tens of arrays of integers. You wish to find all integer that are in more than 3 (say) of the sets. This is neither a union nor an intersection. A simple algorithm to solve this problem is to use a counter for each possible value, and increment the counter each time the value is seen. We can finally check the counters and output the answer.

counter <- array of zeros
for array x in set of arrays {
    for value y in array x {
      counter[y] += 1
}
for i = 0; i < counter.size(); i++ {
  if(counter[i] > threshold)
     output i;
}

You can do much better if you break down the space of the problem, so that all of your counters fit in the CPU cache. You solve the problem using multiple passes. It also helps if you use small counters, say 8-bit counters.

With this early optimization, I was happy to report a performance of about 16 cycles per input value. It is pretty good… but Nathan Kurz and Travis Downs pointed out that you can do much better.

What are some of the issues?

  1. I was using textbook C++ with iterators and STL algorithms. Converting the problem into lower-level code (using pointers or indexes) seems better for performance.
  2. Using 8-bit counters triggers all sorts of nasty “aliasing” problems: effectively, the compiler becomes paranoid and thinks that you might be writing all over the memory. Travis wrote a blog post on the topic and on how you can mitigate the issue.
  3. There are many low-level issues one might want to handle. For example, if you have 8-bit counter values and an 32-bit threshold value, an extra instruction might be needed to zero-extend your counter value.

We now have two much faster versions to offer. They run at between 3 to 4 cycles per input element. If you are keeping track, it is better than four times faster than my original “optimized” version. I make the code available on GitHub as a header-only C++ library.

    1. The first version is derived from a comment by Nathan Kurz. It works identically to my original version, but instead of using a second pass through the counters, we check the counter values as we modify them. Other than that, I just carefully rewrote my original C++ in “plain C”. I achieve 4 cycles per input element and a number of instructions per cycle of 3.1 on a skylake processor with GNU GCC and around 3.5 cycles per input element under LLVM clang. The implementation should be portable.
    2. Travis was nice enough to share his implementation which follows the same original algorithm, but has been vectorized: it uses fast AVX2 instructions. It is slightly faster under the GNU GCC 8 compiler, sometimes approaching 3 cycles per input element.

 

So how fast can scancount be? I have not yet answered this question.

In an earlier comment, Travis gave a bound. If each input element must result into the increment of a counter that is not a register, then we have one store per input element, and thus we need at least one cycle per input element on processors that are limited to one store per cycle.

Of course, we also need to load data: say one load to get the element, and one load to get the corresponding counter. Assuming that our hardware can do either two loads per cycle or one store per cycle, then we get a limit of at least two cycles per input element. Based on this napkin mathematics, if we achieve 4 cycles, then we are within a factor of two of the best possible algorithm.

Still: we could go much faster without breaking the speed of light. We are almost solving an embarrassingly parallel problem: we break the problem into subspaces (for the caching benefits) but this means that should be able to dispatch independent components of the problem to distinct cores.

Science and Technology links (August 24th 2019)

  1. The net contribution of the Amazon ecosystem to the world’s oxygen is effectively zero. Furthermore, there is a lot of oxygen in the air and it would be excessively difficult to lower or increase the ratio. In effect, we are not going to run out of oxygen on Earth any time soon, even if we tried to.
  2. According to Nature, the Earth is getting greener. The main driver is climate change and CO2 which acts as a fertilizer. A large fraction of this greening is happening in China and India. These countries have increased their food production by 35% since 2000.
  3. Vitamin D supplements appear to reduce cancer risks. Vitamin D supplements are typically available as either vitamin D2 or D3. We know that D2 supplements do not lower mortality risks. We do not have enough data to assess vitamin D3. (Zhang et al., 2019)
  4. There is no evidence that plastic particules in water are harmful to human beings.
  5. CNBC reports that Mackey, a vegan who runs the successful Whole Foods chain, is critical of the fashionable Beyond Meat “plant-based meat”:

    I don’t think eating highly processed foods is healthy. I think people thrive on eating whole foods. As for health, I will not endorse that, and that is about as big of criticism that I will do in public.

    Meanwhile the stock price of Beyond Meat went up 800% in a few months.

Science and Technology links (August 17th 2019)

  1. Google may soon release an augmented-reality version of Google Maps for mobile phones. I have always found it difficult to identify the streets I see in an unfamiliar setting, and it looks like Google may be solving this very problem. I am looking forward to seeing how well it works in practice.
  2. The delivery company UPS has been using self-driving trucks for some time. However the trucks are still manned by a driver and an engineer.
  3. Researchers have rejuvenated the brain cells of old rats. There is a press release.
  4. Blood pressure is an important health marker and it would be important to monitor it more broadly. There are cheap devices to measure blood pressure automatically, but they typically require that you put a strap around your arm. Researchers have shown that we can measure blood pressure using the camera of a smartphone. It remains to verify that it can work outside of the lab, in the real world.
  5. Are electric cars good for the environment? According to Holland et al., they are worse than gasoline cars on average, even though they may be preferable where you live:

    Ignoring local pollution leads to an overestimate of the benefits of electric vehicles and an underestimate of the geographic heterogeneity. Accounting for both global and local pollution, we find electric vehicles generate negative environmental benefits of 0.73 cents per mile on average relative to comparable gasoline vehicles. (…) On average, electric vehicles driven in metropolitan areas generate benefits of about $0.01 per mile while those driven outside metropolitan areas generate benefits of −1.7 cents per mile.

Faster threshold queries with cache-sensitive scancount

Suppose that you are given 100 sorted arrays of integers. You can compute their union or their intersection. It is a common setup in data indexing: the integers might be unique identifiers.

But there is more than just intersections and unions… What if you want all values that appear in more than three arrays?

A really good algorithm for this problem is called scancount. It is good because it is simple and usually quite fast.

Suppose that all my integers are in the interval [0, 20M). You start with an array of counters, initialized at zero. You scan all your arrays for each value in the array, you increment the corresponding counter. When you have scanned all arrays, you scan your counters, looking for counter values greater than your threshold (3).

The pseudocode looks like this…

counter <- array of zeros
for array x in set of arrays {
    for value y in array x {
      counter[y] += 1
}
for i = 0; i < counter.size(); i++ {
  if(counter[i] > threshold)
     output i;
}

This algorithm is almost entirely bounded by “memory accesses”. Memory-wise if you only have about 100 arrays, you only need 8-bit counters. So I can store all counters in about 20 MB. Sadly, this means that the counters do not fit in processor cache.

Can you make scancount faster without sacrificing too much simplicity?

So far we did not use the fact that our arrays can be sorted. Because they are sorted, then you can solve the problem in “cache-sensitive” or “cache-aware” chunks.

Build a small array of counters, spanning maybe only 256 kB. Process all arrays, as with the naive scancount, but suspend the processing of this array as soon as a value in the array exceeds 262144. This allows you to find all matching values in the interval [0, 262144). Next repeat the problem with the next interval ([262144,524288)), and so forth. In this manner, you will have far fewer expensive cache misses.

I implemented this solution in C++. Here are my results using random arrays, GNU GCC 8 and a Skylake processor. I report the number of CPU cycles per value in the arrays.

naive scancount 37 cycles
cache-sensitive scancount 16 cycles

Further reading: Compressed bitmap indexes: beyond unions and intersections, Software: Practice & Experience 46 (2), 2016.

Science and Technology links (August 10th 2019

    1. A short (less than one hour) online lesson improved grades among weaker students. The lesson taught the growth mindset: that intellectual abilities can be developed.
    2. The chip maker AMD, has released a processor (EPYC 7742) which has 32-billion transistors and 64 cores.
    3. According to a recent study, we tend to see women as victims and men as perpetrator. Managers are perceived as less moral and fair when they fire a group of female (versus male) employees. Female victims are expected to experience more pain from an ambiguous joke.
    4. Blue zones are regions in the world where people claim to live longer. Newman questions the validity of these zones in an article entitled Supercentenarians and the oldest-old are concentrated into regions with no birth certificates and short lifespans. The most likely explanation for these zones is that you find more centenarians in regions where it is easiest to lie about one’s own age.
    5. Young men spend more time than ever playing video games. It is also true young men in the West work fewer hours. Yet video games are not to blame for the reduced work.
    6. There is a massive overproduction of coffee beans, leading to falling prices.
    7. In most cases, electric fans are a good way to mitigate heat-related illnesses, despite contrary government advice. Don’t be shy and use a fan when it is too hot! It is safe and it cools you down.
    8. McDonald’s admitted that its paper straws cannot be recycled at this time, contrary to its plastic straws.
    9. Bilingual people are much more resilient to dementia (like Alzheimer’s) than monolingual people, suffering from dementia four years later on average.
    10. In mice, we can stimulate hair growth using prescription drugs rapamycin and metformin. These drugs are often viewed as anti-aging drugs.
    11. A single season of collegiate football (without injury) leads to a reduction in brain integrity.
    12. Playing video games (but not taking piano lessons) increases grey matter volume in the brain.
    13. Building one wind turbine requires 900 tons of steel, 2,500 tons of concrete and 45 tons of plastic. Much of which is produced with coal power. Moreover, most solar panels are built in China using coal power.

Science and Technology links (August 3rd 2019)