Given a long list of sorted values, how do you find the location of a particular value? A simple strategy is to first look at the middle of the list. If your value is larger than the middle value, look at the last half of the list, if not look at the first half of the list. Then repeat with select half, looking again in the middle. This algorithm is called a binary search. One of the first algorithms that computer science students learn is the binary search. I suspect that many people figure it out as a kid.

It is hard to drastically improve on the binary search if you only need to do one.

But what if you need to execute multiple binary searches, over distinct lists? Maybe surprisingly, in such cases, you can multiply the speed.

Let us first reason about how a binary search works. The processor needs to retrieve the middle value and compare it with your target value. Before the middle value is loaded, the processor cannot tell whether it will need to access the last or first half of the list. It might speculate. Most processors have branch predictors. In the case of a binary search, the branch is hard to predict so we might expect that the branch predictor will get it wrong half the time. Still: when it speculates properly, it is a net win. Importantly, we are using the fact that processors can do multiple memory requests at once.

What else could the processor do? If there are many binary searches to do, it might initiate the second one. And then maybe initiate the third one and so forth. This might be a lot more beneficial than speculating wastefully on a single binary search.

How can it start the second or third search before it finishes the current one? Again, due to speculation. If it is far along in the first search to predict its end, it might see the next search coming and start it even though it is still waiting for data regarding the current binary search. This should be especially easy if your sorted arrays have a predictable size.

There is a limit to how many instructions your processor might reorder in this manner. Let us say, for example, that the limit is 200 instructions. If each binary search takes 100 instructions, and that this value is reliable (maybe because the arrays have fixed sizes), then your processor might be able do up to two binary searches at once. So it might go twice as fast. But it probably cannot easily go much further (to 3 or 4).

But the programmer can help. We can manually tell the processor initiate right away four binary searches:

given arrays a1, a2, a3, a4 given target t1, t2, t3, t4 compare t1 with middle of a1 compare t2 with middle of a2 compare t3 with middle of a3 compare t4 with middle of a4 cut a1 in two based on above comparison cut a2 in two based on above comparison cut a3 in two based on above comparison cut a4 in two based on above comparison compare t1 with middle of a1 compare t2 with middle of a2 compare t3 with middle of a3 compare t4 with middle of a4 ...

You can go far higher than 4 interleaved searches. You can do 8, 16, 32… I suspect that there is no practical need to go beyond 32.

How well does it work? Let us take 1024 sorted arrays containing a total of 64 million integers. In each array, I want to do one and just one binary search. Between each test, I access all of data in all of the arrays a couple of times to fool the cache.

By default, if you code a binary search, the resulting assembly will be made of comparisons and jumps. Hence your processor will execute this code in a speculative manner. At least with GNU GCC, we can write the C/C++ code in such a way that the branches are implemented as “conditional move” instructions which prevents the processor from speculating.

My results on a recent Intel processor (Cannon Lake) with GNU GCC 8 are as follows:

algorithm | time per search | relative speed |
---|---|---|

1-wise (independent) | 2100 cycles/search | 1.0 × |

1-wise (speculative) | 900 cycles/search | 2.3 × |

1-wise (branchless) | 1100 cycles/search | 2.0 × |

2-wise (branchless) | 800 cycles/search | 2.5 × |

4-wise (branchless) | 600 cycles/search | 3.5 × |

8-wise (branchless) | 350 cycles/search | 6.0 × |

16-wise (branchless) | 280 cycles/search | 7.5 × |

32-wise (branchless) | 230 cycles/search | 9 × |

So we are able to go 3 to 4 times faster, on what is effectively a memory bound problem, by interleaving 32 binary searches together. Interleaving merely 16 searches might also be preferable on some systems.

But why is this not higher than 4 times faster? Surely the processor can issue more than 4 memory loads at once?

That is because the 1-wise search, even without speculation, already benefits from the fact that we are streaming multiple binary searches, so that more than one is ongoing at any one time. Indeed, I can prevent the processor from usefully executing more than one search either by inserting memory fences between each search, or by making the target of one search dependent on the index found in the previous search. When I do so the time goes up to 2100 cycles/array which is approximately 9 times longer than 32-wise approach. The exact ratio (9) varies depending on the exact processor: I get a factor of 7 on the older Skylake architecture and a factor of 5 on an ARM Skylark processor.

Implementation-wise, I code my algorithms in pure C/C++. There is no need for fancy, esoteric instructions. The condition move instructions are pretty much standard and old at this point. Sadly, I only know how to convince one compiler (GNU GCC) to reliably produce conditional move instructions. And I have no clue how to control how Java, Swift, Rust or any other language deals with branches.

Could you do better than my implementation? Maybe but there are arguments suggesting that you can’t beat it by much in general. Each data access is done using fewer than 10 instructions in my implementation, which is far below the number of cycles and small compared to the size of the instruction buffers, so finding ways to reduce the instruction count should not help. Furthermore, it looks like I am already nearly maxing out the amount of memory-level parallelism at least on some hardware (9 on Cannon Lake, 7 on Skylake, 5 on Skylark). On Cannon Lake, however, you should be able to do better.

Credit: This work is the result of a collaboration with Travis Downs and Nathan Kurz, though all of the mistakes are mine.

That’s a bold claim, since that would imply there isn’t really any room for further speedup! I thought Travis had concluded that Cannon Lake was actually able to sustain more than 20 outstanding L1 misses: https://www.realworldtech.com/forum/?threadid=181499&curpostid=182779.

Measuring the speedup ratio does seem like a good way of estimating the degree of memory parallelism, though. I wondered how much the ordering of the small array would affect the results, so I tried deleting the std::sort at L1141. To my surprise, it seemed to have no effect. I would have thought that the searching for nearby results in succession would have resulted in better caching — and hence faster results — than searching for them in random order. Is it expected that this would have no effect?

Actually, 20x might be an underestimation on Cannon Lake, so you are correct that I am way off from the maximum on this processor. However, on skylake, I am not too far off.

nkurz:

L1141 looks like sorting the vector for bsearch. The targets are overwritten with rand() on L1199.

Re sorting the search keys. The top of the implicit search tree should remain in cache even with random ordering, so sorting the search keys might only be expected to help data cache hits in iterations ~11-12 and above. However, the density of search keys with respect to the size of the sorted array means that there will rarely be any overlap between the keys, at these later iterations. There might be an impact on TLB hits?

If we increase the number of searches by a factor of 100-1000, we should observe more sharing in cache hits. However, I expect we’ll just end up underutilising the line fill buffers with a straight up sort of the search keys. It probably makes more sense to transpose the search keys into multiple (one per interleaved search) sorted substreams, after sorting.

Thank you, I was indeed lost with what was happening with that line. I’m still somewhat lost, but perhaps getting closer to understanding.

Neat! Two other improvements for large arrays:

Cache the midpoints (corresponding to the first few levels of a binary search tree) in a small block, thereby reducing the number of memory accesses.

If the values are evenly distributed (e.g., they are hash values), one can scan the array and record the maximum error from the expected position. For example, the NIST’s NSRL hash set has ~34M unique SHA-1 values, requiring close to 700MB of memory to store as a sorted binary array. However, the maximum error is not that great: empirically, any hash value is within 16KB of its expected position. Although it takes a few more cycles to compute the expected position, the reduced window size allows for double the throughput over a naive binary search over the whole thing.

Not so– if the data is well distributed (as is usually the case if your seeking on a hash, for example) then linearly interpolating will be dramatically faster than bisection: If the point you seek has a value 1/3 of the way between your current bounds, you seek 1/3 of the way. Care should be taken to switch back to bisection if interpolation isn’t converging to avoid a linear worst case cost. There are simple procedures that guarantee a worst case only a small factor more probes than bisection, but on well conditioned data converges MUCH faster.

Similarly, when the bounds are too close, switching to a linear scan can be faster continuing to bisect.

Downside, of course– is that batching like you suggest works better with bisection because its easy to arrange lookups to share probe locations.

Recently this was posted on HN: https://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-case-for-caches/ , which has some interesting observations on bs, as well.

All that neat and clean CS theory is worth less and less with modern architectures and real world datasets. Even with the simplest of algorithms you have to think of questions such as: Are your searches mostly hits or misses? Are your keys evenly distributed or not? Which parts of your data structures might be hot and which are cold? Can you increase benefits of locality without much overhead? All possible combinations of these and many other parameters are going to affect the choice of algorithm adaptation to the architecture deep below the simple-looking ISA model and probably even creating feedback loops to approaches you want to take.

Sometimes all this makes one feel that there has to be a good, but currently unknown approach (instead of plain black magic) to creating efficient solutions. That is, solid engineering backed up by good theory. Yet, I’m perfectly unaware of it, and something like “how can we make binary search efficient” is like construction engineers considering proper use of a nail a mystery.

This (and other forms of interleaving of memory access cost dominated algorithms) can be implemented easily using coroutines http://www.vldb.org/pvldb/vol11/p1702-jonathan.pdf.

If I have a single binary search to perform, then, am I better off to interleave it with 31 other, unimportant binary searches? I’m confused a bit on the implications.

Here is the second paragraph from the post: