Consider the scenario where you are given a small sorted array of integers (e.g., [1,10,100]) and a large sorted array ([1,2,13,51,…]). You want to compute the intersection between these two arrays.

A simple approach would be to take each value in the small array and do a binary search in the large array. Let the size of the large array be *n*, then a binary search takes up to log *n* comparisons and data accesses. Of course, log *n* may not be an integer, so I need to round it up to the smallest integer at least as large as log *n: *ceil( log *n *). If the small array contains *k* elements, then I need a total of* k* ceil( log *n *) comparisons. Each comparison involves one data access in the large array.

Is *k* log *n *comparisons the best bound? Clearly not: if *k* is almost as big as *n*, then *k* log *n *can far exceeds* k+n.* Yet a standard merge algorithm (as in the Merge Sort algorithm) requires as little as *k+n* comparisons. So for the bound to be reasonable, we must require that *k* is much smaller than *n*.

Even so, intuitively, it is wasteful to do *k* distinct binary searches. If the values in the small array are in sorted order, then you have new knowledge after doing the first binary search that you can reuse to help in the next binary search.

How few comparisons can you do? We can use an information-theoretical argument. The basic idea behind such an argument is that for an algorithm based on a decision tree to generate *M* distinct possible output, it must involve up to ceil( log( *M *) ) decisions in the worst case, a count corresponding to the height of the corresponding balanced tree.

To be clear, this means that if an adversary gets to pick the input data, and you have divine algorithm, then I can give you a lower bound on the number of decisions that your algorithm takes. It is entirely possible that, in practice, you will do better given some data; it is also entirely possible that you could do much worse, evidently. Nevertheless, let us continue.

We can recast the intersection problem as the problem of locating, for each key in the small value, the largest value in the large array that is small or equal to the key. How many possible outputs (of this algorithm) are there? I have to choose *k* values from a set of *n* elements, but the order does not matter: that’s *n ^{k}*/k! possibilities. This means that if my algorithm relies on a (balanced) decision tree, it must involve ceil(

*k*log

*n*– log

*k*! ) decisions.

Let us put some numbers in there. Suppose that *n* is 4 million and *k* is 1024. Then ceil( log *n *) is 22 so each key in the small array will involve 22 decisions. On a per key basis, our tighter model gives something like log *n* – (log *k*!)/*k.* The log of the factorial is almost equal to *k *log* k *by Stirling’s approximation, so we can approximate the result as log *n – *log* k*.

Let us plug some numbers in these formula to track the lower bound on the number of decisions per key in the small array:

k |
log n |
log n – (log k!) / k |
---|---|---|

8 | 22 | 20 |

16 | 22 | 19 |

64 | 22 | 17 |

1024 | 22 | 13 |

This suggests that you may be able to do quite a bit better than *k* distinct binary searches.

Often, however, it is not so much the decisions that one cares about as much as the number of accesses. Can this value 13 in table when *n* is 1024 be taken a lower bound on the number of accesses? Not as such because you can access one value from the large array and then reuse it for many comparisons.

To complicate matters, our systems have cache and cache lines. The cache means that repeated accesses to the same value can be much cheaper than accesses to distinct values and may even be free (if the value remains in a register). And our cache does not operate on a per-value basis, but rather data is loaded in chunks of, say, 64 bytes (a cache line).

All in all, my derivations so far may not be highly useful in practical applications, except maybe to argue that we ought to be able to beat the multiple binary search approach.

Can we sketch such an approach? Suppose that we start by sampling *B* different values in the big array, at intervals of size *n* / ( *B *+ 1 ). And then we issue the same *k* binary searches, but targeting one of the interval of size *n* / ( *B *+ 1 ). This will incur *B* + *k* log (*n / *( *B* + 1 )) accesses in the large array. Let us pick *B* so as to minimize the number of accesses. We could pick *B* to be *k* / ln (2) – 1. We get *k* / ln (2) – 1 + *k* log (*n* ln (2) / *k*) accesses in the large array. For simplicity, I am ignoring the fact that *B* must be an integer. I am also ignoring the cost of matching each value in the small array to the right subinterval. However, if I am only interested in the number of accesses in the large array, it is an irrelevant cost. Even so, under the assumption that *k* is small, this is a small cost (O(*k*)).

Is *k* / ln (2) + *k* log (*n* ln (2) / *k*) accesses in the large array much better than the naive approach doing *k *log* n *accesses? It is unclear to me how much better it might be on real systems once caching effects are accounted for.

**Credit**: Travis Downs for the initial questions and many ideas. The mistakes are mine.

**Further reading**: SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience 46 (6), 2016

Daniel Lemire, "Bounding the cost of the intersection between a small array and a large array," in *Daniel Lemire's blog*, June 27, 2019.

In our roaring implementation, we have an

`intersectionCountArrayArray`

, which does exactly what it sounds like, and during some unrelated benchmarking, I noticed that the performance varied significantly depending on whether the first or second array was larger. Looking at the current code, it appears that we got noticably better performance when the larger array is the inner loop. I think I did look at trying to binary-search instead, but ended up leaving it alone for now. Total amounts of data are smallish (<8KB for each array), and having the accesses be sequential seems to improve cache performance enough to make up for reading more bits than are probably necessary. It’s been too long for me to remember the impact of the simple change, but I think that particular op ended up about 20% faster across “random” cases. (It matters less in practice, since N is nearly always under 5.)For two sorted arrays we can use the SVS algorithm see : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.415.2636&rep=rep1&type=pdf

For a two levels structure like Roaring-Bitmaps it might be possible to use intersections with skip intervals like in the TurboPFor “inverted index” demo.

Maybe this is a typo:

“The log of the factorial is almost equal to n log k”

Should have been “k log k”

Ignoring memory hierarchy considerations for the time being, you can approach (log n – (log k!) / k) comparisons by performing a suitably imbalanced binary search. I.e. when searching the large array for the smallest element in the small array with k = 1000, you first compare against the element at position ~0.693 * (n/1000) in the large array, since that comes the closest to dividing the state space evenly. The principle of dividing the state space evenly can be used to estimate all the other optimal binary search indices (this isn’t guaranteed to yield the minimal worst-case number of comparisons, but it should come close enough); and then you can come up with a simple approximation to use to select comparison indices in practice.

More precisely, the approach I just described should do a good job of reducing the average-case # of comparisons. To optimize more for the worst case, you’d want to round the first comparison index to a power of 2, and switch to plain binary search once you’ve found a larger element in the large array.

Thanks. I have now sketched something like what you describe in the blog post.

Ignoring memory hierarchy won’t work in practice. A regular binary search will hit L1 most of the time, as the indices are the same as from the previous binary search. By doing something fancy like binary search of an array subset, we reduce the number of memory lookups, but frequently go to L3 or DRAM instead. L3 is 10x more expensive than L1, DRAM is 50x more expensive.

But we could do something like evenly divide the large array into 1024 fixed subsets. Make a guess which subset contains what you are searching for. If you guessed right, you saved yourself the first 10 steps of binary search. If you guessed wrong, you can merge subsets following buddy-allocator rules. That way you keep reusing the same indices, so they are likely in L1.

Now I wonder how far that approach could be extended. One way to deal with latency is to replace bisection with octasection. I use that approach for git bisection already – if tests take a long time, but you can run them in parallel, octasection is 3x faster while doing 2.3x more work. But bisection is doing a single search. For intersection, you can do multiple searches in parallel and get a speedup without doing 2.3x more work, so probably not a good idea.

But you could just do a streaming lookup of the edges between the 1024 subsections. That removes the guessing. With k=1000, you expect to continue the search in most of those subsections, so effectively you just removed the 10 (fast) lookups at the beginning. In Daniels table, that means 22-10=12 lookups, which seems impressive, but remember that the remaining 12 lookups are likely cache-cold and 10x more expensive. So overall more like a 5% improvement, not 50%.

Ignoring memory hierarchy won’t work in practice.I agree.

There is another optimization we can copy from various sort-implementations. Things like quicksort and binary search work great when your n is large. But after enough divisions, the remaining set is small and things like bubble sort or linear search can outperform the fancy algorithms. Most sort-implementations eventually fall back to something like bubble-sort when dealing with leaves.

Here the obvious choice would be to stop bisections once you reached 1-2 cachelines and do linear search within the remainder. Or to vectorize the compare, movemask and tzcnt to find the precise index without branches.

If you have 4-byte integers in 64-byte cachelines, you replace the last 4-5 compares with this. Again, benefit is relatively low because we optimize the cheap L1 accesses, not the expensive ones.

If the goal is to replace the expensive accesses, I’d try to do better than binary search. If you have a search window of 100 values between 0 and 1000 and are searching for 20, the assumption would be that you’re looking for the second entry or one very close to that. If the values are typically spread very evenly, you can guess a tight search window and skip a lot of bisections. If the value are clumped with random patterns, you need a wide search window and might revert back to binary search.

You could try to hard-code whether you assume random clumps are even distribution. Or you could do a feedback loop, where future searches are influenced by past searches. If guessing worked well in the past, do more guessing. If not, fall back to binary search.

Hwang and Lin have a similar bound in their paper, although they use a slightly different choice formulation (it appears you allow duplicates). Their lower bound is basically log2((n+k) choose k).

The UB complexity of their algorithm g is similar: log(n/k) and change. It comes from probing the n/k-th element first and binary-searching within that range only.

Interesting reference. Though I have not read it, I have come away with a similar idea.

Right. And it is a bit like what Christopher (another commenter) described.

And N. Kurz also privately alluded to the same idea, months ago.

It’s in Knuth 😉

Of course, it would be in Knuth.

I think you can do this in the same time bound (big O, not exact) by iterating over the smaller array in order and doing exponential search in the larger array from the front until you exceed the sought-for key, then doing binary search on the induced prefix of the larger array. On the next key from the smaller array, start your exponential search in the larger array at the index you stopped at in the previous search.

If the ith search covers b_i range in the indexes of the larger array, the total cost is 2 * sum(lg b_i), where sum(b_i) < 2n. Since lg is convex, the cost is maximized when all the b_i’s are equal, so lg (2n/k).

I suspect the constant factors could be eliminated by reusing more of the exponential overshoot when searching subsequent keys.

Thanks. This algorithm is described in the reference I offer… SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience 46 (6), 2016. In this instance, I am concerned with constants and caching… so whether this is the best algorithm is in question.

(I will come back on this topic later, on this blog.)

As far as caching goes, if the large array is arranged as a level-linked B-tree with parent pointers, I think the exponential search algorithm turns into a finger search algorithm with cost O(k log_B (n/k)).