Consider the scenario where you are given a small sorted array of integers (e.g., [1,10,100]) and a large sorted array ([1,2,13,51,…]). You want to compute the intersection between these two arrays.
A simple approach would be to take each value in the small array and do a binary search in the large array. Let the size of the large array be n, then a binary search takes up to log n comparisons and data accesses. Of course, log n may not be an integer, so I need to round it up to the smallest integer at least as large as log n: ceil( log n ). If the small array contains k elements, then I need a total of k ceil( log n ) comparisons. Each comparison involves one data access in the large array.
Is k log n comparisons the best bound? Clearly not: if k is almost as big as n, then k log n can far exceeds k+n. Yet a standard merge algorithm (as in the Merge Sort algorithm) requires as little as k+n comparisons. So for the bound to be reasonable, we must require that k is much smaller than n.
Even so, intuitively, it is wasteful to do k distinct binary searches. If the values in the small array are in sorted order, then you have new knowledge after doing the first binary search that you can reuse to help in the next binary search.
How few comparisons can you do? We can use an information-theoretical argument. The basic idea behind such an argument is that for an algorithm based on a decision tree to generate M distinct possible output, it must involve up to ceil( log( M ) ) decisions in the worst case, a count corresponding to the height of the corresponding balanced tree.
To be clear, this means that if an adversary gets to pick the input data, and you have divine algorithm, then I can give you a lower bound on the number of decisions that your algorithm takes. It is entirely possible that, in practice, you will do better given some data; it is also entirely possible that you could do much worse, evidently. Nevertheless, let us continue.
We can recast the intersection problem as the problem of locating, for each key in the small value, the largest value in the large array that is small or equal to the key. How many possible outputs (of this algorithm) are there? I have to choose k values from a set of n elements, but the order does not matter: that’s nk/k! possibilities. This means that if my algorithm relies on a (balanced) decision tree, it must involve ceil( k log n – log k! ) decisions.
Let us put some numbers in there. Suppose that n is 4 million and k is 1024. Then ceil( log n ) is 22 so each key in the small array will involve 22 decisions. On a per key basis, our tighter model gives something like log n – (log k!)/k. The log of the factorial is almost equal to k log k by Stirling’s approximation, so we can approximate the result as log n – log k.
Let us plug some numbers in these formula to track the lower bound on the number of decisions per key in the small array:
|log n – (log k!) / k
This suggests that you may be able to do quite a bit better than k distinct binary searches.
Often, however, it is not so much the decisions that one cares about as much as the number of accesses. Can this value 13 in table when n is 1024 be taken a lower bound on the number of accesses? Not as such because you can access one value from the large array and then reuse it for many comparisons.
To complicate matters, our systems have cache and cache lines. The cache means that repeated accesses to the same value can be much cheaper than accesses to distinct values and may even be free (if the value remains in a register). And our cache does not operate on a per-value basis, but rather data is loaded in chunks of, say, 64 bytes (a cache line).
All in all, my derivations so far may not be highly useful in practical applications, except maybe to argue that we ought to be able to beat the multiple binary search approach.
Can we sketch such an approach? Suppose that we start by sampling B different values in the big array, at intervals of size n / ( B + 1 ). And then we issue the same k binary searches, but targeting one of the interval of size n / ( B + 1 ). This will incur B + k log (n / ( B + 1 )) accesses in the large array. Let us pick B so as to minimize the number of accesses. We could pick B to be k / ln (2) – 1. We get k / ln (2) – 1 + k log (n ln (2) / k) accesses in the large array. For simplicity, I am ignoring the fact that B must be an integer. I am also ignoring the cost of matching each value in the small array to the right subinterval. However, if I am only interested in the number of accesses in the large array, it is an irrelevant cost. Even so, under the assumption that k is small, this is a small cost (O(k)).
Is k / ln (2) + k log (n ln (2) / k) accesses in the large array much better than the naive approach doing k log n accesses? It is unclear to me how much better it might be on real systems once caching effects are accounted for.
Credit: Travis Downs for the initial questions and many ideas. The mistakes are mine.
Further reading: SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience 46 (6), 2016