# Faster threshold queries with cache-sensitive scancount

Suppose that you are given 100 sorted arrays of integers. You can compute their union or their intersection. It is a common setup in data indexing: the integers might be unique identifiers.

But there is more than just intersections and unions… What if you want all values that appear in more than three arrays?

A really good algorithm for this problem is called scancount. It is good because it is simple and usually quite fast.

Suppose that all my integers are in the interval [0, 20M). You start with an array of counters, initialized at zero. You scan all your arrays for each value in the array, you increment the corresponding counter. When you have scanned all arrays, you scan your counters, looking for counter values greater than your threshold (3).

The pseudocode looks like this…

```counter <- array of zeros
for array x in set of arrays {
for value y in array x {
counter[y] += 1
}
for i = 0; i < counter.size(); i++ {
if(counter[i] > threshold)
output i;
}
```

This algorithm is almost entirely bounded by “memory accesses”. Memory-wise if you only have about 100 arrays, you only need 8-bit counters. So I can store all counters in about 20 MB. Sadly, this means that the counters do not fit in processor cache.

Can you make scancount faster without sacrificing too much simplicity?

So far we did not use the fact that our arrays can be sorted. Because they are sorted, then you can solve the problem in “cache-sensitive” or “cache-aware” chunks.

Build a small array of counters, spanning maybe only 256 kB. Process all arrays, as with the naive scancount, but suspend the processing of this array as soon as a value in the array exceeds 262144. This allows you to find all matching values in the interval [0, 262144). Next repeat the problem with the next interval ([262144,524288)), and so forth. In this manner, you will have far fewer expensive cache misses.

I implemented this solution in C++. Here are my results using random arrays, GNU GCC 8 and a Skylake processor. I report the number of CPU cycles per value in the arrays.

 naive scancount 37 cycles cache-sensitive scancount 16 cycles

Further reading: Compressed bitmap indexes: beyond unions and intersections, Software: Practice & Experience 46 (2), 2016.

Daniel Lemire, "Faster threshold queries with cache-sensitive scancount," in Daniel Lemire's blog, August 16, 2019.

### Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

## 14 thoughts on “Faster threshold queries with cache-sensitive scancount”

1. Nathan Kurz says:

Daniel is probably already viewing it this way, but readers might benefit from making the example more concrete.

Assume that the lists of sorted integers are “posting lists” from an inverted index, with each integer representing a document in which a word appears. Assume that we have a “query” that consists of 100 words, so that each of the 100 lists of integers represent all the documents that contain a given word. We’d like to return as the result of the query all documents that contain at least 3 of the words. How can we do this efficiently? And while we are at it, wouldn’t it be nice if we could sort the results (the list of documents that contain the words) based on how many of the terms they contain?

Operations like this are essential to the functioning of search engines, and thus making them more efficient is a big deal.

2. Nice article!

I have implemented here some “T-overlap” algorithms such as
“ScanCount”, “MergeSkip”, “DivideSkip”, “CPMerge” and there is a really good performance for “CPMerge” algorithm described here

1. Please see my other comment in this thread; it was meant to be a reply to this comment.

3. Thanks for that reference; I was wondering if anyone had any use for this type of algorithm — I came up with MergeSkip in the early 00’s as part of a frequent itemset finder. Any iterator that can “skip” in sublinear time can benefit; I believe even graph cliques are possible.

My version had some tweaks that might help performance — I always kept T elements below the heap in a ring buffer instead of dynamically expanding each run. I started each iteration by swapping the tail of the FIFO with the top of the heap and rotating the ring head pointer. The new heap top would then skip and sift down as in MergeSkip.

One of my todo’s is to look into how this method might compare for simpler types like arrays — thanks for helping to answer that question.

1. Interesting design! I definitely will try to implement it. My benchmarks show that the bottleneck of this algorithm is a heap: `Pop` and `Push`. Thank you for the idea

1. KWillets says:

It looks like Go’s heap has a Fix method that allows you to update element 0 and sift down, so that’s good (many heap implementations only allow sift-up). The skip operation seems like it would produce a new value near the minimum.

The ring buffer to be clear had only one pointer, as it was always full, making head/tail adjacent. A standard two-pointer version might have more overhead.

4. Travis Downs says:

I was able to get it down to around 4 cycles/element with a few changes.

Also, on my machine (Skylake i7-6700HQ) the benefit of the cache-blocked technique is even better than you show: I still get ~16 cycles for the blocked approach, but usually around 50 cycles for other one. It may depend on activity on my system.

1. I was able to get it down to around 4 cycles/element with a few
changes.

Can you tell us more?

1. Nathan Kurz says:

Let’s flip the question and ask: When it took 16 cycles to do what should be a load, a compare, and a write per element, where did you think the extra time was going?

Having not heard back from Travis, I played with it a little. The first thing was to use “-Ofast” instead of “-O2”, which got me down from 16 cycles to 12. Beyond that I had to use “perf record”. The next thing I noticed was that the second loop that checks the threshold and writes the hits takes a surprising amount of time. This can easily be combined with the other loop, so that a value gets written as soon as it equals the threshold. This gets down from 12 to 8.

The next step was to blame C++. Well, maybe that’s just me. But I think there is a big abstraction issue that’s causing it to do a lot more reads and writes than should be necessary. Every time you call “it++” it’s writing the new value to memory (well, cache) and then rereading it on the next iteration.

Worse, it seems like almost everything else is working off the stack too. The IO for the inner loop should be basically a single read and write, but instead it’s got two writes and a half-dozen reads. By the time I figured out what the assembly was doing, my C++ allergy was making it hard for me to breath normally, so I didn’t actually try to fix this. But I think if you were to write something straightforward in C, I think you’d easily get down to 4 cycles, possibly fewer. Travis is stronger than I am, and probably managed to avoid the excessive IO without switching languages.

1. Travis Downs says:

Yeah the constant reloading from memory thing was interesting enough that I wrote a post about it:

https://travisdowns.github.io/blog/2019/08/26/vector-inc.html

The motivation was exactly this problem. Basically once you have writes to `char` (or `uint8_t`) arrays in some loop, you better make sure everything else is an unescaped local variable.

Sorry I didn’t get back to this: I got to 4 cycles quickly (basically std:fill -> memset, fix the iterator problems Nathan mentions, vectorize the final scanning of the counters array), but of course I wanted more. I tried a few things that got close to 3 cycles but then also tried a bunch of unproductive things and got the code in an ugly state without being faster so I didn’t want to send a PR with that mess.

The speed of light here is 1 cycle since the only absolutely compulsory thing that is O(N) in the size of the input elements seems to be the scattered writes, and I can’t see an easy way to vectorize that (it’s basically similar to radix sort, which is also limited by the scattered writes). It seems hard to get to 1, but 2 could be possible.

This can easily be combined with the other loop, so that a value gets
written as soon as it equals the threshold.

That’s a good idea. I took a different approach of vectorizing the scan of the counter array: it’s almost all zeros so this goes fast, but I feel your approach may be faster if it can slip in for free among all the other work. I’m going to try it.

1. Travis Downs says:

it’s almost all zeros so this goes fast

Sorry, that should say “it’s almost all values below the threshold”, and that still goes fast. Only about 2,500 hits over the entire input, so the scanning has to be fast but the hit handling doesn’t.

2. Nathan Kurz says:

Great writeup! Comparing it to my semi-coherent blog post reply, I feel slightly outdone by your complete and coherent explanation. I hadn’t figured out that the char-type aliasing was the base issue.

I’m still confused why the generated assembly is rereading all the constants from memory in the inner loop. I tried changing all the uint8_t’s to uint32_t’s (which solves the aliasing issues at the cost of some cache), but GCC still rereads things like ‘threshold’ from the stack on every iteration. I’m not sure that this is a performance issue (clang doesn’t seem do it and ends up slower), but it seems like a silly choice. Is there a good way to convince GCC to behave more sensibly?

1. Travis Downs says:

Yes, what you noticed about those reads comes from another effect: the function is complicated enough that it runs out of registers and gcc makes some not ideal choices about which registers to spill, and this results in it reading spilled regs from the stack in the inner loop.

I work around this by using “noinline” to force the loop to a standalone function, where it gets a full set of registers and so doesn’t need to spill. An interesting case of forcing things out of line causing it to speed up, without having anything to do with code size effects (indeed, the inline version may be smaller).

You may subscribe to this blog by email.