# How fast can scancount be?

In an earlier post, I described the following problem. Suppose that you have tens of arrays of integers. You wish to find all integer that are in more than 3 (say) of the sets. This is neither a union nor an intersection. A simple algorithm to solve this problem is to use a counter for each possible value, and increment the counter each time the value is seen. We can finally check the counters and output the answer.

```counter <- array of zeros
for array x in set of arrays {
for value y in array x {
counter[y] += 1
}
for i = 0; i < counter.size(); i++ {
if(counter[i] > threshold)
output i;
}
```

You can do much better if you break down the space of the problem, so that all of your counters fit in the CPU cache. You solve the problem using multiple passes. It also helps if you use small counters, say 8-bit counters.

With this early optimization, I was happy to report a performance of about 16 cycles per input value. It is pretty good… but Nathan Kurz and Travis Downs pointed out that you can do much better.

What are some of the issues?

1. I was using textbook C++ with iterators and STL algorithms. Converting the problem into lower-level code (using pointers or indexes) seems better for performance.
2. Using 8-bit counters triggers all sorts of nasty “aliasing” problems: effectively, the compiler becomes paranoid and thinks that you might be writing all over the memory. Travis wrote a blog post on the topic and on how you can mitigate the issue.
3. There are many low-level issues one might want to handle. For example, if you have 8-bit counter values and an 32-bit threshold value, an extra instruction might be needed to zero-extend your counter value.

We now have two much faster versions to offer. They run at between 3 to 4 cycles per input element. If you are keeping track, it is better than four times faster than my original “optimized” version. I make the code available on GitHub as a header-only C++ library.

1. The first version is derived from a comment by Nathan Kurz. It works identically to my original version, but instead of using a second pass through the counters, we check the counter values as we modify them. Other than that, I just carefully rewrote my original C++ in “plain C”. I achieve 4 cycles per input element and a number of instructions per cycle of 3.1 on a skylake processor with GNU GCC and around 3.5 cycles per input element under LLVM clang. The implementation should be portable.
2. Travis was nice enough to share his implementation which follows the same original algorithm, but has been vectorized: it uses fast AVX2 instructions. It is slightly faster under the GNU GCC 8 compiler, sometimes approaching 3 cycles per input element.

So how fast can scancount be? I have not yet answered this question.

In an earlier comment, Travis gave a bound. If each input element must result into the increment of a counter that is not a register, then we have one store per input element, and thus we need at least one cycle per input element on processors that are limited to one store per cycle.

Of course, we also need to load data: say one load to get the element, and one load to get the corresponding counter. Assuming that our hardware can do either two loads per cycle or one store per cycle, then we get a limit of at least two cycles per input element. Based on this napkin mathematics, if we achieve 4 cycles, then we are within a factor of two of the best possible algorithm.

Still: we could go much faster without breaking the speed of light. We are almost solving an embarrassingly parallel problem: we break the problem into subspaces (for the caching benefits) but this means that should be able to dispatch independent components of the problem to distinct cores.

### Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

## 9 thoughts on “How fast can scancount be?”

1. foobar says:

I think that if your integer lists were (sorted and) dense, you could theoretically go significantly below two reads and one write per element on the update operation by partially transposing the update loop, especially if you can benefit from wide reads and writes (AVX2/AVX-512). The optimal case would cause only the integer lists to be read once with the widest possible read size, and the output array to be written once(!) with maximum width, that is, very close to zero memory operations per element. (Actually you could collect the results straight away at this phase.)

The problem here is all the juggling required in the middle. Maybe some of those very un-RISCy vector instructions could save the day? (I don’t believe current architectures would be that good, or at least this would require significant register spilling which would undo large portion of potential benefits.)

1. Jörn Engel says:

The most common trick in this direction is to use the AH/AL pair for byte reads or writes. Requires minimal loop unrolling.

```read(AL, foo); read(AH, bar); write(AX, baz); // writes two bytes in one instruction ```

It doesn’t always help and compilers tend to forget about this optimization once you get into register pressure. But it allows you to double the number of byte reads/writes without writing vector code.

1. foobar says:

That is not the hard part in this case, the hard part is expanding an array of sorted integers into a corresponding bit-mappy vector of 1/0 integers (probably of the same size) in an efficient manner. 🙂

I think this could be accomplished with AVX-512 with a core loop magic consisting basically of a broadcast, a compare to mask and a VPERM instruction. Reasoning about AVX-512 instructions sincerely makes my head hurt, but it would seem plausible that this kind of a loop could process 8 32-bit input integers per one loop iteration for 50% dense integer list. What would probably be the performance killer is conditional branching…

1. foobar says:

Apparently my idea was naive, but you can still perform the following operation relatively efficiently on AVX-512:

A list of sorted unique integers (32 32-bit integers in a pair of AVX-512 registers):

0, 3, 4, 6, 9, 10, 11, 20, 25, 26, 30, 35, …

can be converted to a list of integers (or maybe their two’s complements) on similarly sized vector entries:

1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0 (35 didn’t fit on first 32 entries)

These results could be added with a vectored operation 16 entries at a time.

My back-of-an-envelope calculation of this operation requires around 4 instructions for masking outlying values, 2 instructions for adjusting offsets, 2 variable vectored shifts, 11 operations for 32-way reduction (OR or ADD operation and shuffles) and two TEST-like compares against constant vectors to achieve this result. This is on the ballparks of 23 uops per 32 resulting output values. (No, I didn’t write the actual code…)

Density of input lists is a crucial point, but with densities of >20% this approach might be very competitive with less vectorised approaches.

1. foobar says:

The benchmarks on the scancount problem are sparse, but I believe here is a chance for another toy problem (unless it’s already solved!):

How fast can you convert a sorted array of integers into the corresponding bitmap?

That is essentially the reverse of this problem: https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/

My bet is that naive implementation is fastest for sparse bitmaps (roughly one clock cycle per set bit, but not that fast for dense bitmaps!), but AVX-512 implementation might be faster for dense bitmaps (as a wild-assed guess, possibly something like four bits of bitmap produced per clock cycle).

2. Nathan Kurz says:

Assuming that our hardware can do either two loads per cycle or one
store per cycle, then we get a limit of at least two cycles per input
element.

For modern Intel processors you can actually do two loads and a write within the same cycle. The trickiness is that the addressing mode for the write would need to be “simple”, which is to say, based on just one register plus a fixed offset (rather than a base and index register). That’s difficult to achieve in this case, since we want to index into the counter array by the value we just read. So in practice, I think you are right that this algorithm can’t achieve 1 cycle timings. But this is because of the demands of the algorithm, rather than a simple hardware limitation.

And it’s not necessarily the case that just because we can’t achieve 1.0 cycles, that the minimum jumps to 2.0 — it might be fractional. For example, a pattern of R+R, R+W, R+W gives you 4 reads and 2 writes in 3 cycles, which allows 1.5 cycles per element. Anyway, by fiddling with the range and array sizes, I’ve gotten results that are getting pretty close to 2 cycles per iteration. For example, demo(100000000, 1500000, 50, 5) gets down to 2.2 cycles/element for scalar:

```[email protected]:~/git/fastscancount\$ perf record counter Got 9924 hits optimized cache-sensitive scancount 2.19437 cycles/element 3.66452 instructions/cycles 0.00227307 miss/element AVX2-based scancount 2.48756 cycles/element 2.5749 instructions/cycles 0.00199687 miss/element ```

I still don’t quite understand why some sizes are faster than others. I think the issue is that both the data and counter arrays are competing for the same space in L3, such that increasing the number or size of the data arrays causes significantly slower results. It’s possible that a streaming read for the data (reading in a way that avoids polluting L3) might help get things a little bit faster.

3. Travis Downs says:

The trickiness is that the addressing mode for the write would need to
be “simple”, which is to say, based on just one register plus a fixed
offset (rather than a base and index register). That’s difficult to
achieve in this case, since we want to index into the counter array by

One way to get simple addressing is to use a static (global) array. Then the only register needed in the addressing expression is the one holding the just loaded value: the array base is a constant offset. Here’s a godbolt.

Of course, this code is not thread-safe, so if you really wanted to do this type of really, really low-level optimization and keep things thread safe, you’d have to use a mutex to protect this code, perhaps falling back to the non-static array variant if the function is called while another thread holds the lock.

One downside of this approach is that you don’t really have any control over the allocation of the array, in particular you probably won’t be able to ask for huge pages (perhaps you can if the array lives in the .bss and you do a madvise before the page is accessed – but there are a lot of issues to work out).

Another approach is to keep using a dynamically allocated counters array, but JIT-compile the core loop embedding the array offset as a static offset in the same way as above. Yes, a crazy hack.

I think the issue is that both the data and counter arrays are
competing for the same space in L3, such that increasing the number or
size of the data arrays causes significantly slower results.

How big was the counter array? I thought that most of the performance effects as probably related to L1 cache misses on the counter array: stores to L2 (i.e., stores that miss in L1 bit hit in L2) have a speed limit of 3 cycles per store, so any result less than 3 cycles indicates that you are getting at least some hits in L1 for the counter increments, which I think requires a small counter array (or dense values, i.e., more than one hit in the same cache line).

4. Travis Downs says:

I was using textbook C++ with iterators and STL algorithms. Converting
the problem into lower-level code (using pointers or indexes) seems
better for performance.

I would caution readers from using this as any kind of general rule.

In fact, for the aliasing stuff, using iterators or even more idiomatic C++ such as for-each loops would have helped also, versus say indexing with `vector[i]` (it may not be simple if you wanted the value of i for another purpose in the loop).

Some C++ abstractions are zero cost, some are not. As we’ve seen here, replacing plain array indexing with vector indexing is not always zero cost, especially in the presence of char writs and the consequent aliasing problems.

You can still make it fast using C++ idioms, but indeed it might require a bit more care.

Of course, we also need to load data: say one load to get the element,
and one load to get the corresponding counter.

Strictly speaking you don’t need 1 load to get the element: these elements are adjacent, so in principle you can get say 2 elements with a 64-bit load, or 8 elements with a 256-bit vector load. In practice this doesn’t help too much because your “budget” of stuff you can do per cycle (4 ops) is too small for this to save much – but it’s something to keep in mind when one input or output is contiguous.

I think a carefully tuned implementation could achieve ~1.5 cycles per cycle for your parameters (20M, 50K, 100, 3) which are kind of a worst case (admittedly the “worst case” region is very large). At some other points you could do better, e.g., when the data arrays are dense, since vectorized approaches become feasible, or when the threshold is much larger (since you can eliminate many elements without performing the corresponding store).

1. I would caution readers from using this as any kind of general rule.

I agree with you. My statement was not meant as a prescription.

You may subscribe to this blog by email.