Suppose that you have ever larger sets of 64-bit integers, and you want to quickly find out how many distinct integers there are. So given `{10, 12, 10, 16}`, you want an algorithm to output 3, as there are three distinct integers in the set. I choose 64-bit integers, but strings would do fine as well.

There are sensible algorithms to estimate this number, but you want an exact count.

Though there are many good ways to solve this problem, most programmers would first attempt to use one of these two techniques:

- Create a hash set. Throw all the values in the hash set (implemented with a hash table). Then check how many values are found in the hash set in the end. In C++, you might implement it as such:
size_t distinct_count_hash(const uint64_t * values, size_t howmany) { std::unordered_set<uint64_t> hash(values, values + howmany); return hash.size(); }

- Put all the values in an array, sort the array then run through it, deduplicating the values. In C++, you might implement it as follows:
size_t distinct_count_sort(const uint64_t * values, size_t howmany) { std::vector<uint64_t> array(values, values + howmany); std::sort(array.begin(), array.end()); return std::unique(array.begin(), array.end()) - array.begin(); }

Which is best? Sorting has complexity *O*(*n* log *n*) whereas insertion in a hash set has expected constant time *O*(1). That would seem to predict that the hash set approach would always be best.

However, there are many hidden assumptions behind textbook naive big-O analysis, as is typical. So we should be careful.

Simple engineering considerations do ensure that as long as the number of distinct elements is small (say no larger than some fixed constant), then the hash set approach has to be best. Indeed, sorting and copying a large array with lots of repeated elements is clearly wasteful. There is no need for fancy mathematics to understand that scenario.

But that’s not the difficult problem that will give you engineering nightmares. The nasty problem is the one where the number of distinct elements can grow large. In that case, both the array and the hash set can become large.

Which is best in that difficult case? I wrote a small C++ benchmark which you can run yourself.

N |
hash set (cycles/value) | array sort (cycles/value) |

1,000 |
220 | 161 |

10,000 |
260 | 163 |

100,000 |
340 | 200 |

1,000,000 |
820 | 245 |

10,000,000 |
1,100 | 282 |

So when there are many distinct values to be counted, sorting an array is an efficient approach whereas the hash table should be avoided.

How can we understand this problem? One issue is that as the hash table becomes large, it comes to reside in RAM (as it no longer fits in CPU cache). Because of how hash sets work, each operation risks incurring an expensive cache miss. A single retrieval from RAM can take dozens of CPU cycles. Meanwhile, sorting and scanning an array can be done while avoiding most cache misses. It may involve many more operations, but avoiding cache misses can be worth it.

What if I kept cranking up the data size (*N*)? Would the hash set ever catch up? It might not.

The problem is the underlying assumption that you can access all memory using a constant time. That’s not even close to true.

Very interesting post, Daniel (as always!). One thing that is probably worth at leadt mentioning is that the hash approach can apply in situations where the sorting approach might not (e.g., when the elements don’t have a total ordering — or any orderibg at all), since it only requires hashing and equality operations, whole sorting requires less-than comparability. Though, usually, one could come up with an irderibg thasy works for this purpose, even if it’s not completely natural.

P.S. sorry for the mis-spellings; I’m on a new mobile phone :P.

There are other benefits to the hash set approach, such as the fact that it can maintain counts dynamically.

However, the sorting approach can be optimized far more than in my example. There is no good reason to actually fully sort the data and the call to

uniqueis easy to optimize.Sorting a multiset is actually O(Hn), since you get back the work of sorting the duplicates. For ternary quicksort I think you can also just modify it to count how many times it gets called, since each call consumes a unique pivot value.

Are you interested in doing a similar comparison for strings?

Thanks for the post – it’s an useful insight that hashing may not be fastest. Couple factors though for consideration….

When the input includes many duplicate values, the vector approach needs to store the full data set in memory whereas the hash table approach only needs to retain the distinct elements – past a point that can be the difference between driving the machine into heavy swapping and running just fine.

std::unordered_set is not a particularly good choice for this, as it’s implemented as a vector of iterators into a linked list. A hash table implementation that uses open addressing (aka closed hashing) should outperform it – I think my old benchmarks that show about an order of magnitude difference should be applicable here, but unfortunately there’s no such implementation in the Standard Library that makes that easy for you to try.

When the performance is dominated by the time it takes to read in the values, the hash table approach has the advantage of doing its work gradually as the input becomes available, so the final count of unique elements is available almost immediately after the final input’s processed. That contrasts with the vector approach, where std::sort and std::unique are easiest to do after all the values are known, though there are some optimisation options for doing some of the work earlier.

@Tony

When the performance is dominated by the time it takes to read in the values, the hash table approach has the advantage of doing its work gradually as the input becomes available, so the final count of unique elements is available almost immediately after the final input’s processed. That contrasts with the vector approach, where std::sort and std::unique are easiest to do after all the values are known, though there are some optimisation options for doing some of the work earlier.Absolutely, the hash set is the best way (out of the two options) to solve the problem the problem online, without delay.

When the input includes many duplicate values, the vector approach needs to store the full data set (…)I address this point I believe with the paragraph that starts with “Simple engineering considerations do ensure that as long as the number of distinct elements is small (…)”.

std::unordered_set is not a particularly good choice for this, as it’s implemented as a vector of iterators into a linked list.Open addressing, given enough data, will also be hurting from cache misses. I debated about whether extending the benchmark to include other implementations, but, in the end, I decided against it for the sake of simplicity. The same conclusion would apply, with different numbers.

It intuitively feels like you might be able to do even better with a modification of 3-way quicksort that throws away all elements equal to the pivot. Then, once the array is sorted, it also contains the right number of elements.

I agree that it ought to be possible to do much better but I wanted two-three lines of code.

Cool! I think another interesting aspect of hashing worth mentioning is hash collisions – meaning that insertions into a hash table are no longer O(1) even regardless of the CPU cache problem. And then, of course, rehashing is often used to “fix” that as the table grows. Also not a cheap operation.

But a hybrid method is quite approachable. Start with the hash table. After the table exceeds some number of distinct elements, fall back to a sort. That method would adapt fairly well to various datasets.

I think another interesting aspect of hashing worth mentioning is hash collisions â€“ meaning that insertions into a hash table are no longer O(1) even regardless of the CPU cache problemIn this instance, collisions are not the problem. You can increase the capacity of the hash table, and it won’t fix the issue… it might even make it worse.

But a hybrid method is quite approachable. Start with the hash table. After the table exceeds some number of distinct elements, fall back to a sort. That method would adapt fairly well to various datasets.Yes.

Copying my comment from the Hacker News thread on your blog entry (https://news.ycombinator.com/item?id=14403840) :

Just weeks ago, I had to pay out a $5000 bounty on my Cuckoo Cycle proof of work scheme [1], because I had wrongly assumed that hash sets were faster, even though the hash set was reduced to a simple bitmap.

Where the article considers up to 10M elements,

Cuckoo Cycle deals with about a billion elements,

thus answering the question of what happens when cranking up the data size. It turns out that despite using 32x to 64x more memory than the bitmap, sorting is about 4x faster.

Blog entry [2] explains how Cuckoo Cycle reduces to a counting problem.

[1] https://github.com/tromp/cuckoo

[2] http://cryptorials.io/beyond-hashcash-proof-work-theres-mining-hashing

This is not exactly a fair comparison, unordered_set is a bad hash table for small types because it does one allocation per entry. So you’re really benchmarking the allocator, not the hash table. This is not even about open vs closed addressing.

Try google::dense_hash_set for a reasonably good hash table.

Daniel, I believe your conclusion is incorrect. You aren’t really comparing sorting an array to a hash set – you a comparing a particular (maybe slow?) sort implementation to a particular (slow!) hash set implementation. Your problem isn’t just cache misses, it’s a lot of things – an allocation per unordered_set insertion is one of them.

I have tried to solve this problem for a set of 32-bit integers and the fastest solution was a well tuned hash set, similar to Google’s dense hash set.

Here’s the same hash set adapted for this problem, in your benchmark: https://gist.github.com/zeux/e271d172b820e67bebd565a9cd13de30

In this case for 10M elements I get 170 cycles/element for the hash and 260 cycles/element for sort. I haven’t profiled or instrumented the resulting hash code, it might be that the hash function isn’t a very good one in this case.

Now, the sort is also not necessarily the fastest possible in this case; for my problem (with 32-bit integers) a 3-step radix sort was faster than std::sort but that still wasn’t enough to beat the hash. Maybe this case is different – I didn’t analyze this in detail.

Updated the gist with a slightly better hash function, now it’s 145 cycles vs 260 cycles. I will leave exploring other sort options, such as radix, to somebody else, although radix will probably lose here because it will require many passes (around 6) and have significant issues wrt cache as well.

Maybe you (Arseny) are right, but there is nothing magical in the 10M point. You should provide an estimate of the asymptotics of your implementation, ant not just a single point. Maybe for your implementation the crossover between hash sets and sorting is just father away…

I tried hashing vs sorting with larger amounts of data:

10M elements: 146 vs 259 cycles (hashing vs sorting)

100M elements: 169 vs 285 cycles

1G elements: 199 vs 322 cycles

8G elements: 286 vs 364 cycles

Of course, the real difference with 8G elements is memory usage: 187 vs 119 GB. If we can work in-place, the memory usage of the sorting-based method is reduced to 60 GB.

I also tried the multithreaded std::sort implementations from libstdc++ parallel mode. When working in-place, I got the following improvements over the sequential version with 8G elements and 32 threads:

Quicksort: 5.5x speedup, same memory usage

Mergesort: 9.6x speedup, 2x memory usage

I wonder how the hashing-based algorithm would work with tens of threads.

Thanks, this is good data. I’m not sure how to exactly explain the slow deterioration of performance for hash set – the only thing that comes to mind is that TLB misses grow more and more expensive as you need more levels in the page table hierarchy and/or more cache misses *into* the page table structure. Is this using 4K pages, and if so, can you try using huge pages if this is an option? (not sure what the system you’re testing is).

I don’t think it’s straightforward to implement the hashing algorithm on multiple threads assuming that the input set is mostly unique (if it has high redundancy ratio then doing a unique pass with local hash tables will probably be a win). If you do have a few hundred gigabytes of memory and a matching dataset, multithreaded sorting followed by merging might be a faster solution overall (you can even do a unique count without actually merging, so the entire algorithm can work in place) – although 6x speedup on 32 threads is not as exciting as it could have been :).

Slower random access due to TLB misses is the expected behavior with large arrays. You could avoid it with hugepages, but that seems a bit artificial, as you usually cannot expect hugepages in a production system. (I also can’t test it myself, as I don’t have root access to the system I run my benchmarks on.)

The last test with 8G elements also ran into a NUMA bottleneck. As the system has two Opterons and 256 GB memory, the first 64 GB (the original data) is in local memory, while there rest (the hash table) is in non-local memory.

The “perfect” speedup from 32 threads tends to be around 18x on that system. CPU frequency goes down whan all cores are running, the memory bottleneck gets worse, and the data is usually in non-local memory. Mergesort achieved half of that, which is a bit worse than what I expected, while the slower quicksort reached 1/3.

I think the O(1)/O(N lg N) comparison you make is a little misleading, you compare sorting an array of N elements and inserting 1 element into a hash map, I think it should be O(N) with a single insertion being O(1). Your point will still remain valid.

Good research!

Great post! I was curious how Judy arrays would stack up, so I modified your code a bit and wrote a blog post with the results:

https://logperiodic.com/blog/2017/05/counting-distinct-elements-judy-arrays

Sounds like there’s a few lessons to potentially be learned here…

The first is one I hear a lot from people trying to do optimization – you have to focus not on “operations” as is common in College CS 101 but on data access, specifically cache misses of various types. Doesn’t matter how few operations you do if your algorithm accidentally causes data to be written to a hard drive in the middle of it.

The second is that there’s no substitute for real-world testing. You can write out the “algorithm” on paper as much as you want, but reality has all sorts of messy edge cases that can dominate, and the implementation matters a lot (see previous point).

Nice post!

With multiple threads a hash table suffers from serious synchronization overhead issues. Locking and atomics have both an immediate cost and a secondary cost in inter-CPU synchronization of shared cache rows (whether false or true sharing). It is possible to use a hash table per thread and merge them later but I don’t think it’s likely to be worth it.

Merge sort has excellent memory prefetch behavior – straight linear reads. At the top levels of the merge sort recursion it may be better to use the non-temporal memory access instructions that do not update the cache – it will be too long before these locations are accessed again for it to have any effect and cache contention can be minimized for the benefit of other (hyper)threads that share some of the same cache layers.

QuickSort and friends with access patterns that jump around are less prefetch-friendly but will better benefit from caching – as long as they fit in the cache. Below a certain size threshold it pays to switch to a quicksort or even insertion sort for the deeper levels of the merge sort recursion.

A variation of merge sort that stores counts of repeated unique values instead of actually repeating the value could make a significant improvement, unless the values are mostly unique. Some clever encoding to avoid having lots of “1” counts take extra space could overcome that and be the overall winner for all value distributions.

Thanks for this concise reminder that big-O doesn’t tell the whole story.

Log

grows so slowly that I usually consider it as a constant factor, that can often be dwarfed by other down-to-earth hidden factors including slow allocations, CPU-cache-misses, etc.nI love this quote from Damian Gryski in “Slices: Performance through cache-friendliness”

https://www.youtube.com/watch?v=jEG4Qyo_4Bc :

“What

needs to be in order to be considerednsmall… is getting bigger all the time.”