In a previous post, I considered the problem of finding the k smallest (or k largest) elements from a stream of values.
The naive approach is to collect all the values in an array, sort the array, and return the first k values. When an index is not available, that is how some relational databases reportedly solve SELECT/ORDER BY/LIMIT queries. It is also how many programmers will solve this problem.
Lots of people objected that we could do even better using an algorithm called QuickSelect (e.g., Rasmus Pagh). Whereas QuickSelect has far higher memory requirements than a binary heap (O( n )) and a worst complexity of O( n2 ), it has a superior average case complexity of O( n ). (Note: you can combine a binary heap and QuickSelect using introselect, I do not consider this possibility.)
In this benchmark, we receive 1408 random integers and we must collect the smallest 128.
The approach using a binary heap can run about 37,000 times a second, whereas QuickSelect runs 45,000 times per second, or about 20% faster. They are both about an order of magnitude faster than the naive sort/slice approach.
For all practical purposes, 20% faster is negligible in this case. I have actually hit a sweet spot where QuickSelect and the binary heap are comparable.
What are other cases of interest?
- If you only seek the top 5 elements out of an array, then the binary heap is likely to beat QuickSelect, irrespective of how many elements I have. The binary heap will fit in one or two cache lines, and the log k factor will be small.
- If I keep k at a sizeable 128, but I increase substantially the size of the array, then QuickSelect will start to win big.
- However, if I keep increasing the array size, the benefits of QuickSelect might start to fade. Indeed, QuickSelect will start to operate in RAM whereas the binary heap will remain in CPU cache. QuickSelect will become increasingly limited by potentially expensive cache misses.
- QuickSelect still has the worst case quadratic-time scenario that could be triggered by really bad input data. The binary heap is more likely to provide consistent speed.
What else could we do?
- Martin Cohen suggested a simple insertion sort on the basis that, for large streams, you are unlikely to encounter many candidates for the top-k position. This would probably work well for very long streams, but it could degenerate badly in some cases.
- Michel Lemay refered to a fancier approach used by Google Guava: maintain a buffer of 2k elements initially empty. Fill it up. Once it is full, use QuickSelect on it and discard half of it. Rinse and repeat. This seems like an appealing idea and Michel testifies that it provides very good practical performance.
So I am probably going to have to return to this problem.
Follow-up blog post: Top speed for top-k queries.