Quickly returning the top-k elements: computer science vs. the real world

A standard problem in computer science is to quickly return the smallest or largest K elements of a list. If the list is made of N elements, then we can solve this problem in time N log (K) using a binary heap.

How would the average programmer solve this problem? My guess is that unless K is 1, then most programmers would simply sort the list and pick the first or last K elements.

That’s not such a bad deal. The difference between log (K) and log (N) might not be so large. The logarithm function grows slowly.

It could even be that a well optimized sort function could beat a poorly optimized binary heap.

To test it out in the real world, I wrote a benchmark using a JavaScript library. In this benchmark, I generate 1408 random integers and I seek to select the smallest 128 integers.

You might ask: why JavaScript? Because JavaScript is everywhere and we should care about its performance.

The code with a priority queue backed by a binary heap looks like this… we eat the first 128 integers, and then we do a push/poll danse to maintain always the smallest 128 integers. (The blocks parameter is set at 10 in my tests.)

var b = new FastPriorityQueue(defaultcomparator);
for (var i = 0 ; i < 128  ; i++) {
  b.add(rand(i));
}
for (i = 128 ; i < 128 * blocks  ; i++) {
  b.add(rand(i));
  b.poll();
}

The sorting routine is somewhat simpler. Just construct a big array, sort it and then extract the first few values:

var a = new Array();
for (var i = 0 ; i < 128 * (blocks + 1)  ; i++) {
   a.push(rand(i));
}
a.sort()
return a.slice(0,128);

How does it fare?

The approach using a binary heap can run about 37,000 times a second, whereas the sorting approach is limited to about 4,000 times a second. So a factor-of-ten difference.

In this instance, computer science wins out: using a binary heap pays handsomely.

Another way programmers might implement a top-K is by a SQL ORDER BY/LIMIT query. My friend, professor Antonio Badia, checked how PostgreSQL solves these queries and he believes that they result in a full sort unless there is a tree index on the relevant attribute. Can other databases, such as Oracle or SQL Server do better than PostgreSQL? It is possible, but PostgreSQL is hardly a naive database engine.

Interestingly, it might be quite challenging for a database user to somehow implement a binary heap solution on top of a relational database. Database engines rarely give you direct access to the underlying data structures.

So all your fancy computer science knowledge might be of limited use if your data is managed by an engine you cannot hack.

11 thoughts on “Quickly returning the top-k elements: computer science vs. the real world”

  1. Hi, thanks for the nice post and a couple of comments:
    – Depending on the application, sorting may fail badly. In web search, for instance, each server (in the data center) is responsible for scoring millions of documents and returning top-k, where k is very small (like 10 or 100). So, using a heap may yield a huge performance gain (+ there is no need for an array for millions of docs, a gain from the space and advantage for multi-core parallelism).

    Databases is another story and I would agree on your observation (although there was once a huge interest on making such top-k queries efficient; and at one point (back in 2004) we have also proposed an extended SQL algebra for handling tuple scores natively in databases, together with processing algorithms that would also exploit such scores for faster processing). Nevertheless, I am not aware of industrial adoption of such ideas, but maybe in some research prototypes…

  2. How about a simple-minded insertion sort into an array of size K. After a while most entries would be immediately rejected.

  3. Note that from the computer science perspective it is possible to do better than the heap solution:

    1. find the Kth smallest element from the list using the quick select algorithm. This takes time O(N)

    2. Use the Kth smallest element as the pivot to partition the array and only keep the elements smaller than the pivot (similar to what is done in quicksort). This takes time O(N).

    The resulting algorithm is linear and thus optimal. If the result needs to be sorted this simply requires an extra O(K log K) additional time, for an overall running time of O(N + K log K).

    However, this algorithm seems to have worse locality of reference than the heap-based one, so I wouldn’t expect it to perform better in pratice. It could be interesting to try though

  4. The database engine I work on (Impala) uses the binary heap solution for top-k (ORDER BY/LIMIT) queries. It’s a common pattern in practice (and in benchmarks). I suspect other engines do too. Maybe Postgres is an outlier.

  5. As coincidence would have it, I just ran across the Postgres commit for top-N sorting.https://postgrespro.com/list/thread-id/1270227Apparently it shows up in the plan as a regular sort, since it uses the same module and switches modes only when the tuple count is large:  https://github.com/postgres/postgres/blob/2df537e43fdc432cccbe64de166ac97363cbca3c/src/backend/utils/sort/tuplesort.c#L1587

  6. I got exactly same question in Amazon interivew and like an average programmer I followed the sorting approach and I failed the interview miserably. Then I spend some time on what went wrong and later found what went wrong with me and I came to know about binary heap way of solving this problem. But this RDBMS is good way of solving this approach in real world senarios. Also people have shared other solutions too like @Thibaut have shared a good solution i would definitely like to give it a try

Leave a Reply

Your email address will not be published. Required fields are marked *