Quickly returning the top-k elements: computer science vs. the real world

A standard problem in computer science is to quickly return the smallest or largest K elements of a list. If the list is made of N elements, then we can solve this problem in time N log (K) using a binary heap.

How would the average programmer solve this problem? My guess is that unless K is 1, then most programmers would simply sort the list and pick the first or last K elements.

That’s not such a bad deal. The difference between log (K) and log (N) might not be so large. The logarithm function grows slowly.

It could even be that a well optimized sort function could beat a poorly optimized binary heap.

To test it out in the real world, I wrote a benchmark using a JavaScript library. In this benchmark, I generate 1408 random integers and I seek to select the largest 128 integers.

You might ask: why JavaScript? Because JavaScript is everywhere and we should care about its performance.

The code with a priority queue backed by a binary heap looks like this… we eat the first 128 integers, and then we do a push/poll dance to maintain always the largest 128 integers. (The blocks parameter is set at 10 in my tests, generating (blocks + 1) * 128 values.)

var b = new FastPriorityQueue(defaultcomparator);
for (var i = 0 ; i < 128  ; i++) {
  b.add(rand(i));
}
for (i = 128 ; i < 128 * blocks  ; i++) {
  b.add(rand(i));
  b.poll();
}

The sorting routine is somewhat simpler. Just construct a big array, sort it and then extract the first few values:

var a = new Array();
for (var i = 0 ; i < 128 * (blocks + 1)  ; i++) {
   a.push(rand(i));
}
a.sort(function(a, b) {
  return b - a; // in reverse
});
return a.slice(0,128);

How does it fare?

The approach using a binary heap can run about 37,000 times a second, whereas the sorting approach is limited to about 4,000 times a second. So a factor-of-ten difference.

In this instance, computer science wins out: using a binary heap pays handsomely.

Another way programmers might implement a top-K is by a SQL ORDER BY/LIMIT query. My friend, professor Antonio Badia, checked how PostgreSQL solves these queries and he believes that they result in a full sort unless there is a tree index on the relevant attribute. Can other databases, such as Oracle or SQL Server do better than PostgreSQL? It is possible, but PostgreSQL is hardly a naive database engine.

Interestingly, it might be quite challenging for a database user to somehow implement a binary heap solution on top of a relational database. Database engines rarely give you direct access to the underlying data structures.

So all your fancy computer science knowledge might be of limited use if your data is managed by an engine you cannot hack.

14 thoughts on “Quickly returning the top-k elements: computer science vs. the real world”

  1. Hi, thanks for the nice post and a couple of comments:
    – Depending on the application, sorting may fail badly. In web search, for instance, each server (in the data center) is responsible for scoring millions of documents and returning top-k, where k is very small (like 10 or 100). So, using a heap may yield a huge performance gain (+ there is no need for an array for millions of docs, a gain from the space and advantage for multi-core parallelism).

    Databases is another story and I would agree on your observation (although there was once a huge interest on making such top-k queries efficient; and at one point (back in 2004) we have also proposed an extended SQL algebra for handling tuple scores natively in databases, together with processing algorithms that would also exploit such scores for faster processing). Nevertheless, I am not aware of industrial adoption of such ideas, but maybe in some research prototypes…

  2. How about a simple-minded insertion sort into an array of size K. After a while most entries would be immediately rejected.

  3. Note that from the computer science perspective it is possible to do better than the heap solution:

    1. find the Kth smallest element from the list using the quick select algorithm. This takes time O(N)

    2. Use the Kth smallest element as the pivot to partition the array and only keep the elements smaller than the pivot (similar to what is done in quicksort). This takes time O(N).

    The resulting algorithm is linear and thus optimal. If the result needs to be sorted this simply requires an extra O(K log K) additional time, for an overall running time of O(N + K log K).

    However, this algorithm seems to have worse locality of reference than the heap-based one, so I wouldn’t expect it to perform better in pratice. It could be interesting to try though

  4. The database engine I work on (Impala) uses the binary heap solution for top-k (ORDER BY/LIMIT) queries. It’s a common pattern in practice (and in benchmarks). I suspect other engines do too. Maybe Postgres is an outlier.

  5. As coincidence would have it, I just ran across the Postgres commit for top-N sorting.https://postgrespro.com/list/thread-id/1270227Apparently it shows up in the plan as a regular sort, since it uses the same module and switches modes only when the tuple count is large:  https://github.com/postgres/postgres/blob/2df537e43fdc432cccbe64de166ac97363cbca3c/src/backend/utils/sort/tuplesort.c#L1587

  6. I got exactly same question in Amazon interivew and like an average programmer I followed the sorting approach and I failed the interview miserably. Then I spend some time on what went wrong and later found what went wrong with me and I came to know about binary heap way of solving this problem. But this RDBMS is good way of solving this approach in real world senarios. Also people have shared other solutions too like @Thibaut have shared a good solution i would definitely like to give it a try

  7. Hello Daniel,

    There’s a bit of inconsistency in this post:
    – you say you’re generating 1408 random integers but then say blocks is set to 10
    – you say you’re looking for the 128 smallest integers but with defaultcomparator you’re polling the smallest one each time

    1. you say you’re generating 1408 random integers but then say blocks is set to 10

      There are (blocks + 1) * 128 values being generated. You can verify that 11 * 128 is indeed 1408.

    2. you say you’re looking for the 128 smallest integers but with defaultcomparator you’re polling the smallest one each time

      Yes, I have updated the blog post to reflect the fact that we seek the 128 largest values.

      I also corrected a critical mistake in the code samples: the sorting code was flat out wrong (incredibly).

Leave a Reply

Your email address will not be published. If you leave an email, you will be notified when there are replies. The comment form expects plain text. If you need to format your text, you can use HTML tags such <strong> and <em>. For formatting code as HTML automatically, I recommend tohtml.com.