I like to sort things. If you should learn one thing about Computer Science is that **sorting is fast and useful**.

Here’s a little example. You want to check **quickly** whether an integer belongs to a set. Maybe you want to determine whether a userID is valid. The solutions:

- Use a hash table. Java programmers use the HashSet class.
- Use a tree structure such as a red-black tree or a B-tree. Java programmers use the TreeSet class.
- If your set of integers changes rarely, you can sort it, and then try to locate integers using binary search.

I wrote a Java benchmark to compare the three solutions:

Binary search over a sorted array is a **only 10% slower** than the HashSet. Yet, **the sorted array uses half the memory**. Hence, using a sorted array is the clear winner for this problem.

If you think that’s a little bit silly, consider that column-oriented DBMSes like Vertica use binary search over sorted columns as an indexing technique.

**Code**: Source code posted on my blog is available from a github repository.

Unfortunately, I think you’re mostly measuring the cost of boxing the raw int to Integers with regard to HashSets. If you use fastutil’s IntOpenHashSet, which is completely unboxed. Array’s performance (and HashSet and TreeSet’s) is blown away, consistently by a factor of 4 or more. See the last column of the data here:

http://gist.github.com/408391

Sorted Arrays are great, but O(1) lookup is hard to beat.

@David Thanks for this great benchmark. I was well aware of the cost of boxing in Java, hence I was careful to document precisely how I ran my benchmark.

By your numbers, you beat the sorted array by a factor of 3.6 with over a million elements. That’s considerable, but remember that the hash table doesn’t give you access to the elements in sorted order, so it is not as useful in practice. And, of course, your gains are less with smaller sets.

Google’s BigTable (on AppEngine) also maintains sorted indices which are used for queries. For example, say you wanted to do the following query:

company=”x” and date>y and date<z

Then BigTable would maintain an index sorted first by company and then by date. Finding rows that satisfy the query is then just requires a binary search in that index.

This has a number of limitation (i.e. you can only have inequality filters on one fields) but in practice it works really well.

And moreover, the sorted array has good locality of reference, so it will tend to perform better as the speed of cache memory relative to regular RAM increases.

@Bannister

Thanks for this great comment.

Counterpoints:

(1) Not many non-trivial operations scale as well as sorting. If sorting is too expensive for you, what else are you going to do with the data?

(2) I’m pretty sure that sorting is much less expensive than building a B-tree.

(3) I used Java because it makes it easy for people to reproduce my benchmark.

@Bannister

On the flip side, I tend to find binary search a shade inelegant.In this case, you know that the distribution of keys is uniform, so you could definitively do better than a binary search.

This time around, it seems we swapped hats.

Sorting is undoubtedly useful, but often not “fast”. We care about performance when the size of data is large, the number of repetitions is large – or both. Sorting over very large data is expensive. Even with small data, frequent sorts can be horribly expensive. Your curve does not add in the cost of sorts – which is all you can do without knowing more about the specific problem.

Guess I am really bothered by your use of a Java benchmark, a small measured difference, and final mention of a specific sort of database. These are all quite different animals.

I’d be happier if you started with the mention of a problem space. If your premise is:

Data small enough to fit easily in memory.

Very large number of reads, very small number of writes.

Frequent(?) need for sort-order access.

Somewhat(?) general purpose usage.

In the specific case of a column-oriented database – by choosing the right compression method – you might effectively be able to do lookups without decompressing the data. Could be a big win. (Didn’t you write something about this prior?)

Within the above frame, the comparison makes some sense. Outside the frame … not so much.

Guess I was unclear.

As you started with use of sorting, a Java benchmark and a small percentage difference, I was distracted by all the (many!) ways that combination could go wrong.

On re-reading, I suspect the final bit is a more a clue to what you had in mind. For some very specific sorts of problems, in particular when sorted access is at least an occasional need, using binary search over sorted data can be very reasonable. To this, I agree.

1) Over most of the problems I have worked on, anything like general purpose sorting is relatively rare. Lots of use of custom algorithms to suit access patterns (and presuming we distinguish ordered structures from general purpose sorting). Not a general result by any means.

2) Exclusive of update, I suspect you are right. Including update, I suspect you are more often wrong. (Lots of possible variants here that could push the result in either direction.)

3) Using Java as you meant is entirely suitable. I was distracted by the order of presentation.

A 10% difference over small data in simple Java code could come out very different. Reading lead to an explosion of questions. (Client or server JVM? JIT code differs across CPUs? Data fits or spills from innermost CPU cache? Time dominated by Java’s wrapped array and integer access?) None of which is where you wanted to lead, I suspect.

On the flip side, I tend to find binary search a shade inelegant. Something a bit silly about comparing a billion numbers, and starting each with “is this number 43?”. If the problem allows use of some sort of address-based hash for at least part of the lookup, this can be a big win.

I’ve touched this before.

http://bannister.us/weblog/2006/06/15/breakage-or-misapplication/

http://bannister.us/weblog/2005/01/18/a-diversion-into-hash-tables-and-binary-searches/