Why can’t hash tables preserve the order of keys?

One of the most common data structuring in Computer Science is the hash table. It is used to store key-value pairs. For example, it is a good data structure to implement a phone directory: given the name of the individual, find his phone number.

Implementing a hash table is not difficult. Start with an array of size N. Each element of the array can be used to store a list of elements. You need a hash function f which maps keys (e.g. people’s name) to an integer in {0, 1, …, N-1}. Store the key-value pair (k,v) at location f(k) in the array. Assuming that

  1. the computation of f(k) is independent of the number of keys, and that
  2. the hash function distributes the keys disperses the elements properly in the array,

then a hash table has constant-time look-ups.

The problem with hash tables is that they don’t necessarily store the data in order. For example, if you have retrieved the phone number of “Smith, Jill”, you can’t expect the phone number of “Smith, John” to be nearby.

This can be a big problem in practice. It is especially bad if the hash table is on a high-latency medium, like a spinning disk. But it can also hurt you if you have a gigantic hash table in RAM.

At this point, the astute engineer might decide to switch to another data structure such as a tree (e.g., red-black or  the B-tree). But trees have slower look-ups because you have to do a binary search to retrieve your key. (That is, you require up to log n comparisons where n is the number of elements in your tree.)

An even more ambitious engineer could ask a more fundamental question:

Question: Why can’t hash tables preserve the order of keys?

Answer: There is no fundamental reason why they can’t.

If you knew enough about your problem, you could design an order preserving hash function (i.e., f(k2)< f(k1) whenever k2< k1). The problem is that designing such a function that can be both computed quickly and such that it disperses keys adequately is hard work.

We can find many examples of order-preserving hash tables in real-life. A friend of mine would store his research papers in one of dozen of files. He would use the last name of the first author. He then wrote on the cover of each (physical) file the range of last names that this file covered. Of course, in his head, he applied some type of binary search, but the essential point is that his brain took a key (the last name of the first author) and it converted it to the location of the files. The files themselves were not organized as a tree. His brain computed the hash function.

Alas we can’t always abstract out the cost of computing the hash function. But let us think: we may not really need the hash function to be order-preserving. All we may need is that nearby keys (e.g., all Smiths) are hashed to nearby locations. But you often have this feature, almost for free!

Let us consider an example. Suppose that the array supporting your hash table has 224 elements. Then Java would put the following keys at the following locations:

“Smith, A” 1674
“Smith, B” 1675
“Smith, C” 1676

That is, these similar keys will be located nearby! This is no accident: in Java, strings that only differ by the last character are always hashed nearby.

That is, by using only the first character of each first name in the key, instead of the entire first name, I can make sure that all individuals with the same last name a located nearby. By scanning at most 26 buckets, you could retrieve all the Smiths.

Conclusion: It is possible but difficult to build order-preserving hash tables. However, many hash functions have good locality: nearby keys are nearby in the hash table.

Further reading: See Locality preserving hashing on Wikipedia for a related idea.

16 thoughts on “Why can’t hash tables preserve the order of keys?”

  1. @Itman

    Good point. I think that hash tables are more common however.

    @glenn

    I think that Ruby uses some extra data structures (liked a linked list) to achieve order-preservation in its hash tables. So they are a bit more than a hash table… Good point though!

    @Homer

    From a practical point of view, I agree with you. However, exploiting the properties of your hash functions can be a very neat trick whenever it works.

  2. Tries have comparable (and constant) retrieval times (sometimes even better than hashes). The are also locality preserving. Perhaps, even better are tries where lower-levels are grouped into buckets.

  3. I think that relying on an ordering property like this in code would be intrinsically dangerous. The two big things you have to consider are collisions (and whether the hash handles them internally or externally) and resizing, which would change the distribution of the keys.

    If you need both a hash and an ordering, I think it is far better to keep them as separate (but intertwining data-structures). That is, you insert a node into an ordered list and you hash it too (or create a hash index for the list). Otherwise I suspect that you are trying to use a hammer on a screw …

    Paul.

  4. When I was a younger programmer I came up with this really clever trick to solve a difficult programming problem. In my excitement I approached one of the older developers and energetically explained my cleverness.

    He looked at me for a long quiet moment, then finally said “Monkey’s are clever”.

    The problem with clever tricks is that they tend to come back to haunt you when you least expect it.

    Paul.

  5. @Homer

    But you should be excited about your work from time to time, otherwise, what is the point?

    True. Most clever tricks are overrated. But they sometimes expand our mind.

  6. Hi Angelo,

    I don’t think the other programmer was trying to tell me not to be excited, but rather that clever tricks tend to be rather nasty “landmines” placed haphazardly in the code. What we’re looking for in a solution is to encapsulate the problem in a way that we can move onto to larger and more interesting problems. I think that’s also the rational behind “premature optimization is the root of all evil” (you spend too much time on the small stuff, and make it too hard to reuse later).

    I absolutely love programming, always have, and over the years I’ve learned that it’s important to solve the problems the right way, not just any way that is quick (or possibly clever). There is a subtle difference there, but one worth noting.

    I took a combinatorics class long, long ago and the Prof used to always talk about “the obvious first try” in terms of algorithms. It’s essentially the first thing that most people think of when they are presented with the problem. His lesson to us, was that we had to get beyond that to get to the really interesting and functional solutions. It’s a lesson that always stuck with me.

    Paul.

  7. You can also sort your LinkedHashMap/Set (in Java) by the key. The sorting operation has high cost (as you basically have to rebuild it in the proper order), but the HashMap/Set would then already be presorted if you were to want to give a list of alphabetized results, for example. If you have much fewer inserts than you have ordered listings, it might be worth it.

    I’m still a little unsure that I see the practical value for this. This is for cases where Tree-based Maps are too slow, but you need ordered result lists?

  8. @Paul,
    I think that a trick is not to use clever tricks unless you really-really need it. Sometimes, however, clever tricks are necessary.

  9. @Wes

    Consider the following scenario. You have a massive key-value database. You sometimes, but rarely, need to access the keys sequentially.

    So, you basically want a hash table, with the ability to scan the keys in order when you must. In the case of my example, you could scan all keys in 26 buckets, and get all the Smiths. Of course, you would need to filter these buckets for the Smiths (as other keys would be there as well).

    Having to retrieve from disk only 26 buckets instead of, say, a thousand, could be a great boost.

    My point being that you can use locality-preserving hash functions on your keys to make sure your “similar keys” end up clustered in a few buckets.

    This is entirely general. If you use the key “city”, you could use a hash function that hash “nearby cities” to “nearby buckets”. It can be entirely reasonable.

    Of course, whether custom hash functions are worth it… well, I guess you have to try it out.

  10. @Angelo
    I see–optimizing for the lower use case without affecting [much] the higher use case. In massive key-value databases, limiting your buckets as in the name example, would probably have a negative impact on performance, depending on your collision resolution. I’m tempted to try it out with some tests.

    How about nested Hashes, where you have larger partial-key buckets first (like last name, first letter of first name), and then full-key buckets within?

  11. A monotone hash function from [1..N] to [1..N]? Cool, then it’s trivial to sort N keys in O(N) time and space.

    The locality-preserving hash functions aren’t that good; they don’t *perfectly* preserve the order of the keys. But I do wonder if one could use a locality-preserving hash to get a fast-on-average sorting function. For example, a single pass to (loc.-pres.) hash N keys into an array, followed by a quick pass of insertion-sort to clean up.

  12. @Angelo: Yes, I’m aware of that. The interesting thing is that locality-preserving hashes appear to ensure that the number of possible keys N will be the same order of magnitude as the number of actual keys n.

Leave a Reply

Your email address will not be published. If you leave an email, you will be notified when there are replies. The comment form expects plain text. If you need to format your text, you can use HTML tags such <strong> and <em>. For formatting code as HTML automatically, I recommend tohtml.com.