Theory in Computer Science—as in any other field—is based on models. These models make many hidden assumptions. This is one of the fundamental reason why pure theory is wasteful. We must constantly revisit our old assumptions and run experiments to determine whether our models are useful.


Hash tables are a fundamental data structure. They are part of every elementary introduction to Computer Science. From a programmer’s point of view, they associate keys with values. Hash tables go back to the beginnings of Computer Science even though new variations keep being invented (e.g., Cuckoo hashing). Incredibly, hash tables were just recently integrated in the C++ language, but less conservative languages like Java, Python, Ruby, Perl, PHP, all have hash tables built-in.

The secret sauce is that the key is first transformed (or hashed) to a number between 1 and m corresponding to a cell in an array. As long as this number of generated with sufficient randomness, hash tables are almost equivalent to looking up values in an array—except for the overhead of hashing the key.

Thus, everyone knows that hash table queries run in amortized constant time. That is, as the number of keys increases, the average time necessary to recover a key-value pair does not increase.

But wait! How do we compute hash values? Consider the standard universal hashing technique described in Corman et al. Introduction to Algorithms. We pick any prime number m. Pick randomly a number a in [0,m). Then h(x)= x a modulo m is a universal hash function. Multiplications of numbers in [0,m) can almost be done in time O(log m).  For the purposes of hash tables, the number m must be close to the number of different keys n. Thus—formally—hash tables often run in time O(log n).

Am I being pedantic? Does the time required to multiply integers on modern machine depend on the size of the integers? It certainly does if you are using vectorization. And vectorization is used in commercial databases!

And what if your key is a string? As you have more keys, you are more likely to encounter longer strings which take longer to hash.

Thus, it is possible for hash tables to run in time O(log n) in practice, despite what the textbooks say.

What other hidden assumptions are you making right now?

Note 1: Fortunately, universal hashing can be computed in linear time with respect to the number of bits, by avoiding multiplications.

Note 2: The problems that result from picking the wrong model of computation are well known and addressed by most textbooks. I have not discovered anything new.

Note 3: In a recent paper on fast string hashing, we show that, in practice, you can hash strings for a fraction of a CPU cycles per byte.

Update: See my more recent post Sensible hashing of variable-length strings is impossible.

17 Comments

  1. You lost a logarithm: “Multiplications of numbers in [0,m) can almost be done in time O(m log m)” should read “Multiplications of numbers in [0, 2^m) can almost be done in time O(m log m)”.

    Thus accessing an entry in a hash table of size n takes time O((log n)^(1+epsilon)), not O(n^(1+epsilon)).

    There’s a more fundamental for hash tables not being constant-time, however: Random-access memory isn’t constant-time. You’ve got at least log n gate delays; and once you start dealing with large amounts of storage, you’ve got a speed-of-light cost of O(n^(1/2)) or O(n^(1/3)), depending on whether your circuits are two- or three- dimensional.

    Comment by Colin Percival — 18/8/2009 @ 12:48

  2. Thus accessing an entry in a hash table of size n takes time O((log n)^(1+epsilon)), not O(n^(1+epsilon)).

    Duh! Thanks. (I don’t think anyone would be using hash tables, had I been right…) (:shame:)

    Comment by Daniel Lemire — 18/8/2009 @ 13:07

  3. You’re playing semantic games. If you look at algorithms, modulus computation is always considered O(1). And it is in most commercial CPUs. The time for a multiplication is indepedent of the size of the number, assuming we stay in the natural word size. The complexity you mention has simply been moved from time-complexity to gate-complexity.

    And even if we stipulate hashing were O(log n) – where do you make the jump to O(n log n)?

    Comment by Rachel Blum — 18/8/2009 @ 13:17

  4. @Rachel

    Thank you for your comments. Yes, I like to play games, but please give me some credit.

    Yes, we always assume that multiplications take constant time because we assume that we take in 64-bit integers and produce 64-bit integers using 64-bit processors.

    But is it true?

    No, it is not if you use vectorization. With vectorization, you can multiply four pairs of 16-bit numbers in the time it takes to multiply one pair of 64-bit numbers. And yes, people use vectorization, right now, in commercial products.

    Further reading:

    http://en.wikipedia.org/wiki/Vectorization_(computer_science)

    Comment by Daniel Lemire — 18/8/2009 @ 13:37

  5. When doing complexity analysis you are supposed to be comparing apples with apples: how fast does the time or space complexity grow /with respect to the size of the input n/.

    You are taking n as the size of the hash table at the beginning, then using n as the size of one element in another part, and Colin is (I’m guessing facetiously) taking n as the number of logic gates in a stick of RAM.

    Neither of these variables change as n changes, so it doesn’t make sense to include them in analysis. Even if you do include them, they are still constants, and choosing M and x (following http://en.wikipedia.org/wiki/Big_O_notation) appropriately would show you the correct result.

    Comment by Sammy Larbi — 18/8/2009 @ 13:37

  6. @Sammy

    When using larger hash tables, you have to compute larger hash values. This takes more time.

    Try it out. Right now. Implement a hash table with 2^256 keys and one with 2^16 keys. You’ll see that multiplying 256-bit integers takes longer. Thus, your 2^256-element hash table will be slower.

    True. Nobody right now has databases with 2^256 elements. But with vectorization, you can do 4 multiplications in 16-bit in the time it takes to do one 64-bit multiplication.

    So, right now, in real systems, I claim that larger hash tables could be slower, even if I discard the memory access time.

    True. Few people are using vectorization *right now*. But some important people are using it for important database applications.

    Comment by Daniel Lemire — 18/8/2009 @ 13:47

  7. That is *still* constant time… 4*O(1)=O(1)

    (I’m well aware of vectorization – given that I work on games, it’s inevitable ;)

    Only if your key size were unlimited it would be an issue. (Because then time required for hash computation would indeed change w/ the size of your key)

    And yes, larger hash tables are slower – but their time complexity is the same. (Don’t you love complexity arithemtic? ;)

    Comment by Rachel Blum — 18/8/2009 @ 15:21

  8. @Rachel

    And yes, larger hash tables are slower – but their time complexity is the same. (Don’t you love complexity arithemtic?)

    When a 2^m-key hash table runs 2 times slower than a 2^2m-key hash table, then you have O(log n) complexity by definition. And that’s precisely what happens when running multiple hash table queries simultaneously with vectorization.

    Comment by Daniel Lemire — 18/8/2009 @ 15:46

  9. As I said – it *is* a true statement when you consider variable key size.

    But for most hash table implementations, the key size is fixed. Hence there is a constant upper bound for modulus, hence O(1)

    That shouldn’t change for vectorization, either, as long as the key size has a fixed upper bound.

    What am I missing?

    Comment by Rachel Blum — 18/8/2009 @ 16:30

  10. Perhaps I am not tracking your argument, but as far as I can tell, 32 and 64-bit integer multiply instructions take a fixed number of clocks on recent x86 CPUs (the fixed time multiply hardware went into either the 386 or 486, if memory serves). Presuming you are interested in hash tables that fit in memory, integer multiply instructions are sufficient.

    Not sure how you would use vectorization with hashing, at least not in the usual cases.

    Then again, for my purpose, simple Pearson hashing (not using multiply or divide) has always won out when measured over actual problem sizes.

    Comment by Preston L. Bannister — 18/8/2009 @ 16:44

  11. @Rachel

    But for most hash table implementations, the key size is fixed. Hence there is a constant upper bound for modulus, hence O(1)

    That shouldn’t change for vectorization, either, as long as the key size has a fixed upper bound.

    What am I missing?

    If you fix the maximal number of keys, then all data structures run in constant time. Even a linear scan through an array will never use more than a fixed number of operations.

    That’s not very exciting.

    Comment by Daniel Lemire — 18/8/2009 @ 16:52

  12. @Preston

    Regarding Pearson hashing, I specifically pointed out that multiplication was not required (see Disclaimer 1). As for Pearson being quite good… I agree, and more about this another day. Watch this blog for follow-ups. (Man! Do I wish I knew more people who knew about Pearson hashing!)

    As for using Vectorization, the idea I had was to implement several hash tables that are queried simultaneously; or a single hash table that is queried in small batches. Why not? People are implementing and selling database engines designed around vectorization. Am I going to try doing building these vectorized hash tables today? Probably not.

    I was just trying to get people to think about their assumptions this morning and it degenerated…

    Comment by Daniel Lemire — 18/8/2009 @ 17:07

  13. Yes, multiplication isn’t really constant time and memory access isn’t really constant time, but if you’re going to assess a penalty to those operations when you use them in hashing then you need to assess them consistently throughout whatever other algorithms you’re using.

    Or, you could use the uniform cost model, pretend the penalty is one, and get basically the same comparison between any two algorithms (that don’t have drastically different memory hierarchy behavior) without all the pain.

    Where this falls down, of course, is that the memory hierarchy behavior of hashing is bad. So if you compare it to something that’s designed to have good lcality of reference, the uniform cost model may not be the right thing to yse.

    Comment by D. Eppstein — 18/8/2009 @ 18:05

  14. @Daniel – even w/ an upper bound on the number of elements, an array scan still depends on how full the array is, though.

    A hashtable still is independent of the number of entries, for a fixed *key* size. The number of elements can be variable.

    That’s not saying keysize is not important, just that a hashtable w/ a fixed keysize is O(1) w/ respect to number of entries, and an array scan is O(n)

    So I guess my quibble is that for a given key size, hash tables should be O(log k) – k being the maximum number of keys – , not O(log n).

    Comment by Rachel Blum — 18/8/2009 @ 18:25

  15. @Rachel I am happy with O(log k). But what if you are using strings as keys?

    Comment by Daniel Lemire — 18/8/2009 @ 18:46

  16. @Rachel

    Don’t you think number of keys is relevant to number of entries?

    @Daniel

    You even don’t have to discuss the time complexity of integer multiplication in modern CPUs. You only need to show the lower bound of the cost to generate the hash value. For ex. you have n data entries and an ideal hash function which generate hash value between 1~n without a single collision. You need a data field to contain this value in bits(Size >= log n). So the cost to process this size of data is obviously O(log n). At least you have to read through it!

    Comment by sleepnova — 1/5/2011 @ 13:06

  17. @sleepnova

    You are correct. My blog post was over-engineered.

    Comment by Daniel Lemire — 1/5/2011 @ 13:13

Sorry, the comment form is closed at this time.

« Blog's main page

Powered by WordPress