When accessing hash tables, how much time is spent computing the hash functions?

Suppose that you create a large set of objects that you store in a hash table. Let us take 10 million objects. It is large enough that it probably will not fit in the cache memory of your processor. Let us keep the objects simple… say they are lists of three integers.

In Java, it looks like this…

int N = 10000000;
HashSet<List<Integer>> hm = new HashSet<List<Integer>>();
for(int k = 0; k < N; k++) {
     List<Integer> s = new ArrayList<Integer>(3);
     for(int z = 0; z < 3; z++) s.add(k * z - 2);

Then you take another set made of a small number of values, say 100 objects. You count how many of these values are within the large set.

int count = 0;
for(List<Integer> st : small_hm) {
    if(s.hm.contains(st)) count++;

Because of the way a hash table works, each of your target objects is first hashed, which means that we apply some function that returns a random-like integer. Then we look up the value at an address determined by the hash value in the large hash table.

Java uses a really simple hash function:

int hashCode = 1;
for (Integer e : list)
    hashCode = 31*hashCode + e.hashCode();

Moreover, the hash value of each Integer object is the integer value itself. Thus hashing a list of integers in Java ought to be fast.

Meanwhile, accessing a table that is outside of the CPU cache is potentially expensive. But how expensive? Can we consider the computation of the hash values negligible?

I decided to run a small experiment.

Computing hash values (100) Accessing the large table (10,000,000)
0.9 us 0.7 us

Thus, for an out-of-cache hash table in this test, the majority of the time is due to the computation of the hash values! The result is even stronger when the table is reduced in size so that it fits in cache.

There are certainly cases where cache misses will pile up and make anything else look ridiculously expensive. However, you might be more computationally bound than you think.

As an aside, it means that precomputing the hash values of your objects, so that they do not need to be computed each time they are needed, can speed up your code noticeably.

My code is available.

Update: as pointed out by Evan in the comments, the HashSet class will then do extra bit-mixing work on top of what the Java hashCode method returns. It should be considered part of the hash-value computation. Unfortunately, it is not easy to estimate this cost so I have attributed it to the table-access cost. Therefore I underestimate the fraction of the time spent computing the hash value.

Follow-up: See Should you cache hash values even for trivial classes? and For greater speed, try batching your out-of-cache data accesses

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

8 thoughts on “When accessing hash tables, how much time is spent computing the hash functions?”

  1. Java uses a really simple hash function… Moreover, the hash value of each Integer object is the integer value itself.

    Are you sure about that? It seems pretty weak, and it doesn’t match what I’ve read elsewhere, though that information may be obsolete. My understanding is that hashCode will return the integer itself, but HashMap does some additional work to reduce unintentional collisions…

    That said, the Java hash function I’m thinking of is quite fast; significantly faster than anything else on http://www.burtleburtle.net/bob/hash/integer.html (I did some benchmarks on the algorithms on that page recently using uarch-bench; it’s pretty trivial, but I’d be happy to post it if you’re interested).

  2. Feels good to read this… That is the main reason for hot structures that we have in our code we keep a list of hash caches, usually small. But for the most common data it makes a big impact. Now I can safely put this link as a commemt to thatso that people would stop looking weird at me 🙂

  3. Rerun the benchmark with an object with just three ints so you measure more of the hash function, and less the memory layout.

    Also, note that e.g. the string class because of this effect caches its own hash code, to avoid recomputation cost. This, also benchmark four interested, one for caching the hash code, to get a clearer view.

  4. In my benchmark, my targets are going to be in L1 with high probability: I have only 100 lists and they are created right before I run the test function. So memory access should not be a problem.

    I would have used arrays for a simpler example but you can’t use arrays as keys in Java.

    Indeed, String hashes are cached (computed at most once) in Java. It is a common recommendation to cache hash values.

    I do include an example of hash-value caching in the companion code to this blog post. The blog post does not describe it because it is a secondary point.

  5. Perhaps this is an aside, but … a fixed size array of three integers? Is this a fixed aspect of your problem? (Does the dimension of three ever vary?) Would this be more efficient as a tuple?

    public class Tuple {
    int a, b, c;

    Not tested, but guessing the compiler would emit more efficient code in this case.

    1. I think it’d be interesting to see what happens if data does not fit into cache (which seems more realistic anyways): using (say) a large varchar (anything from 256 bytes to 1K) instead of 3 integers. My guess is that this would change the result tremendously, since hashing is notorious for destroying locality of reference anyways. Maybe when I have some free time…

Leave a Reply

Your email address will not be published. The comment form expects plain text. If you need to format your text, you can use HTML elements such strong, blockquote, cite, code and em. For formatting code as HTML automatically, I recommend tohtml.com.

You may subscribe to this blog by email.