How fast is tabulation-based hashing? The downsides of Zobrist…

In practice, hashing is the process of taking an input, such as a string, and turning it into an integer value. It is a fundamental tool in programming, as most software relies on hashing in one way or another. We often expect hashing to appear “random” so that any two strings are unlikely to have the same hash value.

One of the earliest and simplest forms of hashing is Zobrist hashing. Suppose you want to hash strings of up to 16 characters. For each possible 16 character positions, you generate a table made of 256 randomly chosen words. To hash a given string, you pick the first character, look up the corresponding chosen word in the first table, then you pick the second character and look up the word in the second table… and you XOR the retrieved words. In C, you get something like this:

for (size_t i = 0; i < length ; i++ )
      h ^= hashtab [ i ] [ s[ i ] ];
return h;

Zobrist hashing is an instance of a more general class of hashing said to be tabulation-based.

The benefits of Zobrist are obvious. It is simple. It can be implemented in seconds. It has excellent theoretical properties and is generally very convenient.

The first downside is also obvious: Zobrist hashing uses a lot of memory. To be able to hash all 4-byte strings to 64-bit values, you need 8 KB. But let us ignore this obvious downside.

State-of-the-art universal hash functions such as CLHash can process over 10 bytes per cycle on a recent Intel processor. The good old CityHash can process over 4 bytes per cycle.

How fast can Zobrist hashing be? To get a better idea, I implemented Zobrist hashing as a small C library. How fast is it? Not very fast. I can barely process 0.65 bytes per cycle when hashing repeatedly the same short string, taking into account the function-call overhead. Thus, tabulation-based hashing is probably about an order of magnitude slower than commonly used fast hash functions, assuming you can avoid cache misses.

In an exhaustive experimental evaluation of hash-table performance, Richter et al. (VLDB, 2016) found that Zobrist hashing produces a low throughput. Consequently, the authors declare it to be “less attractive in practice” than its strong randomness properties would suggest.

Software reference. zobristhashing: Zobrist hashing in C

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

5 thoughts on “How fast is tabulation-based hashing? The downsides of Zobrist…”

  1. The usual strategy is to use a fast hash function to hash down to 32 or 64 bits and then use tabulation hashing on that 32-to-64-bit value.

    1. There are real-world systems that use Zobrist hashing as I described it. See for example Gigablast (https://www.gigablast.com/). Of course, you can use tabulation-based as a final step, and then its speed is less of a concern. You should still expect it to be slower than arithmetic-based functions.

      1. I agree with your expectation of slowness compared to arithmetic functions. However, I suspect it is the fastest known hash function for 32-bit or 64-bit input that has the nice theoretical guarantees it has for hash tables, including for linear probing and cuckoo hashing, as noted in the work of Mihai Pătrașcu and Mikkel Thorup.

  2. Where is the crossover point between CLHash and tabulation hashing? That is, if my input is guaranteed to be 64 bytes long, in your recent experiments, is Zobrist or CLHash faster? What about 16 bytes or 1024 bytes?

    1. Your 16-byte Zobrist hashing would use 16kB of memory. Cache faults, if they occur, could dominate running time for sure. Maybe you can hold the 16kB in L1 cache and avoid expensive cache misses. Then you might do well. But this 16kB of L1 cache is a precious resource that the rest of your code might need…

      If you have little need for the L1 cache, maybe because you are working with small data structures, or you have mostly sequential access… then Zobrist would do well.

      Otherwise, it will start to generate cache faults.

      As usual, there is nothing better than actual testing.

Leave a Reply to Daniel Lemire Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.