Fast strongly universal 64-bit hashing everywhere!

In software, hashing is the process of taking a value and mapping it to a random-looking value. Suppose you are given 64-bit integers (a long in Java). You might want to “hash” these integers to other 64-bit values.

There are many good ways to achieve this result, but let me add some constraints:

  1. The hashing should be strongly universal, also called pairwise independent. This means that if h is a hash function picked at random, then you want the knowledge that h(x)=y for some x and some y, to give you absolutely no information about the value of h(x') for x' distinct from x. That’s not, as it appears, a security/cryptography issue, but more of a useful constraint for probabilistic algorithms. Indeed, there are many “probabilistic” algorithms that require different, and “independent”, hash functions. You want to be absolutely sure that your hash functions are unrelated. Strong universality is not perfect independence, but it is pretty good in practice.
  2. You don’t want to have large look-up tables occupying your cache.
  3. You want to code that works efficiently in most programming languages (including, say, Java).

My proposal is as follows. Instead of trying to hash the 64-bit values to other 64-bit values directly, we can hash them to 32-bit values. If you repeat twice (using two different hash functions), you get the 64-bit result you seek. All that is needed per hash function are three 64-bit values chosen at random, and then two multiplications, two additions and a single shift. The two multiplications are faster than they appear because they can be executed simultaneously as there is no data dependency. Two get a full 64-bit output, you thus need four multiplications.

// in Java, long is 64-bit, int 32-bit

long a,b,c; // randomly assigned 64-bit values

int hash32(long x) {
  int low = (int)x;
  int high = (int)(x >>> 32);
  return (int)((a * low + b * high + c) >>> 32);
}

How fast is it? We can try to compare it against a standard bit-mixing function:

long murmur64(long h) {
  h ^= h >>> 33;
  h *= 0xff51afd7ed558ccdL;
  h ^= h >>> 33;
  h *= 0xc4ceb9fe1a85ec53L;
  h ^= h >>> 33;
  return h;
}

This bit-mixing function is “obviously” faster. It has half the number of multiplications, and none of the additions. However, in my tests, the difference is less than you might expect (only about 50%). Moreover if you do need two 32-bit hash values, the 64-bit mixing function loses much of its edge and is only about 25% faster.

My code is available so you can run your own Java benchmark.

How do I know that this hashing function is, indeed, strongly universal? We wrote a paper about it: Strongly universal string hashing is fast. At the time, we were interested in hashing strings. The trick is to view a 64-bit word as a string of two 32-bit words. We could extend the same trick to 128-bit inputs or, indeed, inputs of any length.

Published by

Daniel Lemire

A computer science professor at the Université du Québec (TELUQ).

10 thoughts on “Fast strongly universal 64-bit hashing everywhere!”

  1. Is there a bug in lines 35-40 of HashFast.java?

    static long hash64(long x) {
    int low = (int)x;
    int high = (int)(x >>> 32);
    return ((a1 * low + b1 * high + c1) >>> 32)
    | ((a2 * low + b2 * high + c2) & 0xFFFFFFFFL);
    }

    Should 0xFFFFFFFFL be 0xFFFFFFFF00000000L ?

    1. Peter: Strongly universal means that if the constants (a,b,c…) are generated randomly, and I give you x and h(x), and x’ is distinct from x, then you know nothing about h(x’). In some sense, it ensures that there is a possible collision in that if you knew that h(x) and h(x’) could not collide, then it would not be strongly universal.

        1. The default values for a1,b1,c1,a2,b2,c2 produce the effect mentioned. Other specific for a1,b1,c1,a2,b2,c2 produce order of magnitude more collisions for nearby values than would be expected from random.

  2. Most languages offer a multiply of two N-bit values into an N-bit result, but actual hardware often offers the full 2N-bit result for no little extra cost (for example, on x86, getting the 128-bit product of two 64-bit multiplies has the same latency and throughput as the variants that only give a 64-bit result, although they do cost an extra uop).

    Given that, could you calculate your 64-bit strongly universal hash more directly, using only two multiplications if you had access to the full 128-bit result of a 64 * 64 multiplication?

    If I understand it correctly, you are using the multiply-shift scheme to calculate an N-bit hash of an N-bit input x using: (a*x + b) >> N, where a and b are 2N-bit values and x is N-bit.

    So given N = 64, on most hardware we can do a 64 * 64 -> 128 bit multiplication, but for the a * x part we actually need 128 * 64 -> 128, which could be implemented with two 64 * 64 -> 128 multiplies and a 64-bit add. Then you have two more adds for the + b part, although in principle it seems like adding the bottom half of b is mostly pointless because it only very weakly affects the actual result since the bottom bits are all thrown away [1]. The shift >> 64 is a no-op, since we’re splitting the 128-bit result into two 64-bit registers, so we just directly use the top half.

    So you end up with half the number of multiplies and fewer additions as well.

    The trick, of course, is convincing the language of your choice to generate sensible code under the covers that actually uses the hardware capabilities! From C or C++ this is somewhat easier due to the presence of 128-bit types for most compilers. I’m not sure about Java though.

    [1] Of course, you’d have to walk through the strong universality proof to see if dropping the low-bit addition breaks the guarantee in theory. Perhaps it ends up being strongly 1.0000001-universal or something.

    1. I guess this kind of shows that there is a parallel between the explicit decomposition approach described in Daniels post and multi-precision arithmetic.

      I.e., one way to extend the multiply-shift scheme to larger words is decompose it explicitly into smaller hashes and concatenate the hashes together to get your word-sized hash, another another way is to just perform the multiply-shift formula once with the full word size and rely on multi-precision arithmetic to calculate the answer when the needed arithmetic exceeds the machine’s word size.

      The end up scaling in about the same way: both are quadratic in the word size (at least as I understand Daniel’s approach) and in fact ultimately produce similar operations.

      The explicit decomposition approach has the advantage that you can perhaps rely on some knowledge of the required result to eliminate or combine operations. For example, Daniel’s approach only uses 4 multiplications while a naive 64128->128 multiplication would need 8 (I think). Essentially the decomposition was able to work around the lack of a way to get the high-half of a 6464-bit multiplication in Java.

      The multi-precision arithmetic approach has the advantage of being able to lean on existing multi-precision libraries[1], which may end up being faster (i.e., if they know how to get a 64*64->128-bit result, you cut the multiplications down to 2 as in my suggestion above), and being obviously correct without proof since they use the original multiply-shift-add formula directly.

      [1] Of course if you actually use some big heavy multi-precision library maybe that’s also a downside.

  3. Thanks for an interesting article as always.

    This is the type of result which could be very influenced by your choice of compiler and VM. For the heck of it, I went and ran your tests on Azul Zing. Overall, the big picture results are roughly the same, but the percentage difference between the hash32 and murmur scores was a bit bigger (80% on Zing vs 65% on Zulu).

    (a random recent zulu8 snapshot - e.g. openjdk)

    Benchmark Mode Cnt Score Error Units
    HashFast.fast2_32 avgt 10 235.174 ± 0.125 us/op
    HashFast.fast64 avgt 10 205.715 ± 0.651 us/op
    HashFast.murmur avgt 10 143.495 ± 0.044 us/op
    HashFast.murmur_32 avgt 10 181.772 ± 0.441 us/op

    (a random recent Zing8 snapshot)

    Benchmark Mode Cnt Score Error Units
    HashFast.fast2_32 avgt 10 134.482 ± 0.026 us/op
    HashFast.fast64 avgt 10 122.611 ± 0.832 us/op
    HashFast.murmur avgt 10 74.235 ± 0.017 us/op
    HashFast.murmur_32 avgt 10 87.856 ± 0.024 us/op

    Results were collected on a skylake-client box I had sitting around. I tested on two other architectures (an ancient AMD, and haswell) and got roughly the same picture (though the absolute scores differed of course).

    Disclaimer: I work for Azul on our compilers, so I’m obviously biased.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax