Crazily fast hashing with carry-less multiplications

We all know the regular multiplication that we learn in school. To multiply a number by 3, you can multiply a number by two and add it with itself. Programmers write:

a * 3 = a + (a<<1)

where a<<1 means "shift the bit values by one to the left, filling in with a zero". That's a multiplication by two. So a multiplication can be described as a succession of shifts and additions.

But there is another type of multiplication that you are only ever going to learn if you study advanced algebra or cryptography: carry-less multiplication (also called "polynomial multiplication). When multiplying by powers of two, it works the same as regular multiplication. However, when you multiply numbers that are not powers of two, you combine the results with a bitwise exclusive OR (XOR). Programmers like to write "XOR" as "^", so multiplying by 3 in carry-less mode becomes:

a "*" 3 = a ^ (a<<1)

where I put the multiplication symbol (*) used by programmers in quotes ("*") to indicate that I use the carry-less multiplication.

Why should you care about carry-less multiplications? It is actually handier than you might expect.

When you multiply two numbers by 3, you would assume that

a * 3 == b * 3

is only true if a has the same value as b. And this works because in an actual computer using 64-bit or 32-bit arithmetic because 3 is coprime with any power of two (meaning that their greatest common factor is 1).

The cool thing about this is that there is an inverse for each odd integer. For example, we have that

0xAAAAAAAB * 3 == 1.

Sadly, troubles start if you multiply two numbers by an even number. In a computer, it is entirely possible to have

a * 4 == b * 4

without a being equal to b. And, of course, the number 4 has no inverse.

Recall that a good hash function is a function where different inputs are unlikely to produce the same value. So multiplying by an even number is troublesome.

We can "fix" this up by using the regular arithmetic modulo a prime number. For example, Euler found out that 231 -1 (or 0x7FFFFFFF) is a prime number. This is called a Mersenne prime because its value is just one off from being a power of two. We can then define a new multiplication modulo a prime:

a '*' b = (a * b) % 0x7FFFFFFF.

With this modular arithmetic, everything is almost fine again. If you have

a '*' 4 == b '*' 4

then you know that a must be equal to b.

So problem solved, right? Well... You carry this messy prime number everywhere. It makes everything more complicated. For example, you cannot work with all 32-bit numbers. Both 0x7FFFFFFF and 0 are zero. We have 0x80000000 == 1 and so forth.

What if there were prime numbers that are powers of two? There is no such thing... when using regular arithmetic... But there are "prime numbers" (called "irreducible polynomials" by mathematicians) that act a bit like they are powers of two when using carry-less multiplications.

With carry-less multiplications, it is possible to define a modulo operation such that

modulo(a "*" c) == modulo(b "*" c)

implies a == b. And it works with all 32-bit or 64-bit integers.

That's very nice, isn't it?

Up until recently, however, this was mostly useful for cryptographers because computing the carry-less multiplications was slow and difficult.

However, something happened in the last few years. All major CPU vendors have introduced fast carry-less multiplications in their hardware (Intel, AMD, ARM, POWER). What is more, the latest Intel processors (Haswell and better) have a carry-less multiplication that is basically as fast as a regular multiplication, except maybe for a higher latency.

Cryptographers are happy: their fancy 128-bit hash functions are now much faster. But could this idea have applications outside cryptography?

To test the idea out, we created a non-cryptographic 64-bit hash function (CLHash). For good measure, we made it XOR universal: a strong property that ensures your algorithms will behave probabilistically speaking. Most of our functions is a series of carry-less multiplications and bitwise XOR.

It is fast. How fast is it? Let us look at the next table...

CPU cycles per input byte:

64B input 4kB input
CLHash 0.45 0.16
CityHash 0.48 0.23
SipHash 3.1 2.1

That's right: CLHash is much faster that competing alternatives as soon as your strings are a bit large. It can hash 8 input bytes per CPU cycles. You are more likely to run out of memory bandwidth than to wait for this hash function to complete. As far as we can tell, it might be the fastest 64-bit universal hash family on recent Intel processors, by quite a margin.

As usual, the software is available under a liberal open source license. There is even a clean C library with no fuss.

Further reading:

Daniel Lemire, "Crazily fast hashing with carry-less multiplications," in Daniel Lemire's blog, October 26, 2015.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

12 thoughts on “Crazily fast hashing with carry-less multiplications”

  1. What does the avalanche bias profile look like? One of the problems with most hash function families is that they trade quality for speed. As computing systems become larger and faster, bias in non-cryptographic hash functions starts to manifests as real pathologies in the software that uses them. The development of algorithms like MurmurHash, CityHash, etc were driven by this.

    One of my motivations for the research that led to the MetroHash functions is that I needed non-cryptographic hash functions that were significantly faster than CityHash but which had randomness comparable to cryptographic functions (which you do not get from CityHash) for some use cases inside database engines. And it looked like an interesting research problem, of course.

  2. 1. only competitive-speed on recent CPUs, right?

    2. small key (64 bytes or less) performance is worse than xxHash ( ) and Farmhash ( ) which are both faster than City, but perhaps there’s a weaker quality use of this instruction that can give something suitable e.g. for typical string-keyed hash tables? All I know about quality is that these all past smhasher, but perhaps they cheated a bit by tuning to meet those particular tests.

    1. Note that CityHash, xxHash, Farmhash… are not universal. CLHash is XOR universal. (Almost so on long strings.)

      1. only competitive-speed on recent CPUs, right?

      Yes. On older CPUs, the carry-less multiplication had a low throughput (except maybe on AMD CPUs where it did better). This is reviewed in our earlier paper (Strongly universal string hashing is fast).

      2. small key (64 bytes or less) performance is worse (…)

      For short strings, performance is likely driven by latency, and the carry-less multiplications have relatively high latency (7 cycles). So I expect it will be difficult to get really exciting results on short strings with carry-less multiplications, at least if you care about the latency of the hash function.

  3. I’m sorry, I’d like asking why it is possible to have a*4 == b*4 with a != b in a computer? Maybe I’m missing some basic math but I can’t see that.

    Thank you.

    1. You can verify that, in Java, you have the following:

      int a = 0;
      int b = 1073741824;
      a != b; // true
      4 * a == 4 * b ; // true

      The result is not Java specific of course, but I have to write the code in some language.

      1. In the post it says:
        a * 4 == b * 4
        whereas in your comment it’s:
        4 * a == a * b
        Which is a totally different thing and I’m still confused.

        1. Let me try with actual code.

          You can run the following Java program:

          public class CrazyMath
            public static void main (String[] args) 
              int a = 0;
              int b = 1073741824;
              System.out.println(4 * a == 4 * b);

          Or you can write the following C program:

          int main() {
            uint32_t a = 0;
            uint32_t b = 1073741824;
            printf("%s\n",a*4==4*b?"equal":"not equal");

          Or you can try the following Go program…

          package main
          import "fmt"
          func main() {

          And so forth, and so forth…

          Please try it out for yourself.

          1. But isn’t that because 1073741824 * 4 = 2^32 * 2? Where a 32bit integer cannot go beyond? Thus wrapping itself around to 0? (I havn’t studied computer engineering, only been programming for a couple years, so only speaking from hands on experience, no theoretical knowledge, basically).

            1. Right. Most programming languages rely on integers spanning a finite and pre-determined number of bits. That’s because this is how the underlying hardware works.

              A language like Python will support arbitrary integers… this comes at a performance cost however.

              JavaScript is just strange as it has no support for integers per se. The following is true in JavaScript…


  4. One use of carry-less multiplication I’ve run across is morton encoding. Morton encoding is a process to produce an octree from a list of objects to retain locality of reference. Assuming a space of objects, if we take their X and Y (and Z and any other dimension) position as binary, then interleave the bitstring into one binary string, then order these resultant binary strings from 0000… to 1111…, we arrive at a Z-ordered curve where objects in our list that are physically close together are also close together in the list. This can reduce collision checks by orders of magnitude. The problem is the process for encoding a morton string is usually too slow for real time applications without dedicated opcodes (eg PDEP and PEXT) or memory-expensive LUTs.

    Well, one neat side effect of carry-less multiplication is that any number multiplied by itself will yield the original bitstring interleaved with 0s, which is perfect for bitwise operation to combine into one morton code!

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.