A hash function maps values (e.g., strings) into a fixed number of strings, typically smaller than the original. It is useful to compare quickly two long strings, for example. Instead of comparing the strings, you may compare the hash values.

A simple hash function consists in repeatedly multiplying the hash value by some constant B (e.g., you may pick a number such as 31):

uint32_t hash = 0; for (size_t i = 0; i < len; i++) { hash = hash * B + data[i]; } return hash;

I credit Karp-Rabin for this type of hash functions. It is an example of recursive hashing: the hash function is computed by taking the hash value and updating it with the next character.

Given a long strings, you may want to hash all sequences of N characters. A naive approach might be as follows:

for(size_t i = 0; i < len-N; i++) { uint32_t hash = 0; for(size_t j = 0; j < N; j++) { hash = hash * B + data[i+j]; } //... }

You are visiting each character value N times. If N is large, it is inefficient.

You can do better using a rolling hash function: instead of recomputing the hash function anew each time, you just update it. It is possible to only access each character twice:

uint32_t BtoN = 1; for(size_t i = 0; i < N; i++) { BtoN *= B; } uint32_t hash = 0; for(size_t i = 0; i < N; i++) { hash = hash * B + data[i]; } // ... for(size_t i = N; i < len; i++) { hash = hash * B + data[i] - BtoN * data[i-N]; // ... }

The most expensive component of this routine is the line with two character accesses and two multiplications.

I wrote a simple benchmark in C++ to see how fast a straight-forward implementation could be. I use a set window of size 8 for the purpose of this analysis, but the larger the window, the less competitive the naive implementation becomes.

rolling hashing | 0.75 GB/s | 13 instructions/byte |

naive hashing | 0.18 GB/s | 61 instructions/byte |

Karp-Rabin is not the only type hash functions. Tabulated hashing is another approach that would be much more compute efficient.

However, I suspect we could greatly improve my naive implementation of the Karp-Rabin rolling hash function.

**Further reading**:

- The universality of iterated hashing over variable-length strings , Discrete Applied Mathematics 160 (4-5), 2012
- Recursive n-gram hashing is pairwise independent, at best Computer Speech & Language 24 (4), pages 698-710, 2010
- Strongly universal string hashing is fast, Computer Journal 57 (11), 2014

**Possibly relevant software**:

Daniel Lemire, "How fast is rolling Karp-Rabin hashing?," in *Daniel Lemire's blog*, February 4, 2024.

i got hooked on rabin hashing from reading the blog of the author of the archive tool pcompress, to the extent that I learned about microarchitecture optimization, field math and vectorizing in avx, avx2 and avx512.

in my experiments rabin with the extra complexity doesn’t yield better deduplication ratios results than gear hash which uses a random lookup table instead of an irreducible polynomial in GF2

https://moinakg.wordpress.com/2013/06/22/high-performance-content-defined-chunking/

Improvement suggestion: you could use fast exponentiation to compute BtoN, and reduce the number of multiplications in the first loop from N to 2 log N.

That’s a good idea although I don’t think it would change my performance numbers.

It seems vectorizable if we break a few loop-carried dependencies.

The first is the bytewise subtraction of the trailing edge of the window. Since that accumulates to subtracting the hash of the entire prefix up to that point, why not just save that (showing an array here for clarity, but a ring buffer is better):

`full_hash[i] = full_hash[i-1]*B + data[i]`

rolling_hash[i] = full_hash[i] - BtoN*full_hash[i-N-1]

That makes the second line vectorizable with independent lanes, but the first line is still loop-dependent.

However the first line looks a lot like prefix sum with an extra multiply, and there are a few ways to obtain that from the raw bytes in a logarithmic number of steps.

I think log2(register width) iterations of

`shift right 2^i lanes`

multiply by B^(2^i)

add

would do it, followed by adding the full_hash at the end of the preceding register.

I’m looking into how to do this in NEON, and the initial instruction budget appears to be 8-9 SIMD instructions per 4 bytes, plus some scalars for loop control etc.

Four of those are multiply/accumulate, which fits pretty much all of the arithmetic. The others are shifting across lanes and load, dup, or store.

Doubling the register width would only add two instructions (another shift and multiply/add).

Rolling hashes are just beautifully simple!

Reminded me of this video (https://www.youtube.com/watch?v=rA1ZevamGDc&t=1907s&ab_channel=AlgorithmsLive%21) from Algorithms Live