A few years ago, we worked on automatically removing boilerplate text from e-books taken from the Project Gutenberg. In this type of problem, you want to quickly identify whether a line of text is likely part of the boilerplate text. You could build a hash table, but hash tables tend to be use a lot of memory because they must store all the keys. This is especially bad if the keys are long strings.

Can you implement a hash table without storing the keys? You cannot, of course. However, you can learn from Bloom filters: hash functions are sufficient to test quickly whether an element is not part of a set.

Let me consider an example. Suppose that I have three strings (“dad”, “mom” and “dog”) and they have the hash values 2, 4, and 5. Suppose that I want to check whether “cat” is part of the set. Maybe its hash value is 3. In this case, I can immediately see that 3 is not part of the set (because 3 is not in {2,4,5}). Of course, if “cat” hashes to 2, I could falsely conclude that “cat” is part of my set. We can increase the accuracy of this method by using several hash functions.

In my e-book scenario, I can check quickly whether a line is **not** part of the boilerplate. This might be good enough.

There is a catch however: computing many hash functions is expensive. Thankfully, Kirsch and Mitzenmacher showed that only two hash functions were enough. Will Fitzgerald went a step further and hinted that a single hash function might be required. Indeed, you can compute a 64-bit hash function and make two 32-bit hash functions out of it: use the first 32 bits for the first hash function and the remaining 32 bits for the other. On 64-bit processors, computing a 64-bit hash function might be just as fast as computing a 32-bit hash function. Will’s idea is perfectly sound. However, it might be trickier to implement than one might expect.

Let me consider a simple form of Fowler-Noll-Vo hashing: given a 32-bit integer *c*, the hash value is given by *cp* mod 2^{64} for some prime *p*. This a decent hash functions: as remarked by Dietzfelbinger et al. (1997), it is almost universal if you pick the prime *p* at random. But the first 32 bits are not a good hash function. Indeed, simply pick *c* to be 2^{31}: all but the last of the first 32 bits will be zero irrespective of *p*. It also does not help if you apply a bitwise exclusive or (XOR) on *c *using a random seed before multiplying it by *p,* that is if you compute (*c ^ y *)* p*: the first 31 bits will be determined by the product of the prime *p* and the random seed *y*. I can extend this analysis to string hashing as well.

**Hence, even if you have a good 64-bit hash function (say it is almost universal), it may not be true that the first 32 bits form a good hash function.** How can you avoid this problem? If your hash function is strongly universal, then the first few bits will indeed form a strongly universal hash function of its own. But this may not be practical if you require 32-bit hash functions: I do not know how to compute quickly a strongly universal 64-bit hash function using 64-bit arithmetic. You can, however, generate a strongly universal 32-bit hash function using 64-bit arithmetic. Just compute *h*(*x*)=(*ax*+*b*) / 2^{32} where *a* and *b* are random 64-bit integers and the value *x* is a 32-bit integer. Then, you can safely consider the first and the last 16 bits: they will be strongly universal.

So it is simply not true that a 64-bit hash function can be treated safely as two 32-bit hash functions.

**Further reading**: Daniel Lemire and Owen Kaser, Recursive n-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698-710, 2010.

**Note**: I am abusing the language since a single function cannot be universal or strongly universal, but I did not want to talk about families of hash functions.

@Turney

If the boilerplate text is given in advance, you could generate a perfect hash function. Then you wouldn’t need to store the keys.Your domain space is not limited to boilerplate text.

Can you implement a hash table without storing the keys?If the boilerplate text is given in advance, you could generate a perfect hash function. Then you wouldn’t need to store the keys.

http://en.wikipedia.org/wiki/Perfect_hash_function

Can you implement a hash table without storing the keys?If you have multiple hash functions, you call the first one to see whether the text might be boiler plate. You only need to call the second one if the text is possibly boiler plate. If the text is mostly non-boiler plate and the first hash table is mostly empty, then you’ll rarely need to call the second (third, fourth, …) hash function(s). On average, you’ll only make slightly more than one hash call per chunk of text.

Given hash functions f1, f2, f3, …, fn, use f1 to index into a hash table and store the values of f2 … fn in that slot in the hash table. Most text will likely index into empty slots, so only f1 will need to be calculated. If the slot is not empty, then you calculate f2. If the value, matches, then try f3, and so on. It’s unlikely that all fn would match for a non-boiler plate text. On average, you would only need to calculate f1. And you wouldn’t be storing the keys.

@Daniel

You could still define a Perfect Hash function if you’re a little loose with your definition of “distinct element”. If you define any two elements of non-boilerplate as identical (since we’re only interested in boilerplate), you could write a hash that maps any nonboiler plate to zero, and otherwise maps to a hash for that text.

Of course, if you have the boilerplate, you could just write a trie or finite state machine to detect them.

@Peter

On a single pass of data, you’d still need to compute f1…fn for every piece of text. You don’t know during the initial computation how deep you need to hash. If you only do f1, since you hit an empty slot, and a later piece of text collides you’d need to go back and retrieve the original text. That basic idea would work well for a two pass detection though: You use a hash to see what could possibly be boilerplate (vs. innocent collisions), then you do a second pass over that much smaller set, hopefully down into the region that full keys can fit in memory.

You don’t know during the initial computation how deep you need to hash.If the boiler plate is given in advance, you do one pass through the boiler plate collection, then you can handle a constant stream of non-boiler text. All you need to know is a rough estimate of the size of the boiler plate collection.

If the boiler plate is not given in advance, you’ll need to make two passes through the data, regardless of what algorithm you use, or you’re certain to mess up on the first few documents you see. Also, high frequency text is not necessarily boiler plate. Might be a famous quotation, for example.

@Itman

(1) There are definitively applications where hashing is a bottleneck. It is why people spent so much time designing fast hash functions.

(2) Your processor is likely to pipeline the computation of hash functions and you can use many tricks if you code it in assembly.

(3) Not every machine these days run an Intel processors. What about your iPhone?

@Item

(1) I guess it depends whether your Bloom filter fits in CPU cache.

(3) I think more and more serious applications run on tablets and mobile phones.

Daniel,

Sorry did not have time to read any of these articles. Yet, I wonder if it’s really computation that takes a lot of time. And not the time wasted in missing the cache (accessing bits in the Bloom-filter table or the string for which hash values are computed)?

BTW, a lot of hash functions can be computed in parallel. What about using SIMD intel extensions?

Daniel

(1) True, but is it really the case for a Bloom-filter?

(3) It is also true, but those gadgets are mostly for fun (with a few exceptions). You are not going to run boilerplate detecting applications on smartphones? If not, you will quickly find out that everything else today is Intel ðŸ™‚