Sensible hashing of variable-length strings is impossible

Consider the problem of hashing an infinite number of keys—such as the set of all strings of any length—to the set of numbers in {1,2,…,b}. Random hashing proceeds by first picking at random a hash function h in a family of H functions.

I will show that there is always an infinite number of keys that are certain to collide over all H functions, that is, they always have the same hash value for all hash functions.

Pick any hash function h: there is at least one hash value y such that the set A={s:h(s)=y} is infinite, that is, you have infinitely many strings colliding. Pick any other hash function h‘ and hash the keys in the set A. There is at least one hash value y‘ such that the set A‘={s is in A: h‘(s)=y‘} is infinite, that is, there are infinitely many strings colliding on two hash functions. You can repeat this argument any number of times. Repeat it H times. Then you have an infinite set of strings where all strings collide over all of your H hash functions. CQFD.

Further reading: Do hash tables work in constant time? (blog post) and Recursive n-gram hashing is pairwise independent, at best (research paper).

15 thoughts on “Sensible hashing of variable-length strings is impossible”

  1. Suppose A = {0, 2, 4, 6, …}
    and A’ = {1, 3, 5, 7, …}
    Their intersection is empty so the h and h’ that produced A and A’, respectively, do not collide on any values.

  2. @Yuriy I am not saying that any two infinite sets have a non-empty intersection.

    I am saying that if you distribute an infinite number of elements into a finite number of buckets, one of the buckets will contain infinitely many elements.

    I construct A’ from A, so A’ is a subset of A.

  3. Yes, one bucket, for h, will contain infinitely many elements. Then one different bucket, for h’, will contain infinitely many elements. And so on. You have |H| infinite-sized buckets.

    But when you say that an infinite number of keys collide over all H functions, you are saying that the intersection of the buckets is infinite.

    (Or is the statement you are trying to prove “There is always an infinite number of keys that are certain to collide over each h in H”, as opposed to “over all H”?

  4. Let us restart from the beginning. With one (predetermined) hash function h, you can find one hash value y such that infinitely many strings are mapped to y.

    Ok. Now, consider all these strings and only these strings (discard any string that does not satisfy h(s)=y). Then use a *different* hash function f’ and repeat. Once more there will be a hash value y’ such that infinitely many strings will map to y’ (and they also map to y using h).

    So far we have infinitely many strings s such that h(s)=y and h(s’)=y’.

    Just keep going, H times. Each time your set will grow smaller, but it will remain infinite. Because you only have H hash functions in total, you’ll end up with a set of strings such that they all map to y using the hash function h, they all map to y’ using hash function h’, they all map to y” using hash function h” and so on.

    So far so good?

    The trick, of course, is that I assume you have a finite number of hash functions, while having infinitely many strings. That’s the secret sauce.

  5. Clearly, we have very different definitions of the word “sensible”. Especially in combination with the “infinite” anything.

    Given the odd usage of “sensible”, I suppose you could use the function:

    h(s) = s

    Otherwise you are going to have to clarify.

  6. the set A={s:h(s)=y} is infinite, that is, you have infinitely many strings colliding

    Curious hypothesis about the keys.
    Infinitely many distinct keys entails an infinite key size average.
    Not terribly realistic…

  7. I’m not quite sure why you have several comments arguing against your relatively straightforward proof. It’s really not much more than a kind of pigeonhole principle for infinite sets.

    It is also reminiscent of part of the proof of the Heine-Borel theorem (closed and bounded implies compact) that “chases” the infinite part of a partition.

  8. If put into different words:
    You showed that there is an infinite number of keys, that for each hash function, will hash to the same value under that hash function.
    That is, there exists a set A of keys, such that
    for each h in H and for each x1, x2 in A, h(x1) = h(x2).

    Yuriy:
    This still works, because he didn’t force them to hash to *the same value*.
    So your h and h’ do not collide. Instead, they have a same infinite set of colliding keys.

    Daniel:
    I’m not sure that this makes hashing of infinite number of keys impractical. I will have to give it some more thought.
    Thanks for the interesting blog post.

  9. @Daniel

    So far as I could tell, you did not specify that the set of hashed-to integers was finite. 🙂

    Sometimes address-based hashing (if “hashing” is the right word) is the right solution.

    Please define “sensible”, in this context.

  10. I agree with Preston. It all depends on your definition of the word ‘sensible’.

    There is a math theorem that between any two irrational numbers there are infinitely many rational numbers. It’s true, but it’s not, shall we say, a helpful fact, because there are loads more irrational numbers in the gap than rational numbers. This result is a similar to your theorem: true but misleading.

    It’s always possible to be unlucky with your choice of hash function. But it’s fairly easy to choose a hash function where you are very, very unlikely to be unlucky. And for this pragmatist, that is good enough.

    I say ‘fairly easy’. It took Sun three major versions of Java to do it. In JDK 1.0 and 1.1 which sampled every n’th character and XORed and this, naturally, had issues.

    In JDK 1.2, String.hashCode() used to quit at about the 64th character, presumably to keep the cost of maintaining sets and maps of long strings in check. But I was using the hash map to keep unique names of members in mondrian. Many of these strings had a common prefix, and they would all end up in the same hash bucket. Performance was abyssmal.

    The problem was fixed in later versions of the JDK. (I may be wrong about the precise JDK versions. This is from memory.) At the time I seriously considered using a Trie [ http://en.wikipedia.org/wiki/Trie ] , but a web search didn’t turn up any robust Java implementations. It always struck me as strange that this simple and elegant data structure seems to find so few applications in the real world.

  11. One point that has not been made clear, I think, is that there are classes of hash functions for which the length of keys needed to generate high-probability collisions is exponential in the number of bits needed to specify the hash function. Thus, you can rest assure if your strings are not too long, even if they are drawn from an infinite set.

  12. As an aside: played a bit with a Trie implementation in Java over the weekend. After Julian’s comment, wondered how performance of a “fast” Trie might compare to a stock hash table, and this serves as an excuse to play with Tries.

    Also – in the context of the prior discussion – Tries can be viewed as a form of hashing without collisions.

    My result was not encouraging:
    http://bannister.us/weblog/2009/10/05/example-general-purpose-trie-in-java/

  13. Preston,

    Interesting. Thanks for doing the experiments. I would have thought that trie insert performance would suck (because of the effort of clearing an array of 16 or 256 pointers), but you found that access performance sucks too.

    The case I had in mind was for long strings — recall that my original problem was with strings longer than 64 chars, and JDK 1.2 stopped hashing at the 64th character — so it’s possible that tries have their place if strings are very long, say 1k each.

    But you’ve shown that tries are a long way behind hash tables. I think it’s down to poor use of the CPU cache. Resolving a string in a hash table should be only a couple of memory accesses, whereas a trie access might read 3 or 4 or 5 nodes. The memory wait time overwhelms the extra CPU cost of computing a hash code, while the string being scanned is all within the same block of CPU cache.

    Looks like tries just aren’t cut out for modern architectures.

    Julian

Leave a Reply

Your email address will not be published. Required fields are marked *