Hashing is a programming technique that maps objects (such as strings) to integers. It is a necessary component of hash tables, one of the most frequently used data structure in Computer Science.
Typically, hash tables have the property that looking up or storing a value associated with a key requires constant time. If you use user identifiers to retrieve names and phone numbers, you can scale up to millions and millions of users without performance penalty. However, the worst case complexity of a hash table is linear: it may need to go through most values each time you want to look up a key. Thankfully, the worst case is typically improbable: it only happens when too many objects hash to the same value. In practice, hash functions are chosen so as to spread hash values uniformly (pseudo-randomly).
Most programming languages like Java or C++ use deterministic hash functions. This means that given a string, it will always hash to the same integer, for all Java software in the whole world. And overall, deterministic hashing works quite well. Unfortunately, deterministic hashing is insecure. If your are building a web application, and hackers know which hash function you are using, they can create a denial-of-service attack and bring down your application. The gist of it is not complicated: it suffices to ensure that the hash tables fall back on their worst case performance.
This is very serious: it means that if you rely on the default hash functions of your programming language (e.g., String.hashCode in Java), your application could be at risk. On this issue, Alexander Klink and Julian Wälde issued a well written security advisory.
The fix is relatively simple: programming languages need to adopt random hashing. In random hashing, every time the software is initialized, a new hash function is picked, at random. This does not make attacks impossible, but it makes them much more difficult.
The problem is not novel. In 2003, Crosby and Wallach raised the issue and many responsible vendors fixed their products. Alas, the only programming languages to adopt random hashing were Ruby and Perl. Others are more reluctant.
So, how easy is it to hack the hash functions in, say, Java? Java uses an iterated hash function. At each iteration, iterated hash functions compute a new hash value from the preceding hash value and the next character. Strings in Java are hashed using the function
F(y,c) = 31 y + c.
where y is the previous hash value and c is the current character value. Thus, the hash value of a string made of the characters 65, 66 (corresponding to “AB” in ASCII) is 31 times 65 + 66 which is 2081.
Why does Java uses the number 31? The choice is somewhat arbitrary (and 31 might fail to be ideal) but because it is an odd number, the compression function F is permuting which helps distribute more uniformly the hash values.
It is fairly hard to construct reasonable strings that collide over 32 bits in Java. However, a modest hash table will use only the first few bits of the hash values. Let us consider only the first 16 bits. It is not difficult to check that the strings “Ace”,”BDe”,”AdF” and “BEF” all have the same hash value in Java.
Of course, having 4 strings colliding will not disrupt hash tables. But because the hash function is iterated, we can multiply the number of collisions. Indeed, any two same-length sequences of these four colliding strings will also collide. This means that you can construct 16 strings of length 6 all colliding (“AceAce”,”AceBDe”,”AceAdF”, “AceBEF”, “BDeAce”,”BDeBDe”,”BDeAdF”, “BDeBEF”, “AdFAce”,”AdFBDe”,”AdFAdF”, “AdFBEF”, “BEFAce”,”BEFBDe”,”BEFAdF”, and “BEFBEF”). You can keep going to 64 strings of length 9. And so on.
How badly does this impact the performance of a hash table? I tried inserting all the colliding strings in a Java Hashtable container. For comparison, I also inserted randomly chosen strings into either a Hashtable or a TreeMap (tree structure). The net result is that what should be a tiny cost (0.006 s) becomes a massive cost (30s). A server able to process thousands of queries per second might quickly become bogged down trying to process a couple of queries per second.
number of strings | hash table: average time (s) | hash table: worst time (s) | tree: average time (s) |
---|---|---|---|
16384 | 0.002 | 1.1 | 0.005 |
65536 | 0.006 | 30 | 0.03 |
For these tests, I am using a MacBook Air with a 1.8 GHz Intel Core i7 running Java 6. My code is available.
Why aren’t programming languages adopting random hashing? A potential issue is that language designers like determinism. They much prefer reproducible bugs. Nevertheless, any expert programmer should be aware of this problem.
Further reading:
- Scott A. Crosby and Dan S. Wallach, Denial of Service via Algorithmic Complexity Attacks, Usenix Security’03.
- Daniel Lemire, The universality of iterated hashing over variable-length strings, Discrete Applied Mathematics 160 (4-5), 2012.
- Owen Kaser and Daniel Lemire, Strongly universal string hashing is fast
Update: I initially reported that Ruby was the only language to adopt random hashing. In fact, Perl adopted random hashing with version 5.8.1. In version 5.8.2, Perl adopted an hybrid that switches between a deterministic and a random hash function when needed. (Thanks to Mike Giroux for the pointer.)
Code: Source code posted on my blog is available from a github repository.
> Why aren’t programming languages adopting random hashing?
There are many reasons to use deterministic hashing besides making a language easier to debug.
I recommend reading this extensive comment in the python source tree about its choice of hash function: http://hg.python.org/cpython/file/12de1ad1cee8/Objects/dictobject.c#l34
Basically, in a hash-based language like python, speed is absolutely critical, and so a simple deterministic function beats a random one. They also choose a hash function with good memory performance in common cases.
ISTM that a better solution than “use randomized hashing”, which has many downsides for *everyone* is, if you’re accepting possible malicious input, you should check it before hashing.
It’s more work for the implementor of the program, but it doesn’t penalize absolutely everyone else. (Think of the poor scientist whose numpy program would slow down because Django wanted randomized hashing 🙂
Also, it’s notable that the implementers of the algorithm implicitly mention that it’s vulerable to an attack of this sort: “it’s extremely unlikely hash codes will follow a 5*j+1 recurrence by accident”
Great bit of information. Thanks for summarizing!
It seems to me that this should be handled at the application level, rather than at language design. Security isn’t really a concern for many users of the language (such as those in the research community who mostly use Java for building prototypes).
For a commercial application, programmers would also need to do a lot more in order to be truly resistant to all types of DOS attacks. So overriding the default hash function doesn’t seem like alot to ask for.
@Marie
I agree that it is one of countless issues to watch out for.
As it was successfully implemented in Ruby, a mainstream language, it is unclear why it cannot be implemented in other languages like Java.
As far as I can tell, it would be fairly difficult for a user to override String.hashCode since the String class is final. You would need to create your own String class or your own collection framework. You can also modify the strings themselves, for example by prepending a seed, but you pay again a price. Most likely, software using Java will use other workarounds than random hashing.
@Bill
An interesting link, thanks! I think it makes plenty of sense not to sacrifice too much performance for security.
I wonder however how much of a price Ruby is paying, if any, for its random hash function.=
Is there a performance price at all? If I am reading it correctly, a “random hash function” would only replace that 31 with a new random value chosen every time the *application* is started. So for a large application the overhead would be negligible.
Of course, having 4 strings colliding will not disrupt hash tables. But because the hash function is iterated, we can multiply the number of collisions. Indeed, any two same-length sequences of these four colliding strings will also collide. This means that you can construct 16 strings of length 6 all colliding. You can keep going to 64 strings of length 9.
Daniel,
I’m sorry but I did not grasp this part. Can you please explain it more?
<>
Well, and perl – perl’s hashes don’t randomly permute by default, but they do if a given hash chain gets too long. See “perldoc perlsec” for any perl newer than 5.8.1 – all hashes were randomized in 5.8.1, but for compatibility in 5.8.2 that was changed so that only hashes which are “behaving badly” get randomized.
(Whoops, I was trying to quote your “Alas, the only programming language to adopt random hashing was Ruby.” line, but it looks like it got swallowed at the top of comment 7)
@Amey I’ll edit my post to give an example.
@Mike Thanks. I updated the post and I give you credit for the observation.
Thanks a lot Daniel. Appreciate it.
The hash involving 31 is the Knuth hash right? Or is that 33? In any case it’s not a top-performance hash anyway so that’s probably something to fix first.
@Federico picking values other than 31 or 33 will have performance impact on the hashtable because those hash functions produce greatly more collisions.
What I’d rather see is two HashMap classes so that ordinary people can still have high performance when necessary. For security reasons, I’d say that HashMap should be randomized and DetHashMap could be a faster deterministic version.
After looking, I must’ve confused Knuth with Kerrigan and Ritchie – that’s who I’ve seen attributed with 31 and Bernstein with 33.
I believe the note I saw about 31 and 33 being much better was in Sedgewick’s book but I’ll have to check when I get home. Here’s one article that’s picky about using 33, but no data that I see:
http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
I seem to recall reading that 33 and 31 were partly important because they were primes greater than the number of lowercase English characters, which is a common evaluation for hashes. There’s some inkling of that in the following link and it also suggests that you don’t want to use those hashes with numeric data.
http://www.strchr.com/hash_functions
Part of my issue with multiplicative hashes is that short keys tend to clump up. But in language processing, that can accidentally lead to many collisions in looking common words like “a”, “the”, etc and then kill performance.
@Keith
The hash involving 31 is the Knuth hash right?
This type of hash function is used in the Karp-Rabin algorithm, though maybe not with value 31. Could be that Knuth should get credit for it, I don’t know.
icking values other than 31 or 33 will have performance impact on the hashtable because those hash functions produce greatly more collisions.
Do you have reference supporting this statement? GCC uses 5.
I looked around but couldn’t find where I’d seen that 31 or 33 were special for multiplicative hash. I still think I read that somewhere but can’t find it. It could’ve been something in one of my students’ projects, but I don’t have those around anymore.
If I get some free time, I’ll run a test. In the meantime, I’ll weaken my statement and say that the constant should be a prime and there may be collision ramifications for some of those primes compared to others and it’d be good to check before blindly using any arbitrary prime. Also, I’m assuming that we’re only talking about power-of-2 table sizes for what it’s worth.
@Keith
You probably meant “odd” and not “prime”.
whoops, that was an editing goof on my part – when I wrote the first part (saying that we want primes) I was assuming an API-like hashtable where we might not have control over the size. Then I wrote that I was assuming power-of-two sizes.