Hashing is a common concept in programming. It is the process of mapping any object to an integer. It is used in fast search algorithms (like Rabin-Karp) and in hash maps. Almost all non-trivial software depends on hashing. Hashing often works best when it appears random.

It used to be that hashing was deterministic. The creators of a language would specify once and for all how a given string would be hashed. For example, in Java, the hash value of the string “a” is 97 ("a".hashCode()). No matter how often you run a program, the hash value will always be 97.

This can be a security problem because an adversary can find many strings that hash to the value 97. And then hashing no longer appears random.

For this reason, language designers are migrating to random hashing. In random hashing, the hash value of the string “a” can change every time you run your program. Random hashing is now the default in Python (as of version 3.3), Ruby (as of version 1.9) and Perl (as of version 5.18).

In Ruby, random hashing means that if you run the following program many times, you will get different outputs…

puts "a".hash

The same is true of the following Python program…

print ({"a":1,"b":2})

Here is the output I got, running it twice:

$ python3 tmp.py
{'a': 1, 'b': 2}
$ python3 tmp.py
{'b': 2, 'a': 1}

To my knowledge, Java still relies on deterministic hashing by default. However, the specification requires you to expect random hashing: the return value of hashCode() can change from one program execution to another.

How many Ruby, Python and Perl programmers realize that they are relying on random hashing?

10 Comments

  1. Hi

    Main problem for hash it’s collisions resolution. Random hash it’s attempt exclude algorithmic attacks on hash table, because mostly collision resolution based on linked-list. But in java HashMap from jdk8 exist 2 different strategies: for buckets with low collisions use LinkedList, for buckets with high collisions use Balanced Tree. Strategies can switch in runtime based on statistics for buckets.

    http://openjdk.java.net/jeps/180

    =) I think it’s solved issue with security and in mostly attacks on hashtable.

    What you think about this technique?

    Comment by Alexey Diomin — 24/4/2014 @ 4:19

  2. @Alexey

    Yes, in Java, they have introduced specific strategies to alleviate problems resulting from collision attacks in the HashMap class. On the surface, it sounds reasonable and should help people who rely on HashMap. However, programmers use the hashCode method to create their own data structures and algorithms, and some use data structures from other libraries. By not adopting random hashing, Java puts such programmers at risk.

    Comment by Daniel Lemire — 24/4/2014 @ 10:35

  3. hi

    But how handle correct matching for distributed cache if every jvm will be use custom hashCode.

    Method hashCode can be override if you create own data structures (and it’s necessary, because supertype return wrong value for your instance). For Object.hashCode you can change behavior over -XX:hashCode=n

    Yes, you can’t override hashCode for internal java classes from jdk, but it’s really so important?

    Comment by Alexey Diomin — 25/4/2014 @ 4:21

  4. Clustering for memcache apply on client side, also on client side we find necessary shard by hash. Each Memcache server it’s very simple key-value storage. Any changes for distribution apply in client code and “yes” we can change algo for hashing, but it’s make some value unreachable, because we will ask incorrect server. Can you get me link for random hashing for Memcache?

    Comment by Alexey Diomin — 25/4/2014 @ 8:39

  5. Small example:

    N – count of servers in cluster,
    we try get data from server

    serverId = hash(object) % N

    but if we change hash function between runs or on different client then we can’t get data back =\

    Comment by Alexey Diomin — 25/4/2014 @ 8:47

  6. @Alexey

    Well, for distributed systems, you have other constraints. For example, you should use consistent hashing… so whatever Java or Python provides is not good enough in the first place.

    Comment by Daniel Lemire — 25/4/2014 @ 8:59

  7. Summary:
    1) internal hash never used for distributed systems
    2) algorithmic attacks on hash table not critical because we can auto-switch on tree for resolve collisions. (HashMap and ConcurrentHashMap implement this, 3th part implementation maybe yes, maybe not)
    3) for custom object we always have custom implementation for hashCode, because vm don’t know which field must be used for hash function
    4) Object.hashCode we always can change on start vm

    One place where random hash can be apply it’s internal classes from jvm, but problem really so important as some people thinking?

    Comment by Alexey Diomin — 25/4/2014 @ 9:13

  8. @Alexey

    The important point is that, as a programmer, you need to know about random hashing as it is now widespread.

    Comment by Daniel Lemire — 25/4/2014 @ 13:43

  9. I agreed with you, about every good programmer must know about random hashing (and I know about it), but I not agreed with necessary implement it in jvm.

    Comment by Alexey Diomin — 25/4/2014 @ 16:30

  10. @Alexey

    Sure, we can debate whether the core java classes should implement random hashing. I personally think that they should. There is no real downside for competent programmers and increased security as an upside.

    Comment by Daniel Lemire — 25/4/2014 @ 16:49

Sorry, the comment form is closed at this time.

« Blog's main page

Powered by WordPress