Probabilities and the C++ standard

The new C++ standard introduced hash functions and hash tables in the language (as “unordered maps”).

As every good programmer should know, hash tables only work well if collisions between keys are rare. That is, if you have two distinct keys k1 and k2, you want their hash values h(k1) and h(k2) to differ most of the time.

The C++ standard does not tell us how the keys are hashed but it gives us two rules:

  • The value returned by h(k) shall depend only on the argument k.
  • For two different values k1 and k2, the probability that h(k1) and h(k2) “compare equal” (sic) should be very small.

The first rule says that h(k) must be deterministic. This is in contrast with languages like Java where the hash value can depend on a random number if you want (as long as the value remains the same through throughout the execution of a given program).

It is a reasonable rule. It means that if you are iterating through the keys of an “unordered set”, you will always visit the keys in the same order… no matter how many times you run your program.

It also means, unfortunately, that if you find two values such that h(k1) and h(k2), then they will always be equal, for every program and every execution of said programs.

The second rule is less reasonable. We have that h(k1) and h(k2) are constant values that are always the same. There is no random model involved. Yet, somehow, we want that the probability that they will be the same be low.

I am guessing that they mean that if you pick k1 and k2 randomly, the probability that they will hash to the same value is low, but I am not sure. If it is what they mean, then it is a very weak requirement: a vendor could simply hash strings down to their first character. That is a terrible hash function!

I am under the impression that the next revision of the C++ standard will fix this issue by following in Java’s footstep and allow hash functions to vary from one run of a program to another. That is, C++ will embrace random hashing. This will help us build safer software.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

7 thoughts on “Probabilities and the C++ standard”

  1. Since the hash function isn’t specified, it seems to me that an implementation that picks a new hash function each time the program runs would be standard compliant. One way to “pick a new hash function on each run” would be to have the hash function rely on a random number that is allowed to differ between runs. However, I don’t know enough to say whether that actually would comply.

    But, even assuming that the hash function isn’t allowed to change between runs (i.e., it’s “set in stone”), I don’t think the standard prohibits a hash table from re-ordering elements in a bucket when they are accessed (move-to-front). Which would have the effect that iterating the hash table multiple times could return different orders even if there have been no insertions or deletions. But, again, I don’t know enough of the standard to say so definitively.

  2. The C++11 standard you link already gives you the capability to specify a custom hash function.

    Examine the template type parameters for the unordered_map class:

    class Key,
    class T,
    class Hash = hash,
    class Pred = std::equal_to,
    class Allocator = std::allocator

    The third type parameter is Hash, and you can supply a custom type (a functor) to instantiate it.

    See this StackOverflow answer for a quick walkthrough for implementating custom hashing and key-equality functions:
    http://stackoverflow.com/a/15809592/1301580

  3. @Max

    have the hash function rely on a random number

    They go out of the way to specify that “h(k) shall depend only on the argument k” so my interpretation is that it cannot depend on a random seed. (Update: Of course, what can they do to prevent you from doing exactly what you describe?)

    I don’t think the standard prohibits a hash table from re-ordering elements in a bucket when they are accessed

    Yes, the unordered map can certainly do something like this but this would still mean that the same program, with the same inputs, would list the keys in the same order.

    Once you have random hashing, running the same program with the same inputs twice would give you different results.

    This might throw some people off.

  4. > Once you have random hashing, running the same program with the same inputs twice would give you different results.

    > This might throw some people off.

    There’s no question about that. The Perl guys originally turned random hashing on everywhere, but that broke several modules that relied on deterministic ordering of a hash’s contents, so they changed things to only use random hashing when they detected large numbers of collisions.

    But it turned out that wasn’t random enough, so they recently randomized it more ( http://blog.booking.com/hardening-perls-hash-function.html ). If there wasn’t a valid security rationale, I doubt the changes would have stayed. As it is, they upset several programmers by breaking expectations.

    I’m not on the committee, but in my opinion, it would be a mistake to add any requirements on the hashing function other than “it’s idempotent (per program run)” and “it minimizes collisions.”

  5. Sorry, “idempotent” is the wrong term. I guess the best term is “it’s deterministic (per run of the program).” But, usually, you have to say “NOTE: the function *may* return different results for different runs of the program” to really get the point across.

  6. @Geoff

    Thanks. Yes, I was aware that, with some hard work, one could provide his own hash functions. And as I hint in my answer to Max, I believe that you can ignore the specification when building your own functions.

  7. In regard to “that might throw some people off”, I wanted to say that the Perl experience is that this is a *good* thing. It prevents a developer from relying on insertion order into the hash (many if not most hash implementations are sensitive to this). And it helps find edge case scenarios most developers would never thing to test.

    Most devs think that making things random makes it harder to reproduce bugs. The perl experience is the opposite, the randomization effectively tries out every permutation of keys, and if the code is sensitive to this it very quickly becomes apparent. Run it enough times and the failure will eventually occur.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.