Are 64-bit random identifiers free from collision?

It is common in software system to map objects to unique identifiers. For example, you might map all web pages on the Internet to a unique identifier.

Often, these identifiers are integers. For example, many people like to use 64-bit integers. If you assign two 64-bit integers at random to distinct objects, the probability of a collision is very, very small. You can be confident that they will not collide.

However, what about the case where you have 300 million objects? Or maybe 7 billion objects? What is the probably that at least two of them collide?

This is just the Birthday’s paradox. Wikipedia gives us an approximation to the collision probability assuming that the number of objects r is much smaller than the number of possible values N: 1-exp(-r**2/(2N)). Because there are so many 64-bit integers, it should be a good approximation.

Number of objects Collision probability
500M 0.7%
1B 3%
1.5B 6%
2B 10%
4B 35%

Thus if you have a large system with many objects, it is quite conceivable that your randomly assigned 64-bit identifiers might collide. If a collision is a critical flaw, you probably should not use only 64 bits.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

8 thoughts on “Are 64-bit random identifiers free from collision?”

    1. Yes. Some people report using 64-bit hash values as identifiers and they sometimes believe that it makes collision highly improbable. That’s true when dealing with tiny sets, but as I demonstrate, for many practical cases, collisions are actually quite likely.

      So, yes, this post is motivated by my discussions with people building systems.

  1. Randomly generated UUIDs, which are 122 bits, are less problematic: the probability of getting a duplicate is much lower than the probability of being hit by a meteorite (I calculated that some time ago and put it on Wikipedia, but somebody edited it away).

    128-bit values are even better. However, there is a risk if you generate them with a “lower quality” hash function (not sure what low quality is… Murmur hash seems OK, SipHash is better I think). If an attacker can supply the data, then MD4, MD5, and even SHA-1 are risky: the attacker could use specially crafted data that will result in a hash collision. So, for this case, better use SHA-256.

  2. And that is why I always choose sequential id assignment instead of random 🙂

    This is, however, an interesting idea. What would be the probability to collide twice in a row if whenever we find a collision we retry and generate a new random 64-bot id?

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.