It is common in software system to map objects to unique identifiers. For example, you might map all web pages on the Internet to a unique identifier.

Often, these identifiers are integers. For example, many people like to use 64-bit integers. If you assign two 64-bit integers at random to distinct objects, the probability of a collision is very, very small. You can be confident that they will not collide.

However, what about the case where you have 300 million objects? Or maybe 7 billion objects? What is the probably that at least two of them collide?

This is just the Birthday’s paradox. Wikipedia gives us an approximation to the collision probability assuming that the number of objects r is much smaller than the number of possible values N: 1-exp(-r**2/(2N)). Because there are so many 64-bit integers, it should be a good approximation.

Number of objects | Collision probability |
---|---|

500M | 0.7% |

1B | 3% |

1.5B | 6% |

2B | 10% |

4B | 35% |

Thus if you have a large system with many objects, it is quite conceivable that your randomly assigned 64-bit identifiers might collide. If a collision is a critical flaw, you probably should not use only 64 bits.

Interesting! I’d be curious to know why you wrote a post about that. Did you encounter such a problem in your own work?

Yes. Some people report using 64-bit hash values as identifiers and they sometimes believe that it makes collision highly improbable. That’s true when dealing with tiny sets, but as I demonstrate, for many practical cases, collisions are actually quite likely.

So, yes, this post is motivated by my discussions with people building systems.

And with 64-bits, they are easy to find! I did this exercise recently with FNV-1a: https://twitter.com/pstbrn/status/1201044024892829697

(This is hash collisions, not random identifiers. But finding them used the same principle – birthday attack.)

Randomly generated UUIDs, which are 122 bits, are less problematic: the probability of getting a duplicate is much lower than the probability of being hit by a meteorite (I calculated that some time ago and put it on Wikipedia, but somebody edited it away).

128-bit values are even better. However, there is a risk if you generate them with a “lower quality” hash function (not sure what low quality is… Murmur hash seems OK, SipHash is better I think). If an attacker can supply the data, then MD4, MD5, and even SHA-1 are risky: the attacker could use specially crafted data that will result in a hash collision. So, for this case, better use SHA-256.

what about a 160 bit hash built by ripemd(sha256), i.e, bitcoin?

160-bit is perfectly fine.

There is a huge difference between 64-bit and 160-bit.

And that is why I always choose sequential id assignment instead of random 🙂

This is, however, an interesting idea. What would be the probability to collide twice in a row if whenever we find a collision we retry and generate a new random 64-bot id?