Almost all the programming languages have an undercover library function for generating a Universally Unique Identifier (UUID) or sometimes humbly called Globally Unique Identifier (GUID). These libraries are usually responsible for generating an identifier that we can trust to be unique not just in our database but also in the whole universe so that even the two databases are combined, there will not be any common identifiers or merge issues. How do these libraries go around the complete universe and generate an identifier that is unique in the whole universe in just a few milliseconds? We are going to unravel this mystery in this post as we go through more details.
So, lets directly start by looking at one of the library function from the Google firestore (No-SQL database) unique id generator-
Github: Firestore id generation
Wait, what!? This isn’t going around and looking at the identifiers in the universe. This is just a random string of 16 characters! But but but… Library said it is a universally unique identifier. How can it be unique in the universe when it is not checking that the identifier is unique even in the local database! The answer lies in the probability.
Probability of collision
As we see the code is just generating a random string of 16 characters or 128 bits and it is definitely not verifying that the generated string is unique. So, if we are just generating the string randomly, what is the probability that we generate a string once, then at a later point of time, we generate the same identifier again?
We are assuming that probability distribution for the strings is uniform i.e. probability of generating every string is equal. So, if we are generating a random string of 128 bits, the probability of generation for all the strings will be .
We are making this assumption to simplify our calculation. In reality, the computer systems can’t generate a truly random string and instead use the high entropy data to simulate the truly random string. And if you don’t know – Entropy is the amount of randomness/unpredictability in some piece of data. For example – the time delay between keystrokes when a user is typing, the time difference between two CPU clock cycles, etc. are considered as high entropy because they are highly unpredictable.
With this, we can go back to our argument of the birthday paradox from the previous post and we can say – the probability of collision is ~ if we generate strings which is an enormous number.
Let’s recall the same formula we generated in the previous post-
where = and is number of random strings generated
In this case, we can see that is a value slightly lower than 1 as is a very small value because N is very large.
So, the probability is going to very close to 0 for small values of and reaches to for and then continues.
Quoting the Wikipedia article, the probability to find a duplicate within 103 trillion UUIDs is one in a billion and that is a negligible probability.
Is it all?
That’s it for this post. Thank you for reading 🙂
I would encourage you to read more about UUID standards from Wikipedia