Almost all the programming languages have an undercover library function for generating a Universally Unique Identifier (UUID) or sometimes humbly called Globally Unique Identifier (GUID). These libraries are usually responsible for generating an identifier that we can trust to be unique not just in our database but also in the whole universe so that even the two databases are combined, there will not be any common identifiers or merge issues. How do these libraries go around the complete universe and generate an identifier that is unique in the whole universe in just a few milliseconds? We are going to unravel this mystery in this post as we go through more details.
So, lets directly start by looking at one of the library function from the Google firestore (No-SQL database) unique id generator-
Github: Firestore id generation
Wait, what!? This isn’t going around and looking at the identifiers in the universe. This is just a random string of 16 characters! But but but… Library said it is a universally unique identifier. How can it be unique in the universe when it is not checking that the identifier is unique even in the local database! The answer lies in the probability.
Probability of collision
As we see the code is just generating a random string of 16 characters or 128 bits and it is definitely not verifying that the generated string is unique. So, if we are just generating the string randomly, what is the probability that we generate a string once, then at a later point of time, we generate the same identifier again?
Let’s see.
Assumption
We are assuming that probability distribution for the strings is uniform i.e. probability of generating every string is equal. So, if we are generating a random string of 128 bits, the probability of generation for all the strings will be .
We are making this assumption to simplify our calculation. In reality, the computer systems can’t generate a truly random string and instead use the high entropy data to simulate the truly random string. And if you don’t know – Entropy is the amount of randomness/unpredictability in some piece of data. For example – the time delay between keystrokes when a user is typing, the time difference between two CPU clock cycles, etc. are considered as high entropy because they are highly unpredictable.
With this, we can go back to our argument of the birthday paradox from the previous post and we can say – the probability of collision is ~ if we generate
strings which is an enormous number.
Let’s recall the same formula we generated in the previous post-
where =
and
is number of random strings generated
In this case, we can see that is a value slightly lower than 1 as
is a very small value because N is very large.
So, the probability is going to very close to 0 for small values of and reaches to
for
and then continues.
Quoting the Wikipedia article, the probability to find a duplicate within 103 trillion UUIDs is one in a billion and that is a negligible probability.
Is it all?
Not really, there is still a lot to discuss about UUIDs like multiple standards and formats that are available with the standard libraries (Please refer to Wikipedia for the details). But in the end, the implementation depends on the developer based on the requirements. In some cases when we need the value to be unpredictable like security applications, we need to use the high entropy data to generate the identifier while in some cases low entropy data is sufficient. For example, if you are using the C library rand() function, it is not using any high entropy data and will be insecure to use for cryptographic applications. We saw that in the case of the firestore, it is just using javascript’s random function to generate the UUID for firestore documents which is not a very high entropy identifier.
That’s it for this post. Thank you for reading 🙂
I would encourage you to read more about UUID standards from Wikipedia