The Birthday Paradox

T


Hello there! Welcome to the first post on this blog.
In this post, we will understand what is popularly known as “The Birthday Paradox” and along with that, we will look into what this paradox means in cryptography, where the birthday paradox is known as the birthday attack.

Prerequisites

You will need to know the basics of combinatorics & probability theory to understand this article.

Assumptions

For the first part of the article, we also assume that birthdays are distributed uniformly across the year i.e. an equal number of people are born on all days of the year. This assumption is not even close to true, but to begin, it will make it easy to understand the calculations. We will also see what happens when the distribution is not uniform.

The answer

The probability that at least two people have the same birthday with 23 people is at least \approx 50\%. The reason this is called a paradox is that the number 23 looks so small compared to the number of days in a year i.e. 365 days, but still, the probability of at least two people having the same birthday is as high as \approx50\%. These values are calculated with the assumption of uniform distribution and if the distribution is non-uniform, the probability will be higher than 50\%.

Next, let’s see how we get this answer. I would highly recommend that you try to solve the problem yourself before you start reading the next section. You can also start reading the solution, and whenever you feel you can do the rest of the process yourself, feel free to stop reading and continue on your own.

Solution

Let’s think of the problem this way, we have 23 people and we have to assign a birthday between 1 and 365 (days of the year) to each of them. We can define the probability that at least two people will share the birthday

    \begin{equation*}= \frac{number\:of\:combinations\:where\:at\:least\:2\:people\:share\:the\:birthday}{total\:number\:of\:possible\:combinations\:of\:birthdays}\end{equation*}

(We can only use this formula when the probability distribution for the values is uniform.)

For every person, we have 365 possible values and so the total number of possible combinations of the values that the people can take is 365^{23}.

Now, out of these 365^{23} combinations, we need the number of combinations where at least two people have the same birthday. To calculate that, we will calculate the number of combinations where no one shares the birthday with anyone else i.e. every person is assigned a different birthday and then subtract this number from the total number of combinations.

Number of combinations where no two people share the birthday = 365 * 364 * 363 * .. *343
So, the number of combinations where at least two people share the birthday will be 365^{23} - (365*364*363*...*343)

Using the above formula, the final probability will be

    \begin{equation*}\begin{align*}&=\frac{365^{23} - (365*364*...*343)}{365^{23}} \\&\approx 0.50729\end{equation*}

Simple! Wasn’t it? This paradox works because intuitively we try to compare the number of people with the number of days in a year. Instead, we should be comparing the number of possible pairs which is quite high even with relatively fewer people.

But birthdays are not uniform…

We have got the answer to the question we posed, but it works only if the given distribution is uniform. Now, we’ll see what to do if the birthday distribution is not uniform. Let’s start with the notations we will now use:

N – Total number of days
d_{i}i^{th} day of the year
p_{i} – Probability that a person is born on d_{i}
q – Total number of people
q_{i}i^{th} person from the group
P(q, N, distribution) – Probability that in a group of q people, and considering N days in a year, at least two people will share the birthday for the given distribution

For any probability distribution, data will be given like –

The data will specify the probability that a person is born on that day for every day of the year. In case of uniform distribution, all these probabilities become equal to \frac{1}{N}. Let’s say the name of distribution given to us is dist.

The probability that the q_{i}‘s birthday lies on a specific day d_{k} is given in the data, say p_{k}. Let’s take another person q_{j} where i \neq j, the probability that q_{i} is born on d_{k} and q_{j} is also born on the same day d_{k} = p_{k} * p_{k} (Because birthday’s are independent)

The probability that the birthday of both the people lie on the same day (any day of the year) = probability for the first day + probability for the second day + … probability for the N^{th} day

    \begin{equation*}\begin{align*}&= (p_{1})^{2} + (p_{2})^{2} + (p_{3})^{2}... + (P_{N})^{2} \\&= p (say)\end{align*}\end{equation*}

The probability that q_{i} and q_{j} will have the different birthday = 1-p

This is the probability that a given pair (q_{i},\:q_{j}) have the different birthday.
Total number of pairs in the group of q people

    \begin{equation*}\begin{align*}&= \binom{q}{2} \\&= \frac{q*(q-1)}{2}\end{align*}\end{equation*}

Probability that all the pairs have different birthdays,

    \begin{equation*}\begin{align*}&= (1-p)*(1-p)*(1-p)... \frac{q*(q-1)}{2}\:times \\&= (1-p)^{\frac{q*(q-1)}{2}}\end{align*}\end{equation*}

The probability that at least one pair will have the same birthday for the given distribution dist,

    \begin{equation*}P(q,N,dist) = 1 - (1-p)^{\frac{q*(q-1)}{2}}\end{equation*}

If we put dist as uniform distribution where p_{i} = \frac{1}{N} for all values of i.

    \begin{equation*}\begin{align*}p &= (p_{1})^{2} + (p_{2})^{2} + (p_{3})^{2}... + (P_{N})^{2} \\p &= \frac{1}{N^{2}} + \frac{1}{N^{2}} + \frac{1}{N^{2}} +... + N times \\p &= N * \frac{1}{N^{2}} \\p &= \frac{1}{N}\end{align*}\end{equation*}

Using the formula derived, for uniform distribution

    \begin{equation*}P(q,N,"uniform") = 1 - (1 - \frac{1}{N})^{\frac{q*(q-1)}{2}}\end{equation*}

Which will yield the same answer as before for q = 23 and N = 365 i.e. \approx0.50729
Let’s see the graph for different values of q and N = 365

Observations

1. The probability that at least two people will have the same birthday is \approx 1 with just 70 people in the group.
2. The probability that at least two people will have the same birthday is minimum if the birthday distribution is uniform. For a non-uniform distribution, the probability is higher. This comes with the fact that value of p will be minimum if p_{1} = p_{2} = p_{3}...p_{N} = \frac{1}{N} (Proof here)
3. If we put q = 1.2*\sqrt{N} then, the value of P(q,N) will be \approx50\% in case of uniform distribution

The third observation is a bit non-trivial and works on certain approximations. Let’s work it out.

For large values of q,

    \begin{equation*}\frac{q*(q-1)}{2} \approx \frac{q^{2}}{2}\end{equation*}

For small values of x,

    \begin{equation*}1 - x \approx e^{-x}\end{equation*}

Using these approximations, for uniform distribution,

    \begin{equation*}P(q,N) = 1 - e^{\frac{q^{2}}{2*N}} \\\end{equation*}

Putting q = 1.2*\sqrt{N},

    \begin{equation*}\begin{align*}P(q,N) &\approx 1 - e^{-0.72} \\P(q,N) &\approx 0.51325\end{align*}\end{equation*}

This is what we saw in the above case as well i.e. for N=365, q = 1.2 * \sqrt{365} \approx 22.926 \approx 23, we get the probability very close to 50\%. This has some implications in the real world and we will see one next.

Breaking the hash function

In cryptography, the birthday paradox is used to analyze the time complexity of the generic attack to find the collisions for a hash function. The final aim is to find two inputs which will have the same output for the given hash function. The generic attack is to keep evaluating the hash function on different inputs until we find a collision. What will be the time complexity of this attack?
Assuming that the hash function outputs b bits, the total number of possible outputs for the function = 2^{b}
By the pigeonhole principle, by evaluating the function at 2^{b}+1 different inputs, we will find a collision for the hash function with probability 1. According to this analysis, the time complexity to find a collision will be O(2^{b}).

Now, If we compare this problem with the birthday paradox we discussed above, we can compare the inputs to the hash function with the people and compare the outputs of the hash function with the birthdays. As per our observation above, if we evaluate the function at \approx 1.2 * \sqrt{2^{b}} values, we will be able to find a collision with the probability \frac{1}{2}.
So, if we evaluate the function at \approx 1.2 * \sqrt{2^{b}} different inputs in a single iteration to find the collision with probability \frac{1}{2}, the expected number of iterations required to find a collision will be \frac{1}{\frac{1}{2}} = 2.
And so the time complexity to find a collision for a given hash function will be O(\sqrt(2^{b}) = O(2^{\frac{b}{2}}) which shows the significant reduction in the number of function evaluations we will need to make to find a collision for a given hash function.

Thank you for reading 🙂

References

1. Jonathan Katz and Yehuda Lindell. 2007. Introduction to Modern Cryptography (Chapman & Hall/Crc Cryptography and Network Security Series). Chapman & Hall/CRC.
2. How to prove the sum of squares is minimum?
3. Birthday Problem on Wikipedia

About the author

Akshay Jain

I am passionate about programming and feel amazing when I see people using the software I have contributed in. I believe it is essential to write high quality code, which is easy to understand and test. I am still a work in progress and decided to document and share some of my learnings with everyone. Apart from that, I like to read books, and so you might find a lot of book reviews on my blog. I am working as a Software Engineer at Oracle since post-graduation from IISc Bangalore, and happily married :)

Add Comment

Akshay Jain

I am passionate about programming and feel amazing when I see people using the software I have contributed in. I believe it is essential to write high quality code, which is easy to understand and test. I am still a work in progress and decided to document and share some of my learnings with everyone. Apart from that, I like to read books, and so you might find a lot of book reviews on my blog. I am working as a Software Engineer at Oracle since post-graduation from IISc Bangalore, and happily married :)

Get in touch