Created
March 15, 2017 17:43
-
-
Save dmackinnon1/03cfa8b325509754f46dd02a8fd7f921 to your computer and use it in GitHub Desktop.
Exploring the Birthday Problem in R
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Birthday" | |
author: "Dan MacKinnon" | |
date: "March 15, 2017" | |
output: html_document | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE) | |
``` | |
##Introduction | |
The Birthday Problem, as frequently happens in questions of probability, is easy to state and understand, yet leads to an (initially) suprising solution. | |
> In a group of *n* people, how likely is it to find two or more people sharing the same birthday? | |
Many people are surprised to find out that in a group of 23 people, there is just over a fifty percent chance of finding a birthday match, and in a group of 60 people, you are almost certain to find two or more people sharing a birthday. | |
## Direct Calculation | |
If $B_n$ is the event that we find two or more people in a group of $n$ that share a birthday, then $\neg B_n$, is the probability that there are no birthday matches in a group of $n$ people, and the probability of this event is a bit easier to calculate. | |
In a group of one, $n = 1$, there is no chance to find two or more people that share a birthday. In a group of two, the probabability of not sharing a birthday is $p(\neg B_2) = \frac{364}{365}$. For a group of three, we have $p(\neg B_2) = \frac{364}{365} \times \frac{363}{365}$, and in general: | |
$$p(\neg B_n) = \prod_{i=1}^n \frac{366-i}{365}$$ | |
We can set up a data frame **BCalc** in R to calculate this directly. | |
```{r} | |
group_size <- 60 | |
group <- 1:group_size | |
BCalc <- data.frame(group) | |
BCalc$noMatch[1] = 1; | |
for (i in 2:group_size) { | |
BCalc$noMatch[i] <- BCalc$noMatch[i-1]*(366-i)/365 | |
} | |
BCalc$match <- 1 - BCalc$noMatch | |
``` | |
Plotting this relationship between the probability of a birthday match and the size of the group involved gives a characteristic sigmoid or logistic curve. | |
```{r} | |
plot(BCalc$group, BCalc$match,main='probability of a birthday match', xlab = 'group size', ylab='probability of a match') | |
lines(BCalc$group, BCalc$match) | |
``` | |
This gives the somewhat surprising result that, in a group of 23 people there is over a fifity percent chance of finding two people with the same birthday. | |
```{r} | |
BCalc$match[23] | |
``` | |
By the time you have sixty people together, it is almost a sure thing to find a match. | |
```{r} | |
BCalc$match[60] | |
``` | |
## Setting up the Simulation | |
A simulation can be set up by randomly generating groups of numbers between 1 and 365, and checking to see if there are matches. For each group size between 1 and 60, we'll generate 100 groups and check for collisions, the results will be cases in the **BSim** data frame. | |
```{r} | |
group_size <- 60 | |
trial_limit <-100 | |
trial_index <- 1:(group_size * trial_limit) | |
BSim <- data.frame(trial_index) | |
current_index <- 1; | |
for (i in 1:group_size) { | |
for (k in 1:trial_limit) { | |
t <- sample(1:365, i, replace=TRUE) | |
BSim$groupSize[current_index] <- i | |
BSim$foundMatch[current_index] <- as.logical(anyDuplicated(t)) | |
current_index <- current_index +1 | |
} | |
} | |
``` | |
## Exploring the Simulation | |
For each group size, we can find the proportion of groups that contained a match, and plot the results in comparison to the expected curve from the **BCalc** data frame. | |
```{r} | |
tbl <- table(BSim$groupSize,BSim$foundMatch) | |
mtrx <- as.matrix(prop.table(tbl,1)[,2]) | |
plot(1:group_size, mtrx[,1],main='proportion of found birthday matches', xlab = 'group size', ylab='proportion with a match') | |
lines(BCalc$group, BCalc$match) | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment