Estimate population size from number of unique collisions in a generalized birthday problem
We have 100 new people entering the room each year. We need to figure out number of unique day+month dates (across all years) during which any birthday (shared or not) has occured.
import random
D = 365
N = 100
def count(x):
days = [0] * D
for i in range(0,N*x):
days[random.randint(0,D-1)] += 1
return D - days.count(0)
for y in range(1,10):
n = count(y)
print("y=%d years, n=%d (%.2f%%) birthdays occured" %( y, n, n*100/D))
Output of which is:
y=1 years, n=86 (23.56%) birthdays occured
y=2 years, n=155 (42.47%) birthdays occured
y=3 years, n=208 (56.99%) birthdays occured
y=4 years, n=248 (67.95%) birthdays occured
y=5 years, n=271 (74.25%) birthdays occured
y=6 years, n=302 (82.74%) birthdays occured
y=7 years, n=314 (86.03%) birthdays occured
y=8 years, n=321 (87.95%) birthdays occured
y=9 years, n=338 (92.60%) birthdays occured
Is there a closed formula for n=count(y)
function, given n
, and
y
, D
and N
? Further, is there an inverse of it - given a particular n,
return number of years - y=count_invert(n)
.
A formula exists for the case of counting all collisions occuring, not just the unique ones - http://matt.might.net/articles/counting-hash-collisions/
The formulas for unique collisions are of special importance to determine optimal replica/RS recovery block counts in data archival systems with stochastic replica dispersal.