Skip to content

Instantly share code, notes, and snippets.

@ezdiy
Created December 7, 2017 03:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ezdiy/d22bea12a45ad71b1f0138dc5e1f52b8 to your computer and use it in GitHub Desktop.
Save ezdiy/d22bea12a45ad71b1f0138dc5e1f52b8 to your computer and use it in GitHub Desktop.

Estimate population size from number of unique collisions in a generalized birthday problem

We have 100 new people entering the room each year. We need to figure out number of unique day+month dates (across all years) during which any birthday (shared or not) has occured.

import random
D = 365
N = 100
def count(x):
    days = [0] * D
    for i in range(0,N*x):
        days[random.randint(0,D-1)] += 1
    return D - days.count(0)

for y in range(1,10):
    n = count(y)
    print("y=%d years, n=%d (%.2f%%) birthdays occured" %( y, n, n*100/D))

Output of which is:

y=1 years, n=86 (23.56%) birthdays occured
y=2 years, n=155 (42.47%) birthdays occured
y=3 years, n=208 (56.99%) birthdays occured
y=4 years, n=248 (67.95%) birthdays occured
y=5 years, n=271 (74.25%) birthdays occured
y=6 years, n=302 (82.74%) birthdays occured
y=7 years, n=314 (86.03%) birthdays occured
y=8 years, n=321 (87.95%) birthdays occured
y=9 years, n=338 (92.60%) birthdays occured

Is there a closed formula for n=count(y) function, given n, and y, D and N? Further, is there an inverse of it - given a particular n, return number of years - y=count_invert(n).

A formula exists for the case of counting all collisions occuring, not just the unique ones - http://matt.might.net/articles/counting-hash-collisions/

The formulas for unique collisions are of special importance to determine optimal replica/RS recovery block counts in data archival systems with stochastic replica dispersal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment