Skip to content

Instantly share code, notes, and snippets.

@pilipolio
Created May 4, 2013 22:32
Show Gist options
  • Save pilipolio/5518999 to your computer and use it in GitHub Desktop.
Save pilipolio/5518999 to your computer and use it in GitHub Desktop.
# -*- coding: utf-8 -*-
# <nbformat>3.0</nbformat>
# <headingcell level=1>
# "Chance to beat" or C2B empirical estimation
# <markdowncell>
# We define two and independent random variables $A$ and $B$ and their probability density, respectively $f_A :\mathbb{R} \mapsto [0;1]$ and $f_B: \mathbb{R} \mapsto [0;1]$. The "chance to beat" probability, or $C2B$, is the probability that a sample from $B$ is greater than a sample from $A$, i.e.
#
#
# <center>$C2B(A,B) = \mathcal{P}[A > B]$.</center>
# <headingcell level=2>
# First example with known gaussian variables
# <markdowncell>
# Let's draw the two pdfs fron an example where $A\sim\mathcal{N}(\mu_A,\sigma_A)$ and $B\sim\mathcal{N}(\mu_B,\sigma_B)$ with $\mu_A = 3$, $\mu_B = 2$ and $\sigma_A = 0.3$ and $\sigma_B = 0.5$.
# <codecell>
mu_a = 3; mu_b = 2
sigma_a = .4
sigma_b = .6
xs = np.linspace(0,6,100)
from scipy.stats import distributions
from scipy.stats.distributions import norm
pdf_a_xs = norm.pdf(xs, loc = mu_a, scale=sigma_a)
pdf_b_xs = norm.pdf(xs, loc = mu_b, scale=sigma_b)
plt.plot(xs, pdf_a_xs, xs, pdf_b_xs, lw=3, alpha=.75);
plt.figsize(12,6)
plt.legend(['$f_a$', '$f_b$']);
# <markdowncell>
# For a fixed value $a$ drawn from $A$ with a probability $f_A(a)$, the chance to beat a sample $b$ from $B$ is $P[b \le a] = P[B \le a] = F_B(a)$ where $F_B$ is the cumulative distribution function of $B$. Let's illustrate this with arbitrary value for $a$.
# <codecell>
# Initialize figure and maine axe
f, ax = plt.subplots(1, 1)
f.set_size_inches(12,6)
ax.plot(xs, pdf_a_xs, xs, pdf_b_xs, lw=3, alpha=.75);
plt.legend(['$f_a$', '$f_b$'], prop={'size':20});
# Estimate pdf of A and B at a.
a = 2.75
f_A_a = norm.pdf(a, loc = mu_a, scale=sigma_a)
f_B_a = norm.pdf(a, loc = mu_b, scale=sigma_b)
ax.vlines(a, 0, f_A_a, color='blue', linestyles='dashed');
ax.hlines(f_A_a, 0, a, color='blue', linestyles='dashed');
ax.fill_between(xs[xs<=a], y1=0, y2=pdf_b_xs[xs<=a], color='green', alpha=.25);
ax.text(a/2, f_A_a, s='$f_A(a)$', color='blue', size=20, ha='center',va='bottom');
ax.text(a, 0, s='$a$', color='blue', size=20, ha='center',va='top');
ax.text(mu_b, norm.pdf(mu_b, loc = mu_b, scale=sigma_b)/2, s='$F_B(a)$',color='green', size=20, ha='center',va='bottom');
# <markdowncell>
# The probability $P[A > B]$ is the integral of $F_B(a)$ for every possible value of $a \in \mathrm{R}$ with a pointwise density $f_A(a)$ :
#
# <center> $$C2B(A,B) = P[A > B] = \int\limits_{a \in \mathrm{R}} F_B(a) f_A(a) da$$</center>
#
# Empirically (and for the purpose of plotting), we have defined a grid $ (x_i)_{i=1, \\cdots,n} $ of $ n = 100 $ equi-distant points between $ x_1 = 0 $ and $ x_n = 6 $. The value $ F_B(x_i) $ of the cumulative distribution of $B$ at a point $x_i$ can be numerically approximated by the cumulated sums of $ f_B(x_j) $ from $j$ up to $i$ :
#
# <center> $F_B(x_i) \\approx \\sum \\limits_{j=1}^{i} f_B(x_j) \\frac{x_n - x_1}{n} $ </center>
# <codecell>
approx_FB_xs = np.cumsum(pdf_b_xs * (xs[-1] - xs[0]) / len(xs))
plt.plot(xs, norm.cdf(xs, loc = mu_b, scale=sigma_b), 'k-')
plt.figsize(12, 6)
plt.bar(xs, approx_FB_xs, 1.0/len(xs), color='black', alpha=.1)
plt.legend(['$F_B(a)$', r'$\approx F_B(a)$'],'upper left',prop={'size':20});
# <markdowncell>
# And then apply the same scheme to approximate $C2B(A,B)$ :
# <center> $$ C2B(A,B) \approx \sum \limits_{i=1}^{n} F_B(x_i) f_A(x_i) \approx \sum \limits_{i=1}^{n} \Big( \sum \limits_{j=1}^{i} f_B(x_i) \frac{x_n - x_1}{n} \Big) f_A(x_i) \frac{x_n - x_1}{n} $$</center>
# <codecell>
C2B = np.sum(approx_FB_xs * pdf_a_xs * (xs[-1] - xs[0]) / len(xs))
print 'The C2B is {:.3f}'.format(C2B)
# <markdowncell>
# Let's define a generic function which calculates the $C2B(A,B)$ given :
#
# * An array of probability measures of $A$ (pointwise or integrated over a bin, anyway taking into account the $dx$ part) on a set of points or bins spanning the whole domain of $A$.
# * An array of the probability measures of $B$ on the same set, also taking into account the $dx$ part.
# <codecell>
def calculate_C2B(prob_measures_A, prob_measures_B):
assert len(prob_measures_A) == len(prob_measures_B), "arguments prob_measures_A and prob_measures_B must have the same length"
return np.sum(np.cumsum(prob_measures_B) * prob_measures_A)
print calculate_C2B((xs[-1]-xs[0]) / len(xs) * pdf_a_xs, (xs[-1]-xs[0]) / len(xs) * pdf_b_xs)
# <headingcell level=2>
# Samples from two unknown discrete distributions
# <markdowncell>
# In a more realistic context, we observe finite samples $n_a$ and $n_b$ of unknown variables $A$ and $B$.
# <codecell>
from scipy.stats import distributions
n_a = 20
n_b = 15
a_samples = np.random.binomial(n=20,p=.7,size=n_a)
b_samples = np.random.binomial(n=30,p=.4,size=n_b)
#plt.plot(a_samples, np.repeat(0, n_a), 'xb', b_samples, np.repeat(0, n_b), 'og')
#plt.plot(np.arange(n_a), a_samples, 'xb', np.arange(n_b), b_samples, 'og')
plt.plot(a_samples, 'ob', b_samples, 'og', alpha=0.9)
plt.ylim([0, np.max(np.concatenate((a_samples, b_samples))) + 5])
plt.legend([r'$(A_i)_{i=1,\ldots,n_A}$', r'$(B_i)_{i=1,\ldots,n_B}$'], prop={'size':20});
# <markdowncell>
# The $C2B(A,B)$ or $P[A > B]$ can actually be computed in a combinatorial way :
#
# "For all every sample $a$ drawn from $A$, enumerate all the smaller or equal samples from $B$."
# "The probability $P[A > B]$ is the sum of those case where $A$ was actually greater than $B$ divided by the number of all possibilities"
# <codecell>
C2B = np.sum([np.sum(b_samples<=a) for a in a_samples], dtype='float') / (n_a * n_b)
print 'With the combinatorial method, we found that the C2B is {:.3f}'.format(C2B)
# <markdowncell>
# But we can apply the same reasoning as before, with $A$ and $B$ probability measures as sums of Dirac indicator functions of weight resp. $1 / n_a$ and $1 / n_b$.
#
# <center> $$ P[A>B] = \int\limits_{x \in \mathcal{R}} F_B(x) f_A(x) dx $$ </center>
#
# <center> $$ P[A>B] = \int\limits_{x \in \mathrm{R}} F_B(x) \sum \limits_{i=1}^{n_a} \mathbb{1}(a_i = x) = \frac{1}{n_a} \sum \limits_{i=1}^{n_a} F_B(a_i) $$ </center>
#
# <center> $$ P[A>B] = \frac{1}{n_a n_b} \sum \limits_{i=1}^{n_a} \sum \limits_{j=1}^{n_b} \mathbb{1}(b_i \le a_i) $$ </center>
#
# This can be visualize in a similar fashion to the before mentionned gaussian example :
# <codecell>
grid = np.linspace(0, np.max(np.concatenate((a_samples, b_samples))), 1000)
FB_on_A_samples = np.array([np.sum(b_samples<=a) for a in a_samples],dtype='float')
FB_on_grid = np.array([np.sum(b_samples<=x) for x in grid])
plt.plot(grid, FB_on_grid, '-g')
plt.plot(a_samples, FB_on_A_samples, 'ob', alpha=.5)
plt.legend(['$F_B$', '$(A_i)_{i=1,\cdots,n_A}$'], 'upper left', prop={'size':20});
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment