pilipolio/C2B_Estimation.py

## C2B_Estimation.py
# -*- coding: utf-8 -*-
# <nbformat>3.0</nbformat>

# <headingcell level=1>

# "Chance to beat" or C2B empirical estimation

# <markdowncell>

# We define two and independent random variables $A$ and $B$ and their probability density, respectively $f_A :\mathbb{R} \mapsto [0;1]$ and $f_B: \mathbb{R} \mapsto [0;1]$. The "chance to beat" probability, or $C2B$, is the probability that a sample from $B$ is greater than a sample from $A$, i.e.
#
#
# <center>$C2B(A,B) = \mathcal{P}[A > B]$.</center>

# <headingcell level=2>

# First example with known gaussian variables

# <markdowncell>

# Let's draw the two pdfs fron an example where $A\sim\mathcal{N}(\mu_A,\sigma_A)$ and $B\sim\mathcal{N}(\mu_B,\sigma_B)$ with $\mu_A = 3$, $\mu_B = 2$ and $\sigma_A = 0.3$ and $\sigma_B = 0.5$.

# <codecell>

mu_a = 3; mu_b = 2
sigma_a = .4
sigma_b = .6
xs = np.linspace(0,6,100)
from scipy.stats import distributions
from scipy.stats.distributions import norm

pdf_a_xs = norm.pdf(xs, loc = mu_a, scale=sigma_a)
pdf_b_xs = norm.pdf(xs, loc = mu_b, scale=sigma_b)

plt.plot(xs, pdf_a_xs, xs, pdf_b_xs, lw=3, alpha=.75);
plt.figsize(12,6)
plt.legend(['$f_a$', '$f_b$']);

# <markdowncell>

# For a fixed value $a$ drawn from $A$ with a probability $f_A(a)$, the chance to beat a sample $b$ from $B$ is $P[b \le a] = P[B \le a] = F_B(a)$ where $F_B$ is the cumulative distribution function of $B$. Let's illustrate this with arbitrary value for $a$.

# <codecell>

# Initialize figure and maine axe
f, ax = plt.subplots(1, 1)
f.set_size_inches(12,6)
ax.plot(xs, pdf_a_xs, xs, pdf_b_xs, lw=3, alpha=.75);
plt.legend(['$f_a$', '$f_b$'], prop={'size':20});

# Estimate pdf of A and B at a.
a = 2.75
f_A_a = norm.pdf(a, loc = mu_a, scale=sigma_a)
f_B_a = norm.pdf(a, loc = mu_b, scale=sigma_b)

ax.vlines(a, 0, f_A_a, color='blue', linestyles='dashed');
ax.hlines(f_A_a, 0, a, color='blue', linestyles='dashed');
ax.fill_between(xs[xs<=a], y1=0, y2=pdf_b_xs[xs<=a], color='green', alpha=.25);
ax.text(a/2, f_A_a, s='$f_A(a)$', color='blue', size=20, ha='center',va='bottom');
ax.text(a, 0, s='$a$', color='blue', size=20, ha='center',va='top');
ax.text(mu_b, norm.pdf(mu_b, loc = mu_b, scale=sigma_b)/2, s='$F_B(a)$',color='green', size=20, ha='center',va='bottom');

# <markdowncell>

# The probability $P[A > B]$ is the integral of $F_B(a)$ for every possible value of $a \in \mathrm{R}$ with a pointwise density $f_A(a)$ :
#
# <center> $$C2B(A,B) = P[A > B] = \int\limits_{a \in \mathrm{R}} F_B(a) f_A(a) da$$</center>
#
# Empirically (and for the purpose of plotting), we have defined a grid $ (x_i)_{i=1, \\cdots,n} $ of $ n = 100 $ equi-distant points between $ x_1 = 0 $ and $ x_n = 6 $. The value $ F_B(x_i) $ of the cumulative distribution of $B$ at a point $x_i$ can be numerically approximated by the cumulated sums of $ f_B(x_j) $ from $j$ up to $i$ :
#
# <center> $F_B(x_i) \\approx \\sum \\limits_{j=1}^{i} f_B(x_j) \\frac{x_n - x_1}{n} $ </center>

# <codecell>

approx_FB_xs = np.cumsum(pdf_b_xs * (xs[-1] - xs[0]) / len(xs))

plt.plot(xs, norm.cdf(xs, loc = mu_b, scale=sigma_b), 'k-')
plt.figsize(12, 6)
plt.bar(xs, approx_FB_xs, 1.0/len(xs), color='black', alpha=.1)
plt.legend(['$F_B(a)$', r'$\approx F_B(a)$'],'upper left',prop={'size':20});

# <markdowncell>

# And then apply the same scheme to approximate $C2B(A,B)$ :
# <center> $$ C2B(A,B) \approx \sum \limits_{i=1}^{n} F_B(x_i) f_A(x_i) \approx \sum \limits_{i=1}^{n} \Big( \sum \limits_{j=1}^{i} f_B(x_i) \frac{x_n - x_1}{n} \Big) f_A(x_i) \frac{x_n - x_1}{n} $$</center>

# <codecell>

C2B = np.sum(approx_FB_xs * pdf_a_xs * (xs[-1] - xs[0]) / len(xs))
print 'The C2B is {:.3f}'.format(C2B)

# <markdowncell>

# Let's define a generic function which calculates the $C2B(A,B)$ given :
#
# * An array of probability measures of $A$ (pointwise or integrated over a bin, anyway taking into account the $dx$ part) on a set of points or bins spanning the whole domain of $A$.
# * An array of the probability measures of $B$ on the same set, also taking into account the $dx$ part.

# <codecell>

def calculate_C2B(prob_measures_A, prob_measures_B):
    assert len(prob_measures_A) == len(prob_measures_B), "arguments prob_measures_A and prob_measures_B must have the same length"
    return np.sum(np.cumsum(prob_measures_B) * prob_measures_A)

print calculate_C2B((xs[-1]-xs[0]) / len(xs) * pdf_a_xs, (xs[-1]-xs[0]) / len(xs) * pdf_b_xs)

# <headingcell level=2>

# Samples from two unknown discrete distributions

# <markdowncell>

# In a more realistic context, we observe finite samples $n_a$ and $n_b$ of unknown variables $A$ and $B$.

# <codecell>

from scipy.stats import distributions
n_a = 20
n_b = 15
a_samples = np.random.binomial(n=20,p=.7,size=n_a)
b_samples = np.random.binomial(n=30,p=.4,size=n_b)
#plt.plot(a_samples, np.repeat(0, n_a), 'xb', b_samples, np.repeat(0, n_b), 'og')
#plt.plot(np.arange(n_a), a_samples, 'xb', np.arange(n_b), b_samples, 'og')
plt.plot(a_samples, 'ob', b_samples, 'og', alpha=0.9)
plt.ylim([0, np.max(np.concatenate((a_samples, b_samples))) + 5])
plt.legend([r'$(A_i)_{i=1,\ldots,n_A}$', r'$(B_i)_{i=1,\ldots,n_B}$'], prop={'size':20});

# <markdowncell>

# The $C2B(A,B)$ or $P[A > B]$ can actually be computed in a combinatorial way :
#
# "For all every sample $a$ drawn from $A$, enumerate all the smaller or equal samples from $B$."
# "The probability $P[A > B]$ is the sum of those case where $A$ was actually greater than $B$ divided by the number of all possibilities"

# <codecell>

C2B = np.sum([np.sum(b_samples<=a) for a in a_samples], dtype='float') / (n_a * n_b)
print 'With the combinatorial method, we found that the C2B is {:.3f}'.format(C2B)

# <markdowncell>

# But we can apply the same reasoning as before, with $A$ and $B$ probability measures as sums of Dirac indicator functions of weight resp. $1 / n_a$ and $1 / n_b$.
#
# <center> $$ P[A>B] = \int\limits_{x \in \mathcal{R}} F_B(x) f_A(x) dx $$ </center>
#
# <center> $$ P[A>B] = \int\limits_{x \in \mathrm{R}} F_B(x) \sum \limits_{i=1}^{n_a} \mathbb{1}(a_i = x) = \frac{1}{n_a} \sum \limits_{i=1}^{n_a} F_B(a_i) $$ </center>
#
# <center> $$ P[A>B] = \frac{1}{n_a n_b} \sum \limits_{i=1}^{n_a} \sum \limits_{j=1}^{n_b} \mathbb{1}(b_i \le a_i) $$ </center>
#
# This can be visualize in a similar fashion to the before mentionned gaussian example :

# <codecell>

grid = np.linspace(0, np.max(np.concatenate((a_samples, b_samples))), 1000)
FB_on_A_samples = np.array([np.sum(b_samples<=a) for a in a_samples],dtype='float')
FB_on_grid = np.array([np.sum(b_samples<=x) for x in grid])

plt.plot(grid, FB_on_grid, '-g')
plt.plot(a_samples, FB_on_A_samples, 'ob', alpha=.5)
plt.legend(['$F_B$', '$(A_i)_{i=1,\cdots,n_A}$'], 'upper left', prop={'size':20});
	# -- coding: utf-8 --
	# <nbformat>3.0</nbformat>

	# <headingcell level=1>

	# "Chance to beat" or C2B empirical estimation

	# <markdowncell>

	# We define two and independent random variables $A$ and $B$ and their probability density, respectively $f_A :\mathbb{R} \mapsto [0;1]$ and $f_B: \mathbb{R} \mapsto [0;1]$. The "chance to beat" probability, or $C2B$, is the probability that a sample from $B$ is greater than a sample from $A$, i.e.
	#
	#
	# <center>$C2B(A,B) = \mathcal{P}[A > B]$.</center>

	# <headingcell level=2>

	# First example with known gaussian variables

	# <markdowncell>

	# Let's draw the two pdfs fron an example where $A\sim\mathcal{N}(\mu_A,\sigma_A)$ and $B\sim\mathcal{N}(\mu_B,\sigma_B)$ with $\mu_A = 3$, $\mu_B = 2$ and $\sigma_A = 0.3$ and $\sigma_B = 0.5$.

	# <codecell>

	mu_a = 3; mu_b = 2
	sigma_a = .4
	sigma_b = .6
	xs = np.linspace(0,6,100)
	from scipy.stats import distributions
	from scipy.stats.distributions import norm

	pdf_a_xs = norm.pdf(xs, loc = mu_a, scale=sigma_a)
	pdf_b_xs = norm.pdf(xs, loc = mu_b, scale=sigma_b)

	plt.plot(xs, pdf_a_xs, xs, pdf_b_xs, lw=3, alpha=.75);
	plt.figsize(12,6)
	plt.legend(['$f_a$', '$f_b$']);

	# <markdowncell>

	# For a fixed value $a$ drawn from $A$ with a probability $f_A(a)$, the chance to beat a sample $b$ from $B$ is $P[b \le a] = P[B \le a] = F_B(a)$ where $F_B$ is the cumulative distribution function of $B$. Let's illustrate this with arbitrary value for $a$.

	# <codecell>

	# Initialize figure and maine axe
	f, ax = plt.subplots(1, 1)
	f.set_size_inches(12,6)
	ax.plot(xs, pdf_a_xs, xs, pdf_b_xs, lw=3, alpha=.75);
	plt.legend(['$f_a$', '$f_b$'], prop={'size':20});

	# Estimate pdf of A and B at a.
	a = 2.75
	f_A_a = norm.pdf(a, loc = mu_a, scale=sigma_a)
	f_B_a = norm.pdf(a, loc = mu_b, scale=sigma_b)

	ax.vlines(a, 0, f_A_a, color='blue', linestyles='dashed');
	ax.hlines(f_A_a, 0, a, color='blue', linestyles='dashed');
	ax.fill_between(xs[xs<=a], y1=0, y2=pdf_b_xs[xs<=a], color='green', alpha=.25);
	ax.text(a/2, f_A_a, s='$f_A(a)$', color='blue', size=20, ha='center',va='bottom');
	ax.text(a, 0, s='$a$', color='blue', size=20, ha='center',va='top');
	ax.text(mu_b, norm.pdf(mu_b, loc = mu_b, scale=sigma_b)/2, s='$F_B(a)$',color='green', size=20, ha='center',va='bottom');

	# <markdowncell>

	# The probability $P[A > B]$ is the integral of $F_B(a)$ for every possible value of $a \in \mathrm{R}$ with a pointwise density $f_A(a)$ :
	#
	# <center> $$C2B(A,B) = P[A > B] = \int\limits_{a \in \mathrm{R}} F_B(a) f_A(a) da$$</center>
	#
	# Empirically (and for the purpose of plotting), we have defined a grid $ (x_i)_{i=1, \\cdots,n} $ of $ n = 100 $ equi-distant points between $ x_1 = 0 $ and $ x_n = 6 $. The value $ F_B(x_i) $ of the cumulative distribution of $B$ at a point $x_i$ can be numerically approximated by the cumulated sums of $ f_B(x_j) $ from $j$ up to $i$ :
	#
	# <center> $F_B(x_i) \\approx \\sum \\limits_{j=1}^{i} f_B(x_j) \\frac{x_n - x_1}{n} $ </center>

	# <codecell>

	approx_FB_xs = np.cumsum(pdf_b_xs * (xs[-1] - xs[0]) / len(xs))

	plt.plot(xs, norm.cdf(xs, loc = mu_b, scale=sigma_b), 'k-')
	plt.figsize(12, 6)
	plt.bar(xs, approx_FB_xs, 1.0/len(xs), color='black', alpha=.1)
	plt.legend(['$F_B(a)$', r'$\approx F_B(a)$'],'upper left',prop={'size':20});

	# <markdowncell>

	# And then apply the same scheme to approximate $C2B(A,B)$ :
	# <center> $$ C2B(A,B) \approx \sum \limits_{i=1}^{n} F_B(x_i) f_A(x_i) \approx \sum \limits_{i=1}^{n} \Big( \sum \limits_{j=1}^{i} f_B(x_i) \frac{x_n - x_1}{n} \Big) f_A(x_i) \frac{x_n - x_1}{n} $$</center>

	# <codecell>

	C2B = np.sum(approx_FB_xs * pdf_a_xs * (xs[-1] - xs[0]) / len(xs))
	print 'The C2B is {:.3f}'.format(C2B)

	# <markdowncell>

	# Let's define a generic function which calculates the $C2B(A,B)$ given :
	#
	# * An array of probability measures of $A$ (pointwise or integrated over a bin, anyway taking into account the $dx$ part) on a set of points or bins spanning the whole domain of $A$.
	# * An array of the probability measures of $B$ on the same set, also taking into account the $dx$ part.

	# <codecell>

	def calculate_C2B(prob_measures_A, prob_measures_B):
	assert len(prob_measures_A) == len(prob_measures_B), "arguments prob_measures_A and prob_measures_B must have the same length"
	return np.sum(np.cumsum(prob_measures_B) * prob_measures_A)

	print calculate_C2B((xs[-1]-xs[0]) / len(xs) * pdf_a_xs, (xs[-1]-xs[0]) / len(xs) * pdf_b_xs)

	# <headingcell level=2>

	# Samples from two unknown discrete distributions

	# <markdowncell>

	# In a more realistic context, we observe finite samples $n_a$ and $n_b$ of unknown variables $A$ and $B$.

	# <codecell>

	from scipy.stats import distributions
	n_a = 20
	n_b = 15
	a_samples = np.random.binomial(n=20,p=.7,size=n_a)
	b_samples = np.random.binomial(n=30,p=.4,size=n_b)
	#plt.plot(a_samples, np.repeat(0, n_a), 'xb', b_samples, np.repeat(0, n_b), 'og')
	#plt.plot(np.arange(n_a), a_samples, 'xb', np.arange(n_b), b_samples, 'og')
	plt.plot(a_samples, 'ob', b_samples, 'og', alpha=0.9)
	plt.ylim([0, np.max(np.concatenate((a_samples, b_samples))) + 5])
	plt.legend([r'$(A_i)_{i=1,\ldots,n_A}$', r'$(B_i)_{i=1,\ldots,n_B}$'], prop={'size':20});

	# <markdowncell>

	# The $C2B(A,B)$ or $P[A > B]$ can actually be computed in a combinatorial way :
	#
	# "For all every sample $a$ drawn from $A$, enumerate all the smaller or equal samples from $B$."
	# "The probability $P[A > B]$ is the sum of those case where $A$ was actually greater than $B$ divided by the number of all possibilities"

	# <codecell>

	C2B = np.sum([np.sum(b_samples<=a) for a in a_samples], dtype='float') / (n_a * n_b)
	print 'With the combinatorial method, we found that the C2B is {:.3f}'.format(C2B)

	# <markdowncell>

	# But we can apply the same reasoning as before, with $A$ and $B$ probability measures as sums of Dirac indicator functions of weight resp. $1 / n_a$ and $1 / n_b$.
	#
	# <center> $$ P[A>B] = \int\limits_{x \in \mathcal{R}} F_B(x) f_A(x) dx $$ </center>
	#
	# <center> $$ P[A>B] = \int\limits_{x \in \mathrm{R}} F_B(x) \sum \limits_{i=1}^{n_a} \mathbb{1}(a_i = x) = \frac{1}{n_a} \sum \limits_{i=1}^{n_a} F_B(a_i) $$ </center>
	#
	# <center> $$ P[A>B] = \frac{1}{n_a n_b} \sum \limits_{i=1}^{n_a} \sum \limits_{j=1}^{n_b} \mathbb{1}(b_i \le a_i) $$ </center>
	#
	# This can be visualize in a similar fashion to the before mentionned gaussian example :

	# <codecell>

	grid = np.linspace(0, np.max(np.concatenate((a_samples, b_samples))), 1000)
	FB_on_A_samples = np.array([np.sum(b_samples<=a) for a in a_samples],dtype='float')
	FB_on_grid = np.array([np.sum(b_samples<=x) for x in grid])

	plt.plot(grid, FB_on_grid, '-g')
	plt.plot(a_samples, FB_on_A_samples, 'ob', alpha=.5)
	plt.legend(['$F_B$', '$(A_i)_{i=1,\cdots,n_A}$'], 'upper left', prop={'size':20});