Skip to content

Instantly share code, notes, and snippets.

@clarkgrubb
Created September 25, 2014 01:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save clarkgrubb/7ac2563fffb27a0fb484 to your computer and use it in GitHub Desktop.
Save clarkgrubb/7ac2563fffb27a0fb484 to your computer and use it in GitHub Desktop.
{
"metadata": {
"name": "",
"signature": "sha256:8e904721f90a0b0c0c0dd648eb9f9f2473c22dc731e07b4f14c14c1cf2c3df9c"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Click Mathematics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This IPython notebook accompanies [http://clarkgrubb.com/click-math](http://clarkgrubb.com/click-math), which shows how to perform some calculations involving the two fundamental quantities of web metrics: the impression and the click.\n",
"\n",
"In this notebook we show how to perform those same calculations in Python."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Setup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition to IPython, we are using the Python libraries NumPy and SciPy. An easy way to get all three of these products is to install the [Anaconda Scientific Python Distribution](http://continuum.io/downloads)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import sys, os, re, math"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In Python 2.7, the / operator returns a quotient (i.e. an integer) when operating on integers.\n",
"\n",
"We can change this behavior:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from __future__ import division"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following module aliases are commonly used in the Scientific Python community:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import numpy as np\n",
"import scipy as sp\n",
"import scipy.stats as stats\n",
"import matplotlib as mpl\n",
"import matplotlib.pyplot as plt"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 14
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"NumPy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NumPy provides an array type that we will use instead of the built-in Python list.\n",
"\n",
"One of the advantages of the NumPy array is that the basic arithmetic operations are vectorized. This will help us to avoid writing loops.\n",
"\n",
"Note the difference in meaning of the + operator:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.array([1, 2, 3]) + np.array([3, 4, 5])\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": [
"array([4, 6, 8])"
]
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"[1,2,3] + [3,4,5]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
"[1, 2, 3, 3, 4, 5]"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When an arithmetic operator has an NumPy array and a simple type as arguments, the simple type is \"broadcast\" over the entire array:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.array([1, 2, 3]) * 2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": [
"array([2, 4, 6])"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The native Python list does not broadcast:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"[1, 2, 3] * 2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": [
"[1, 2, 3, 1, 2, 3]"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This isn't Here is an example of the Python's list comprehension syntax, which we will use:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"squares = [n * n for n in range(0, 11)]\n",
"squares"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 18,
"text": [
"[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100]"
]
}
],
"prompt_number": 18
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"CTR"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is our raw click data.\n",
"\n",
"The first element of the clicks array goes with the first element of the impressions array. I.e. there was a link which we displayed 313 times and which received 23 clicks."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"imps = np.array([313, 285, 298, 34, 3398, 333, 301])\n",
"clicks = np.array([23, 20, 8, 2, 128, 15, 11])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"non_clicks = imps - clicks\n",
"non_clicks"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
"array([ 290, 265, 290, 32, 3270, 318, 290])"
]
}
],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ctr = clicks/imps\n",
"ctr"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": [
"array([ 0.07348243, 0.07017544, 0.02684564, 0.05882353, 0.03766922,\n",
" 0.04504505, 0.03654485])"
]
}
],
"prompt_number": 11
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"P-Values"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"expected_ctr = 0.04\n",
"expected_clicks = expected_ctr * imps\n",
"expected_non_clicks = imps - expected_clicks\n",
"expected_clicks\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 18,
"text": [
"array([ 12.52, 11.4 , 11.92, 1.36, 135.92, 13.32, 12.04])"
]
}
],
"prompt_number": 18
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"p_values = np.array([stats.chisquare([clicks[i], non_clicks[i]], [expected_clicks[i], expected_non_clicks[i]])[1]\n",
" for i\n",
" in range(len(imps))])\n",
"p_values"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 25,
"text": [
"array([ 0.00250367, 0.00933262, 0.24653354, 0.57540302, 0.48809458,\n",
" 0.63849131, 0.7596781 ])"
]
}
],
"prompt_number": 25
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python functions have doc strings. This is how you get the documentation for a function:\n",
"\n",
" print(stats.chisquare.__doc__)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"bonferroni_p_values = [min(1.0, p_values[i] * len(p_values))\n",
" for i\n",
" in range(len(p_values))]\n",
"bonferroni_p_values"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 28,
"text": [
"[0.017525659707906406, 0.065328314692726625, 1.0, 1.0, 1.0, 1.0, 1.0]"
]
}
],
"prompt_number": 28
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TODO: Holm-Bonferroni.\n",
"\n",
"Would it be better to use Pandas for the Holm-Bonferroni example? We need to sort."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Confidence Intervals"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"alpha = 0.05\n",
"z = stats.norm.ppf(1 - alpha / 2)\n",
"z"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 30,
"text": [
"1.959963984540054"
]
}
],
"prompt_number": 30
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def lower_normal_conf(i, c, z):\n",
" h = z * sqrt(c * (i - c) / (i ** 3))\n",
" ctr = c / i\n",
" return ctr - h\n",
"\n",
"lower_normal_conf(imps, clicks, z)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 31,
"text": [
"array([ 0.044576 , 0.04051902, 0.0084943 , -0.02026613, 0.03126757,\n",
" 0.02276885, 0.01534691])"
]
}
],
"prompt_number": 31
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def upper_normal_conf(i, c, z):\n",
" h = z * sqrt(c * (i - c) / (i ** 3))\n",
" ctr = c / i\n",
" return ctr + h\n",
"\n",
"upper_normal_conf(imps, clicks, z)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 32,
"text": [
"array([ 0.10238886, 0.09983186, 0.04519697, 0.13791319, 0.04407087,\n",
" 0.06732124, 0.05774279])"
]
}
],
"prompt_number": 32
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def lower_wilson_score_conf(i, c, z):\n",
" h = z * sqrt(c * (i - c) / (i ** 3) + z ** 2 / (4 * i ** 2))\n",
" \n",
" return (i / (i + z ** 2)) * (c / i + z ** 2 / (2 * i ) - h)\n",
"\n",
"lower_wilson_score_conf(imps, clicks, z)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 34,
"text": [
"array([ 0.04946129, 0.04588384, 0.01366458, 0.01628266, 0.031772 ,\n",
" 0.02748511, 0.02052648])"
]
}
],
"prompt_number": 34
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def upper_wilson_score_conf(i, c, z):\n",
" h = z * sqrt(c * (i - c) / (i ** 3) + z ** 2 / (4 * i ** 2))\n",
" \n",
" return (i / (i + z ** 2)) * (c / i + z ** 2 / (2 * i ) + h)\n",
"\n",
"upper_wilson_score_conf(imps, clicks, z)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 35,
"text": [
"array([ 0.10784596, 0.10589998, 0.05207013, 0.19093607, 0.04461059,\n",
" 0.07298191, 0.06424368])"
]
}
],
"prompt_number": 35
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Beta Distribution"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"expected_mean = .03\n",
"expected_stddev = .03"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def beta_a(m, sd):\n",
" return (m ** 2 - m ** 3 - m * sd ** 2)/sd ** 2\n",
"\n",
"prior_a = beta_a(expected_mean, expected_stddev)\n",
"prior_a"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": [
"0.94"
]
}
],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def beta_b(m, sd):\n",
" return ((-1 + m) * (-m + m ** 2 + sd ** 2))/sd ** 2\n",
"\n",
"prior_b = beta_b(expected_mean, expected_stddev)\n",
"prior_b"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
"30.39333333333333"
]
}
],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"imps2 = np.array([1, 100, 100, 100, 100, 10, 10])\n",
"clicks2 = np.array([1, 10, 4, 5, 3, 0, 10])\n",
"non_clicks2 = imps2 - clicks2\n",
"non_clicks2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
"array([ 0, 90, 96, 95, 97, 10, 0])"
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"posterior_a = prior_a + clicks2\n",
"posterior_a"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": [
"array([ 1.94, 10.94, 4.94, 5.94, 3.94, 0.94, 10.94])"
]
}
],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"posterior_b = prior_b + non_clicks2\n",
"posterior_b"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 12,
"text": [
"array([ 30.39333333, 120.39333333, 126.39333333, 125.39333333,\n",
" 127.39333333, 40.39333333, 30.39333333])"
]
}
],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"posterior_mean = stats.beta.mean(posterior_a, posterior_b)\n",
"posterior_mean"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 15,
"text": [
"array([ 0.06 , 0.08329949, 0.03761421, 0.04522843, 0.03 ,\n",
" 0.02274194, 0.26467742])"
]
}
],
"prompt_number": 15
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"posterior_stddev = stats.beta.std(posterior_a, posterior_b)\n",
"posterior_stddev"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
"text": [
"array([ 0.04113393, 0.02402151, 0.01653926, 0.01806429, 0.014829 ,\n",
" 0.02291274, 0.06780413])"
]
}
],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"random_draws = stats.beta.rvs(posterior_a, posterior_b)\n",
"random_draws"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 19,
"text": [
"array([ 0.02125831, 0.07894332, 0.04574358, 0.040027 , 0.00985993,\n",
" 0.01236459, 0.19463317])"
]
}
],
"prompt_number": 19
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment