Skip to content

Instantly share code, notes, and snippets.

@mcburton
Last active December 18, 2015 10:39
Show Gist options
  • Save mcburton/5769709 to your computer and use it in GitHub Desktop.
Save mcburton/5769709 to your computer and use it in GitHub Desktop.
This is a translation of Kirby Sheddon's tutorial Elementary statistical calculations and simulations (http://dept.stat.lsa.umich.edu/~kshedden/Python-Workshop/stats_calculations.html) from the ARC/CSCAR workshop from June 10-14th 2013 at the University of Michigan. The purpose of the workshop was to introduce data management and and analysis wi…
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "stats_calculations"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Elementary statistical calculations and simulations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Contents:**\n",
"\n",
"- Goals\n",
"- Comparing the means of two populations\n",
"- Odds ratios\n",
"- Correlation coefficients\n",
"- Analysis of a 2-way contingency table\n",
"- Files"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Goals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are several Python packages that provide high-quality routines for statistical analysis. However sometimes it is useful to be able to do common statistical calculations directly, without relying on libraries. Here we will illustrate how this can be done using a few commonly-encountered statistical calculations. We also illustrate how simulations can be used to assess the properties of statistical procedures."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following import statements are used in the code examples below:\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import numpy as np\n",
"from scipy.stats.distributions import norm"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Comparing the means of two populations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the most common elementary statistical analyses is the comparison of two populations in terms of their means, based on data sampled from the two populations. This is often described as a \"t-test\", which is not a very descriptive term.\n",
"\n",
"If you have vectors `X` and `Y` sampled independently from two populations, you can calculate the mean difference as"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" md = X.mean() - Y.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, using function calls you can use"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" md = np.mean(X) - np.mean(Y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the second form works for both NumPy arrays and Python lists, but the first form only works for NumPy arrays.\n",
"\n",
"The standard error of the mean difference can be estimated as"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" se = np.sqrt(np.var(X)/len(X) + np.var(Y)/len(Y))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This can be used to form a 95% confidence interval for the true mean difference"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" lcl,ucl = md-2*se,md+2*se"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can obtain a Z-score for testing the hypothesis that the true difference is zero"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" z = md/se"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then you can get a 2-sided p-value for the test using"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" pvalue = -2*norm.cdf(-np.abs(z))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This p-value is the Z-test p-value, which differs slightly from the more common t-test p-value. If the sample size is not very small, the two approaches give almost the same result.\n",
"\n",
"We can write a small simulation study to assess the coverage probability of this confidence interval."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"## Simulation study to assess coverage probability of\n",
"## the confidence interval for the comparison of two\n",
"## population means.\n",
"\n",
"## Number of simulated data sets\n",
"nrep = 1000\n",
"\n",
"## Sample sizes for the two observed samples\n",
"nx,ny = 10,20\n",
"\n",
"## Standard deviation for the sample from the second population\n",
"sigy = 2\n",
"\n",
"## Estimate the coverage probability using simulation\n",
"CP = 0\n",
"for it in range(nrep):\n",
"\n",
" ## Generate a data set that satisfies the null hypothesis\n",
" X = np.random.normal(size=nx)\n",
" Y = sigy*np.random.normal(size=ny)\n",
"\n",
" ## The Z-score\n",
" md = X.mean() - Y.mean()\n",
" se = np.sqrt(X.var()/nx + Y.var()/ny)\n",
"\n",
" ## Check whether the interval contains the true value\n",
" if (md - 2*se < 0) and (md + 2*se > 0):\n",
" CP += 1\n",
"\n",
"CP /= float(nrep)\n",
"\n",
"print CP"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0.929\n"
]
}
],
"prompt_number": 28
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Vectorization can be used to do the same simulation analysis using much less code:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"## Vectorized version of the simulation study to assess the coverage\n",
"## probability of the confidence interval for the comparison of two\n",
"## population means.\n",
"\n",
"## Generate nrep data sets simultaneously\n",
"X = np.random.normal(size=(nrep,nx))\n",
"Y = sigy*np.random.normal(size=(nrep,ny))\n",
"\n",
"## Calculate Z-scores for all the data sets simultaneously\n",
"MD = X.mean(1) - Y.mean(1)\n",
"SE = np.sqrt(X.var(1)/nx + Y.var(1)/ny)\n",
"\n",
"## The coverage probability\n",
"CP = np.mean( (MD - 2*SE < 0) & (MD + 2*SE > 0) )\n",
"\n",
"print CP"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0.924\n"
]
}
],
"prompt_number": 26
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also consider non-Gaussian populations:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"## Modify the previous simulation study so that one of the \n",
"## populations is exponential.\n",
"\n",
"## Generate nrep data sets\n",
"X = -np.log(np.random.uniform(size=(nrep,nx)))\n",
"Y = 1 + sigy*np.random.normal(size=(nrep,ny))\n",
"\n",
"## Calculate all the Z-scores\n",
"MD = X.mean(1) - Y.mean(1)\n",
"SE = np.sqrt(X.var(1)/nx + Y.var(1)/ny)\n",
"\n",
"## The coverage probability\n",
"CP = np.mean( (MD - 2*SE < 0) & (MD + 2*SE > 0) )\n",
"\n",
"print CP"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0.928\n"
]
}
],
"prompt_number": 27
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Odds ratios"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The odds ratio is a measure of dependence between two binary values. Suppose X and Y are two binary data values, jointly observed on each observed unit. For example, X could be a person's gender (coded 0/1), and Y could be the person's political affiliation (also coded 0/1).\n",
"\n",
"If nij is the number of units observed with X=i and Y=j (for i,j=0,1), then the odds ratio is\n",
"\n",
"(n<sub>00</sub>n<sub>11</sub>) / (n<sub>10</sub>n<sub>01</sub>).\n",
"\n",
"Usually we prefer to look at the log odds ratio\n",
"\n",
"log(n<sub>00</sub>) + log(n<sub>11</sub>) - log(n<sub>10</sub>) - log(n<sub>01</sub>).\n",
"\n",
"Note that this is the natural log, which is also the log that is given by the NumPy function np.log. It is easy to compute the log odds ratio in Python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" N = np.array([[35,23], [20,29]])\n",
" LOR = np.log(N[0,0]) + np.log(N[1,1]) - np.log(np.log(N[0,1])) - np.log(N[1,0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The standard error of the log odds ratio is\n",
"\n",
"sqrt(1/n<sub>00</sub> + 1/n<sub>01</sub> + 1/n<sub>10</sub> + 1/n<sub>11</sub>),\n",
"\n",
"which can be computed in Python as"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" SE = np.sqrt(np.sum(1/N.astype(np.float64)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a simulation study assessing the performance of the 95% confidence interval based on this standard error. We also look at the bias of the estimated log odds ratio."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"## Simulation study for the log odds ratio\n",
"\n",
"from scipy.stats import rv_discrete\n",
"\n",
"## Cell probabilities\n",
"P = np.array([[0.3,0.2],[0.1,0.4]])\n",
"\n",
"## The population log odds ratio\n",
"PLOR = np.log(P[0,0]) + np.log(P[1,1]) - np.log(P[0,1]) - np.log(P[1,0])\n",
"\n",
"## Sample size\n",
"n = 100\n",
"\n",
"## ravel vectorizes by row\n",
"m = rv_discrete(values=((0,1,2,3), P.ravel()))\n",
"\n",
"## Generate the data\n",
"D = m.rvs(size=(nrep,n))\n",
"\n",
"## Convert to cell counts\n",
"Q = np.zeros((nrep,4))\n",
"for j in range(4):\n",
" Q[:,j] = (D == j).sum(1)\n",
"\n",
"## Calculate the log odds ratio\n",
"LOR = np.log(Q[:,0]) + np.log(Q[:,3]) - np.log(Q[:,1]) - np.log(Q[:,2])\n",
"\n",
"## The standard error\n",
"SE = np.sqrt((1/Q.astype(np.float64)).sum(1))\n",
"\n",
"print \"The mean estimated standard error is %.3f\" % SE.mean()\n",
"print \"The standard deviation of the estimates is %.3f\" % LOR.std()\n",
"\n",
"## 95% confidence intervals\n",
"LCL = LOR - 2*SE\n",
"UCL = LOR + 2*SE\n",
"\n",
"## Coverage probability\n",
"CP = np.mean((PLOR > LCL) & (PLOR < UCL))\n",
"\n",
"print \"The population LOR is %.2f\" % PLOR\n",
"print \"The expected value of the estimated LOR is %.2f\" % LOR[np.isfinite(LOR)].mean()\n",
"print \"The coverage probability of the 95%% CI is %.3f\" % CP"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"The mean estimated standard error is 0.473\n",
"The standard deviation of the estimates is 0.494\n",
"The population LOR is 1.79\n",
"The expected value of the estimated LOR is 1.86\n",
"The coverage probability of the 95% CI is 0.949\n"
]
}
],
"prompt_number": 30
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Correlation coefficients"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The correlation coefficient is a measure of dependence between paired quantitative observations. Given two data vectors `X` and `Y`, you can calculate the correlation coefficient using the NumPy function `np.corrcoef`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" X = np.random.normal(size=100)\n",
" Y = np.random.normal(size=100)\n",
" r = np.corrcoef(X, Y)[0,1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since `np.corrcoef` returns the 2x2 correlation matrix, the correlation coefficient is in the 0,1 position. That is all we want here, so we immediately select that value from the result.\n",
"\n",
"The standard error of the correlation coefficient is roughly 1/sqrt(n), where n is the sample size. A more accurate statistical analysis can be obtained by applying the Fisher transformation to the correlation coefficient:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" F = 0.5*np.log((1+r)/(1-r))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Fisher transformation approximately has standard error 1/sqrt(n-3).\n",
"\n",
"Here is a simulation study looking at the sampling properties of the correlation coefficient and its 95% confidence interval."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"## Simulation study for the correlation coefficient\n",
"\n",
"## Sample size\n",
"n = 20\n",
"\n",
"## Correlation between X and Y\n",
"r = 0.3\n",
"\n",
"## Generate matrices X and Y such that the i^th rows of X and Y are\n",
"## correlated with correlation coefficient 0.3.\n",
"X = np.random.normal(size=(nrep,n))\n",
"Y = r*X + np.sqrt(1-r**2)*np.random.normal(size=(nrep,n))\n",
"\n",
"## Get all the correlation coefficients\n",
"R = [np.corrcoef(x,y)[0,1] for x,y in zip(X,Y)]\n",
"R = np.array(R)\n",
"\n",
"## Fisher transform all the correlation coefficients\n",
"F = 0.5*np.log((1+R)/(1-R))\n",
"\n",
"print \"The standard deviation of the Fisher transformed \" +\\\n",
" \"correlation coefficients is %.3f\" % F.std() \n",
"print \"1/sqrt(n-3)=%.3f\" % (1/np.sqrt(n-3))\n",
"\n",
"## 95% confidence intervals on the Fisher transform scale\n",
"LCL = F - 2/np.sqrt(n-3)\n",
"UCL = F + 2/np.sqrt(n-3)\n",
"\n",
"## Convert the intervals back to the correlation scale\n",
"LCL = (np.exp(2*LCL)-1) / (np.exp(2*LCL)+1)\n",
"UCL = (np.exp(2*UCL)-1) / (np.exp(2*UCL)+1)\n",
"\n",
"CP = np.mean((LCL < r) & (r < UCL))\n",
"\n",
"print \"The coverage probability is %.3f\" % CP"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"The standard deviation of the Fisher transformed correlation coefficients is 0.244\n",
"1/sqrt(n-3)=0.243\n",
"The coverage probability is 0.954\n"
]
}
],
"prompt_number": 31
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Analysis of a 2-way contingency table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Suppose we have a 2-way contingency table, and we want to assess the relationship between the rows and the columns. This is often done using a chi square analysis. The first step in performing the chi square analysis is to fit a table to the observed counts, such that the rows and columns of the fitted table are exactly independent. This can be accomplished as follows:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"T = np.array([[45,27], [80,35], [22, 18]])\n",
"\n",
"PR = T.sum(1) / float(T.sum())\n",
"PC = T.sum(0) / float(T.sum())\n",
"\n",
"E = np.outer(PR, PC)\n",
"C = T.sum()*E\n",
"\n",
"print C"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[ 46.62555066 25.37444934]\n",
" [ 74.47136564 40.52863436]\n",
" [ 25.9030837 14.0969163 ]]\n"
]
}
],
"prompt_number": 32
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The 3x2 table T is the observed data, E is the fitted probabilities that satisfy exact independence between rows and columns, and C are the fitted cell counts that satisfy exact independence between rows and columns. The chi-square statistic, degrees of freedom, and p-value can be calculated as follows"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" cs = ((T-C)**2/C).sum()\n",
" df = (T.shape[0]-1)*(T.shape[1]-1)\n",
" pvalue = 1 - chi2.cdf(cs, df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The chi2 function gives the chi square distribution probabilities, quantiles, etc., and can be obtained using:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from scipy.stats.distributions import chi2"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a simulation study that looks at the chi square test for a particular 3 x 2 classification. The value of x determines how the results should be interpreted. When x=0, the data are generated under exact independence between rows and columns. Thus the test should only reject 5% of the time (since it is a level 0.05 test). The simulation assesses whether the test is performing as expected. If you set x to a nonzero value, the data are generated under a distribution in which the rows and columns are dependent. In this, case we are doing a power analysis to assess whether a given level of dependence can be detected using the specified sample size.\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"## Simulation study for contingency tables\n",
"\n",
"## Row and column cell probabilities\n",
"PR = np.r_[0.3, 0.4, 0.3]\n",
"PC = np.r_[0.6, 0.4]\n",
"\n",
"## Cell probabilities satisfying row/column independence\n",
"T = np.outer(PR, PC)\n",
"\n",
"## Modify the probabilities a bit to deviate from independence\n",
"x = 0.05 ## set to 0 to give independent rows and columns\n",
"T1 = T.copy()\n",
"T1[1,:] += [x, -x]\n",
"\n",
"## Sample size\n",
"n = 100\n",
"\n",
"## ravel vectorizes by row\n",
"m = rv_discrete(values=(range(6), T1.ravel()))\n",
"\n",
"## Generate the data\n",
"D = m.rvs(size=(nrep,n))\n",
"\n",
"## Convert to cell counts\n",
"Q = np.zeros((nrep,6))\n",
"for j in range(6):\n",
" Q[:,j] = (D == j).sum(1)\n",
"Q = np.reshape(Q, (nrep, 3, 2))\n",
"\n",
"## Row and column margins\n",
"RM = Q.sum(2) / float(n)\n",
"CM = Q.sum(1) / float(n)\n",
"\n",
"## Fitted probabilities and counts under independence\n",
"E = [np.outer(rm,cm) for rm,cm in zip(RM,CM)]\n",
"C = [n*x for x in E]\n",
"\n",
"## Chi-square statistics\n",
"CS = [((x-c)**2/c).sum() for x,c in zip(Q,C)]\n",
"CS = np.array(CS)\n",
"\n",
"## Test statistic threshold for alpha=0.04\n",
"from scipy.stats.distributions import chi2\n",
"df = (3-1)*(2-1)\n",
"ts = chi2.ppf(0.95, df)\n",
"\n",
"print \"Proportion of chi^2 tests that reject the independence hypothesis: %.2f\" %\\\n",
" np.mean(CS > ts)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Proportion of chi^2 tests that reject the independence hypothesis: 0.19\n"
]
}
],
"prompt_number": 15
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"http://dept.stat.lsa.umich.edu/~kshedden/Python-Workshop/Code_download/stats_calculations.py\">stats_calculations.py</a>"
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment