Skip to content

Instantly share code, notes, and snippets.

@drfloob
Created June 7, 2016 18:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save drfloob/0d30e669a2d511c59c0f9d0c10e2c70c to your computer and use it in GitHub Desktop.
Save drfloob/0d30e669a2d511c59c0f9d0c10e2c70c to your computer and use it in GitHub Desktop.
Investigating the Central Limit Theorem through Simulation [R] [knitr] [rmarkdown] [statistics]
---
title: "Investigating the Central Limit Theorem through Simulation"
author: "AJ Heller (aj@drfloob.com; [http://drfloob.com](http://drfloob.com))"
date: "May 26, 2016"
output:
html_document:
toc: true
toc_float: true
theme: readable
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.width = 10, fig.height = 6)
```
## Overview
This report explores the [Central Limit Theorem (CLT)][clt] by repeatedly sampling from a non-normal distribution and analyzing the the means of the samples. From this sampled mean distribution, we compare the sample mean and variance to the theoretical values we expect to find. Finally, we show the distribution to be approximately normal.
### Setup
This report does not require many libraries. I prefer `ggplot2`, though `lattice` and `base` plotting are fine alternatives.
```{r setup_lib}
library(ggplot2)
library(grid)
```
## Simulations
We begin by running 1000 simulations, each taking a sample of 40 random values from an [exponential distribution][exp] with parameter $\lambda = 0.2$. This creates a matrix of 1000 samples along the rows, with the 40 sample points stored along the columns.
```{r sim}
set.seed(420)
lambda <- 0.2
n <- 40
sims <- 1000
x <- matrix(rexp(n*sims, lambda), sims, n)
```
Those 1000 rows -- each containing 40 random values -- are then averaged to create a vector of sample means from every sample of size 40.
```{r means}
means <- apply(x, 1, mean)
```
## Sample Mean versus Theoretical Mean
For a bit of theoretical background, it's helpful to think of an exponential distribution as describing the time *between* independent events that occur at some continuous rate (See [Poisson Process][poisproc] to dig deeper into that). To quote [wikipedia][exp]:
> The mean or expected value of an exponentially distributed random variable $X$ with rate parameter $\lambda$ is given by
> $E[X] = \frac{1}{\lambda}$
> ... [intuitively] this makes sense: if you receive phone calls at an average rate of 2 per hour, then you can expect to wait half an hour for every call.
In terms of this report, the "phone calls" would come on average 0.2 per hour, so we would expect to wait $\dfrac{1}{.2} = 5$ hours for every call (again, on average).
Let's now see how that value measures up against our distribution of averages of 40 exponentials. The sample mean is:
```{r sampleMean}
mean(means)
```
```{r plot_avg}
ggplot(data.frame(means=means), aes(means)) +
geom_histogram(binwidth = 0.3, boundary=0, fill="#8b5577", color="#333333") +
geom_vline(xintercept = 1/lambda, size=1.5, color="#333333") +
geom_vline(xintercept = mean(means), size=1.5, color="#cccccc") +
xlab("mean value") +
ggtitle("Distribution of Means of 40 Exponential Random Values")
```
In the plot above, the gray line marks the sample mean and the black line marks the theoretical mean. As you can see, they are nearly on top of each other. The difference can be explained by sampling error; the larger our sample size, the closer our sample mean should be to the theoretical population mean.
## Sample Variance versus Theoretical Variance
The theoretical variance of the distribution of sample means is given by $\dfrac{\sigma^2}{n}$, where $\sigma^2$ is the exponential distribution's variance, and $n$ is the number of observations in each sample (40). We're given from the exponential distribution $\sigma^2 = \dfrac{1}{\lambda^2}$, so the population (theoretical) variance is:
```{r theoreticalVar}
tVar <- (1/lambda^2)/n
print(tVar)
```
The sample variance is easy to calculate in `R`:
```{r sampleVar}
sVar <- var(means)
print(sVar)
```
These numbers are very close. The sample variance is `r sprintf("%0.2f%%", sVar/tVar*100)` of the theoretical variance. These numbers get closer together the larger our sample size is. A sample size of 40 is fairly small for our distribution of sample means to exhibit the normality promised by the CLT (stay tuned), so we should expect some small difference here.
## Normality of the Distribution
To inspect for normality visually, we plot a histogram and look for a bell-shaped curve. For comparison, to show non-normality of our original data, we first plot the first 1000 random variables sampled from our exponential distribution. Then we'll again plot the 1000 sample means. Both plots have an overlaid normal curve with the same mean and standard deviation as the data being plotted (code found [here][overlayNorm]).
```{r plotCompare}
p1 <- ggplot(data.frame(v = x[,1]), aes(v)) +
geom_histogram(binwidth = 2, boundary=0, fill="#8b5577", color="#333333") +
xlab("value") +
ggtitle("Samples from an Exponential Distribution") +
stat_function(fun=function(y,m,sd){
dnorm(y,m,sd)*length(x[,1])*2
}, args=list(m=mean(x[,1]), sd=sd(x[,1])))
p2 <- ggplot(data.frame(means=means), aes(means)) +
geom_histogram(binwidth = 0.3, boundary=0, fill="#8b5577", color="#333333") +
xlab("value") +
ggtitle("Means of Samples from an Exponential Distribution") +
stat_function(fun=function(x,m,sd){
dnorm(x,m,sd)*sims*0.3
}, args=list(m=mean(means), sd=sqrt(sVar)))
grid.newpage()
pushViewport(viewport(layout = grid.layout(1, 2)))
print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(p2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))
```
As you can see, the first plot towers well above the curve for values < ~4, while the second plot lines up with the curve quite nicely. In this way, we can visually confirm that the distribution of sample means is approximately normal, and the distribution of exponentials is definitely not.
I'll add one more visual measure of normality, namely a [Q-Q Normal Plot][qq], which compares our sample means with a normal distribution. The more normal the data, the closer the points will lie to the line $y=x$.
```{r qq}
par(mfrow=c(1,2))
qqnorm(x[,1], main = "Normal Q-Q Plot: sampled exponentials")
qqnorm(means, main = "Normal Q-Q Plot: sample means")
```
The plot of mean samples (right) is quite linear, leading us again to believe it's an approximately normal distribution. The strong curve in the exponentials Q-Q Plot (left) indicates again that the data is unlikely to be normal.
[exp]: https://en.wikipedia.org/wiki/Exponential_distribution
[clt]: https://en.wikipedia.org/wiki/Central_limit_theorem
[poisproc]: https://en.wikipedia.org/wiki/Poisson_point_process
[overlayNorm]: http://stackoverflow.com/a/36344354/10161
[chisq]: https://en.wikipedia.org/wiki/Pearson's_chi-squared_test
[qq]: https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment