drfloob/SimulationsAndTheCLT.Rmd Secret

## SimulationsAndTheCLT.Rmd
---
title: "Investigating the Central Limit Theorem through Simulation"
author: "AJ Heller (aj@drfloob.com; [http://drfloob.com](http://drfloob.com))"
date: "May 26, 2016"
output:
    html_document:
        toc: true
        toc_float: true
        theme: readable
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.width = 10, fig.height = 6)
```

## Overview

This report explores the [Central Limit Theorem (CLT)][clt] by repeatedly sampling from a non-normal distribution and analyzing the the means of the samples. From this sampled mean distribution, we compare the sample mean and variance to the theoretical values we expect to find. Finally, we show the distribution to be approximately normal.

### Setup

This report does not require many libraries. I prefer `ggplot2`, though `lattice` and `base` plotting are fine alternatives.

```{r setup_lib}
library(ggplot2)
library(grid)
```

## Simulations

We begin by running 1000 simulations, each taking a sample of 40 random values from an [exponential distribution][exp] with parameter $\lambda = 0.2$. This creates a matrix of 1000 samples along the rows, with the 40 sample points stored along the columns.

```{r sim}
set.seed(420)

lambda <- 0.2
n <- 40
sims <- 1000
x <- matrix(rexp(n*sims, lambda), sims, n)
```

Those 1000 rows -- each containing 40 random values -- are then averaged to create a vector of sample means from every sample of size 40.

```{r means}
means <- apply(x, 1, mean)
```

## Sample Mean versus Theoretical Mean

For a bit of theoretical background, it's helpful to think of an exponential distribution as describing the time *between* independent events that occur at some continuous rate (See [Poisson Process][poisproc] to dig deeper into that). To quote [wikipedia][exp]:

> The mean or expected value of an exponentially distributed random variable $X$ with rate parameter $\lambda$ is given by

> $E[X] = \frac{1}{\lambda}$

> ... [intuitively] this makes sense: if you receive phone calls at an average rate of 2 per hour, then you can expect to wait half an hour for every call.

In terms of this report, the "phone calls" would come on average 0.2 per hour, so we would expect to wait $\dfrac{1}{.2} = 5$ hours for every call (again, on average).

Let's now see how that value measures up against our distribution of averages of 40 exponentials. The sample mean is:

```{r sampleMean}
mean(means)
```

```{r plot_avg}
ggplot(data.frame(means=means), aes(means)) +
    geom_histogram(binwidth = 0.3, boundary=0, fill="#8b5577", color="#333333") +
    geom_vline(xintercept = 1/lambda, size=1.5, color="#333333") +
    geom_vline(xintercept = mean(means), size=1.5, color="#cccccc") +
    xlab("mean value") +
    ggtitle("Distribution of Means of 40 Exponential Random Values")
```

In the plot above, the gray line marks the sample mean and the black line marks the theoretical mean. As you can see, they are nearly on top of each other. The difference can be explained by sampling error; the larger our sample size, the closer our sample mean should be to the theoretical population mean.

## Sample Variance versus Theoretical Variance

The theoretical variance of the distribution of sample means is given by $\dfrac{\sigma^2}{n}$, where $\sigma^2$ is the exponential distribution's variance, and $n$ is the number of observations in each sample (40). We're given from the exponential distribution $\sigma^2 = \dfrac{1}{\lambda^2}$, so the population (theoretical) variance is:

```{r theoreticalVar}
tVar <- (1/lambda^2)/n
print(tVar)
```

The sample variance is easy to calculate in `R`:

```{r sampleVar}
sVar <- var(means)
print(sVar)
```

These numbers are very close. The sample variance is `r sprintf("%0.2f%%", sVar/tVar*100)` of the theoretical variance. These numbers get closer together the larger our sample size is. A sample size of 40 is fairly small for our distribution of sample means to exhibit the normality promised by the CLT (stay tuned), so we should expect some small difference here.

## Normality of the Distribution

To inspect for normality visually, we plot a histogram and look for a bell-shaped curve. For comparison, to show non-normality of our original data, we first plot the first 1000 random variables sampled from our exponential distribution. Then we'll again plot the 1000 sample means. Both plots have an overlaid normal curve with the same mean and standard deviation as the data being plotted (code found [here][overlayNorm]).


```{r plotCompare}

p1 <- ggplot(data.frame(v = x[,1]), aes(v)) +
    geom_histogram(binwidth = 2, boundary=0, fill="#8b5577", color="#333333") +
    xlab("value") +
    ggtitle("Samples from an Exponential Distribution") +
    stat_function(fun=function(y,m,sd){
        dnorm(y,m,sd)*length(x[,1])*2
    }, args=list(m=mean(x[,1]), sd=sd(x[,1])))
p2 <- ggplot(data.frame(means=means), aes(means)) +
    geom_histogram(binwidth = 0.3, boundary=0, fill="#8b5577", color="#333333") +
    xlab("value") +
    ggtitle("Means of Samples from an Exponential Distribution") +
    stat_function(fun=function(x,m,sd){
        dnorm(x,m,sd)*sims*0.3
    }, args=list(m=mean(means), sd=sqrt(sVar)))

grid.newpage()
pushViewport(viewport(layout = grid.layout(1, 2)))
print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(p2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))
```

As you can see, the first plot towers well above the curve for values < ~4, while the second plot lines up with the curve quite nicely. In this way, we can visually confirm that the distribution of sample means is approximately normal, and the distribution of exponentials is definitely not.

I'll add one more visual measure of normality, namely a [Q-Q Normal Plot][qq], which compares our sample means with a normal distribution. The more normal the data, the closer the points will lie to the line $y=x$.

```{r qq}
par(mfrow=c(1,2))
qqnorm(x[,1], main = "Normal Q-Q Plot: sampled exponentials")
qqnorm(means, main = "Normal Q-Q Plot: sample means")
```

The plot of mean samples (right) is quite linear, leading us again to believe it's an approximately normal distribution. The strong curve in the exponentials Q-Q Plot (left) indicates again that the data is unlikely to be normal.

[exp]: https://en.wikipedia.org/wiki/Exponential_distribution
[clt]: https://en.wikipedia.org/wiki/Central_limit_theorem
[poisproc]: https://en.wikipedia.org/wiki/Poisson_point_process
[overlayNorm]: http://stackoverflow.com/a/36344354/10161
[chisq]: https://en.wikipedia.org/wiki/Pearson's_chi-squared_test
[qq]: https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot
	---
	title: "Investigating the Central Limit Theorem through Simulation"
	author: "AJ Heller (aj@drfloob.com; [http://drfloob.com](http://drfloob.com))"
	date: "May 26, 2016"
	output:
	html_document:
	toc: true
	toc_float: true
	theme: readable
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE, fig.width = 10, fig.height = 6)
	```

	## Overview

	This report explores the [Central Limit Theorem (CLT)][clt] by repeatedly sampling from a non-normal distribution and analyzing the the means of the samples. From this sampled mean distribution, we compare the sample mean and variance to the theoretical values we expect to find. Finally, we show the distribution to be approximately normal.

	### Setup

	This report does not require many libraries. I prefer `ggplot2`, though `lattice` and `base` plotting are fine alternatives.

	```{r setup_lib}
	library(ggplot2)
	library(grid)
	```

	## Simulations

	We begin by running 1000 simulations, each taking a sample of 40 random values from an [exponential distribution][exp] with parameter $\lambda = 0.2$. This creates a matrix of 1000 samples along the rows, with the 40 sample points stored along the columns.

	```{r sim}
	set.seed(420)

	lambda <- 0.2
	n <- 40
	sims <- 1000
	x <- matrix(rexp(n*sims, lambda), sims, n)
	```

	Those 1000 rows -- each containing 40 random values -- are then averaged to create a vector of sample means from every sample of size 40.

	```{r means}
	means <- apply(x, 1, mean)
	```

	## Sample Mean versus Theoretical Mean

	For a bit of theoretical background, it's helpful to think of an exponential distribution as describing the time between independent events that occur at some continuous rate (See [Poisson Process][poisproc] to dig deeper into that). To quote [wikipedia][exp]:

	> The mean or expected value of an exponentially distributed random variable $X$ with rate parameter $\lambda$ is given by

	> $E[X] = \frac{1}{\lambda}$

	> ... [intuitively] this makes sense: if you receive phone calls at an average rate of 2 per hour, then you can expect to wait half an hour for every call.

	In terms of this report, the "phone calls" would come on average 0.2 per hour, so we would expect to wait $\dfrac{1}{.2} = 5$ hours for every call (again, on average).

	Let's now see how that value measures up against our distribution of averages of 40 exponentials. The sample mean is:

	```{r sampleMean}
	mean(means)
	```

	```{r plot_avg}
	ggplot(data.frame(means=means), aes(means)) +
	geom_histogram(binwidth = 0.3, boundary=0, fill="#8b5577", color="#333333") +
	geom_vline(xintercept = 1/lambda, size=1.5, color="#333333") +
	geom_vline(xintercept = mean(means), size=1.5, color="#cccccc") +
	xlab("mean value") +
	ggtitle("Distribution of Means of 40 Exponential Random Values")
	```

	In the plot above, the gray line marks the sample mean and the black line marks the theoretical mean. As you can see, they are nearly on top of each other. The difference can be explained by sampling error; the larger our sample size, the closer our sample mean should be to the theoretical population mean.

	## Sample Variance versus Theoretical Variance

	The theoretical variance of the distribution of sample means is given by $\dfrac{\sigma^2}{n}$, where $\sigma^2$ is the exponential distribution's variance, and $n$ is the number of observations in each sample (40). We're given from the exponential distribution $\sigma^2 = \dfrac{1}{\lambda^2}$, so the population (theoretical) variance is:

	```{r theoreticalVar}
	tVar <- (1/lambda^2)/n
	print(tVar)
	```

	The sample variance is easy to calculate in `R`:

	```{r sampleVar}
	sVar <- var(means)
	print(sVar)
	```

	These numbers are very close. The sample variance is `r sprintf("%0.2f%%", sVar/tVar*100)` of the theoretical variance. These numbers get closer together the larger our sample size is. A sample size of 40 is fairly small for our distribution of sample means to exhibit the normality promised by the CLT (stay tuned), so we should expect some small difference here.

	## Normality of the Distribution

	To inspect for normality visually, we plot a histogram and look for a bell-shaped curve. For comparison, to show non-normality of our original data, we first plot the first 1000 random variables sampled from our exponential distribution. Then we'll again plot the 1000 sample means. Both plots have an overlaid normal curve with the same mean and standard deviation as the data being plotted (code found [here][overlayNorm]).


	```{r plotCompare}

	p1 <- ggplot(data.frame(v = x[,1]), aes(v)) +
	geom_histogram(binwidth = 2, boundary=0, fill="#8b5577", color="#333333") +
	xlab("value") +
	ggtitle("Samples from an Exponential Distribution") +
	stat_function(fun=function(y,m,sd){
	dnorm(y,m,sd)length(x[,1])2
	}, args=list(m=mean(x[,1]), sd=sd(x[,1])))
	p2 <- ggplot(data.frame(means=means), aes(means)) +
	geom_histogram(binwidth = 0.3, boundary=0, fill="#8b5577", color="#333333") +
	xlab("value") +
	ggtitle("Means of Samples from an Exponential Distribution") +
	stat_function(fun=function(x,m,sd){
	dnorm(x,m,sd)sims0.3
	}, args=list(m=mean(means), sd=sqrt(sVar)))

	grid.newpage()
	pushViewport(viewport(layout = grid.layout(1, 2)))
	print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
	print(p2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))
	```

	As you can see, the first plot towers well above the curve for values < ~4, while the second plot lines up with the curve quite nicely. In this way, we can visually confirm that the distribution of sample means is approximately normal, and the distribution of exponentials is definitely not.

	I'll add one more visual measure of normality, namely a [Q-Q Normal Plot][qq], which compares our sample means with a normal distribution. The more normal the data, the closer the points will lie to the line $y=x$.

	```{r qq}
	par(mfrow=c(1,2))
	qqnorm(x[,1], main = "Normal Q-Q Plot: sampled exponentials")
	qqnorm(means, main = "Normal Q-Q Plot: sample means")
	```

	The plot of mean samples (right) is quite linear, leading us again to believe it's an approximately normal distribution. The strong curve in the exponentials Q-Q Plot (left) indicates again that the data is unlikely to be normal.

	[exp]: https://en.wikipedia.org/wiki/Exponential_distribution
	[clt]: https://en.wikipedia.org/wiki/Central_limit_theorem
	[poisproc]: https://en.wikipedia.org/wiki/Poisson_point_process
	[overlayNorm]: http://stackoverflow.com/a/36344354/10161
	[chisq]: https://en.wikipedia.org/wiki/Pearson's_chi-squared_test
	[qq]: https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot