clayford/shapiro_levene_test_simulations.Rmd

## shapiro_levene_test_simulations.Rmd
---
title: "Shapiro Test and Levene Test Simulations"
author: "Clay Ford"
date: "September 25, 2018"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Shapiro Test for Normal data

Can you tell the difference between these two distributions?

```{r echo=FALSE, fig.height=3}
op <- par(mfrow = c(1,2))
curve(expr = dnorm(x), from = -3, to = 3, ylab = "")
curve(expr = dt(x, df = 10), from = -3, to = 3, ylab = "")
par(op)

```


Probably not. But I can definitively say one of them is not normal. The one on the right. It's a *t* distribution with df = 10.

What about these two QQ Plots? Which one is for the non-normal data?

```{r echo=FALSE, fig.height=3}
set.seed(100)
x1 <- rnorm(500)
x2 <- rt(500, df = 10)

op <- par(mfrow = c(1,2))
qqnorm(x1); qqline(x1)
qqnorm(x2); qqline(x2)
par(op)
```

They both look the same to me, but the one on the right is non-normal. The plot on the left is random data from a Normal(0,1) distribution; the one on the right is random data from a *t* distribution with df = 10

In both cases I have compared Normal data to data from a *t* distribution. The latter is absolutely not normal. But I think we can agree it's probably normal enough.

Let's simulate 1000 data sets of size 500 drawn from a *t* distribution with 10 degress of freedom, and each time run a shapiro test and then check if the p-value is less than 0.05.

```{r}
r.out <- replicate(n = 1e4, expr = shapiro.test(rt(500, df = 10))$p.value < 0.05)
mean(r.out)
```

The shapiro test is rejected at the 0.05 level `r round(mean(r.out),3)*100` percent of the time. Yet I think based on our graphical evaluations, the data looks virtually indistinguishable from Normal data.

## Levene test for constant variance

Take a look at this plot. Are we concerned about non-constant variance?

```{r echo=FALSE,fig.height=4}
set.seed(1)
l.out <- lapply(X = c(2, 2.3, 2.5, 2, 2), function(x)rnorm(n = 500, mean = 10, sd = x))
dat <- data.frame(y = unlist(l.out), grp = rep(letters[1:5], each = 500))
stripchart(y ~ grp, data = dat)

```

I'm not. There are about 4 outliers. But that's out of 2500 observations.

What does Levene's test say?

```{r message=FALSE}
library(car)
leveneTest(y ~ grp, data = dat)
```

It reports the data has non-constant variance based on its p-value. And it's right! The data does not have constant variance across groups. But is it really cause for concern?

Here is how I generated the data and made the plot:

```{r eval=FALSE}
l.out <- lapply(X = c(2, 2.3, 2.5, 2, 2), function(x)rnorm(n = 500, mean = 10, sd = x))
dat <- data.frame(y = unlist(l.out), grp = rep(letters[1:5], each = 500))
stripchart(y ~ grp, data = dat)
```

Notice the different variances are not that different. Three are 2, one is 2.3, and the other is 2.5. No they're not *exactly* equal, but I'd say their close enough to satisfy the constant variance assumption, which frankly is an unrealistic assumption in the real world. We should be so lucky to have real data with such similar variability across groups!
	---
	title: "Shapiro Test and Levene Test Simulations"
	author: "Clay Ford"
	date: "September 25, 2018"
	output: pdf_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE)
	```

	## Shapiro Test for Normal data

	Can you tell the difference between these two distributions?

	```{r echo=FALSE, fig.height=3}
	op <- par(mfrow = c(1,2))
	curve(expr = dnorm(x), from = -3, to = 3, ylab = "")
	curve(expr = dt(x, df = 10), from = -3, to = 3, ylab = "")
	par(op)

	```


	Probably not. But I can definitively say one of them is not normal. The one on the right. It's a t distribution with df = 10.

	What about these two QQ Plots? Which one is for the non-normal data?

	```{r echo=FALSE, fig.height=3}
	set.seed(100)
	x1 <- rnorm(500)
	x2 <- rt(500, df = 10)

	op <- par(mfrow = c(1,2))
	qqnorm(x1); qqline(x1)
	qqnorm(x2); qqline(x2)
	par(op)
	```

	They both look the same to me, but the one on the right is non-normal. The plot on the left is random data from a Normal(0,1) distribution; the one on the right is random data from a t distribution with df = 10

	In both cases I have compared Normal data to data from a t distribution. The latter is absolutely not normal. But I think we can agree it's probably normal enough.

	Let's simulate 1000 data sets of size 500 drawn from a t distribution with 10 degress of freedom, and each time run a shapiro test and then check if the p-value is less than 0.05.

	```{r}
	r.out <- replicate(n = 1e4, expr = shapiro.test(rt(500, df = 10))$p.value < 0.05)
	mean(r.out)
	```

	The shapiro test is rejected at the 0.05 level `r round(mean(r.out),3)*100` percent of the time. Yet I think based on our graphical evaluations, the data looks virtually indistinguishable from Normal data.

	## Levene test for constant variance

	Take a look at this plot. Are we concerned about non-constant variance?

	```{r echo=FALSE,fig.height=4}
	set.seed(1)
	l.out <- lapply(X = c(2, 2.3, 2.5, 2, 2), function(x)rnorm(n = 500, mean = 10, sd = x))
	dat <- data.frame(y = unlist(l.out), grp = rep(letters[1:5], each = 500))
	stripchart(y ~ grp, data = dat)

	```

	I'm not. There are about 4 outliers. But that's out of 2500 observations.

	What does Levene's test say?

	```{r message=FALSE}
	library(car)
	leveneTest(y ~ grp, data = dat)
	```

	It reports the data has non-constant variance based on its p-value. And it's right! The data does not have constant variance across groups. But is it really cause for concern?

	Here is how I generated the data and made the plot:

	```{r eval=FALSE}
	l.out <- lapply(X = c(2, 2.3, 2.5, 2, 2), function(x)rnorm(n = 500, mean = 10, sd = x))
	dat <- data.frame(y = unlist(l.out), grp = rep(letters[1:5], each = 500))
	stripchart(y ~ grp, data = dat)
	```

	Notice the different variances are not that different. Three are 2, one is 2.3, and the other is 2.5. No they're not exactly equal, but I'd say their close enough to satisfy the constant variance assumption, which frankly is an unrealistic assumption in the real world. We should be so lucky to have real data with such similar variability across groups!