Lakens/observed_type1_error.Rmd

## observed_type1_error.Rmd
---
title: "Observed Alpha Levels"
output:
  word_document: default
  html_document:
    df_print: paged
editor_options:
  chunk_output_type: console
---

“In the long run we are all dead." - John Maynard Keynes

When we perform hypothesis tests in a Neyman-Pearson framework we want to make decisions while controlling the rate at which we make errors. We do this by setting an alpha level which guarantees that, in the long run, we will not say there is an effect when the null hypothesis is true more than α% of the time. If you have to make decisions that can be correct or wrong, it's smart to make it not too likely to make wrong decisions.

I like my statistics applied. And in practice I don't do an infinite number of studies. As Keynes astutely observed, I will be dead before then. So when I control the error rate for my studies, what is a realistic Type 1 error rate I will observe in the 'somewhat longer run'?

```{r, include = FALSE}
# Distribution of observed Type 1 error rate in non-infinite samples

# library(progress)
options(scipen = 999999999)

n_sims <- 100000
n_studies <- 10
alpha <- 0.05
obs_alpha <- numeric(n_sims) #store observedType 1 error rates

# pb <- txtProgressBar(style = 3)
# for(i in 1:n_sims){
#   setTxtProgressBar(pb, i/n_sims)
#   p <- numeric(n_studies)
#   for(j in 1:n_studies) {
#     x <- rnorm(m = 0, sd = 1, n = 50)
#     y <- rnorm(m = 0, sd = 1, n = 50)
#     p[j] <- t.test(x, y)$p.value
#   }
#
#   obs_alpha[i] <- sum(p < alpha)/n_studies
# }
# close(pb)
#
# hist(obs_alpha, breaks = 20, xlim = c(0, 0.12))

successes <- rbinom(n_sims, size = n_studies, prob = alpha)
proportion <- successes/n_studies

# Plot figure
# png(file=paste0("obs_alpha_",n_sims,"_sims_",n_studies,"_n_studies.png"),
#     width=4000,
#     height=3000,
#     units = "px",
#     res = 500)
hist(proportion,
     breaks=30,
     xlab="Observed Type 1 error rates",
     ylab=paste0("Frequency (",n_sims," simulations)"),
     main=paste("Observed Type 1 error rates with",n_studies,"studies"),
     col="grey",
     # ylim=c(0, n_sims),
     xlim=c(0, max(proportion))
)
# dev.off()

```

Let's assume you publish a paper that contains only a single *p*-value. Let's also assume the true effect size is 0, so the null hypothesis is true. Your study is going to be significant (and this a Type 1 error) or not. With a single study, you don't have the granularity to talk about a 5% error rate.

```{r, echo = FALSE}
# Distribution of observed Type 1 error rate in non-infinite samples

n_sims <- 1000000
n_studies <- 1
alpha <- 0.05
obs_alpha <- numeric(n_sims) #store observedType 1 error rates

successes <- rbinom(n_sims, size = n_studies, prob = alpha)
proportion <- successes/n_studies
# prop.test(x = (n_studies * alpha), n = n_studies)

# png(file=paste0("obs_alpha_",n_sims,"_sims_",n_studies,"_n_studies.png"),
#     width=4000,
#     height=3000,
#     units = "px",
#     res = 500)
hist(proportion,
     breaks=30,
     xlab="Observed Type 1 error rates",
     ylab=paste0("Frequency (total = ",n_sims,")"),
     main=paste("Observed Type 1 error rates with",n_studies,"studies"),
     col="grey",
     # ylim=c(0, n_sims),
     xlim=c(0, max(proportion))
)
# dev.off()

```

In experimental psychology 30 seems to be a reasonable average for the number of *p*-values that are reported in a single paper (http://doi.org/10.1371/journal.pone.0127872). Let's assume you perform 30 tests in a single paper and every time the null is true (even though this is often unlikely in a real paper). In the long run, with an alpha level of 0.05 30 * 0.05 = 1.5 *p*-values will be significant. But in real sets of 30 *p*-values, you will either observe 0, 1, 2, 3, 4, 5, or even more Type 1 errors, which equals 0%, 3.33%, 6.66%, 10%, 13.33%, 16.66%, or even more.

```{r, echo = FALSE}
# Distribution of observed Type 1 error rate in non-infinite samples

n_sims <- 1000000
n_studies <- 30
alpha <- 0.05
obs_alpha <- numeric(n_sims) #store observedType 1 error rates

successes <- rbinom(n_sims, size = n_studies, prob = alpha)
proportion <- successes/n_studies
# prop.test(x = (n_studies * alpha), n = n_studies)

# png(file=paste0("obs_alpha_",n_sims,"_sims_",n_studies,"_n_studies.png"),
#     width=4000,
#     height=3000,
#     units = "px",
#     res = 500)
hist(proportion,
     breaks=30,
     xlab="Observed Type 1 error rates",
     ylab=paste0("Frequency (total = ",n_sims,")"),
     main=paste("Observed Type 1 error rates with",n_studies,"studies"),
     col="grey",
     # ylim=c(0, n_sims),
     xlim=c(0, max(proportion))
)
# dev.off()
# sort(unique(proportion))
```

Each of these error rates occurs with a certain frequency. 21.5% of the time, you will not make any Type 1 errors. 12.7% of the time, you will make 3 Type 1 errors in 30 tests. The average over thousands of papers reporting 30 tests will be a Type 1 error rate of 5%, but no single set of studies is average.

```{r, echo = FALSE}
prop_table <- as.data.frame(table(round(proportion, 3))/n_sims)
colnames(prop_table) <- c("observed Type 1 errors", "Proportion")
prop_table
```

Now maybe a single paper with 30 tests is not 'long runnerish' enough. What we really want to control the Type 1 error rate of is the literature, past, present, and future. Except, we will never read the literature. So let's assume we are interested in 200 studies examining a topic where the true effect size is 0.

```{r, echo = FALSE}
# Distribution of observed Type 1 error rate in non-infinite samples

n_sims <- 1000000
n_studies <- 200
alpha <- 0.05
obs_alpha <- numeric(n_sims) #store observedType 1 error rates

successes <- rbinom(n_sims, size = n_studies, prob = alpha)
proportion <- successes/n_studies
# prop.test(x = (n_studies * alpha), n = n_studies)

# png(file=paste0("obs_alpha_",n_sims,"_sims_",n_studies,"_n_studies.png"),
#     width=4000,
#     height=3000,
#     units = "px",
#     res = 500)
hist(proportion,
     breaks=30,
     xlab="Observed Type 1 error rates",
     ylab=paste0("Frequency (total = ",n_sims,")"),
     main=paste("Observed Type 1 error rates with",n_studies,"studies"),
     col="grey",
     # ylim=c(0, n_sims),
     xlim=c(0, max(proportion))
)
# dev.off()
# sort(unique(proportion))
```

Now things start to look a bit more like what you would expect. It's likely that the observed Type 1 error rate is close to 5%.However, it is almost exactly as likely that the observed Type 1 error rate is 4.5%. 90% of the distribution of observed alpha levels will lie between `r quantile(proportion, probs= .05)` and `r quantile(proportion, probs= .95)`. So, even in 'somewhat longrunnish' 200 tests, the observed Type 1 error rate will rarely be exactly 5%, and it might be more useful to think about it as being between 2.5 and 7.5%.

## Statistical models are not reality.

A 5% error rate exists only in the abstract world of infinite repetitions, and you will not live long enough to perform an infinite number of studies. In practice, if you (or a group of researchers examining a specific question) do real research, the error rates are somewhere in the range of 5%. Everything has variation in samples drawn from a larger population - error rates are no exception.

When we quantify things, there is the tendency to get lost in digits. But in practice, the levels of random noise we can reasonable expect quickly overwhelms most things 3 digits after the decimal. I know we can compute the alpha level after a Pocock correction for two looks at the data in sequential analyses as 0.0294. But this is not the level of granularity that we should think about when we think of the error rate we will observe in real lines of research. When we control our error rates, we do so with the goal to end up somewhere reasonably low, after a decent number of hypotheses have been tested. Whether we end up observing 2.5% Type 1 errors or 7.5% errors: Potato, patato.

This does not mean we should stop quantifying numbers precisely when they can be quantified precisely, but we should realize the limitations of what we get from the statistical procedures we use. Statistical inferences can guide us roughly to where we would ideally like to end up. By all means calculate exact numbers where you can, strictly adhere to hard thresholds to fool yourself too often. But maybe in 2020 we can learn to appreciate statistical inferences are always a bit messy. Do the best you can reasonably do, but don't expect perfection. In 2020, and in statistics.
	---
	title: "Observed Alpha Levels"
	output:
	word_document: default
	html_document:
	df_print: paged
	editor_options:
	chunk_output_type: console
	---

	“In the long run we are all dead." - John Maynard Keynes

	When we perform hypothesis tests in a Neyman-Pearson framework we want to make decisions while controlling the rate at which we make errors. We do this by setting an alpha level which guarantees that, in the long run, we will not say there is an effect when the null hypothesis is true more than α% of the time. If you have to make decisions that can be correct or wrong, it's smart to make it not too likely to make wrong decisions.

	I like my statistics applied. And in practice I don't do an infinite number of studies. As Keynes astutely observed, I will be dead before then. So when I control the error rate for my studies, what is a realistic Type 1 error rate I will observe in the 'somewhat longer run'?

	```{r, include = FALSE}
	# Distribution of observed Type 1 error rate in non-infinite samples

	# library(progress)
	options(scipen = 999999999)

	n_sims <- 100000
	n_studies <- 10
	alpha <- 0.05
	obs_alpha <- numeric(n_sims) #store observedType 1 error rates

	# pb <- txtProgressBar(style = 3)
	# for(i in 1:n_sims){
	# setTxtProgressBar(pb, i/n_sims)
	# p <- numeric(n_studies)
	# for(j in 1:n_studies) {
	# x <- rnorm(m = 0, sd = 1, n = 50)
	# y <- rnorm(m = 0, sd = 1, n = 50)
	# p[j] <- t.test(x, y)$p.value
	# }
	#
	# obs_alpha[i] <- sum(p < alpha)/n_studies
	# }
	# close(pb)
	#
	# hist(obs_alpha, breaks = 20, xlim = c(0, 0.12))

	successes <- rbinom(n_sims, size = n_studies, prob = alpha)
	proportion <- successes/n_studies

	# Plot figure
	# png(file=paste0("obs_alpha_",n_sims,"_sims_",n_studies,"_n_studies.png"),
	# width=4000,
	# height=3000,
	# units = "px",
	# res = 500)
	hist(proportion,
	breaks=30,
	xlab="Observed Type 1 error rates",
	ylab=paste0("Frequency (",n_sims," simulations)"),
	main=paste("Observed Type 1 error rates with",n_studies,"studies"),
	col="grey",
	# ylim=c(0, n_sims),
	xlim=c(0, max(proportion))
	)
	# dev.off()

	```

	Let's assume you publish a paper that contains only a single p-value. Let's also assume the true effect size is 0, so the null hypothesis is true. Your study is going to be significant (and this a Type 1 error) or not. With a single study, you don't have the granularity to talk about a 5% error rate.

	```{r, echo = FALSE}
	# Distribution of observed Type 1 error rate in non-infinite samples

	n_sims <- 1000000
	n_studies <- 1
	alpha <- 0.05
	obs_alpha <- numeric(n_sims) #store observedType 1 error rates

	successes <- rbinom(n_sims, size = n_studies, prob = alpha)
	proportion <- successes/n_studies
	# prop.test(x = (n_studies * alpha), n = n_studies)

	# png(file=paste0("obs_alpha_",n_sims,"_sims_",n_studies,"_n_studies.png"),
	# width=4000,
	# height=3000,
	# units = "px",
	# res = 500)
	hist(proportion,
	breaks=30,
	xlab="Observed Type 1 error rates",
	ylab=paste0("Frequency (total = ",n_sims,")"),
	main=paste("Observed Type 1 error rates with",n_studies,"studies"),
	col="grey",
	# ylim=c(0, n_sims),
	xlim=c(0, max(proportion))
	)
	# dev.off()

	```

	In experimental psychology 30 seems to be a reasonable average for the number of p-values that are reported in a single paper (http://doi.org/10.1371/journal.pone.0127872). Let's assume you perform 30 tests in a single paper and every time the null is true (even though this is often unlikely in a real paper). In the long run, with an alpha level of 0.05 30 * 0.05 = 1.5 p-values will be significant. But in real sets of 30 p-values, you will either observe 0, 1, 2, 3, 4, 5, or even more Type 1 errors, which equals 0%, 3.33%, 6.66%, 10%, 13.33%, 16.66%, or even more.

	```{r, echo = FALSE}
	# Distribution of observed Type 1 error rate in non-infinite samples

	n_sims <- 1000000
	n_studies <- 30
	alpha <- 0.05
	obs_alpha <- numeric(n_sims) #store observedType 1 error rates

	successes <- rbinom(n_sims, size = n_studies, prob = alpha)
	proportion <- successes/n_studies
	# prop.test(x = (n_studies * alpha), n = n_studies)

	# png(file=paste0("obs_alpha_",n_sims,"_sims_",n_studies,"_n_studies.png"),
	# width=4000,
	# height=3000,
	# units = "px",
	# res = 500)
	hist(proportion,
	breaks=30,
	xlab="Observed Type 1 error rates",
	ylab=paste0("Frequency (total = ",n_sims,")"),
	main=paste("Observed Type 1 error rates with",n_studies,"studies"),
	col="grey",
	# ylim=c(0, n_sims),
	xlim=c(0, max(proportion))
	)
	# dev.off()
	# sort(unique(proportion))
	```

	Each of these error rates occurs with a certain frequency. 21.5% of the time, you will not make any Type 1 errors. 12.7% of the time, you will make 3 Type 1 errors in 30 tests. The average over thousands of papers reporting 30 tests will be a Type 1 error rate of 5%, but no single set of studies is average.

	```{r, echo = FALSE}
	prop_table <- as.data.frame(table(round(proportion, 3))/n_sims)
	colnames(prop_table) <- c("observed Type 1 errors", "Proportion")
	prop_table
	```

	Now maybe a single paper with 30 tests is not 'long runnerish' enough. What we really want to control the Type 1 error rate of is the literature, past, present, and future. Except, we will never read the literature. So let's assume we are interested in 200 studies examining a topic where the true effect size is 0.

	```{r, echo = FALSE}
	# Distribution of observed Type 1 error rate in non-infinite samples

	n_sims <- 1000000
	n_studies <- 200
	alpha <- 0.05
	obs_alpha <- numeric(n_sims) #store observedType 1 error rates

	successes <- rbinom(n_sims, size = n_studies, prob = alpha)
	proportion <- successes/n_studies
	# prop.test(x = (n_studies * alpha), n = n_studies)

	# png(file=paste0("obs_alpha_",n_sims,"_sims_",n_studies,"_n_studies.png"),
	# width=4000,
	# height=3000,
	# units = "px",
	# res = 500)
	hist(proportion,
	breaks=30,
	xlab="Observed Type 1 error rates",
	ylab=paste0("Frequency (total = ",n_sims,")"),
	main=paste("Observed Type 1 error rates with",n_studies,"studies"),
	col="grey",
	# ylim=c(0, n_sims),
	xlim=c(0, max(proportion))
	)
	# dev.off()
	# sort(unique(proportion))
	```

	Now things start to look a bit more like what you would expect. It's likely that the observed Type 1 error rate is close to 5%.However, it is almost exactly as likely that the observed Type 1 error rate is 4.5%. 90% of the distribution of observed alpha levels will lie between `r quantile(proportion, probs= .05)` and `r quantile(proportion, probs= .95)`. So, even in 'somewhat longrunnish' 200 tests, the observed Type 1 error rate will rarely be exactly 5%, and it might be more useful to think about it as being between 2.5 and 7.5%.

	## Statistical models are not reality.

	A 5% error rate exists only in the abstract world of infinite repetitions, and you will not live long enough to perform an infinite number of studies. In practice, if you (or a group of researchers examining a specific question) do real research, the error rates are somewhere in the range of 5%. Everything has variation in samples drawn from a larger population - error rates are no exception.

	When we quantify things, there is the tendency to get lost in digits. But in practice, the levels of random noise we can reasonable expect quickly overwhelms most things 3 digits after the decimal. I know we can compute the alpha level after a Pocock correction for two looks at the data in sequential analyses as 0.0294. But this is not the level of granularity that we should think about when we think of the error rate we will observe in real lines of research. When we control our error rates, we do so with the goal to end up somewhere reasonably low, after a decent number of hypotheses have been tested. Whether we end up observing 2.5% Type 1 errors or 7.5% errors: Potato, patato.

	This does not mean we should stop quantifying numbers precisely when they can be quantified precisely, but we should realize the limitations of what we get from the statistical procedures we use. Statistical inferences can guide us roughly to where we would ideally like to end up. By all means calculate exact numbers where you can, strictly adhere to hard thresholds to fool yourself too often. But maybe in 2020 we can learn to appreciate statistical inferences are always a bit messy. Do the best you can reasonably do, but don't expect perfection. In 2020, and in statistics.