Skip to content

Instantly share code, notes, and snippets.

@widdowquinn
Created May 10, 2016 11:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save widdowquinn/5a0bd73a8f6f62eb1ff63d0b55380bb6 to your computer and use it in GitHub Desktop.
Save widdowquinn/5a0bd73a8f6f62eb1ff63d0b55380bb6 to your computer and use it in GitHub Desktop.
Quick R markdown
---
title: "Effect and Sample Size"
author: "Leighton Pritchard"
date: "10 May 2016"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Create dataset
Sampling from a Normal distribution, mean zero, unity standard deviation, with n in {3, 5, 7, 10, 50, 100, 250, 500, 1000, 5000, 10000}, one thousand times each, and calculating the probability of each sample being drawn from that distribution, by t-test.
```{r sim}
sample_sizes = c(3, 5, 7, 10, 50, 100, 250, 5000, 1000, 5000, 10000)
df = data.frame(samples=integer(), mean=double(), sd=double(), p=double(), lci=double(), uci=double())
for (n in sample_sizes) {
for (i in 1:1000) {
data = rnorm(n)
df = rbind(df, setNames(as.list(c(n, mean(data), sd(data),
t.test(data)$p.value,
t.test(data)$conf.int[[1]], t.test(data)$conf.int[[2]])),
names(df)) )
}
}
```
## Significant effects
How likely are we to see 'significant' effects, or other signs of bias due to sample size?
```{r sig}
library("dplyr")
# Count of times 95% CI doesn't include zero (equivalent to two-tailed P<0.05)
sig_summary = df %>% group_by(samples) %>% summarize(sig_effect = sum(lci > 0 | uci < 0))
```
Plotting mean distributions against sample size, we see that the range of means is greater at smaller sample sizes. However, if statistical tests are being run correctly, this should not translate into unduly optimistic estimates of statistical significance.
```{r plot_means}
library(ggplot2)
p1 = ggplot(df, aes(x=samples, y=mean, alpha=0.3))
p1 + geom_point() + scale_x_log10()
```
Plotting the count of events where a statistically significant difference is seen, there's no strong relationship visible between sample size and frequency of statistically significant effects.
```{r plot_sigs}
p2 = ggplot(sig_summary, aes(x=samples, y=sig_effect))
p2 + geom_point() + scale_x_log10()
```
Where there is scope for misinterpretation, this is likely due to the apparent absolute size of an effect - which must be larger at small sample sizes, to produce the same P-value. That may lead to a presumption of biological (or domain-specific) significance, because the effect is 'large'. The statistical test doesn't itself demonstrate this kind of effect, only that the observed difference has a sufficiently low probability of being produced by the null hypothesis distribution to perhaps be worth further investigation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment