Last active
April 28, 2019 14:12
-
-
Save jmcastagnetto/666a2e543cacca541e53 to your computer and use it in GitHub Desktop.
Assignment for the "Statistical Inference" course (Coursera, August 2014)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Part 1: Simulation using the exponential distribution" | |
author: "Jesus M. Castagnetto" | |
date: "2014-08-23"" | |
output: | |
pdf_document: | |
highlight: zenburn | |
fontsize: 10pt | |
geometry: margin=0.8in | |
--- | |
*Project for the "Statistical Inference" course (Coursera, Aug. 2014)* | |
### Comparing the simulated mean and variance with the theoretical values | |
We will run 1000 rounds of simulation of 40 exponentials with $\lambda = 0.2$, | |
using a fixed seed, and comparing the distribution of the simulated mean | |
and variance with the theoretical value of $1 / \lambda$: | |
```{r results='asis'} | |
library(pander) | |
nsim <- 1000 | |
nvals <- 40 | |
lambda <- 0.2 | |
set.seed(567) | |
simdata <- t(replicate(nsim, rexp(nvals, lambda))) | |
df <- data.frame(Mean=c(mean(rowMeans(simdata)), 1/lambda), | |
Variance=c(mean(apply(simdata, 1, var)), 1/lambda^2)) | |
rownames(df) <- c("Simulated", "Theoretical") | |
pander(df, round=2) | |
``` | |
The simulated and theoretical values are very close, as expected by the CLT. | |
### Assessing if the simulated values are approximately normal | |
Also, according to the CLT, the distribution of the simulated means | |
should be approximately normal. To illustrate this we will | |
normalize the vectors and compare it to a $N(0,1)$ distribution. | |
```{r} | |
library(ggplot2) | |
meanvals <- rowMeans(simdata) | |
zmean <- (meanvals - mean(meanvals)) / sd(meanvals) | |
qplot(zmean, geom = "blank") + | |
geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') + | |
stat_function(fun = dnorm, aes(colour = 'Normal')) + | |
geom_histogram(aes(y = ..density..), alpha = 0.4, binwidth=.35) + | |
geom_vline(xintercept=0, colour="red", linetype="longdash") + | |
scale_colour_manual(name = 'Density', values = c('red', 'blue')) + | |
ylab("Density") + xlab("z") + ggtitle("Mean values distribution") + | |
theme_bw() + theme(legend.position = c(0.85, 0.85)) | |
``` | |
### Evaluating the coverage of the confidence interval | |
Theoretically, a 95% confidence interval should contain, if we simulate | |
a big number of them, the mean value for the exponential distribution | |
($1 / \lambda$) 95% of the time. | |
```{r} | |
set.seed(567) | |
lambda <- 0.2 | |
# checks for each simulation if the mean is in the confidence interval | |
inconfint <- function(lambda) { | |
ehats <- rexp(1000, lambda) | |
se <- sd(ehats)/sqrt(1000) | |
ll <- mean(ehats) - 1.96 * se | |
ul <- mean(ehats) + 1.96 * se | |
(ll < 1/lambda & ul > 1/lambda) | |
} | |
# estimate the coverage in each round of simulations | |
coverage <- function(lambda) { | |
covvals <- replicate(100, inconfint(lambda)) | |
mean(covvals) | |
} | |
# perform the simulation | |
simres <- replicate(100, coverage(lambda)) | |
mean(simres) | |
``` | |
As expected, the confidence interval contains the theoretical value 94.84% of the | |
time (close to the expected 95%). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Part 2: Analysis of the ToothGrowth data set" | |
author: "Jesus M. Castagnetto" | |
date: "2014-08-23" | |
output: | |
pdf_document: | |
highlight: zenburn | |
fontsize: 10pt | |
geometry: margin=0.8in | |
--- | |
*Project for the "Statistical Inference" course (Coursera, Aug. 2014)* | |
The `ToothGrowth` data set contains "... *The response is the length of | |
odontoblasts (teeth) in each of 10 guinea pigs at each of three dose levels of | |
Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice | |
or ascorbic acid).* ..." [Ref.; C. I. Bliss (1952) The Statistics of Bioassay. | |
Academic Press] | |
### Quick exploration of the data set | |
For the purposes of this analysis, we will convert the dose variable into a factor. | |
```{r} | |
data(ToothGrowth) | |
ToothGrowth$dose <- as.factor(ToothGrowth$dose) | |
``` | |
The distributions of tooth length by delivery method and dose levels (see graph | |
in the next page), seem to indicate that for 0.5 and 1 mg, the mean tooth length | |
is greater for those where orange juice was used. For the 2 mg | |
level, no noticeable difference in the mean is observed, but the | |
dispersion for the OJ group seems to be smaller. | |
### Multiple hypothesis testing | |
To ascertain whether there is a real difference between the groups by | |
dose level and delivery method, we will employ a series of two-sided | |
unpaired t-tests to obtain the confidence intervals and p-values. | |
The comparison will be done at each dose level, between delivery methods, | |
asumming unequal variance (supported by the distribution graph). The p-values will | |
be adjusted using the (conservative) Bonferroni correction, and the comparative | |
results shown in the table below. | |
Our null hypothesis in all cases is that there is no difference in the means | |
between the two groups (two-sided test). | |
```{r results='asis'} | |
library(pander) | |
ts <- lapply(c(.5, 1, 2), function(x) { | |
t.test(len ~ supp, data=subset(ToothGrowth, dose==x), paired=FALSE, var.equal=FALSE) | |
}) | |
pvals <- c(ts[[1]]$p.value, ts[[2]]$p.value, ts[[3]]$p.value) | |
stats <- c(ts[[1]]$statistic, ts[[2]]$statistic, ts[[3]]$statistic) | |
adjp <- p.adjust(pvals, method = "bonferroni") | |
lls <- sapply(c(ts[[1]]$conf.int[1], ts[[2]]$conf.int[1], ts[[3]]$conf.int[1]), round, 3) | |
uls <- sapply(c(ts[[1]]$conf.int[2], ts[[2]]$conf.int[2], ts[[3]]$conf.int[2]), round, 3) | |
df <- data.frame(dose=c(0.5, 1, 2), t=stats, p=pvals, adj=adjp, | |
ci=paste0("[",paste(lls, uls, sep=", "), "]")) | |
colnames(df) <- c("Dose", "t", "p-value", "adj. p-value", "conf. int.") | |
pander(df, round=3, split.tables=120, | |
caption="*Two-sided comparison of delivery methods by dose*") | |
``` | |
```{r} | |
library(ggplot2) | |
ggplot(data=ToothGrowth, aes(y=len, x=supp, fill=supp)) + geom_boxplot() + | |
facet_wrap(~ dose, ncol=3) + ylab("Tooth length") + xlab("Delivery method") + | |
ggtitle("Tooth growth by delivery and dose") + | |
stat_summary(fun.y=mean, geom="point", shape=5, size=2) + | |
theme_bw()+ theme(legend.position = "none") | |
``` | |
### Conclusions | |
- At the 0.5 and 1 mg dose levels, there is a statistically significant difference | |
between the means of the OJ and VC groups. The adjusted p-values are significant | |
at the $\alpha$ = 0.05 level, and the 95% confidence intervals do not include zero. | |
- For the 2 mg dose level, we fail to reject the null hypothesis, the adjusted | |
p-value is much greater than 0.5, and the 95% confidence interval includes zero. | |
So, it seems that at that Vitamin C level, there is no significative influence | |
of the delivery method on tooth growth in guinea pigs. | |
- Because the effect size is very small for the 2 mg level, to be able to detect | |
a significative difference, we would need a much bigger sample size (n=10 does | |
not give the test enough power) | |
```{r} | |
s <- subset(ToothGrowth, dose==2) | |
d <- split(s, s$supp) | |
n1 <- 10; m1 <- mean(d$OJ$len); s1 <- sd(d$OJ$len) | |
n2 <- 10; m2 <- mean(d$VC$len); s2 <- sd(d$VC$len) | |
pooled_sd <- sqrt( ((n1 - 1) * s1^2 + (n2-1) * s1^2) / (n1 + n2-2)) | |
cat("effect size:", round((m2 - m1)/pooled_sd, 3), "\nestimated sample size:", | |
round(power.t.test(power=0.9, delta = m1 - m2, sd=pooled_sd)$n,0)) | |
``` | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment