Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Rmd for my presentation on simulation studies in R, Quant Methods brownbag colloquium 2016/09/28
---
title: "Designing simulation studies in R"
author: "James E. Pustejovsky"
date: "September 28, 2016"
output:
ioslides_presentation:
css: custom.css
widescreen: true
transition: faster
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.width = 8, fig.height = 5)
```
## Simulation
>1. Generate data from a __probability model__ where you know the true parameters.
>2. Apply one or more estimation procedures to the data, record the results.
>3. Repeat many times ("in repeated samples...").
>4. Compare the results to the truth (bias, accuracy, error/coverage rates).
## Why is simulation useful?
>* Building understanding & intuition about statistical models
>* Assessing/checking modeling strategies
>* Formal research
## Modeling batting averages
h/t David Robinson for this example (more details [on his blog](http://varianceexplained.org/r/beta_binomial_baseball/))
<div class="columns-2">
<img src="babe-ruth.jpg" height="400px" />
>* Baseball batting average = # hits / # at-bats
>* Correlated with # at-bats because better hitters get more opportunities
>* Beta-binomial regression is a useful way to model this relationship
</div>
## Batting averages vs. at-bats
```{r}
library(dplyr)
library(tidyr)
library(Lahman)
library(ggplot2)
pitchers <- Pitching %>%
group_by(playerID) %>%
summarize(gamesPitched = sum(G)) %>%
filter(gamesPitched > 3)
career <- Batting %>%
filter(AB > 0) %>%
anti_join(pitchers, by = "playerID") %>%
group_by(playerID) %>%
summarize_each(funs(sum), H, AB) %>%
mutate(average = H / AB)
career <- Master %>%
tbl_df() %>%
select(playerID, nameFirst, nameLast) %>%
unite(name, nameFirst, nameLast, sep = " ") %>%
inner_join(career, by = "playerID")
career %>%
filter(AB >= 20) %>%
ggplot(aes(AB, average)) +
geom_point(alpha = 0.3) +
geom_smooth() +
scale_x_log10() + theme_minimal() +
labs(x = "Career at-bats", y = "Career batting-average")
```
```{r, eval = FALSE, include = FALSE}
BA_betabin <- gamlss(cbind(H, AB - H) ~ log(AB),
data = career, family = BB(mu.link = "identity", sigma.link = "inverse"))
summary(BA_betabin)
```
Source: [Lahman's Baseball Database, 1871-2015](http://www.seanlahman.com/baseball-archive/statistics/) (R package `Lahman`)
## Beta-binomial regression
>* $Y_i$ is number of hits by player $i$
>* $n_i$ is number of at-bats
>* $Y_i \sim Binomial(n_i, \pi_i)$, where $p_i$ is true batting ability of player $i$
>* $\pi_i \sim Beta\left(\mu_{n_i} / \sigma, (1 - \mu_{n_i}) / \sigma\right)$, where $\mu_{n_i}$ is average batting ability of players with $n_i$ at-bats and $\sigma$ describes the variability in true ability.
>* $\mu_{n_i} = \beta_0 + \beta_1 \log(n_i)$
## Simulate to build understanding
```{r}
career_at_bats <- career$AB
```
```{r, echo = TRUE}
set.seed(20160928)
players <- 1000
at_bats <- sample(career_at_bats, size = players)
summary(at_bats)
mu <- 0.140 + 0.015 * log(at_bats)
summary(mu)
```
## Simulate true abilities
```{r, echo = TRUE}
sigma <- 1 / 500
ability <- rbeta(players,
shape1 = mu / sigma,
shape2 = (1 - mu) / sigma)
```
## Simulate true abilities
```{r}
dat <- data_frame(at_bats, mu, ability)
ggplot(dat, aes(at_bats, ability)) +
geom_point(alpha = 0.3) +
geom_smooth() +
scale_x_log10() + theme_minimal() +
ylim(0, 0.6) +
labs(x = "Career at-bats", y = "Batting ability")
```
## Simulate batting averages
```{r, echo = TRUE}
dat$hits <- with(dat, rbinom(n = players, size = at_bats, prob = ability))
dat$batting_avg <- with(dat, hits / at_bats)
```
## Simulate batting averages
```{r}
ggplot(dat, aes(at_bats, batting_avg)) +
geom_point(alpha = 0.3, color = "purple") +
geom_smooth() +
scale_x_log10() + theme_minimal() +
ylim(0, 0.6) +
labs(x = "Career at-bats", y = "Observed batting average")
```
## Fit the beta-binomial regression
```{r, echo = TRUE, results = "hide"}
library(gamlss)
bb_fit <- gamlss(cbind(hits, at_bats - hits) ~ log(at_bats),
data = dat, family = BB(mu.link = "identity"))
coef(bb_fit)
```
```{r}
coef(bb_fit)
```
## Simulate to check modeling strategies
>* In real-world data analysis, there are __almost always__ multiple ways to approach a problem.
>* Small simulations are a useful way to test out strategies for use in a given setting.
>* For modeling batting averages, beta-binomial regression is useful but __SLOW__.
>* Would it work to use a __quasi-binomial glm__ instead?
## Data-generating function
```{r, echo = TRUE}
simulate_batting_avgs <- function(players, beta, sigma) {
at_bats <- sample(career_at_bats, size = players)
mu <- beta[1] + beta[2] * log(at_bats)
ability <- rbeta(players,
shape1 = mu / sigma,
shape2 = (1 - mu) / sigma)
hits <- rbinom(n = players, size = at_bats, prob = ability)
data_frame(at_bats, hits)
}
new_dat <- simulate_batting_avgs(players = 400,
beta = c(0.140, 0.015),
sigma = 1 / 500)
```
## Data-generating function
```{r, echo = TRUE}
new_dat
```
## Modeling function
```{r, echo = TRUE}
quasibinomial_CI <- function(dat, level = 0.95) {
glm_fit <- glm(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat,
family = quasibinomial(link = "identity"))
b <- coef(glm_fit)
se <- sqrt(diag(vcov(glm_fit)))
crit <- qnorm(1 - (1 - level) / 2)
data_frame(term = names(b), L = b - se * crit, U = b + se * crit)
}
quasibinomial_CI(new_dat)
```
## Put them together!
```{r, echo = TRUE}
lots_of_CIs <-
replicate(2000, {
dat <- simulate_batting_avgs(players = 400, beta = c(0.140, 0.015), sigma = 1 / 500)
quasibinomial_CI(dat)
}, simplify = FALSE)
```
## Confidence interval coverage
```{r}
CI_dat <-
bind_rows(lots_of_CIs) %>%
group_by(term) %>%
mutate(n = row_number())
true_dat <- data_frame(term = c("(Intercept)","log(at_bats)"), val = c(0.140, 0.015))
CI_dat %>% filter(n < 200) %>%
ggplot() +
geom_segment(aes(x = n, xend = n, y = L, yend = U)) +
geom_hline(dat = true_dat, aes(yintercept = val), col = "blue") +
facet_wrap(~ term, ncol = 1, scales = "free") +
theme_minimal() +
labs(x = "Iteration", y = "Confidence interval")
```
## Confidence interval coverage
```{r, echo = TRUE}
bind_rows(lots_of_CIs) %>%
left_join(true_dat, by = "term") %>%
mutate(covered = L < val & val < U) %>%
group_by(term) %>%
summarise(coverage = mean(covered))
```
```{r, eval = FALSE, include = FALSE}
library(sandwich)
robust_quasibinomial_CI <- function(dat, level = 0.95) {
glm_fit <- glm(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat,
family = quasibinomial(link = "identity"))
b <- coef(glm_fit)
se <- sqrt(diag(vcovHC(glm_fit)))
crit <- qnorm(1 - (1 - level) / 2)
data_frame(term = names(b), L = b - se * crit, U = b + se * crit)
}
lots_of_robust_CIs <-
replicate(2000, {
dat <- simulate_batting_avgs(players = 400,
beta = c(0.140, 0.015),
sigma = 1 / 500)
robust_quasibinomial_CI(dat)
}, simplify = FALSE)
bind_rows(lots_of_robust_CIs) %>%
left_join(true_dat, by = "term") %>%
mutate(covered = L < val & val < U) %>%
group_by(term) %>%
summarise(coverage = mean(covered))
```
## Simulation studies in formal research
Questions about quantitative methodology:
> * Which type of confidence intervals should be used for the beta-binomial model?
> * Which of these twelve tests should I use for one-way ANOVA when the variances are non-homogeneous?
> * How big a sample is needed to get accurate estimates of variance components in a multi-level logistic regression model?
> * Is it reasonable to use a multivariate normal model to impute missing data, even though the variables look skewed?
## Why focus on simulation studies?
Few alternatives for assessing
>* small-sample performance of estimation methods
>* performance of combinations of methods (data analysis pipelines)
>* robustness under model mis-specification
>* comparison of competing methods
## Simulation design
```{r, fig.width = 10, fig.height = 6, out.width = "800px"}
library(diagram)
par(mar = c(0.1, 0.1, 0.1, 0.1))
openplotmat()
elpos <- coordinates(c(2,1,2))
fromto <- matrix(ncol = 2, byrow = TRUE,
data = c(4, 1, 1, 2, 2, 3, 3, 5))
nr <- nrow(fromto)
arrpos <- matrix(ncol = 2, nrow = nr)
for (i in 1:nr) {
arrpos[i, ] <- straightarrow(from = elpos[fromto[i, 1], ],
to = elpos[fromto[i, 2], ],
lwd = 2, arr.pos = 0.6, arr.length = 0.5)
}
box_dat <- data_frame(lab = c("Data-generating model",
"Estimation methods",
"Performance criteria",
"Experimental design",
"Results"),
col = c("lightblue","lightgreen","yellow","orange","red"))
for (i in 1:nrow(box_dat)) {
textrect(elpos[i,], 0.2, 0.1, lab = box_dat$lab[i],
box.col = box_dat$col[i], shadow.size = 0, cex = 2)
}
```
## Simulation design strategy
* Write separate functions for each component of the simulation
* Makes it easier to debug, modify, or re-use code
* Test each component
* Run in parallel where possible
## Learning more
* Spring, 2017: Data Analysis, Simulation, & Programming in R
* My blog: http://jepusto.github.io/
* code for today's examples
* lots of other examples
* [another lecture on designing simulations](http://jepusto.github.io/Designing-simulation-studies-using-R)
* Ask faculty for articles with good simulation studies
* November 9th QM colloquium: Anita Israni on "Running Simulations on the Texas Advanced Computing Cluster"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.