jepusto/simulation-in-R-2016

## simulation-in-R-2016
---
title: "Designing simulation studies in R"
author: "James E. Pustejovsky"
date: "September 28, 2016"
output:
  ioslides_presentation:
    css: custom.css
    widescreen: true
    transition: faster
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.width = 8, fig.height = 5)
```

## Simulation

>1. Generate data from a __probability model__ where you know the true parameters.
>2. Apply one or more estimation procedures to the data, record the results.
>3. Repeat many times ("in repeated samples...").
>4. Compare the results to the truth (bias, accuracy, error/coverage rates).

## Why is simulation useful?

>* Building understanding & intuition about statistical models
>* Assessing/checking modeling strategies
>* Formal research

## Modeling batting averages

h/t David Robinson for this example (more details [on his blog](http://varianceexplained.org/r/beta_binomial_baseball/))

<div class="columns-2">
<img src="babe-ruth.jpg" height="400px" />

>* Baseball batting average = # hits / # at-bats
>* Correlated with # at-bats because better hitters get more opportunities
>* Beta-binomial regression is a useful way to model this relationship

</div>


## Batting averages vs. at-bats

```{r}
library(dplyr)
library(tidyr)
library(Lahman)
library(ggplot2)

pitchers <- Pitching %>%
  group_by(playerID) %>%
  summarize(gamesPitched = sum(G)) %>%
  filter(gamesPitched > 3)

career <- Batting %>%
  filter(AB > 0) %>%
  anti_join(pitchers, by = "playerID") %>%
  group_by(playerID) %>%
  summarize_each(funs(sum), H, AB) %>%
  mutate(average = H / AB)

career <- Master %>%
  tbl_df() %>%
  select(playerID, nameFirst, nameLast) %>%
  unite(name, nameFirst, nameLast, sep = " ") %>%
  inner_join(career, by = "playerID")

career %>%
  filter(AB >= 20) %>%
  ggplot(aes(AB, average)) +
  geom_point(alpha = 0.3) +
  geom_smooth() +
  scale_x_log10() + theme_minimal() +
  labs(x = "Career at-bats", y = "Career batting-average")

```

```{r, eval = FALSE, include = FALSE}
BA_betabin <- gamlss(cbind(H, AB - H) ~ log(AB),
                 data = career, family = BB(mu.link = "identity", sigma.link = "inverse"))
summary(BA_betabin)

```

Source: [Lahman's Baseball Database, 1871-2015](http://www.seanlahman.com/baseball-archive/statistics/) (R package `Lahman`)

## Beta-binomial regression

>* $Y_i$ is number of hits by player $i$
>* $n_i$ is number of at-bats
>* $Y_i \sim Binomial(n_i, \pi_i)$, where $p_i$ is true batting ability of player $i$
>* $\pi_i \sim Beta\left(\mu_{n_i} / \sigma, (1 - \mu_{n_i}) / \sigma\right)$, where $\mu_{n_i}$ is average batting ability of players with $n_i$ at-bats and $\sigma$ describes the variability in true ability.
>* $\mu_{n_i} = \beta_0 + \beta_1 \log(n_i)$

## Simulate to build understanding

```{r}
career_at_bats <- career$AB
```

```{r, echo = TRUE}
set.seed(20160928)
players <- 1000
at_bats <- sample(career_at_bats, size = players)
summary(at_bats)
mu <- 0.140 + 0.015 * log(at_bats)
summary(mu)
```

## Simulate true abilities

```{r, echo = TRUE}
sigma <- 1 / 500
ability <- rbeta(players,
                 shape1 = mu / sigma,
                 shape2 = (1 - mu) / sigma)
```

## Simulate true abilities

```{r}
dat <- data_frame(at_bats, mu, ability)
ggplot(dat, aes(at_bats, ability)) +
  geom_point(alpha = 0.3) +
  geom_smooth() +
  scale_x_log10() + theme_minimal() +
  ylim(0, 0.6) +
  labs(x = "Career at-bats", y = "Batting ability")
```

## Simulate batting averages

```{r, echo = TRUE}
dat$hits <- with(dat, rbinom(n = players, size = at_bats, prob = ability))
dat$batting_avg <- with(dat, hits / at_bats)
```

## Simulate batting averages

```{r}
ggplot(dat, aes(at_bats, batting_avg)) +
  geom_point(alpha = 0.3, color = "purple") +
  geom_smooth() +
  scale_x_log10() + theme_minimal() +
  ylim(0, 0.6) +
  labs(x = "Career at-bats", y = "Observed batting average")
```

## Fit the beta-binomial regression

```{r, echo = TRUE, results = "hide"}
library(gamlss)
bb_fit <- gamlss(cbind(hits, at_bats - hits) ~ log(at_bats),
                 data = dat, family = BB(mu.link = "identity"))
coef(bb_fit)
```
```{r}
coef(bb_fit)
```

## Simulate to check modeling strategies

>* In real-world data analysis, there are __almost always__ multiple ways to approach a problem.
>* Small simulations are a useful way to test out strategies for use in a given setting.
>* For modeling batting averages, beta-binomial regression is useful but __SLOW__.
>* Would it work to use a __quasi-binomial glm__ instead?

## Data-generating function

```{r, echo = TRUE}
simulate_batting_avgs <- function(players, beta, sigma) {
  at_bats <- sample(career_at_bats, size = players)
  mu <- beta[1] + beta[2] * log(at_bats)
  ability <- rbeta(players,
                   shape1 = mu / sigma,
                   shape2 = (1 - mu) / sigma)
  hits <- rbinom(n = players, size = at_bats, prob = ability)
  data_frame(at_bats, hits)
}

new_dat <- simulate_batting_avgs(players = 400,
                                 beta = c(0.140, 0.015),
                                 sigma = 1 / 500)
```

## Data-generating function

```{r, echo = TRUE}
new_dat
```

## Modeling function

```{r, echo = TRUE}
quasibinomial_CI <- function(dat, level = 0.95) {
  glm_fit <- glm(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat,
                 family = quasibinomial(link = "identity"))
  b <- coef(glm_fit)
  se <- sqrt(diag(vcov(glm_fit)))
  crit <- qnorm(1 - (1 - level) / 2)
  data_frame(term = names(b), L = b - se * crit, U = b + se * crit)
}

quasibinomial_CI(new_dat)
```

## Put them together!

```{r, echo = TRUE}
lots_of_CIs <-
  replicate(2000, {
    dat <- simulate_batting_avgs(players = 400, beta = c(0.140, 0.015), sigma = 1 / 500)
    quasibinomial_CI(dat)
  }, simplify = FALSE)

```

## Confidence interval coverage

```{r}
CI_dat <-
  bind_rows(lots_of_CIs) %>%
  group_by(term) %>%
  mutate(n = row_number())

true_dat <- data_frame(term = c("(Intercept)","log(at_bats)"), val = c(0.140, 0.015))
CI_dat %>% filter(n < 200) %>%
ggplot() +
  geom_segment(aes(x = n, xend = n, y = L, yend = U)) +
  geom_hline(dat = true_dat, aes(yintercept = val), col = "blue") +
  facet_wrap(~ term, ncol = 1, scales = "free") +
  theme_minimal() +
  labs(x = "Iteration", y = "Confidence interval")
```

## Confidence interval coverage

```{r, echo = TRUE}
bind_rows(lots_of_CIs) %>%
  left_join(true_dat, by = "term") %>%
  mutate(covered = L < val & val < U) %>%
  group_by(term) %>%
  summarise(coverage = mean(covered))
```

```{r, eval = FALSE, include = FALSE}
library(sandwich)
robust_quasibinomial_CI <- function(dat, level = 0.95) {
  glm_fit <- glm(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat,
                 family = quasibinomial(link = "identity"))
  b <- coef(glm_fit)
  se <- sqrt(diag(vcovHC(glm_fit)))
  crit <- qnorm(1 - (1 - level) / 2)
  data_frame(term = names(b), L = b - se * crit, U = b + se * crit)
}

lots_of_robust_CIs <-
  replicate(2000, {
    dat <- simulate_batting_avgs(players = 400,
                                 beta = c(0.140, 0.015),
                                 sigma = 1 / 500)
    robust_quasibinomial_CI(dat)
  }, simplify = FALSE)

bind_rows(lots_of_robust_CIs) %>%
  left_join(true_dat, by = "term") %>%
  mutate(covered = L < val & val < U) %>%
  group_by(term) %>%
  summarise(coverage = mean(covered))

```

## Simulation studies in formal research

Questions about quantitative methodology:

> * Which type of confidence intervals should be used for the beta-binomial model?
> * Which of these twelve tests should I use for one-way ANOVA when the variances are non-homogeneous?
> * How big a sample is needed to get accurate estimates of variance components in a multi-level logistic regression model?
> * Is it reasonable to use a multivariate normal model to impute missing data, even though the variables look skewed?

## Why focus on simulation studies?

Few alternatives for assessing

>* small-sample performance of estimation methods
>* performance of combinations of methods (data analysis pipelines)
>* robustness under model mis-specification
>* comparison of competing methods

## Simulation design

```{r, fig.width = 10, fig.height = 6, out.width = "800px"}
library(diagram)
par(mar = c(0.1, 0.1, 0.1, 0.1))
openplotmat()
elpos <- coordinates(c(2,1,2))
fromto <- matrix(ncol = 2, byrow = TRUE,
                 data = c(4, 1, 1, 2, 2, 3, 3, 5))
nr <- nrow(fromto)
arrpos <- matrix(ncol = 2, nrow = nr)
for (i in 1:nr) {
 arrpos[i, ] <- straightarrow(from = elpos[fromto[i, 1], ],
                               to = elpos[fromto[i, 2], ],
                               lwd = 2, arr.pos = 0.6, arr.length = 0.5)
}

box_dat <- data_frame(lab = c("Data-generating model",
                              "Estimation methods",
                              "Performance criteria",
                              "Experimental design",
                              "Results"),
                      col = c("lightblue","lightgreen","yellow","orange","red"))

for (i in 1:nrow(box_dat)) {
  textrect(elpos[i,], 0.2, 0.1, lab = box_dat$lab[i],
           box.col = box_dat$col[i], shadow.size = 0, cex = 2)
}

```

## Simulation design strategy

* Write separate functions for each component of the simulation
    * Makes it easier to debug, modify, or re-use code
* Test each component
* Run in parallel where possible

## Learning more

* Spring, 2017: Data Analysis, Simulation, & Programming in R
* My blog: http://jepusto.github.io/
    * code for today's examples
    * lots of other examples
    * [another lecture on designing simulations](http://jepusto.github.io/Designing-simulation-studies-using-R)
* Ask faculty for articles with good simulation studies
* November 9th QM colloquium: Anita Israni on "Running Simulations on the Texas Advanced Computing Cluster"
	---
	title: "Designing simulation studies in R"
	author: "James E. Pustejovsky"
	date: "September 28, 2016"
	output:
	ioslides_presentation:
	css: custom.css
	widescreen: true
	transition: faster
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.width = 8, fig.height = 5)
	```

	## Simulation

	>1. Generate data from a __probability model__ where you know the true parameters.
	>2. Apply one or more estimation procedures to the data, record the results.
	>3. Repeat many times ("in repeated samples...").
	>4. Compare the results to the truth (bias, accuracy, error/coverage rates).

	## Why is simulation useful?

	>* Building understanding & intuition about statistical models
	>* Assessing/checking modeling strategies
	>* Formal research

	## Modeling batting averages

	h/t David Robinson for this example (more details [on his blog](http://varianceexplained.org/r/beta_binomial_baseball/))

	<div class="columns-2">
	<img src="babe-ruth.jpg" height="400px" />

	>* Baseball batting average = # hits / # at-bats
	>* Correlated with # at-bats because better hitters get more opportunities
	>* Beta-binomial regression is a useful way to model this relationship

	</div>


	## Batting averages vs. at-bats

	```{r}
	library(dplyr)
	library(tidyr)
	library(Lahman)
	library(ggplot2)

	pitchers <- Pitching %>%
	group_by(playerID) %>%
	summarize(gamesPitched = sum(G)) %>%
	filter(gamesPitched > 3)

	career <- Batting %>%
	filter(AB > 0) %>%
	anti_join(pitchers, by = "playerID") %>%
	group_by(playerID) %>%
	summarize_each(funs(sum), H, AB) %>%
	mutate(average = H / AB)

	career <- Master %>%
	tbl_df() %>%
	select(playerID, nameFirst, nameLast) %>%
	unite(name, nameFirst, nameLast, sep = " ") %>%
	inner_join(career, by = "playerID")

	career %>%
	filter(AB >= 20) %>%
	ggplot(aes(AB, average)) +
	geom_point(alpha = 0.3) +
	geom_smooth() +
	scale_x_log10() + theme_minimal() +
	labs(x = "Career at-bats", y = "Career batting-average")

	```

	```{r, eval = FALSE, include = FALSE}
	BA_betabin <- gamlss(cbind(H, AB - H) ~ log(AB),
	data = career, family = BB(mu.link = "identity", sigma.link = "inverse"))
	summary(BA_betabin)

	```

	Source: [Lahman's Baseball Database, 1871-2015](http://www.seanlahman.com/baseball-archive/statistics/) (R package `Lahman`)

	## Beta-binomial regression

	>* $Y_i$ is number of hits by player $i$
	>* $n_i$ is number of at-bats
	>* $Y_i \sim Binomial(n_i, \pi_i)$, where $p_i$ is true batting ability of player $i$
	>* $\pi_i \sim Beta\left(\mu_{n_i} / \sigma, (1 - \mu_{n_i}) / \sigma\right)$, where $\mu_{n_i}$ is average batting ability of players with $n_i$ at-bats and $\sigma$ describes the variability in true ability.
	>* $\mu_{n_i} = \beta_0 + \beta_1 \log(n_i)$

	## Simulate to build understanding

	```{r}
	career_at_bats <- career$AB
	```

	```{r, echo = TRUE}
	set.seed(20160928)
	players <- 1000
	at_bats <- sample(career_at_bats, size = players)
	summary(at_bats)
	mu <- 0.140 + 0.015 * log(at_bats)
	summary(mu)
	```

	## Simulate true abilities

	```{r, echo = TRUE}
	sigma <- 1 / 500
	ability <- rbeta(players,
	shape1 = mu / sigma,
	shape2 = (1 - mu) / sigma)
	```

	## Simulate true abilities

	```{r}
	dat <- data_frame(at_bats, mu, ability)
	ggplot(dat, aes(at_bats, ability)) +
	geom_point(alpha = 0.3) +
	geom_smooth() +
	scale_x_log10() + theme_minimal() +
	ylim(0, 0.6) +
	labs(x = "Career at-bats", y = "Batting ability")
	```

	## Simulate batting averages

	```{r, echo = TRUE}
	dat$hits <- with(dat, rbinom(n = players, size = at_bats, prob = ability))
	dat$batting_avg <- with(dat, hits / at_bats)
	```

	## Simulate batting averages

	```{r}
	ggplot(dat, aes(at_bats, batting_avg)) +
	geom_point(alpha = 0.3, color = "purple") +
	geom_smooth() +
	scale_x_log10() + theme_minimal() +
	ylim(0, 0.6) +
	labs(x = "Career at-bats", y = "Observed batting average")
	```

	## Fit the beta-binomial regression

	```{r, echo = TRUE, results = "hide"}
	library(gamlss)
	bb_fit <- gamlss(cbind(hits, at_bats - hits) ~ log(at_bats),
	data = dat, family = BB(mu.link = "identity"))
	coef(bb_fit)
	```
	```{r}
	coef(bb_fit)
	```

	## Simulate to check modeling strategies

	>* In real-world data analysis, there are __almost always__ multiple ways to approach a problem.
	>* Small simulations are a useful way to test out strategies for use in a given setting.
	>* For modeling batting averages, beta-binomial regression is useful but __SLOW__.
	>* Would it work to use a __quasi-binomial glm__ instead?

	## Data-generating function

	```{r, echo = TRUE}
	simulate_batting_avgs <- function(players, beta, sigma) {
	at_bats <- sample(career_at_bats, size = players)
	mu <- beta[1] + beta[2] * log(at_bats)
	ability <- rbeta(players,
	shape1 = mu / sigma,
	shape2 = (1 - mu) / sigma)
	hits <- rbinom(n = players, size = at_bats, prob = ability)
	data_frame(at_bats, hits)
	}

	new_dat <- simulate_batting_avgs(players = 400,
	beta = c(0.140, 0.015),
	sigma = 1 / 500)
	```

	## Data-generating function

	```{r, echo = TRUE}
	new_dat
	```

	## Modeling function

	```{r, echo = TRUE}
	quasibinomial_CI <- function(dat, level = 0.95) {
	glm_fit <- glm(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat,
	family = quasibinomial(link = "identity"))
	b <- coef(glm_fit)
	se <- sqrt(diag(vcov(glm_fit)))
	crit <- qnorm(1 - (1 - level) / 2)
	data_frame(term = names(b), L = b - se * crit, U = b + se * crit)
	}

	quasibinomial_CI(new_dat)
	```

	## Put them together!

	```{r, echo = TRUE}
	lots_of_CIs <-
	replicate(2000, {
	dat <- simulate_batting_avgs(players = 400, beta = c(0.140, 0.015), sigma = 1 / 500)
	quasibinomial_CI(dat)
	}, simplify = FALSE)

	```

	## Confidence interval coverage

	```{r}
	CI_dat <-
	bind_rows(lots_of_CIs) %>%
	group_by(term) %>%
	mutate(n = row_number())

	true_dat <- data_frame(term = c("(Intercept)","log(at_bats)"), val = c(0.140, 0.015))
	CI_dat %>% filter(n < 200) %>%
	ggplot() +
	geom_segment(aes(x = n, xend = n, y = L, yend = U)) +
	geom_hline(dat = true_dat, aes(yintercept = val), col = "blue") +
	facet_wrap(~ term, ncol = 1, scales = "free") +
	theme_minimal() +
	labs(x = "Iteration", y = "Confidence interval")
	```

	## Confidence interval coverage

	```{r, echo = TRUE}
	bind_rows(lots_of_CIs) %>%
	left_join(true_dat, by = "term") %>%
	mutate(covered = L < val & val < U) %>%
	group_by(term) %>%
	summarise(coverage = mean(covered))
	```

	```{r, eval = FALSE, include = FALSE}
	library(sandwich)
	robust_quasibinomial_CI <- function(dat, level = 0.95) {
	glm_fit <- glm(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat,
	family = quasibinomial(link = "identity"))
	b <- coef(glm_fit)
	se <- sqrt(diag(vcovHC(glm_fit)))
	crit <- qnorm(1 - (1 - level) / 2)
	data_frame(term = names(b), L = b - se * crit, U = b + se * crit)
	}

	lots_of_robust_CIs <-
	replicate(2000, {
	dat <- simulate_batting_avgs(players = 400,
	beta = c(0.140, 0.015),
	sigma = 1 / 500)
	robust_quasibinomial_CI(dat)
	}, simplify = FALSE)

	bind_rows(lots_of_robust_CIs) %>%
	left_join(true_dat, by = "term") %>%
	mutate(covered = L < val & val < U) %>%
	group_by(term) %>%
	summarise(coverage = mean(covered))

	```

	## Simulation studies in formal research

	Questions about quantitative methodology:

	> * Which type of confidence intervals should be used for the beta-binomial model?
	> * Which of these twelve tests should I use for one-way ANOVA when the variances are non-homogeneous?
	> * How big a sample is needed to get accurate estimates of variance components in a multi-level logistic regression model?
	> * Is it reasonable to use a multivariate normal model to impute missing data, even though the variables look skewed?

	## Why focus on simulation studies?

	Few alternatives for assessing

	>* small-sample performance of estimation methods
	>* performance of combinations of methods (data analysis pipelines)
	>* robustness under model mis-specification
	>* comparison of competing methods

	## Simulation design

	```{r, fig.width = 10, fig.height = 6, out.width = "800px"}
	library(diagram)
	par(mar = c(0.1, 0.1, 0.1, 0.1))
	openplotmat()
	elpos <- coordinates(c(2,1,2))
	fromto <- matrix(ncol = 2, byrow = TRUE,
	data = c(4, 1, 1, 2, 2, 3, 3, 5))
	nr <- nrow(fromto)
	arrpos <- matrix(ncol = 2, nrow = nr)
	for (i in 1:nr) {
	arrpos[i, ] <- straightarrow(from = elpos[fromto[i, 1], ],
	to = elpos[fromto[i, 2], ],
	lwd = 2, arr.pos = 0.6, arr.length = 0.5)
	}

	box_dat <- data_frame(lab = c("Data-generating model",
	"Estimation methods",
	"Performance criteria",
	"Experimental design",
	"Results"),
	col = c("lightblue","lightgreen","yellow","orange","red"))

	for (i in 1:nrow(box_dat)) {
	textrect(elpos[i,], 0.2, 0.1, lab = box_dat$lab[i],
	box.col = box_dat$col[i], shadow.size = 0, cex = 2)
	}

	```

	## Simulation design strategy

	* Write separate functions for each component of the simulation
	* Makes it easier to debug, modify, or re-use code
	* Test each component
	* Run in parallel where possible

	## Learning more

	* Spring, 2017: Data Analysis, Simulation, & Programming in R
	* My blog: http://jepusto.github.io/
	* code for today's examples
	* lots of other examples
	* [another lecture on designing simulations](http://jepusto.github.io/Designing-simulation-studies-using-R)
	* Ask faculty for articles with good simulation studies
	* November 9th QM colloquium: Anita Israni on "Running Simulations on the Texas Advanced Computing Cluster"