--- title: "Designing simulation studies in R" author: "James E. Pustejovsky" date: "September 28, 2016" output: ioslides_presentation: css: custom.css widescreen: true transition: faster --- {r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.width = 8, fig.height = 5)  ## Simulation >1. Generate data from a __probability model__ where you know the true parameters. >2. Apply one or more estimation procedures to the data, record the results. >3. Repeat many times ("in repeated samples..."). >4. Compare the results to the truth (bias, accuracy, error/coverage rates). ## Why is simulation useful? >* Building understanding & intuition about statistical models >* Assessing/checking modeling strategies >* Formal research ## Modeling batting averages h/t David Robinson for this example (more details [on his blog](http://varianceexplained.org/r/beta_binomial_baseball/)) >* Baseball batting average = # hits / # at-bats >* Correlated with # at-bats because better hitters get more opportunities >* Beta-binomial regression is a useful way to model this relationship ## Batting averages vs. at-bats {r} library(dplyr) library(tidyr) library(Lahman) library(ggplot2) pitchers <- Pitching %>% group_by(playerID) %>% summarize(gamesPitched = sum(G)) %>% filter(gamesPitched > 3) career <- Batting %>% filter(AB > 0) %>% anti_join(pitchers, by = "playerID") %>% group_by(playerID) %>% summarize_each(funs(sum), H, AB) %>% mutate(average = H / AB) career <- Master %>% tbl_df() %>% select(playerID, nameFirst, nameLast) %>% unite(name, nameFirst, nameLast, sep = " ") %>% inner_join(career, by = "playerID") career %>% filter(AB >= 20) %>% ggplot(aes(AB, average)) + geom_point(alpha = 0.3) + geom_smooth() + scale_x_log10() + theme_minimal() + labs(x = "Career at-bats", y = "Career batting-average")  {r, eval = FALSE, include = FALSE} BA_betabin <- gamlss(cbind(H, AB - H) ~ log(AB), data = career, family = BB(mu.link = "identity", sigma.link = "inverse")) summary(BA_betabin)  Source: [Lahman's Baseball Database, 1871-2015](http://www.seanlahman.com/baseball-archive/statistics/) (R package Lahman) ## Beta-binomial regression >*$Y_i$is number of hits by player$i$>*$n_i$is number of at-bats >*$Y_i \sim Binomial(n_i, \pi_i)$, where$p_i$is true batting ability of player$i$>*$\pi_i \sim Beta\left(\mu_{n_i} / \sigma, (1 - \mu_{n_i}) / \sigma\right)$, where$\mu_{n_i}$is average batting ability of players with$n_i$at-bats and$\sigma$describes the variability in true ability. >*$\mu_{n_i} = \beta_0 + \beta_1 \log(n_i)$## Simulate to build understanding {r} career_at_bats <- career$AB  {r, echo = TRUE} set.seed(20160928) players <- 1000 at_bats <- sample(career_at_bats, size = players) summary(at_bats) mu <- 0.140 + 0.015 * log(at_bats) summary(mu)  ## Simulate true abilities {r, echo = TRUE} sigma <- 1 / 500 ability <- rbeta(players, shape1 = mu / sigma, shape2 = (1 - mu) / sigma)  ## Simulate true abilities {r} dat <- data_frame(at_bats, mu, ability) ggplot(dat, aes(at_bats, ability)) + geom_point(alpha = 0.3) + geom_smooth() + scale_x_log10() + theme_minimal() + ylim(0, 0.6) + labs(x = "Career at-bats", y = "Batting ability")  ## Simulate batting averages {r, echo = TRUE} dat$hits <- with(dat, rbinom(n = players, size = at_bats, prob = ability)) dat$batting_avg <- with(dat, hits / at_bats)  ## Simulate batting averages {r} ggplot(dat, aes(at_bats, batting_avg)) + geom_point(alpha = 0.3, color = "purple") + geom_smooth() + scale_x_log10() + theme_minimal() + ylim(0, 0.6) + labs(x = "Career at-bats", y = "Observed batting average")  ## Fit the beta-binomial regression {r, echo = TRUE, results = "hide"} library(gamlss) bb_fit <- gamlss(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat, family = BB(mu.link = "identity")) coef(bb_fit)  {r} coef(bb_fit)  ## Simulate to check modeling strategies >* In real-world data analysis, there are __almost always__ multiple ways to approach a problem. >* Small simulations are a useful way to test out strategies for use in a given setting. >* For modeling batting averages, beta-binomial regression is useful but __SLOW__. >* Would it work to use a __quasi-binomial glm__ instead? ## Data-generating function {r, echo = TRUE} simulate_batting_avgs <- function(players, beta, sigma) { at_bats <- sample(career_at_bats, size = players) mu <- beta[1] + beta[2] * log(at_bats) ability <- rbeta(players, shape1 = mu / sigma, shape2 = (1 - mu) / sigma) hits <- rbinom(n = players, size = at_bats, prob = ability) data_frame(at_bats, hits) } new_dat <- simulate_batting_avgs(players = 400, beta = c(0.140, 0.015), sigma = 1 / 500)  ## Data-generating function {r, echo = TRUE} new_dat  ## Modeling function {r, echo = TRUE} quasibinomial_CI <- function(dat, level = 0.95) { glm_fit <- glm(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat, family = quasibinomial(link = "identity")) b <- coef(glm_fit) se <- sqrt(diag(vcov(glm_fit))) crit <- qnorm(1 - (1 - level) / 2) data_frame(term = names(b), L = b - se * crit, U = b + se * crit) } quasibinomial_CI(new_dat)  ## Put them together! {r, echo = TRUE} lots_of_CIs <- replicate(2000, { dat <- simulate_batting_avgs(players = 400, beta = c(0.140, 0.015), sigma = 1 / 500) quasibinomial_CI(dat) }, simplify = FALSE)  ## Confidence interval coverage {r} CI_dat <- bind_rows(lots_of_CIs) %>% group_by(term) %>% mutate(n = row_number()) true_dat <- data_frame(term = c("(Intercept)","log(at_bats)"), val = c(0.140, 0.015)) CI_dat %>% filter(n < 200) %>% ggplot() + geom_segment(aes(x = n, xend = n, y = L, yend = U)) + geom_hline(dat = true_dat, aes(yintercept = val), col = "blue") + facet_wrap(~ term, ncol = 1, scales = "free") + theme_minimal() + labs(x = "Iteration", y = "Confidence interval")  ## Confidence interval coverage {r, echo = TRUE} bind_rows(lots_of_CIs) %>% left_join(true_dat, by = "term") %>% mutate(covered = L < val & val < U) %>% group_by(term) %>% summarise(coverage = mean(covered))  {r, eval = FALSE, include = FALSE} library(sandwich) robust_quasibinomial_CI <- function(dat, level = 0.95) { glm_fit <- glm(cbind(hits, at_bats - hits) ~ log(at_bats), data = dat, family = quasibinomial(link = "identity")) b <- coef(glm_fit) se <- sqrt(diag(vcovHC(glm_fit))) crit <- qnorm(1 - (1 - level) / 2) data_frame(term = names(b), L = b - se * crit, U = b + se * crit) } lots_of_robust_CIs <- replicate(2000, { dat <- simulate_batting_avgs(players = 400, beta = c(0.140, 0.015), sigma = 1 / 500) robust_quasibinomial_CI(dat) }, simplify = FALSE) bind_rows(lots_of_robust_CIs) %>% left_join(true_dat, by = "term") %>% mutate(covered = L < val & val < U) %>% group_by(term) %>% summarise(coverage = mean(covered))  ## Simulation studies in formal research Questions about quantitative methodology: > * Which type of confidence intervals should be used for the beta-binomial model? > * Which of these twelve tests should I use for one-way ANOVA when the variances are non-homogeneous? > * How big a sample is needed to get accurate estimates of variance components in a multi-level logistic regression model? > * Is it reasonable to use a multivariate normal model to impute missing data, even though the variables look skewed? ## Why focus on simulation studies? Few alternatives for assessing >* small-sample performance of estimation methods >* performance of combinations of methods (data analysis pipelines) >* robustness under model mis-specification >* comparison of competing methods ## Simulation design {r, fig.width = 10, fig.height = 6, out.width = "800px"} library(diagram) par(mar = c(0.1, 0.1, 0.1, 0.1)) openplotmat() elpos <- coordinates(c(2,1,2)) fromto <- matrix(ncol = 2, byrow = TRUE, data = c(4, 1, 1, 2, 2, 3, 3, 5)) nr <- nrow(fromto) arrpos <- matrix(ncol = 2, nrow = nr) for (i in 1:nr) { arrpos[i, ] <- straightarrow(from = elpos[fromto[i, 1], ], to = elpos[fromto[i, 2], ], lwd = 2, arr.pos = 0.6, arr.length = 0.5) } box_dat <- data_frame(lab = c("Data-generating model", "Estimation methods", "Performance criteria", "Experimental design", "Results"), col = c("lightblue","lightgreen","yellow","orange","red")) for (i in 1:nrow(box_dat)) { textrect(elpos[i,], 0.2, 0.1, lab = box_dat$lab[i], box.col = box_dat$col[i], shadow.size = 0, cex = 2) }  ## Simulation design strategy * Write separate functions for each component of the simulation * Makes it easier to debug, modify, or re-use code * Test each component * Run in parallel where possible ## Learning more * Spring, 2017: Data Analysis, Simulation, & Programming in R * My blog: http://jepusto.github.io/ * code for today's examples * lots of other examples * [another lecture on designing simulations](http://jepusto.github.io/Designing-simulation-studies-using-R) * Ask faculty for articles with good simulation studies * November 9th QM colloquium: Anita Israni on "Running Simulations on the Texas Advanced Computing Cluster"