Created
February 17, 2020 03:12
-
-
Save bayesball/8e2118d7ae948dc7c75e040e1164ced0 to your computer and use it in GitHub Desktop.
Markdown file to implement a prediction experiment to compare three estimates of future batting averages.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Prediction Exercise" | |
author: "Jim Albert" | |
date: "2/16/2020" | |
output: html_document | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE, | |
warning = FALSE, | |
message = FALSE) | |
``` | |
Note the BApredict package is available on my github site: https://github.com/bayesball/BApredict | |
```{r} | |
library(mgcv) | |
library(tidyverse) | |
library(BApredict) | |
``` | |
Read in data and focus on balls in play. | |
```{r} | |
statcast2017 <- read_csv("~/Dropbox/2016 WORK/BLOG Baseball R/OTHER/StatcastData/statcast2017.csv") | |
scip <- filter(statcast2017, type == "X") | |
``` | |
Remove sacrifices and define hit variable: | |
```{r} | |
hits <- c("single", "double", "triple", "home_run") | |
scip %>% | |
filter(woba_denom == 1) %>% | |
mutate(Hit = ifelse(events %in% hits, 1, 0)) -> | |
scip_no_sac | |
``` | |
Question: Is expected AVG a better predictor than AVG of future performance? | |
1. Randomly divide season into 2 parts. | |
Here I set the seed, so someone can replicate this exercise. | |
```{r} | |
set.seed(123456) | |
``` | |
```{r} | |
j <- sample(126625, size = 126624 / 2, | |
replace = FALSE) | |
d1 <- scip_no_sac[j, ] | |
d2 <- scip_no_sac[-j, ] | |
``` | |
2. Fit the GAM model on data from first part and find probabilities of hit for each ball in play in the 1st and 2nd halves. | |
```{r} | |
fit <- gam(Hit ~ s(launch_speed, launch_angle), | |
family = binomial, | |
data = d1) | |
d1$Prob <- predict(fit, d1, type = "response") | |
d2$Prob <- predict(fit, d2, type = "response") | |
``` | |
2. Find the AVG and expected AVG on each part. | |
```{r} | |
d1 %>% group_by(batter) %>% | |
summarize(N = n(), | |
AVG = mean(Hit), | |
EAVG = mean(Prob)) -> S1 | |
d2 %>% group_by(batter) %>% | |
summarize(N = n(), | |
AVG = mean(Hit), | |
EAVG = mean(Prob)) -> S2 | |
``` | |
Merge two datasets and restrict to at least 100 BIP in each part. | |
```{r} | |
S12 <- inner_join(S1, S2, by = "batter") | |
S12a <- filter(S12, N.x >= 100, N.y >= 100) | |
head(S12a) | |
``` | |
3. Which is a better predictor? | |
Sum of prediction errors using AVG from 1st half to predict AVG in 2nd half. | |
```{r} | |
(p1 <- with(S12a, sum((AVG.x - AVG.y) ^ 2))) | |
``` | |
Use expected average from first half to predict AVG in 2nd half. | |
```{r} | |
(p2 <- with(S12a, sum((EAVG.x - AVG.y) ^ 2))) | |
``` | |
What if we use expected average from 2nd half to predict AVG in 2nd half? (Actually in real-life you wouldn't have access to the launch conditions data.) | |
```{r} | |
with(S12a, sum((EAVG.y - AVG.y) ^ 2)) | |
``` | |
How about we just fit an exchangeable model to the first half data? | |
```{r} | |
newfit <- fit_bb_model(list(y = | |
round(S12a$AVG.x * S12a$N.x), | |
n = S12a$N.x)) | |
S12a$bayes <- newfit$d$est | |
(p3 <- with(S12a, sum((bayes - AVG.y) ^ 2))) | |
``` | |
In terms of predictive performnce, summarize results: | |
```{r} | |
data.frame(Type = c("BA", "xBA", "Multilevel"), | |
Performance = c(p1, p2, p3), | |
Rank = rank(c(p1, p2, p3))) | |
``` | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment