Skip to content

Instantly share code, notes, and snippets.

@bayesball
Created February 17, 2020 03:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bayesball/8e2118d7ae948dc7c75e040e1164ced0 to your computer and use it in GitHub Desktop.
Save bayesball/8e2118d7ae948dc7c75e040e1164ced0 to your computer and use it in GitHub Desktop.
Markdown file to implement a prediction experiment to compare three estimates of future batting averages.
---
title: "Prediction Exercise"
author: "Jim Albert"
date: "2/16/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE,
message = FALSE)
```
Note the BApredict package is available on my github site: https://github.com/bayesball/BApredict
```{r}
library(mgcv)
library(tidyverse)
library(BApredict)
```
Read in data and focus on balls in play.
```{r}
statcast2017 <- read_csv("~/Dropbox/2016 WORK/BLOG Baseball R/OTHER/StatcastData/statcast2017.csv")
scip <- filter(statcast2017, type == "X")
```
Remove sacrifices and define hit variable:
```{r}
hits <- c("single", "double", "triple", "home_run")
scip %>%
filter(woba_denom == 1) %>%
mutate(Hit = ifelse(events %in% hits, 1, 0)) ->
scip_no_sac
```
Question: Is expected AVG a better predictor than AVG of future performance?
1. Randomly divide season into 2 parts.
Here I set the seed, so someone can replicate this exercise.
```{r}
set.seed(123456)
```
```{r}
j <- sample(126625, size = 126624 / 2,
replace = FALSE)
d1 <- scip_no_sac[j, ]
d2 <- scip_no_sac[-j, ]
```
2. Fit the GAM model on data from first part and find probabilities of hit for each ball in play in the 1st and 2nd halves.
```{r}
fit <- gam(Hit ~ s(launch_speed, launch_angle),
family = binomial,
data = d1)
d1$Prob <- predict(fit, d1, type = "response")
d2$Prob <- predict(fit, d2, type = "response")
```
2. Find the AVG and expected AVG on each part.
```{r}
d1 %>% group_by(batter) %>%
summarize(N = n(),
AVG = mean(Hit),
EAVG = mean(Prob)) -> S1
d2 %>% group_by(batter) %>%
summarize(N = n(),
AVG = mean(Hit),
EAVG = mean(Prob)) -> S2
```
Merge two datasets and restrict to at least 100 BIP in each part.
```{r}
S12 <- inner_join(S1, S2, by = "batter")
S12a <- filter(S12, N.x >= 100, N.y >= 100)
head(S12a)
```
3. Which is a better predictor?
Sum of prediction errors using AVG from 1st half to predict AVG in 2nd half.
```{r}
(p1 <- with(S12a, sum((AVG.x - AVG.y) ^ 2)))
```
Use expected average from first half to predict AVG in 2nd half.
```{r}
(p2 <- with(S12a, sum((EAVG.x - AVG.y) ^ 2)))
```
What if we use expected average from 2nd half to predict AVG in 2nd half? (Actually in real-life you wouldn't have access to the launch conditions data.)
```{r}
with(S12a, sum((EAVG.y - AVG.y) ^ 2))
```
How about we just fit an exchangeable model to the first half data?
```{r}
newfit <- fit_bb_model(list(y =
round(S12a$AVG.x * S12a$N.x),
n = S12a$N.x))
S12a$bayes <- newfit$d$est
(p3 <- with(S12a, sum((bayes - AVG.y) ^ 2)))
```
In terms of predictive performnce, summarize results:
```{r}
data.frame(Type = c("BA", "xBA", "Multilevel"),
Performance = c(p1, p2, p3),
Rank = rank(c(p1, p2, p3)))
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment