bayesball/Prediction_Exercise.rmd

## Prediction_Exercise.rmd
---
title: "Prediction Exercise"
author: "Jim Albert"
date: "2/16/2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      warning = FALSE,
                      message = FALSE)
```

Note the BApredict package is available on my github site: https://github.com/bayesball/BApredict


```{r}
library(mgcv)
library(tidyverse)
library(BApredict)
```

Read in data and focus on balls in play.

```{r}
statcast2017 <- read_csv("~/Dropbox/2016 WORK/BLOG Baseball R/OTHER/StatcastData/statcast2017.csv")
scip <- filter(statcast2017, type == "X")
```

Remove sacrifices and  define hit variable:

```{r}
hits <- c("single", "double", "triple", "home_run")
scip %>%
  filter(woba_denom == 1) %>%
  mutate(Hit = ifelse(events %in% hits, 1, 0)) ->
  scip_no_sac
```

Question: Is expected AVG a better predictor than AVG of future performance?

1.  Randomly divide season into 2 parts.

Here I set the seed, so someone can replicate this exercise.

```{r}
set.seed(123456)
```

```{r}
j <- sample(126625, size = 126624 / 2,
replace = FALSE)
d1 <- scip_no_sac[j, ]
d2 <- scip_no_sac[-j, ]
```

2.  Fit the GAM model on data from first part and find probabilities of hit for each ball in play in the 1st and 2nd halves.

```{r}
fit <- gam(Hit ~ s(launch_speed, launch_angle),
           family = binomial,
           data = d1)
d1$Prob <- predict(fit, d1, type = "response")
d2$Prob <- predict(fit, d2, type = "response")
```

2.  Find the AVG and expected AVG on each part.

```{r}
d1 %>% group_by(batter) %>%
  summarize(N = n(),
            AVG = mean(Hit),
            EAVG = mean(Prob)) -> S1
d2 %>% group_by(batter) %>%
  summarize(N = n(),
            AVG = mean(Hit),
            EAVG = mean(Prob)) -> S2
```

Merge two datasets and restrict to at least 100 BIP in each part.

```{r}
S12 <- inner_join(S1, S2, by = "batter")
S12a <- filter(S12, N.x >= 100, N.y >= 100)
head(S12a)
```

3.  Which is a better predictor?

Sum of prediction errors using AVG from 1st half to predict AVG in 2nd half.

```{r}
(p1 <- with(S12a, sum((AVG.x - AVG.y) ^ 2)))
```

Use expected average from first half to predict AVG in 2nd half.

```{r}
(p2 <- with(S12a, sum((EAVG.x - AVG.y) ^ 2)))
```
What if we use expected average from 2nd half to predict AVG in 2nd half? (Actually in real-life you wouldn't have access to the launch conditions data.)

```{r}
with(S12a, sum((EAVG.y - AVG.y) ^ 2))
```

How about we just fit an exchangeable model to the first half data?

```{r}
newfit <- fit_bb_model(list(y =
              round(S12a$AVG.x * S12a$N.x),
              n = S12a$N.x))
S12a$bayes <-  newfit$d$est
(p3 <- with(S12a, sum((bayes - AVG.y) ^ 2)))
```

In terms of predictive performnce, summarize results:

```{r}
data.frame(Type = c("BA", "xBA", "Multilevel"),
           Performance = c(p1, p2, p3),
           Rank = rank(c(p1, p2, p3)))
```
	---
	title: "Prediction Exercise"
	author: "Jim Albert"
	date: "2/16/2020"
	output: html_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE,
	warning = FALSE,
	message = FALSE)
	```

	Note the BApredict package is available on my github site: https://github.com/bayesball/BApredict


	```{r}
	library(mgcv)
	library(tidyverse)
	library(BApredict)
	```

	Read in data and focus on balls in play.

	```{r}
	statcast2017 <- read_csv("~/Dropbox/2016 WORK/BLOG Baseball R/OTHER/StatcastData/statcast2017.csv")
	scip <- filter(statcast2017, type == "X")
	```

	Remove sacrifices and define hit variable:

	```{r}
	hits <- c("single", "double", "triple", "home_run")
	scip %>%
	filter(woba_denom == 1) %>%
	mutate(Hit = ifelse(events %in% hits, 1, 0)) ->
	scip_no_sac
	```

	Question: Is expected AVG a better predictor than AVG of future performance?

	1. Randomly divide season into 2 parts.

	Here I set the seed, so someone can replicate this exercise.

	```{r}
	set.seed(123456)
	```

	```{r}
	j <- sample(126625, size = 126624 / 2,
	replace = FALSE)
	d1 <- scip_no_sac[j, ]
	d2 <- scip_no_sac[-j, ]
	```

	2. Fit the GAM model on data from first part and find probabilities of hit for each ball in play in the 1st and 2nd halves.

	```{r}
	fit <- gam(Hit ~ s(launch_speed, launch_angle),
	family = binomial,
	data = d1)
	d1$Prob <- predict(fit, d1, type = "response")
	d2$Prob <- predict(fit, d2, type = "response")
	```

	2. Find the AVG and expected AVG on each part.

	```{r}
	d1 %>% group_by(batter) %>%
	summarize(N = n(),
	AVG = mean(Hit),
	EAVG = mean(Prob)) -> S1
	d2 %>% group_by(batter) %>%
	summarize(N = n(),
	AVG = mean(Hit),
	EAVG = mean(Prob)) -> S2
	```

	Merge two datasets and restrict to at least 100 BIP in each part.

	```{r}
	S12 <- inner_join(S1, S2, by = "batter")
	S12a <- filter(S12, N.x >= 100, N.y >= 100)
	head(S12a)
	```

	3. Which is a better predictor?

	Sum of prediction errors using AVG from 1st half to predict AVG in 2nd half.

	```{r}
	(p1 <- with(S12a, sum((AVG.x - AVG.y) ^ 2)))
	```

	Use expected average from first half to predict AVG in 2nd half.

	```{r}
	(p2 <- with(S12a, sum((EAVG.x - AVG.y) ^ 2)))
	```
	What if we use expected average from 2nd half to predict AVG in 2nd half? (Actually in real-life you wouldn't have access to the launch conditions data.)

	```{r}
	with(S12a, sum((EAVG.y - AVG.y) ^ 2))
	```

	How about we just fit an exchangeable model to the first half data?

	```{r}
	newfit <- fit_bb_model(list(y =
	round(S12a$AVG.x * S12a$N.x),
	n = S12a$N.x))
	S12a$bayes <- newfit$d$est
	(p3 <- with(S12a, sum((bayes - AVG.y) ^ 2)))
	```

	In terms of predictive performnce, summarize results:

	```{r}
	data.frame(Type = c("BA", "xBA", "Multilevel"),
	Performance = c(p1, p2, p3),
	Rank = rank(c(p1, p2, p3)))
	```