wessonmo/ep_models_and_team_str.md

## ep_models_and_team_str.md

      
    Raw
  

              ep_models_and_team_str.md
            
          
    Team Strength Exclusion Bias in Expected Points Models

@nflscrapR's Expected Points (EP) is a popular metric among analysts doing public research of play in the NFL. Detailed in the creators' research paper, the metric is derived from a model that was built as a part of a larger system designed to calculate individual wins above replacement values for offensive skill players.
The authors very graciously made public all of their data (nflscrapR-data) and code (nflWAR, nflscrapR-models, nflscrapR) for this project, including the code used to build the EP model. In the init_ep_fg_models.R file of the nflscrapR-models repository, we can see that the following variables are used as inputs in the model:

TimeSecs_Remaining: time remaining until end of half
yrdline100: yardline from 1-99, where 1 is the possessing team's own 1 yardline and 99 is their opponent's
down
log_ydstogo: log transform of yards to go
GoalToGo: indicator for goal-to-go down and distances
Under_TwoMinute_Warning: indicator for two minute warning

Noticably absent are variables for the possessing and defensive team's "strength", as well as an indicator for whether or not the possessing team was at home. When I asked why these variables were excluded, it was explained to me (via tweet):

[The model is] creating an average baseline for which EPA calculations can then be attributed to players in some manner (then you can adjust for home field at that point) - one could make an EP mode taking into account home, team strength etc - different purpose

Regaring a home-field advantage (HFA) variable, I am not sure I understand how it is different (in a way that would merit exclusion) from any of the variables that are included. In addition, while I do see the logic regarding the exclusion of team strength variables, I believe their exclusion could be leading to biases in the EP model.
As such, the point of this analysis is to compare the performance of the current model vs one that includes HFA and team strength estimates.
1. Getting the Data

I did not have the proper permissions to access the nflscrapR-models repository in order to import the functions it used to build the play-by-play (pbp) data. So, I just copied the necessary code from that repository and used it to create virtually identical functions in my own scripts. If you want to replicate this analysis, you'll need to create a file in your working directory named utils.R that is identical to the file here.
To begin, we'll use the get_ep_model_pbp_data function created in the utils.R file to import the data and create two new variables:

ActPts: number of points of the next score in the half
pos_loc: HFA indicator for the team possessiong the ball

source('utils.R')

pbp_ep_model_data <- get_ep_model_pbp_data() %>%
  mutate(
    ActPts = case_when(
      Next_Score_Half == 'No_Score' ~ 0,
      Next_Score_Half == 'Opp_Safety' ~ -2,
      Next_Score_Half == 'Safety' ~ 2,
      Next_Score_Half == 'Opp_Field_Goal' ~ -3,
      Next_Score_Half == 'Field_Goal' ~ 3,
      Next_Score_Half == 'Opp_Touchdown' ~ -7,
      Next_Score_Half == 'Touchdown' ~ 7,
      TRUE ~ NA_real_
    ),
    pos_loc = case_when(
      posteam == HomeTeam ~ 1,
      posteam == AwayTeam ~ -1,
      TRUE ~ 0
    )
  )
2. Replicating the EP Model

I'm not sure if we can actually access the EP model that nflscrapR uses but we can attempt to rebuild it from init_ep_fg_models.R:
ep_model <- nnet::multinom(Next_Score_Half ~ TimeSecs_Remaining + yrdline100 +
                             down + log_ydstogo + GoalToGo + log_ydstogo*down + 
                             yrdline100*down + GoalToGo*log_ydstogo + 
                             Under_TwoMinute_Warning, data = pbp_ep_model_data,
                           weights = Total_W_Scaled, maxit = 300)
In order to evaluate how closely this EP model matches the one actually used, we'll compare the EP estimates in the pbp_ep_model_data versus the estimates our model generates. We will exclude field goals because the process for generating those estimates is a little more involved and they only represent about 3% of the dataset.
model_check_data <- pbp_ep_model_data %>%
  filter(PlayType != 'Field Goal', is.na(FieldGoalResult))

ep_model_pred <- data.frame(predict(ep_model, type = 'probs', model_check_data)) %>%
  mutate(
    our_ExpPts = (-2 * Opp_Safety) + (2 * Safety) + (-3 * Opp_Field_Goal) + (3 * Field_Goal) +
      (-7 * Opp_Touchdown) + (7 * Touchdown)
  )

model_check_data <- model_check_data %>%
  bind_cols(ep_model_pred %>% select(our_ExpPts))

MLmetrics::R2_Score(model_check_data$our_ExpPts, model_check_data$ExpPts)
[1] 0.9999999
So, yeah, our model is virtually the same as the one currently used.
2. Team Strength Estimates

There are many ways to build team strength estimates. Similar to my previous analysis on accumulated QB hits, let's use a linear mixed-effects regression (lmer) model.
First, we'll need data for the model. Let's aggregate a team's offensive and defensive EPA at a game level and create the factors for what will be our other input variables.
game_epa <- pbp_ep_model_data %>%
  filter(!is.na(EPA)) %>%
  group_by(Season, GameID, posteam, DefensiveTeam, pos_loc) %>%
  summarize(mean_epa = mean(EPA)) %>%
  mutate(
    game = as.factor(GameID),
    offense = as.factor(paste(Season, posteam, sep='_')),
    defense = as.factor(paste(Season, DefensiveTeam, sep='_')),
    pos_loc = as.factor(pos_loc)
  )
So, here is our fomula:
team_epa_formula <- mean_epa ~ pos_loc + (1|game) + (1|offense) + (1|defense)
offense and defense will estimate the strength of each team while game will, sort of, capture the variance in EPA for  each game. If otherwise prolific offenses underperform in a specific game (say, due to bad weather), game should capture some of that and "punish" the offense less for it. Now, we build the model and extract the random effects (i.e., the team strengths) from the model.
team_epa_fit <- lmer(team_epa_formula, game_epa)

ran <- ranef(team_epa_fit)

pos_str <- ran$offense %>%
  rename(pos_epa_str = '(Intercept)') %>%
  tibble::rownames_to_column(var = 'offense') %>%
  separate(offense, c('Season', 'posteam'), sep='_') %>%
  mutate(
    Season = as.numeric(Season),
    pos_epa_str = as.vector(scale(pos_epa_str))
    ) %>%
  arrange(-pos_epa_str)

def_str <- ran$defense %>%
  rename(def_epa_str = '(Intercept)') %>%
  tibble::rownames_to_column(var = 'defense') %>%
  separate(defense, c('Season', 'DefensiveTeam'), sep='_') %>%
  mutate(
    Season = as.numeric(Season),
    def_epa_str = as.vector(scale(def_epa_str))
    ) %>%
  arrange(def_epa_str)
Looking at the top and bottom defenses, I'd say these do a decent job of estimating the team strengths.
head(pos_str, 10)
#    Season posteam pos_epa_str
# 1    2016     ATL    2.830115
# 2    2011      GB    2.492116
# 3    2011      NO    2.431377
# 4    2013     DEN    2.296974
# 5    2011      NE    2.269525
# 6    2010      NE    2.204658
# 7    2012      NE    2.134231
# 8    2014      GB    1.812909
# 9    2013      SD    1.712631
# 10   2015     ARI    1.660256

tail(pos_str)
#     Season posteam pos_epa_str
# 252   2009     DET   -2.055390
# 253   2011     STL   -2.096951
# 254   2009     CLE   -2.142117
# 255   2009      TB   -2.253331
# 256   2010     CAR   -2.647988
# 257   2010     ARI   -2.664356

head(def_str, 10)
#    Season DefensiveTeam def_epa_str
# 1    2009           NYJ   -3.406461
# 2    2010           CHI   -2.462656
# 3    2012           CHI   -2.240408
# 4    2013           SEA   -2.132376
# 5    2010           PIT   -2.127570
# 6    2010            SD   -1.936222
# 7    2014           BUF   -1.825720
# 8    2012           ARI   -1.777001
# 9    2010           NYG   -1.750598
# 10   2013           CIN   -1.726135

tail(def_str)
#     Season DefensiveTeam def_epa_str
# 252   2010           DEN    1.856413
# 253   2010           HOU    1.888760
# 254   2016           CLE    1.914799
# 255   2013           ATL    2.066987
# 256   2014            NO    2.127412
# 257   2015            NO    2.267538
Finally, let's append them to our dataset.
pbp_ep_model_data <- pbp_ep_model_data %>%
  select(-pos_wpa_str, -def_wpa_str) %>% 
  left_join(pos_str, by = c('Season', 'posteam')) %>%
  left_join(def_str, by = c('Season', 'DefensiveTeam'))
3. Hypothesized Bias

So, what do I mean when I talk about biases in an EP model that result from the exclusion of team strength variables? I'll illustrate with a chart.
Let's take all second downs where the distance was less than or equal to 20 yards and plot the average offensive and defensive strengths:
library(ggplot2)

sec_dwn_data <- pbp_ep_model_data %>%
  filter(down == 2, GoalToGo == 0, ydstogo <= 20, !is.na(pos_epa_str), !is.na(def_epa_str)) %>%
  group_by(ydstogo) %>% summarize(
    off_mean=mean(pos_epa_str),
    def_mean=mean(def_epa_str)
  )

sec_dwn_plot <- ggplot(sec_dwn_data, aes(x = ydstogo)) + 
  geom_point(aes(y = off_mean, colour = 'off')) +
  geom_smooth(method = "lm", aes(y = off_mean, colour = 'off')) + 
  geom_point(aes(y = def_mean, colour = 'def')) +
  geom_smooth(method = "lm", aes(y = def_mean, colour = 'def')) +
  labs(
    title = 'Mean Team Strength by Distance - 2nd Down',
    x = 'Distance',
    y = 'Mean Strength'
    )

plot(sec_dwn_plot)

As the 2nd down distance increase, we see that the values for the mean offense and defense get lower. In other words, the longer the distance, the worse the offense and the better the defense.
Does the trend hold for third and fourth down?


Yep! Notice also that the mean values themselves get lower on each sucessive down.
One final plot of the mean team strengths for 1st and 10s by yardline:
yl_data <- pbp_ep_model_data %>%
  filter(down == 1, ydstogo == 10, !is.na(pos_epa_str), !is.na(def_epa_str)) %>%
  group_by(yrdline100) %>% summarize(
    off_mean=mean(pos_epa_str),
    def_mean=mean(def_epa_str)
  )

yl_plot <- ggplot(yl_data, aes(x = yrdline100)) + 
  geom_point(aes(y = off_mean, colour = 'off')) +
  geom_smooth(method = "lm", aes(y = off_mean, colour = 'off')) + 
  geom_point(aes(y = def_mean, colour = 'def')) +
  geom_smooth(method = "lm", aes(y = def_mean, colour = 'def')) +
  labs(
    title = 'Mean Team Strength by Yardline - 1st & 10',
    x = 'Yardline',
    y = 'Mean Strength'
  )

plot(yl_plot)

At this point, you might be thinking "Duh, better teams put themselves in more advantageous positions. Big deal."
Consider a hypothetical 3rd and 10 between an average strength offense and an average strength defense. In the pbp dataset used to train the model, the mean strengths of an offense and defense facing a 3rd and 10 are 0.0743 and -0.0893, values that are roughly equivalent to 46th percentile and 52nd percentile teams, respectively. In other words, a below average offense and an above average defense. However, because it is blind to team strength, the naive model will treat the average offense like it is below average and vice-versa for the defense, leading to an inflated EP value. Conversely, in more offensively advantageous down and distances, it will deflate the EP.
At least, that is the hypothesis.
4. Informed Model

For the informed model, we'll just add the team strengths (pos_epa_str and def_epa_str) and HFA indicator (pos_loc) to the previous formula with no interactions. Run the new model.
new_ep_model <- nnet::multinom(Next_Score_Half ~ TimeSecs_Remaining + yrdline100 + 
                                 down + log_ydstogo + GoalToGo + log_ydstogo*down + 
                                 yrdline100*down + GoalToGo*log_ydstogo + 
                                 Under_TwoMinute_Warning + pos_epa_str + def_epa_str +
                                 pos_loc, data = pbp_ep_model_data, 
                               weights = Total_W_Scaled, maxit = 300)
Now, let's see how it compares to our original (naive) model.
new_model_check_data <- pbp_ep_model_data %>%
  filter(PlayType != 'Field Goal', is.na(FieldGoalResult))

new_ep_model_pred <- data.frame(predict(new_ep_model, type = 'probs', new_model_check_data)) %>%
  mutate(
    new_ExpPts = (-2 * Opp_Safety) + (2 * Safety) + (-3 * Opp_Field_Goal) + (3 * Field_Goal) +
      (-7 * Opp_Touchdown) + (7 * Touchdown)
  )

new_model_check_data <- new_model_check_data %>%
  bind_cols(new_ep_model_pred %>% select(new_ExpPts))

MLmetrics::R2_Score(new_model_check_data$new_ExpPts, new_model_check_data$ExpPts)
# [1] 0.8348314
An R2 of 0.83 suggest there is some variance between the two models. If you want to visualize the variance:
mod_var <- ggplot(new_model_check_data, aes(x=new_ExpPts, y=ExpPts) ) +
  geom_bin2d(bins = 100) +
  scale_fill_continuous(type = "viridis") +
  theme_bw()

plot(mod_var)

So we know it's different. Now, let's see if it is better by comparing the ActPts variable we created at the beginning to each model's EP predictions. We'll compare using the RMSE metric.
MLmetrics::RMSE(
  new_model_check_data %>% filter(!is.na(ExpPts)) %>% pull(ExpPts),
  new_model_check_data %>% filter(!is.na(ExpPts)) %>% pull(ActPts)
)
# [1] 4.688036

MLmetrics::RMSE(
  new_model_check_data %>% filter(!is.na(new_ExpPts)) %>% pull(new_ExpPts),
  new_model_check_data %>% filter(!is.na(new_ExpPts)) %>% pull(ActPts)
)
# [1] 4.625228
A lower RMSE indicates a better fit for the informed model.
5. Resetting the Baseline

So, we have a better model but there is a problem with using before making any other changes. Recall @nflscrapR's reasoning on leaving out team strengths in the first place:

[The model is] creating an average baseline for which EPA calculations can then be attributed to players in some manner (then you can adjust for home field at that point) - one could make an EP mode taking into account home, team strength etc - different purpose

With no further changes, players (and teams) would no longer be working from the same, average baseline. Every team would be working on a different baseline derived from their team strength.
To visualize this, let's aggregate each offense's old and new EP by season, calculate the difference in those two values, and compare the difference to each offense's strength.
ep_compare <- new_model_check_data %>%
  filter(!is.na(ExpPts)) %>%
  group_by(Season, posteam) %>%
  summarize(
    mean_ep = mean(ExpPts),
    mean_new_ep = mean(new_ExpPts)
    ) %>%
  left_join(pos_str, by = c('Season', 'posteam')) %>%
  mutate(diff_ExpPts = mean_new_ep - mean_ep)

ep_diff <- ggplot(ep_compare, aes(x=pos_epa_str, y=diff_ExpPts)) +
  geom_point() +
  ggtitle('Change in Offensive Expected Points by Team Strength')

plot(ep_diff)

We can see a pretty clear trend that as an offense gets better, their EP inflates.
So, again, if we use this method without making any other changes, our baseline for each team will no longer be average and we won't be able to use EP in the way that it is often used by analysts. The good news is: there is a simple change we can make to reset the baseline for each team.
But the change is not to the model itself, it is to the dataset that the model will be scoring. The baseline EP for each team were only different because, in the dataset that the model scored, we told it how good each team was. If we "trick" the model into thinking each team is average, it should put them back on the same, average baseline again.
Let's try it out, starting with resetting each team's strength to zero.
pbp_ep_model_data_reset <- pbp_ep_model_data %>%
  filter(PlayType != 'Field Goal', is.na(FieldGoalResult)) %>%
  mutate(
    pos_epa_str = 0,
    def_epa_str = 0
  )
Next, score this dataset with our new EP model and compare with the original EP values.
new_ep_model_pred_reset <- data.frame(predict(new_ep_model, type = 'probs', pbp_ep_model_data_reset)) %>%
  mutate(
    reset_ExpPts = (-2 * Opp_Safety) + (2 * Safety) + (-3 * Opp_Field_Goal) + (3 * Field_Goal) +
      (-7 * Opp_Touchdown) + (7 * Touchdown)
  )

pbp_ep_model_data_reset <- pbp_ep_model_data_reset %>%
  bind_cols(new_ep_model_pred_reset %>% select(reset_ExpPts))

MLmetrics::R2_Score(pbp_ep_model_data_reset$reset_ExpPts, pbp_ep_model_data_reset$ExpPts)
# [1] 0.9816411
So, our EP values are a lot closer to the original values but a little bit of variance remains. Visualizing what that variance looks like:
reset_ep <- ggplot(pbp_ep_model_data_reset, aes(x=reset_ExpPts, y=ExpPts)) +
  geom_bin2d(bins = 100) +
  scale_fill_continuous(type = "viridis") +
  theme_bw()

plot(reset_ep)

We can also compare how our reset predictions compare to the original predictions.
MLmetrics::RMSE(
  pbp_ep_model_data_reset %>% filter(!is.na(ExpPts)) %>% pull(ExpPts),
  pbp_ep_model_data_reset %>% filter(!is.na(ExpPts)) %>% pull(ActPts)
)
# [1] 4.688036

MLmetrics::RMSE(
  pbp_ep_model_data_reset %>% filter(!is.na(reset_ExpPts)) %>% pull(reset_ExpPts),
  pbp_ep_model_data_reset %>% filter(!is.na(reset_ExpPts)) %>% pull(ActPts)
)
# [1] 4.681047
So, our reset EP actually gets closer to the eventual points.
For one last point of comparison, let's revisit the change in EP versus team strength plot for our reset EP values.
*Note: For some reason, there are only 75 plays for the Jacksonville Jaguar offense in the pbp dataset. Due to their extremely small sample size relative to the other teams (every other team has 1000+ plays), their value is an extremely high outlier and skews the graph. As a result, I am going to exclude their value so we can better visualize the result.
ep_reset_compare <- pbp_ep_model_data_reset %>%
  filter(!is.na(ExpPts)) %>%
  group_by(Season, posteam) %>%
  summarize(
    mean_ep = mean(ExpPts),
    mean_reset_ep = mean(reset_ExpPts)
  ) %>%
  left_join(pos_str, by = c('Season', 'posteam')) %>%
  mutate(diff_ExpPts = mean_reset_ep - mean_ep)

ep_reset_diff <- ggplot(
  ep_reset_compare %>% filter(Season != 2016 | posteam != 'JAC'),
  aes(x=pos_epa_str, y=diff_ExpPts)
  ) +
  geom_point() +
  ggtitle('Change in Offensive Expected Points by Team Strength - Reset EP')

plot(ep_reset_diff)

Compared to the previous plot of the change in EP, this is virtually random, suggesting that resetting our baseline with this method worked.
6. Conclusion

Given the added predictive value of the team-strength (and HFA) informed model and the success of the technique to reset the baseline for each team, I would recommend the nflscrapR EP model adopt a similar methodology.

  
## utils.R
library(tidyverse)
library(nflWAR)

find_game_next_score_half <- function(pbp_dataset) {

  # Which rows are the scoring plays:
  score_plays <- which(pbp_dataset$sp == 1 & pbp_dataset$PlayType != "No Play")

  # Define a helper function that takes in the current play index,
  # a vector of the scoring play indices, play-by-play data,
  # and returns the score type and drive number for the next score:
  find_next_score <- function(play_i, score_plays_i,pbp_df) {

    # Find the next score index for the current play
    # based on being the first next score index:
    next_score_i <- score_plays_i[which(score_plays_i >= play_i)[1]]

    # If next_score_i is NA (no more scores after current play)
    # or if the next score is in another half,
    # then return No_Score and the current drive number
    if (is.na(next_score_i) |
        (pbp_df$qtr[play_i] %in% c(1, 2) & pbp_df$qtr[next_score_i] %in% c(3, 4, 5)) |
        (pbp_df$qtr[play_i] %in% c(3, 4) & pbp_df$qtr[next_score_i] == 5)) {

      score_type <- "No_Score"

      # Make it the current play index
      score_drive <- pbp_df$Drive[play_i]

      # Else return the observed next score type and drive number:
    } else {

      # Store the score_drive number
      score_drive <- pbp_df$Drive[next_score_i]

      # Then check the play types to decide what to return
      # based on several types of cases for the next score:

      # 1: Return TD
      if (identical(pbp_df$ReturnResult[next_score_i], "Touchdown")) {

        # For return touchdowns the current posteam would not have
        # possession at the time of return, so it's flipped:
        if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) {

          score_type <- "Opp_Touchdown"

        } else {

          score_type <- "Touchdown"

        }
      } else if (identical(pbp_df$FieldGoalResult[next_score_i], "Good")) {

        # 2: Field Goal
        # Current posteam made FG
        if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) {

          score_type <- "Field_Goal"

          # Opponent made FG
        } else {

          score_type <- "Opp_Field_Goal"

        }

        # 3: Touchdown (returns already counted for)
      } else if (pbp_df$Touchdown[next_score_i] == 1) {

        # Current posteam TD
        if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) {

          score_type <- "Touchdown"

          # Opponent TD
        } else {

          score_type <- "Opp_Touchdown"

        }
        # 4: Safety (similar to returns)
      } else if (pbp_df$Safety[next_score_i] == 1) {

        if (identical(pbp_df$posteam[play_i],pbp_df$posteam[next_score_i])) {

          score_type <- "Opp_Safety"

        } else {

          score_type <- "Safety"

        }
        # 5: Extra Points
      } else if (identical(pbp_df$ExPointResult[next_score_i], "Made")) {

        # Current posteam Extra Point
        if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) {

          score_type <- "Extra_Point"

          # Opponent Extra Point
        } else {

          score_type <- "Opp_Extra_Point"

        }
        # 6: Two Point Conversions
      } else if (identical(pbp_df$TwoPointConv[next_score_i], "Success")) {

        # Current posteam Two Point Conversion
        if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) {

          score_type <- "Two_Point_Conversion"

          # Opponent Two Point Conversion
        } else {

          score_type <- "Opp_Two_Point_Conversion"

        }

        # 7: Defensive Two Point (like returns)
      } else if (identical(pbp_df$DefTwoPoint[next_score_i], "Success")) {

        if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) {

          score_type <- "Opp_Defensive_Two_Point"

        } else {

          score_type <- "Defensive_Two_Point"

        }

        # 8: Errors of some sort so return NA (but shouldn't take place)
      } else {

        score_type <- NA

      }
    }

    return(data.frame(Next_Score_Half = score_type,
                      Drive_Score_Half = score_drive))
  }

  # Using lapply and then bind_rows is much faster than
  # using map_dfr() here:
  lapply(c(1:nrow(pbp_dataset)), find_next_score,
         score_plays_i = score_plays, pbp_df = pbp_dataset) %>%
    bind_rows() %>%
    return
}


get_ep_model_pbp_data <- function(min_season=2009, max_season=2016){
  pbp_data <- get_pbp_data(min_season:max_season)

  pbp_data <- pbp_data %>% filter(GameID != "2011121101")

  # Apply to each game (ignore the warning messages here):
  pbp_next_score_half <- map_dfr(unique(pbp_data$GameID),
                                 function(x) {
                                   pbp_data %>%
                                     filter(GameID == x) %>%
                                     find_game_next_score_half()
                                 })

  # Join to the pbp_data:
  pbp_data_next_score <- bind_cols(pbp_data, pbp_next_score_half)

  # Create the EP model dataset that only includes plays with basic seven
  # types of next scoring events along with the following play types:
  # Field Goal, No Play, Pass, Punt, Run, Sack, Spike

  pbp_ep_model_data <- pbp_data_next_score %>%
    filter(Next_Score_Half %in% c("Opp_Field_Goal", "Opp_Safety", "Opp_Touchdown",
                                  "Field_Goal", "No_Score", "Safety", "Touchdown") &
             PlayType %in% c("Field Goal", "No Play", "Pass", "Punt", "Run", "Sack",
                             "Spike") & is.na(TwoPointConv) & is.na(ExPointResult) &
             !is.na(down) & !is.na(TimeSecs))

  nrow(pbp_ep_model_data)
  # 304805

  # Now adjust and create the model variables:
  pbp_ep_model_data <- pbp_ep_model_data %>%

    # Reference level should be No_Score:
    mutate(Next_Score_Half = fct_relevel(factor(Next_Score_Half), "No_Score"),

           # Create a variable that is time remaining until end of half:
           # (only working with up to 2016 data so can ignore 2017 time change)
           TimeSecs_Remaining = as.numeric(ifelse(qtr %in% c(1,2), TimeSecs - 1800,
                                                  ifelse(qtr == 5, TimeSecs + 900,
                                                         TimeSecs))),

           # log transform of yards to go and indicator for two minute warning:
           log_ydstogo = log(ydstogo),
           Under_TwoMinute_Warning = ifelse(TimeSecs_Remaining < 120, 1, 0),

           # Changing down into a factor variable:
           down = factor(down),

           # Calculate the drive difference between the next score drive and the
           # current play drive:
           Drive_Score_Dist = Drive_Score_Half - Drive,

           # Create a weight column based on difference in drives between play and next score:
           Drive_Score_Dist_W = (max(Drive_Score_Dist) - Drive_Score_Dist) /
             (max(Drive_Score_Dist) - min(Drive_Score_Dist)),
           # Create a weight column based on score differential:
           ScoreDiff_W = (max(abs(ScoreDiff)) - abs(ScoreDiff)) /
             (max(abs(ScoreDiff)) - min(abs(ScoreDiff))),
           # Add these weights together and scale again:
           Total_W = Drive_Score_Dist_W + ScoreDiff_W,
           Total_W_Scaled = (Total_W - min(Total_W)) /
             (max(Total_W) - min(Total_W)))

  return(pbp_ep_model_data)
}
	library(tidyverse)
	library(nflWAR)

	find_game_next_score_half <- function(pbp_dataset) {

	# Which rows are the scoring plays:
	score_plays <- which(pbp_dataset$sp == 1 & pbp_dataset$PlayType != "No Play")

	# Define a helper function that takes in the current play index,
	# a vector of the scoring play indices, play-by-play data,
	# and returns the score type and drive number for the next score:
	find_next_score <- function(play_i, score_plays_i,pbp_df) {

	# Find the next score index for the current play
	# based on being the first next score index:
	next_score_i <- score_plays_i[which(score_plays_i >= play_i)[1]]

	# If next_score_i is NA (no more scores after current play)
	# or if the next score is in another half,
	# then return No_Score and the current drive number
	if (is.na(next_score_i) \|
	(pbp_df$qtr[play_i] %in% c(1, 2) & pbp_df$qtr[next_score_i] %in% c(3, 4, 5)) \|
	(pbp_df$qtr[play_i] %in% c(3, 4) & pbp_df$qtr[next_score_i] == 5)) {

	score_type <- "No_Score"

	# Make it the current play index
	score_drive <- pbp_df$Drive[play_i]

	# Else return the observed next score type and drive number:
	} else {

	# Store the score_drive number
	score_drive <- pbp_df$Drive[next_score_i]

	# Then check the play types to decide what to return
	# based on several types of cases for the next score:

	# 1: Return TD
	if (identical(pbp_df$ReturnResult[next_score_i], "Touchdown")) {

	# For return touchdowns the current posteam would not have
	# possession at the time of return, so it's flipped:
	if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) {

	score_type <- "Opp_Touchdown"

	} else {

	score_type <- "Touchdown"

	}
	} else if (identical(pbp_df$FieldGoalResult[next_score_i], "Good")) {

	# 2: Field Goal
	# Current posteam made FG
	if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) {

	score_type <- "Field_Goal"

	# Opponent made FG
	} else {

	score_type <- "Opp_Field_Goal"

	}

	# 3: Touchdown (returns already counted for)
	} else if (pbp_df$Touchdown[next_score_i] == 1) {

	# Current posteam TD
	if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) {

	score_type <- "Touchdown"

	# Opponent TD
	} else {

	score_type <- "Opp_Touchdown"

	}
	# 4: Safety (similar to returns)
	} else if (pbp_df$Safety[next_score_i] == 1) {

	if (identical(pbp_df$posteam[play_i],pbp_df$posteam[next_score_i])) {

	score_type <- "Opp_Safety"

	} else {

	score_type <- "Safety"

	}
	# 5: Extra Points
	} else if (identical(pbp_df$ExPointResult[next_score_i], "Made")) {

	# Current posteam Extra Point
	if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) {

	score_type <- "Extra_Point"

	# Opponent Extra Point
	} else {

	score_type <- "Opp_Extra_Point"

	}
	# 6: Two Point Conversions
	} else if (identical(pbp_df$TwoPointConv[next_score_i], "Success")) {

	# Current posteam Two Point Conversion
	if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) {

	score_type <- "Two_Point_Conversion"

	# Opponent Two Point Conversion
	} else {

	score_type <- "Opp_Two_Point_Conversion"

	}

	# 7: Defensive Two Point (like returns)
	} else if (identical(pbp_df$DefTwoPoint[next_score_i], "Success")) {

	if (identical(pbp_df$posteam[play_i], pbp_df$posteam[next_score_i])) {

	score_type <- "Opp_Defensive_Two_Point"

	} else {

	score_type <- "Defensive_Two_Point"

	}

	# 8: Errors of some sort so return NA (but shouldn't take place)
	} else {

	score_type <- NA

	}
	}

	return(data.frame(Next_Score_Half = score_type,
	Drive_Score_Half = score_drive))
	}

	# Using lapply and then bind_rows is much faster than
	# using map_dfr() here:
	lapply(c(1:nrow(pbp_dataset)), find_next_score,
	score_plays_i = score_plays, pbp_df = pbp_dataset) %>%
	bind_rows() %>%
	return
	}



	get_ep_model_pbp_data <- function(min_season=2009, max_season=2016){
	pbp_data <- get_pbp_data(min_season:max_season)

	pbp_data <- pbp_data %>% filter(GameID != "2011121101")

	# Apply to each game (ignore the warning messages here):
	pbp_next_score_half <- map_dfr(unique(pbp_data$GameID),
	function(x) {
	pbp_data %>%
	filter(GameID == x) %>%
	find_game_next_score_half()
	})

	# Join to the pbp_data:
	pbp_data_next_score <- bind_cols(pbp_data, pbp_next_score_half)

	# Create the EP model dataset that only includes plays with basic seven
	# types of next scoring events along with the following play types:
	# Field Goal, No Play, Pass, Punt, Run, Sack, Spike

	pbp_ep_model_data <- pbp_data_next_score %>%
	filter(Next_Score_Half %in% c("Opp_Field_Goal", "Opp_Safety", "Opp_Touchdown",
	"Field_Goal", "No_Score", "Safety", "Touchdown") &
	PlayType %in% c("Field Goal", "No Play", "Pass", "Punt", "Run", "Sack",
	"Spike") & is.na(TwoPointConv) & is.na(ExPointResult) &
	!is.na(down) & !is.na(TimeSecs))

	nrow(pbp_ep_model_data)
	# 304805

	# Now adjust and create the model variables:
	pbp_ep_model_data <- pbp_ep_model_data %>%

	# Reference level should be No_Score:
	mutate(Next_Score_Half = fct_relevel(factor(Next_Score_Half), "No_Score"),

	# Create a variable that is time remaining until end of half:
	# (only working with up to 2016 data so can ignore 2017 time change)
	TimeSecs_Remaining = as.numeric(ifelse(qtr %in% c(1,2), TimeSecs - 1800,
	ifelse(qtr == 5, TimeSecs + 900,
	TimeSecs))),

	# log transform of yards to go and indicator for two minute warning:
	log_ydstogo = log(ydstogo),
	Under_TwoMinute_Warning = ifelse(TimeSecs_Remaining < 120, 1, 0),

	# Changing down into a factor variable:
	down = factor(down),

	# Calculate the drive difference between the next score drive and the
	# current play drive:
	Drive_Score_Dist = Drive_Score_Half - Drive,

	# Create a weight column based on difference in drives between play and next score:
	Drive_Score_Dist_W = (max(Drive_Score_Dist) - Drive_Score_Dist) /
	(max(Drive_Score_Dist) - min(Drive_Score_Dist)),
	# Create a weight column based on score differential:
	ScoreDiff_W = (max(abs(ScoreDiff)) - abs(ScoreDiff)) /
	(max(abs(ScoreDiff)) - min(abs(ScoreDiff))),
	# Add these weights together and scale again:
	Total_W = Drive_Score_Dist_W + ScoreDiff_W,
	Total_W_Scaled = (Total_W - min(Total_W)) /
	(max(Total_W) - min(Total_W)))

	return(pbp_ep_model_data)
	}