guga31bb/qb hits.md

## qb hits.md

      
    Raw
  

              qb hits.md
            
          
    The accumulation of QB hits vs passing efficiency

Ben Baldwin
In a follow-up to his excellent piece on the value of the run game in The Athletic (great website, highly recommended), Ted Nguyen shared the following:
"In-house NFL analytics crews track QB hits and the results of the accumulation of hits and how it affects offensive performance over the course of a game."
Does the accumulation of hits affect offensive performance over the game? Is this finally a feather in the cap for the run game defenders?
Because QB hits are tracked by the NFL, we can investigate this ourselves. Let's dive in.
1. Get the data

library(tidyverse)
library(dplyr)
library(na.tools)

pbp_all_rp<- readRDS("FILENAME/pbp_rp.rds")
As a starting point, I'm using the saved dataset of rush plays and pass plays that I created in this tutorial.
Let's make sure the qb_hit variable includes penalty plays:
pbp_all_rp %>% filter(pass==1,qb_hit==1,play_type=="no_play") %>%
  select(desc,qb_hit) %>%head()
  
1 (13:14) (Shotgun) J.Cutler pass incomplete short right to D.Hester [N.Eason]. PENALTY on CHI~      1
2 (2:54) S.Hill pass incomplete short right to Unidentified [R.Edwards]. PENALTY on SF-S.Hill,~      1
3 (:51) (Shotgun) T.Edwards pass incomplete short right [S.Ellis]. PENALTY on BUF-T.Edwards, I~      1
4 (6:53) T.Romo pass incomplete short middle [J.Beason]. PENALTY on DAL-T.Romo, Intentional Gr~      1
5 (7:11) (Shotgun) J.Flacco pass incomplete short left [G.Guyton]. PENALTY on BAL-J.Flacco, In~      1
6 (12:40) (Shotgun) M.Sanchez pass incomplete [R.Starks]. PENALTY on NYJ-M.Sanchez, Intentiona~      1
Nice. So we don't need to mess with anything (note: QB hits are denoted in the play description by players in [brackets], as seen above).
2. Calculate total hits and cumulative hits

Now we need to create two variables: (1) qb hits taken up to the current point in the game and (2) total qb hits taken in the game. I'll also filter out run plays.
hits_data <- pbp_all_rp %>% 
  filter(pass==1) %>% group_by(posteam, game_id) %>%
    mutate(
    cum_hits=cumsum(qb_hit),
    total_hits=sum(qb_hit)
    ) %>%
  ungroup()
I'm grouping by team (posteam), which isn't quite perfect. If a team has to switch quarterbacks mid-game, then the count of hits won't be accurate for the second quarterback. But because these situations are so rare, it shouldn't matter in the aggregate.
The variable cum_hits is created using cumsum, which totals up how many QB hits a team has suffered to that point in the game. And total_hits just sums up the total number of hits over the whole game. I'm kind of amazed at how easy this is to do in R.
Now let's see how total_hits affects EPA per dropback at the game level:
hits_data %>% group_by(total_hits)%>%
  summarize(mean_epa=mean(epa),games=n_distinct(game_id,posteam))
  
   total_hits mean_epa games
        <int>    <dbl> <int>
 1          0  0.256     120
 2          1  0.214     354
 3          2  0.166     596
 4          3  0.112     762
 5          4  0.0758    771
 6          5  0.0353    754
 7          6  0.0202    557
 8          7 -0.00527   428
 9          8 -0.0383    298
10          9 -0.0681    192
11         10 -0.0463    116
12         11 -0.0795     81
13         12 -0.162      38
14         13 -0.124      30
15         14 -0.0417     11
16         15 -0.242       6
17         16 -0.184       4
Wow, the most efficient games are most decidedly the ones in which a QB isn't hit often!
3. Make sure the data are sound

I was surprised that there have been so many games where a QB was never hit (120, the first row above). Initially I thought I did something wrong, but it checks out. Let's make sure we can replicate the official NFL data. I'm going to look at the later stage of Cleveland's season because I know that's where some of the 0-hit games come from.
hits_data %>% filter(posteam=="CLE" & season==2018 & week>=10) %>%
  group_by(week)%>%summarize(hits=mean(total_hits),mean_epa=mean(epa))
  
     week  hits mean_epa
  <int> <dbl>    <dbl>
1    10     0   0.711 
2    12     1   0.792 
3    13     1  -0.0750
4    14     1  -0.0370
5    15     3  -0.0217
6    16     0   0.366 
7    17     2   0.217 
Now compare to the official stats (with thanks to SportRadar):

Boom! A perfect match!
4. Some final cleaning up

Returning to the relationship between hits and EPA per dropback, case closed, right? Games with fewer hits have higher EPA per dropback. Well, not so fast. This is picking up, in part, a game script effect, where overmatched teams fall behind early and are forced to pass a lot, resulting in their QB being hit more often.
So we want to create a level playing field. To do this, let's take teams with a given number of hits and see how the number of accumulated hits affects passing efficiency, holding the total number of hits received in the game constant. There are a number of other ways we could have approached this -- looking at plays within some range of win probability or score differential, for example -- but I think this is a nice illustration.
hits_data <- hits_data %>%
  mutate(
    hit_in_game=case_when(total_hits==0 | total_hits==1~"0-1",
                 total_hits==2 | total_hits==3~"2-3", 
                 total_hits==4 | total_hits==5~"4-5", 
                 total_hits==6 | total_hits==7~"6-7", 
                 total_hits==8|total_hits==9~"8-9", 
                 total_hits>9~"10+") %>% 
                    factor(levels = c("0", "2-3", "4-5", "6-7", "8-9", "10+")))
Above, we've created some BINS based on how often a quarterback is hit in a game (the factor(levels... part at the end isn't strictly necessary, but allows the legend to display in the right order later on).
Now we can group by our bins, along with how many hits a QB has taken up to that point in a given game.
chart <- hits_data %>% group_by(hit_in_game,cum_hits) %>%
  summarize(avg_epa=mean(epa), plays=n())
5. Make the graph

Now all that's left to do is plot the data (with a huge thanks to R genius Josh Hornsby for helping make this looks good)
chart %>% filter(cum_hits>0&cum_hits<=12&!is.na(hit_in_game))%>%
ggplot(aes(x=cum_hits, y=avg_epa, color=hit_in_game, shape=hit_in_game)) +
    geom_jitter(aes(x = cum_hits, y = avg_epa, fill = hit_in_game), shape = 21, stroke = 1.25, size = 4, width = 0.1,show.legend=FALSE)+
   geom_smooth(method=lm, se=FALSE) +
   theme_minimal() +
   theme(
    legend.position = c(0.99, 0.99), 
    legend.justification = c(1, 1) ,
    plot.title = element_text(size = 16, hjust = 0.5),
    panel.grid.minor = element_blank())+ 
  ggsci::scale_color_locuszoom(name = "Total Hits\nIn-Game") +
  scale_y_continuous(name = "EPA Per Dropback", breaks = scales::pretty_breaks(n = 5))+
  scale_x_continuous(breaks = 0:50, name = "Cumulative QB Hits Suffered In Game So Far")+
  labs(title="QB hits versus QB efficiency", caption = "Data from nflscrapR")
    
ggsave('FILENAME.png', dpi=1000)


Well then. The negative relationship between QB hits and efficiency is because the group of teams that get hit often are the only ones to make it to the high numbers of hits. Stated this way, it sounds obvious, but it's important. These teams aren't necessarily inefficient because their QBs are getting hit a lot; but rather, their QBs are getting hit a lot because they're bad teams to begin with.
Side note: I'm not showing 0 hits because there's a mechanical relationship between QB hits and efficiency. It is the one x-axis point that contains 0 hits, by definition, so of course EPA per play is higher: it's a comparison of a set of plays with no QB hits to other sets of plays with QB hits. I also truncated the x-axis at 12 hits because anything higher is extremely rare.
6. Wrapping up

Letting your QB get hit is bad. Obviously. Teams that allow more hits are less likely to have efficient offenses. But for a given level of hits, there is no evidence that the accumulation of hits makes any difference throughout the course of a game. The evidence suggests that we've found a variation of Brian Burke's "passing paradox":

As with the Rule of 53, the NFL has appeared to draw the wrong conclusions from a correlation driven by game state.