Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Introduction to College Football Data with R and cfbscrapR
title author date output
Introduction to College Football Analytics with R
Parker Fleming @statsowar
1/10/2020
pdf_document html_document
default
default

Introduction to College Football Data with R and cfbscrapR

Hello, friends.

You, like me, might be a big college football fan who is interested in statistics and analytics, wanting to get into the game for yourself. This simple tutorial is just the way to do that!

First of all, I need to thank a few people who inspired this document and made it possible. One, @CFB_Data, the proprietor of CollegeFootballData.com. Two, @903124s, an amazing data scientist and Twitter follow who has been very helpful in my understanding and investigation of sports analytics. Third, I have to thank and credit @benbaldwin, who wrote a great nflscrapr tutorial (you should go through that one if you're learning R!) and has been engaging and helpful on Twitter.

Fourth, I'd like to thank @msubbaiah1 and Chad Peltier, both talented data scientists, who have been immenseley helpful with collaborating/troubleshooting data issues and also have made the college football season much more fun for me, and the plethora of other friends on Twitter whose conversation, feedback, and questions have helped me learn.

We'll do three things in this document:

  1. Aggregate data from past seasons and retrieve current data in-season.
  2. Clean the data, define success rate.
  3. Learn some basics of visualization.

I hope you enjoy! Know that there is a learning curve to all this! It's difficult, and at times can be frustrating. The best way to learn this stuff is to play around with it, understand the syntax, and beat your head against the wall for a little bit until it works. Feel free to reach out and ask questions on Twitter!

-Parker Fleming | @statsowar

Setting up R and R Studio

First things first, you need to install R and download R Studio. I'll direct you to this link. Go do that now, then come back.

Now that we have R and R Studio all put together, it's time to get organized. That's often the first step of any project working with data, and the more organized you can be, the easier of a time you'll have trying to wrangle and analyze the data.

Notice there are four quadrants to your R Studio setup: the top right is the Evironment tab, which shows you the data, values, and functions you currently have loaded. The bottom right is where you'll see your charts and graphs in the viewer or plots tab. The bottom left is the console, where output goes. When you execute code, the terminal responds and shows you some stuff down there. The top left is the place you'll live; it's where you can edit scripts. We'll load a new project and then get started.

In the top right of your R Studio window, there will be a small dropdown menu that says "None". Click it, and select "New Project." Title your new folder something like "CFB Analysis".

This takes care of setting your working directory, so everything you create and output will be saved in that folder.

What should pop up then is a completely fresh R project. Click the white square with a green plus sign in the top left of the window and select "New R Script". A script is the place you will write your commands before you execute them. You can work in the console, which is just below the scripts and where output is displayed, but I find it easier to stick to scripts - that way, you can save what you've done and replicate it easily, like I do with the EPA rankings, for example.

Throughout this document, I'll have lines of code interspersed with commentary. What I'd advise is that you read the commentary, then copy the code into your script and then run the code from there. Then, at the end, you'll have a complete script that will guide you through CFB Data analysis.

Getting the Data

We are going to make use of a couple packages for our analysis. Packages are bundles of shared code that make execution more convenient for everyone. They include functions and scripts that help us do the things we want to do more easily. We'll use, very breifly, three packages today: tidyverse, cfbscrapR, and gt.

The tidyverse is a collection of data wrangling tools that will prove itself indispensable to your college football analytics experience. I'll run over some of the basic functions below.

cfbscrapR is an awesome tool created by Meyappan Subbaiah, whichs allows for easy access to college football play by play data from none other than CollegeFootballData.com. You can see more about cfbscrapR here, and you can find CollegefootballData.com at, well, Collegefootballdata.com.

gt is a convient way to make customizable tables. We will just use the basics of that today.

Installing packages

To use the packages, we need to do two things. First, one time only on your machine, you need to install the pacakges. Then, every time you open up R Studio, you will need to load the packages.

To install the packages, simply run each line of code below once (without the '#' at the beginning):

#install.packages('tidyverse')

#install.packages("devtools")
#devtools::install_github("meysubb/cfbscrapR")

#remotes::install_github("rstudio/gt")

Don't worry about what all that means for now. Now, after running those lines of code, you have all the pacakges you need to get started with college football analytics. Let's load the pacakges and download the play-by-play data for 2019. First we load the three necessary packages, then we're going to run a loop over the number of weeks in the season and use the cfb_pbp_data() function, which pulls the play-by-play.

Note in the code how we defined what pbp_2019 is: we used the little arrow. If you're on mac, you can press option and the dash key to pull one of those little guys up. That's just R's fun way of saying "name that thing this".

library(tidyverse)
library(cfbscrapR)
library(gt)


pbp_2019 <- data.frame()
for(i in 1:15){
  data <- cfb_pbp_data(year = 2019, season_type = "both", week = i, epa_wpa = TRUE) %>% mutate(week = i)
  df <- data.frame(data)
  pbp_2019 <- bind_rows(pbp_2019, df)
}

THIS FUNCTION IS GOING TO TAKE A MINUTE OR TWO TO RUN. You'll know it's done running when you can see "pbp_2019" in the "Environment" window in the top right of your R console (and when the little carot > pops up in your terminal, at the bottom left).

Now, we have the entire season of play-by-play data stored in an object called "pbp_2019". Let's use the function head() and tail() to see what this dataset looks like. head() and tail() just show us the first and last six rows of the dataset. They can be very useful to just step into a dataset. If you want to see the entire dataset (not advised, because pbp_2019 is over 150,000 observations), you can just type view().

head(pbp_2019)
##     game_id   drive_id    new_id          id_play.x offense_play defense_play    home away period half clock.minutes clock.seconds offense_score defense_score
## 1 401110720 4011107201 101849902 401110720101849902         Duke      Alabama Alabama Duke      1    1            30             0             0             0
## 2 401110720 4011107201 101855301 401110720101855301      Alabama         Duke Alabama Duke      1    1            29            46             0             0
## 3 401110720 4011107201 101858401 401110720101858401      Alabama         Duke Alabama Duke      1    1            29            15             0             0
## 4 401110720 4011107201 101866901 401110720101866901      Alabama         Duke Alabama Duke      1    1            28            30             0             0
## 5 401110720 4011107201 101874201 401110720101874201      Alabama         Duke Alabama Duke      1    1            27            57             0             0
## 6 401110720 4011107202 101874901 401110720101874901         Duke      Alabama Alabama Duke      1    1            27            50             0             0
##        play_type                                                                       play_text scoring TimeSecsRem Under_two down Goal_To_Go distance adj_yd_line
## 1        Kickoff   AJ Reed kickoff for 65 yds , Henry Ruggs III return for 22 yds to the Alab 22   FALSE        1800     FALSE    1      FALSE       10          78
## 2           Rush                                         Jerome Ford run for 1 yd to the Alab 23   FALSE        1786     FALSE    1      FALSE       10          78
## 3 Pass Reception            Tua Tagovailoa pass complete to Jerome Ford for 2 yds to the Alab 25   FALSE        1755     FALSE    2      FALSE        9          77
## 4           Sack      Tua Tagovailoa sacked by Koby Quansah for a loss of 6 yards to the Alab 19   FALSE        1710     FALSE    3      FALSE        7          75
## 5           Punt Will Reichard punt for 39 yds , Josh Blackwell returns for 3 yds to the Duke 45   FALSE        1677     FALSE    4      FALSE       13          81
## 6           Rush                            Deon Jackson run for a loss of 1 yard to the Duke 44   FALSE        1670     FALSE    1      FALSE       10          55
##   yards_gained TimeSecsRem_end down_end distance_end adj_yd_line_end                   ppa start_yardline yards drive_result pts_drive turnover  ep_before   ep_after
## 1           22            1786        1           10              78                  <NA>             22    -3         PUNT         0        1  1.0107843  0.8099986
## 2            1            1755        2            9              77  -0.51421952939679995             22    -3         PUNT         0        0  0.8099986 -0.1160054
## 3            2            1710        3            7              75 -0.227264862029380229             22    -3         PUNT         0        0 -0.1160054 -0.9544627
## 4           -6            1677        4           13              81 -0.337947526010106361             22    -3         PUNT         0        0 -0.9544627 -2.2805163
## 5            3            1670        1           10              55                  <NA>             22    -3         PUNT         0        1 -2.2805163 -2.4563457
## 6           -1            1647        2           11              56   -1.3838406426271828             55     3         PUNT         0        0  2.4563457  1.5136236
##          EPA score_diff   home_EPA   away_EPA ExpScoreDiff ExpScoreDiff_Time_Ratio rz_play scoring_opp pass rush stuffed_run success epa_success        wp    def_wp
## 1 -0.2007857          0  0.2007857 -0.2007857    1.0107843            5.612350e-04       0           0    0    0           0       1           0 0.5064591 0.4935409
## 2 -0.9260040          0 -0.9260040  0.9260040    0.8099986            4.532728e-04       0           0    0    1           0       0           0 0.5007075 0.4992925
## 3 -0.8384573          0 -0.8384573  0.8384573   -0.1160054           -6.606233e-05       0           0    1    0           0       0           0 0.4683062 0.5316938
## 4 -1.3260536          0 -1.3260536  1.3260536   -0.9544627           -5.578391e-04       0           0    1    0           0       0           0 0.4408228 0.5591772
## 5 -0.1758294          0 -0.1758294  0.1758294   -2.2805163           -1.359068e-03       0           0    0    0           0       0           0 0.3908703 0.6091297
## 6 -0.9427222          0  0.9427222 -0.9427222    2.4563457            1.469985e-03       0           0    0    1           1       0           0 0.5816861 0.4183139
##     home_wp   away_wp change_of_poss end_of_half   lead_wp     wpa_base   wpa_change          wpa home_wp_post away_wp_post adj_TimeSecsRem week
## 1 0.4935409 0.5064591              1           0 0.5007075 -0.005751569 -0.007166633 -0.007166633    0.5007075    0.4992925            3600    1
## 2 0.5007075 0.4992925              0           0 0.4683062 -0.032401320 -0.032401320 -0.032401320    0.4683062    0.5316938            3586    1
## 3 0.4683062 0.5316938              0           0 0.4408228 -0.027483455 -0.027483455 -0.027483455    0.4408228    0.5591772            3555    1
## 4 0.4408228 0.5591772              0           0 0.3908703 -0.049952503 -0.049952503 -0.049952503    0.3908703    0.6091297            3510    1
## 5 0.3908703 0.6091297              1           0 0.5816861  0.190815878  0.027443615  0.027443615    0.4183139    0.5816861            3477    1
## 6 0.4183139 0.5816861              0           0 0.5468710 -0.034815138 -0.034815138 -0.034815138    0.4531290    0.5468710            3470    1
tail(pbp_2019)
##          game_id    drive_id    new_id          id_play.x      offense_play      defense_play              home      away period half clock.minutes clock.seconds
## 155886 401132984 40113298427 104987201 401132984104987201         Louisiana Appalachian State Appalachian State Louisiana      4    2             1            27
## 155887 401132984 40113298427 104988001 401132984104988001         Louisiana Appalachian State Appalachian State Louisiana      4    2             1            19
## 155888 401132984 40113298428 104988004 401132984104988004         Louisiana Appalachian State Appalachian State Louisiana      4    2             1            19
## 155889 401132984 40113298428 104989101 401132984104989101 Appalachian State         Louisiana Appalachian State Louisiana      4    2             1             8
## 155890 401132984 40113298428 104989102 401132984104989102 Appalachian State         Louisiana Appalachian State Louisiana      4    2             1             8
## 155891 401132984 40113298428 104999901 401132984104999901 Appalachian State         Louisiana Appalachian State Louisiana      4    2             0             0
##        offense_score defense_score         play_type
## 155886            31            45 Pass Incompletion
## 155887            38            45 Passing Touchdown
## 155888            38            45           Kickoff
## 155889            45            38              Rush
## 155890            45            38           Timeout
## 155891            45            38              Rush
##                                                                                                                                           play_text scoring TimeSecsRem
## 155886                                                                                                 Levi Lewis pass incomplete to Jarrod Jackson    TRUE          87
## 155887 Levi Lewis pass complete to Peter Leblanc for 38 yds for a TD LOUISIANA Penalty, False Start (-5 Yards) to the AppSt 8 (Stevie Artigue KICK)    TRUE          79
## 155888                                                                                                        Kenneth Almendares kickoff for 17 yds   FALSE          79
## 155889                                                                                        Zac Thomas run for a loss of 12 yards to the AppSt 41   FALSE          68
## 155890                                                                                                               Timeout LOUISIANA, clock 01:08   FALSE          68
## 155891                                                                                         Zac Thomas run for a loss of 6 yards to the AppSt 35   FALSE           0
##        Under_two down Goal_To_Go distance adj_yd_line yards_gained TimeSecsRem_end down_end distance_end adj_yd_line_end                 ppa start_yardline yards
## 155886      TRUE    1      FALSE       10          38            0              79        2           10              38 -0.7828922522763084             64    64
## 155887      TRUE    2      FALSE       10          38           38              79        2           15              47  3.8774849008614964             64    64
## 155888      TRUE    2      FALSE       15          47            0              68        1           10              47                <NA>             53   -18
## 155889      TRUE    1      FALSE       10          47          -12              68        2           22              59 -3.1871949675841462             53   -18
## 155890      TRUE    2      FALSE       22          59            0               0        2           22              59                <NA>             53   -18
## 155891      TRUE    2      FALSE       22          59           -6               0        4           99              99 -0.5082402762391265             53   -18
##              drive_result pts_drive turnover ep_before  ep_after        EPA score_diff   home_EPA   away_EPA ExpScoreDiff ExpScoreDiff_Time_Ratio rz_play scoring_opp pass
## 155886                 TD         7        0 1.9245973 1.5856217 -0.3389757        -14  0.3389757 -0.3389757   -12.075403             -0.13722049       0           1    1
## 155887                 TD         7        0 1.5856217 7.0000000  5.4143783         -7 -5.4143783  5.4143783    -5.414378             -0.06767973       0           1    1
## 155888 END OF 4TH QUARTER         0        1 0.8564118 1.5039958  0.6475840         -7 -0.6475840  0.6475840    -6.143588             -0.07679485       0           0    0
## 155889 END OF 4TH QUARTER         0        0 1.5039958 0.9346700 -0.5693259          7 -0.5693259  0.5693259     8.503996              0.12324632       0           0    0
## 155890 END OF 4TH QUARTER         0        0 0.9346700 0.7662598 -0.1684101          7 -0.1684101  0.1684101     7.934670              0.11499522       0           0    0
## 155891 END OF 4TH QUARTER         0        0 0.0000000 0.0000000  0.0000000          7  0.0000000  0.0000000     7.000000              7.00000000       0           0    0
##        rush stuffed_run success epa_success         wp    def_wp   home_wp    away_wp change_of_poss end_of_half   lead_wp    wpa_base  wpa_change         wpa home_wp_post
## 155886    0           0       0           0 0.09041892 0.9095811 0.9095811 0.09041892              0           0 0.2674507  0.17703176  0.17703176  0.17703176    0.7325493
## 155887    0           0       1           1 0.26745068 0.7325493 0.7325493 0.26745068              0           0 0.2409835 -0.02646721 -0.02646721 -0.02646721    0.7590165
## 155888    0           0       0           1 0.24098347 0.7590165 0.7590165 0.24098347              1           0 0.8121967  0.57121323 -0.05318017 -0.05318017    0.8121967
## 155889    1           1       0           0 0.81219670 0.1878033 0.8121967 0.18780330              0           0 0.7966327 -0.01556398 -0.01556398 -0.01556398    0.7966327
## 155890    0           0       0           0 0.79663273 0.2033673 0.7966327 0.20336727              0           0 0.8382075  0.04157473  0.04157473  0.04157473    0.8382075
## 155891    1           1       0           0 0.83820745 0.1617925 0.8382075 0.16179255              0          NA        NA          NA          NA          NA           NA
##        away_wp_post adj_TimeSecsRem week
## 155886    0.2674507              87   15
## 155887    0.2409835              79   15
## 155888    0.1878033              79   15
## 155889    0.2033673              68   15
## 155890    0.1617925              68   15
## 155891           NA               0   15

Ok, well, that was kind of messy. If we don't want to see that much data at one time, we can select the variables we want with the conveniently named select() function. This won't alter our data, it will just display what we have in the console. Don't worry about syntax and symbols yet; I'll discuss that below.

pbp_2019 %>% select(offense_play, defense_play, down, distance, play_type, yards_gained) %>% head()
##   offense_play defense_play down distance      play_type yards_gained
## 1         Duke      Alabama    1       10        Kickoff           22
## 2      Alabama         Duke    1       10           Rush            1
## 3      Alabama         Duke    2        9 Pass Reception            2
## 4      Alabama         Duke    3        7           Sack           -6
## 5      Alabama         Duke    4       13           Punt            3
## 6         Duke      Alabama    1       10           Rush           -1

Alright, now we have some data. We can see that each line is a play in the 2019 season. The data includes some key variables like which team was on offense (offense_play) and defense (defense_play), the down, distance, and yard line. The variable adj_yd_line, adjusted yard line, converts each team's position on the field into distance from the opponent's end zone. That way, we know that if a team starts at the 10, they're closing to scoring, and if they start at the 90, they're backed up. Let's see all the variables we have using glimpse().

glimpse(pbp_2019)
## Observations: 155,891
## Variables: 63
## $ game_id                 <int> 401110720, 401110720, 401110720, 401110720, 401110720, 401110720, 401110720, 401110720, 401110720, 401110720, 401110720, 401110720, 40111…
## $ drive_id                <dbl> 4011107201, 4011107201, 4011107201, 4011107201, 4011107201, 4011107202, 4011107202, 4011107202, 4011107202, 4011107203, 4011107203, 40111…
## $ new_id                  <dbl> 101849902, 101855301, 101858401, 101866901, 101874201, 101874901, 101877201, 101884401, 101885001, 101885701, 101887501, 101888801, 10189…
## $ id_play.x               <chr> "401110720101849902", "401110720101855301", "401110720101858401", "401110720101866901", "401110720101874201", "401110720101874901", "4011…
## $ offense_play            <chr> "Duke", "Alabama", "Alabama", "Alabama", "Alabama", "Duke", "Duke", "Duke", "Duke", "Alabama", "Alabama", "Duke", "Duke", "Duke", "Duke",…
## $ defense_play            <chr> "Alabama", "Duke", "Duke", "Duke", "Duke", "Alabama", "Alabama", "Alabama", "Alabama", "Duke", "Duke", "Alabama", "Alabama", "Alabama", "…
## $ home                    <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
## $ away                    <chr> "Duke", "Duke", "Duke", "Duke", "Duke", "Duke", "Duke", "Duke", "Duke", "Duke", "Duke", "Duke", "Duke", "Duke", "Duke", "Duke", "Duke", "…
## $ period                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,…
## $ half                    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ clock.minutes           <dbl> 30, 29, 29, 28, 27, 27, 27, 26, 26, 26, 26, 26, 25, 25, 24, 24, 24, 23, 23, 22, 22, 22, 22, 21, 21, 20, 20, 19, 19, 19, 18, 18, 18, 17, 1…
## $ clock.seconds           <int> 0, 46, 15, 30, 57, 50, 27, 55, 49, 42, 24, 11, 46, 13, 49, 11, 3, 28, 22, 50, 50, 40, 17, 38, 3, 31, 6, 56, 28, 23, 59, 25, 17, 48, 21, 4…
## $ offense_score           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ defense_score           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ play_type               <chr> "Kickoff", "Rush", "Pass Reception", "Sack", "Punt", "Rush", "Rush", "Pass Incompletion", "Punt", "Rush", "Fumble Recovery (Opponent)", "…
## $ play_text               <chr> "AJ Reed kickoff for 65 yds , Henry Ruggs III return for 22 yds to the Alab 22", "Jerome Ford run for 1 yd to the Alab 23", "Tua Tagovail…
## $ scoring                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ TimeSecsRem             <dbl> 1800, 1786, 1755, 1710, 1677, 1670, 1647, 1615, 1609, 1602, 1584, 1571, 1546, 1513, 1489, 1451, 1443, 1408, 1402, 1370, 1370, 1360, 1337,…
## $ Under_two               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ down                    <int> 1, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 1, 2, 2, 2, 3, 1, 1, 2, 3, 1, 2, 3, 4, 1, 2, 1, 2, 3, 3, 4, 1, 2, 3, 1, 1, 2, 3,…
## $ Goal_To_Go              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ distance                <int> 10, 10, 9, 7, 13, 10, 11, 7, 7, 10, 4, 10, 5, 5, 10, 8, 8, 1, 10, 10, 10, 15, 21, 5, 10, 10, 5, 5, 10, 10, 5, 3, 10, 6, 10, 9, 5, 10, 10,…
## $ adj_yd_line             <dbl> 78, 78, 77, 75, 81, 55, 56, 52, 52, 83, 77, 26, 21, 21, 16, 14, 14, 7, 93, 83, 83, 88, 94, 78, 67, 53, 48, 48, 38, 38, 33, 48, 68, 64, 48…
## $ yards_gained            <int> 22, 1, 2, -6, 3, -1, 4, 0, 35, 6, 3, 5, 0, 5, 2, 0, 7, 0, 10, 0, -5, -6, 16, 11, 14, 5, 0, 10, 0, 5, 2, 49, 4, 16, 1, 4, -5, 0, 0, 7, 2, …
## $ TimeSecsRem_end         <dbl> 1786, 1755, 1710, 1677, 1670, 1647, 1615, 1609, 1602, 1584, 1571, 1546, 1513, 1489, 1451, 1443, 1408, 1402, 1370, 1370, 1360, 1337, 1298,…
## $ down_end                <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 1, 2, 2, 2, 3, 1, 1, 2, 3, 1, 2, 3, 4, 1, 2, 1, 2, 3, 3, 4, 1, 2, 3, 1, 1, 2, 3, 1,…
## $ distance_end            <dbl> 10, 9, 7, 13, 10, 11, 7, 7, 10, 4, 10, 5, 5, 10, 8, 8, 1, 10, 10, 10, 15, 21, 5, 10, 10, 5, 5, 10, 10, 5, 3, 10, 6, 10, 9, 5, 10, 10, 10,…
## $ adj_yd_line_end         <dbl> 78, 77, 75, 81, 55, 56, 52, 52, 83, 77, 26, 21, 21, 16, 14, 14, 7, 93, 83, 83, 88, 94, 78, 67, 53, 48, 48, 38, 38, 33, 48, 68, 64, 48, 47…
## $ ppa                     <chr> NA, "-0.51421952939679995", "-0.227264862029380229", "-0.337947526010106361", NA, "-1.3838406426271828", "-0.1489327290029632", "-0.77989…
## $ start_yardline          <int> 22, 22, 22, 22, 22, 55, 55, 55, 55, 17, 17, 26, 26, 26, 26, 26, 26, 26, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 68, 68, 68, 68, 68, 68,…
## $ yards                   <int> -3, -3, -3, -3, -3, 3, 3, 3, 3, 9, 9, 19, 19, 19, 19, 19, 19, 19, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 20, 20, 20, 20,…
## $ drive_result            <chr> "PUNT", "PUNT", "PUNT", "PUNT", "PUNT", "PUNT", "PUNT", "PUNT", "PUNT", "FUMBLE", "FUMBLE", "DOWNS", "DOWNS", "DOWNS", "DOWNS", "DOWNS", …
## $ pts_drive               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 7, 7, 7, 7, 7, 7,…
## $ turnover                <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ ep_before               <dbl> 1.01078431, 0.80999858, -0.11600545, -0.95446272, -2.28051629, 2.45634573, 1.51362356, 0.93704841, 0.04468223, 0.54965957, -0.04042256, 4…
## $ ep_after                <dbl> 0.80999858, -0.11600545, -0.95446272, -2.28051629, -2.45634573, 1.51362356, 0.93704841, 0.04468223, -0.54965957, -0.04042256, -4.43576801…
## $ EPA                     <dbl> -0.200785727, -0.926004026, -0.838457274, -1.326053572, -0.175829436, -0.942722165, -0.576575150, -0.892366184, -0.594341799, -0.59008213…
## $ score_diff              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ home_EPA                <dbl> 0.200785727, -0.926004026, -0.838457274, -1.326053572, -0.175829436, 0.942722165, 0.576575150, 0.892366184, 0.594341799, -0.590082134, -4…
## $ away_EPA                <dbl> -0.200785727, 0.926004026, 0.838457274, 1.326053572, 0.175829436, -0.942722165, -0.576575150, -0.892366184, -0.594341799, 0.590082134, 4.…
## $ ExpScoreDiff            <dbl> 1.01078431, 0.80999858, -0.11600545, -0.95446272, -2.28051629, 2.45634573, 1.51362356, 0.93704841, 0.04468223, 0.54965957, -0.04042256, 4…
## $ ExpScoreDiff_Time_Ratio <dbl> 5.612350e-04, 4.532728e-04, -6.606233e-05, -5.578391e-04, -1.359068e-03, 1.469985e-03, 9.184609e-04, 5.798567e-04, 2.775294e-05, 3.428943…
## $ rz_play                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ scoring_opp             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pass                    <dbl> 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,…
## $ rush                    <dbl> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0,…
## $ stuffed_run             <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ success                 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,…
## $ epa_success             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
## $ wp                      <dbl> 0.5064591, 0.5007075, 0.4683062, 0.4408228, 0.3908703, 0.5816861, 0.5468710, 0.5253019, 0.4893685, 0.5098827, 0.4858052, 0.6567467, 0.642…
## $ def_wp                  <dbl> 0.4935409, 0.4992925, 0.5316938, 0.5591772, 0.6091297, 0.4183139, 0.4531290, 0.4746981, 0.5106315, 0.4901173, 0.5141948, 0.3432533, 0.357…
## $ home_wp                 <dbl> 0.4935409, 0.5007075, 0.4683062, 0.4408228, 0.3908703, 0.4183139, 0.4531290, 0.4746981, 0.5106315, 0.5098827, 0.4858052, 0.3432533, 0.357…
## $ away_wp                 <dbl> 0.5064591, 0.4992925, 0.5316938, 0.5591772, 0.6091297, 0.5816861, 0.5468710, 0.5253019, 0.4893685, 0.4901173, 0.5141948, 0.6567467, 0.642…
## $ change_of_poss          <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ end_of_half             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lead_wp                 <dbl> 0.5007075, 0.4683062, 0.4408228, 0.3908703, 0.5816861, 0.5468710, 0.5253019, 0.4893685, 0.5098827, 0.4858052, 0.6567467, 0.6421370, 0.613…
## $ wpa_base                <dbl> -0.005751569, -0.032401320, -0.027483455, -0.049952503, 0.190815878, -0.034815138, -0.021569115, -0.035933372, 0.020514200, -0.024077459,…
## $ wpa_change              <dbl> -0.0071666330, -0.0324013200, -0.0274834555, -0.0499525030, 0.0274436151, -0.0348151376, -0.0215691149, -0.0359333720, 0.0007487863, -0.0…
## $ wpa                     <dbl> -0.0071666330, -0.0324013200, -0.0274834555, -0.0499525030, 0.0274436151, -0.0348151376, -0.0215691149, -0.0359333720, 0.0007487863, -0.0…
## $ home_wp_post            <dbl> 0.5007075, 0.4683062, 0.4408228, 0.3908703, 0.4183139, 0.4531290, 0.4746981, 0.5106315, 0.5098827, 0.4858052, 0.3432533, 0.3578630, 0.386…
## $ away_wp_post            <dbl> 0.4992925, 0.5316938, 0.5591772, 0.6091297, 0.5816861, 0.5468710, 0.5253019, 0.4893685, 0.4901173, 0.5141948, 0.6567467, 0.6421370, 0.613…
## $ adj_TimeSecsRem         <dbl> 3600, 3586, 3555, 3510, 3477, 3470, 3447, 3415, 3409, 3402, 3384, 3371, 3346, 3313, 3289, 3251, 3243, 3208, 3202, 3170, 3170, 3160, 3137,…
## $ week                    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

I won't explain every single one of these variables, but we have plenty of information here from straightforward descriptives like home, away, period, down, distance, and rush, which tell us what happened on each play, and we have some calculated variables. rz_play is whether the play happened in the red zone, scoring_opp indicates whether the play comes in a scoring opportunity. We have the full EPA calcs, and even a Win Probability Added model. Included as well is CollegeFootballData's Predicted Points Added metric, which is very similar to EPA. We also have the play_type, which tells us what kinds of plays are all in the dataset. Using the levels() function, let's see what kinds of plays there are. Alternatively, you could use count() to show frequencies of each play type.

levels(factor(pbp_2019$play_type))
pbp_2019 %>% count(play_type, sort = TRUE)
##  [1] "Blocked Field Goal"                 "Blocked Punt"                       "Blocked Punt Touchdown"             "Defensive 2pt Conversion"          
##  [5] "Field Goal Good"                    "Field Goal Missed"                  "Fumble Recovery (Opponent)"         "Fumble Recovery (Own)"             
##  [9] "Fumble Return Touchdown"            "Interception Return Touchdown"      "Kickoff"                            "Kickoff Return (Offense)"          
## [13] "Kickoff Return Touchdown"           "Missed Field Goal Return"           "Missed Field Goal Return Touchdown" "Pass Incompletion"                 
## [17] "Pass Interception Return"           "Pass Reception"                     "Passing Touchdown"                  "Penalty"                           
## [21] "placeholder"                        "Punt"                               "Punt Return Touchdown"              "Rush"                              
## [25] "Rushing Touchdown"                  "Sack"                               "Safety"                             "Timeout"                           
## [29] "Uncategorized"

There's the entire array of offense, defense, special teams, and penalties for plays. For most of the analysis you'll want to do, you want this to be just offensive plays. We can use the filter() function from the tidyverse to select certain observations. We need to discuss two things before we do that.

  1. Filtering on conditions: We will use the double equality (==) to identify conditions we need to meet for our filter function. We can also use !=, "not equals", to select values on that condition (i.e. offense_play != "TCU" would remove TCU from the data), and we can use the standard greater than > and less than <. You might have to stop and think a bit before you write your filter function, so just make sure you know what it is you are actually selecting.

  2. the pipe: %>% is the magrittr pipe. The pipe is extremely handy. Instead of us having to type a function and select the data we want to apply that function to over and over again, the pipe tells R what to do in the structure of : "get this thing and then do this to it."

I'm naming our dataset of play by plays filtered for offense and defense "plays". I don't like to overwrite my data once I've imported it, so I'm going to give it a new name. That way, if I mess something up along the way, I can go back to the start and clean up without too much hassle.

We'll use the indicator variables "rush" and "pass" to select every play categorized as rushes or passes. (Yes, this leaves out special teams and yes this leaves out penalties. We're starting small, and can always go back later.)

Also note we use the | symbol to denote "or" in our filter function. We'll filter the data for any observations that are runs (rush == 1) or passes (pass == 1), and store it in a new object called plays.

plays <- pbp_2019 %>% filter(rush == 1 | pass == 1)

Now we have a clean dataset of every rush and pass that happened in 2019. (Quick note: unfortunately, due to the way the pbp data is generated on ESPN, QB scrambles from dropbacks are coded as rushes. It's not ideal, but it's not a dealbreaker; just something to remember as you do your analysis).

Creating Some Stats

Ok, now let's play with the tidyverse a little more. We've used filter(), which is an extremely useful function. The next most useful functions actually do things to the data: group_by(), and summarise().

Summarise is the way of creating season-long or game-by-game stats. Let's start with some season long raw offense numbers, yards per attempt (passing) and yards per rush. We're going to use the pipe (%>%) to tell R to grab our plays dataset, group it by offense team, and summarize their yards per attempt and yards per rush in a new object called offense.

offense <- plays %>% group_by(offense_play) %>% summarise(ypa = mean(yards_gained[pass==1]), ypr = mean(yards_gained[rush==1]), num.plays = n()) %>% filter(num.plays > 300)

I also created a summary variable that totals the plays an offense has run all season, and then filtered the dataset for those teams with more than 300 plays. That gets rid of FCS teams for now.

Now we can ask and answer some fun questions. Who had the best rushing offense, on a per play basis? Who had the best passing offense, on a per-play basis? We will do this using the arrange(desc()) function, which tells R to order the data from greatest to least (descending).

offense %>% arrange(desc(ypr))
## # A tibble: 130 x 4
##    offense_play        ypa   ypr num.plays
##    <chr>             <dbl> <dbl>     <int>
##  1 Mississippi State  5.91  6.27       816
##  2 Oklahoma           9.5   6.13       903
##  3 Kentucky           4.68  5.96       791
##  4 Clemson            7.96  5.96      1028
##  5 Ohio State         8.09  5.69      1023
##  6 Louisiana          7.58  5.61       963
##  7 Washington State   6.79  5.56       901
##  8 TCU                5.18  5.53       864
##  9 Texas              6.94  5.47       921
## 10 Fresno State       6.57  5.45       730
## # … with 120 more rows
offense %>% arrange(desc(ypa))
## # A tibble: 130 x 4
##    offense_play   ypa   ypr num.plays
##    <chr>        <dbl> <dbl>     <int>
##  1 Air Force    11.0   5.27       838
##  2 Alabama      10.2   4.87       812
##  3 LSU           9.74  5.20      1044
##  4 Oklahoma      9.5   6.13       903
##  5 Navy          9.13  5.44       822
##  6 Minnesota     8.86  4.76       854
##  7 Utah          8.81  5.05       878
##  8 Memphis       8.71  4.54       953
##  9 UCF           8.23  4.97       977
## 10 Louisville    8.16  5.11       846
## # … with 120 more rows
offense %>% arrange(ypa)
## # A tibble: 130 x 4
##    offense_play       ypa   ypr num.plays
##    <chr>            <dbl> <dbl>     <int>
##  1 Northwestern      3.95  4.40       817
##  2 Akron             4.11  3.33       719
##  3 Old Dominion      4.24  3.86       788
##  4 UMass             4.33  3.77       802
##  5 Vanderbilt        4.56  4.59       732
##  6 Georgia Southern  4.61  5.03       784
##  7 Duke              4.65  3.94       852
##  8 Georgia Tech      4.65  5.15       695
##  9 Kentucky          4.68  5.96       791
## 10 South Florida     4.70  5.43       754
## # … with 120 more rows
offense %>% arrange(ypr)
## # A tibble: 130 x 4
##    offense_play           ypa   ypr num.plays
##    <chr>                <dbl> <dbl>     <int>
##  1 West Virginia         5.80  3.15       778
##  2 Akron                 4.11  3.33       719
##  3 Southern Mississippi  7.59  3.63       851
##  4 San Diego State       5.96  3.68       919
##  5 UMass                 4.33  3.77       802
##  6 Miami (OH)            6.25  3.81       845
##  7 Purdue                6.42  3.85       841
##  8 Old Dominion          4.24  3.86       788
##  9 Michigan State        6.15  3.87       917
## 10 Rutgers               4.76  3.87       723
## # … with 120 more rows

I took out the desc() part of the function to display the worst offenses, for fun. We see that Mississippi State had the highest yards per rush, followed by Oklahoma and Kentucky. The top passing teams were Air Force, Alabama, LSU, and Oklahoma. Bonus points if you can tell me why Air Force is ranked that highly!

The worst offenses for rushing and passing were West Virginia, who averaged only 3.15 yards per rush, and Northwestern, who averaged an inconceivable 3.95 yards per passing attempt.

We can easily do the defensive side of the ball as well, grouping plays instead by defense_play. Let's make a dataset and then use left_join() to put offenses and defenses together, but let's do EPA instead of yards, because EPA is more fun. (What's EPA? Glad you asked: An EPA Primer).

offense <- plays %>% group_by(offense_play) %>% summarise(epa.pass.off = mean(EPA[pass==1]), epa.rush.off = mean(EPA[rush==1]), num.plays = n()) %>% filter(num.plays > 300)

defense <- plays %>% group_by(defense_play) %>% summarise(epa.pass.def = mean(EPA[pass==1]), epa.pass.def = mean(EPA[rush==1]), num.plays = n()) %>% filter(num.plays > 300)

team.epa <- left_join(offense, defense, by = c("offense_play" = "defense_play")) 

head(team.epa)
## # A tibble: 6 x 6
##   offense_play      epa.pass.off epa.rush.off num.plays.x epa.pass.def num.plays.y
##   <chr>                    <dbl>        <dbl>       <int>        <dbl>       <int>
## 1 Air Force               0.578       0.00825         838      -0.238          725
## 2 Akron                  -0.318      -0.373           719      -0.130          837
## 3 Alabama                 0.538      -0.0213          812      -0.195          862
## 4 Appalachian State       0.217      -0.0682          931      -0.285          924
## 5 Arizona                 0.0808     -0.0914          848      -0.0592         862
## 6 Arizona State           0.0872     -0.234           836      -0.233          915

Visualization

Let's learn how to make a basic plot using ggplot() to vizualise college football teams! We didn't have to load ggplot(), it's included in the tidyverse. We will start with a basic plot, then we'll add some fancy formatting.

The syntax here is pretty simple: we will grab our dataset and use the pipe to say "open a plot". Then we'll tell the plot what the x and y coordinates are, then indicate the markers, in this case a scatterplot with geom_point(). Notice that we use + instead of %>% once we've called ggplot(). The syntax here tells R to take 'team.epa' and open a plot with the x.axis epa.rush.off and the y.axis epa.pass.off, PLUS all these features. That can be a little confusion to keep straightforward, but R will tell you what's going on with an error if you mix that up!

team.epa %>% ggplot(aes(x=epa.rush.off, y=epa.pass.off)) + geom_point()

plot5-1

Ok, that's pretty cool! Let's dress it up a little bit by adding a title and labelling the axes, using the labs() option. We will also throw a couple reference lines - a vertical and horizontal line indicating the mean values of each statistic to help us compare. Plus, here are some tweaks I like that make the graph look a little better (all the theme() stuff).

team.epa %>% ggplot(aes(x=epa.rush.off, y=epa.pass.off)) + geom_point() +
  geom_vline(xintercept = mean(team.epa$epa.rush.off), linetype = "dashed", color = "blue") +
  geom_hline(yintercept = mean(team.epa$epa.pass.off), linetype = "dashed", color = "blue") +
  labs(x = "Rush EPA/Play", y= "Pass EPA/Play",
       title = "2019 NCAA Team Efficiency") +
  theme_bw() +
	theme(axis.title = element_text(size = 12),
	axis.text = element_text(size = 10),
	plot.title = element_text(size = 16),
	plot.subtitle = element_text(size = 14),
        plot.caption = element_text(size = 12))

000002

That's a pretty cool visualization! If you wanted to get really fancy, once you got comfortable with the basics, you could use the ggimage package and the team page from CollegeFootballData.com to get team logos on there.

Another useful visualization might be to see the EPA of each of a team's plays by location on the field. We will filter the dataset to include only our favorite team, and then plot the EPA of each play to examine outliers.

Here's the 2019 TCU Offense:

tcu <- plays %>% filter(offense_play == "TCU") 

tcu %>%
  ggplot(aes(x=adj_yd_line, y=EPA)) +
  geom_point() +
  labs(x = "Yard Line",
	y = "EPA",
	title = "Expected Points Added by Field Position",
	subtitle = "TCU Offense 2019") +
  geom_abline(slope=0, intercept = 0, alpha = 0.5, col = "purple") +
  theme_bw() +
	theme(axis.title = element_text(size = 12),
	axis.text = element_text(size = 10),
	plot.title = element_text(size = 16),
	plot.subtitle = element_text(size = 14),
        plot.caption = element_text(size = 12))

000002 (1)

I added a couple fun features here: a geom_abline() reference line at zero, to help us better understand the graph, and a subtitle to clarify what was on the graph. Notice that TCU's EPA was spread almost evenly across the field, but you can see some serious negative ouliers that weighed them down. We could call those outliers into our console to explore them more:

plays %>% filter(offense_play == "TCU" & EPA < -4) %>% select(offense_play, defense_play, play_text, down, distance, adj_yd_line)
##   offense_play   defense_play
## 1          TCU            SMU
## 2          TCU            SMU
## 3          TCU            SMU
## 4          TCU            SMU
## 5          TCU          Texas
## 6          TCU Oklahoma State
## 7          TCU         Baylor
## 8          TCU         Baylor
## 9          TCU  West Virginia
##                                                                                                                                        play_text down distance adj_yd_line
## 1                      Sewo Olonilua run for 1 yd to the TCU 4 Sewo Olonilua fumbled, forced by Richard McBryde, recovered by SMU Patrick Nelson    1       10          99
## 2   Max Duggan sacked by Turner Coxe for a loss of 8 yards to the TCU 23 Max Duggan fumbled, recovered by SMU Demerick Gary , return for 0 yards    3       12          70
## 3                                            Max Duggan run for a loss of 4 yards to the TCU 24 Max Duggan fumbled, recovered by SMU Toby Ndukwe    2        6          71
## 4                                                 Sewo Olonilua sacked by Patrick Nelson and Delano Robinson for a loss of 6 yards to the SMU 16    4        1          10
## 5 Max Duggan pass intercepted Brandon Jones return for a loss of 3 yards to the Texas 24 TEXAS Penalty, Illegal Block (12 Yards) to the Texas 12    1       10          24
## 6                       Max Duggan pass complete to John Stephens Jr. for 13 yds John Stephens Jr. fumbled, recovered by OKSt Kolby Harvell-Peel    1       10          31
## 7                                                                  Max Duggan pass intercepted Grayland Arnold return for no gain to the Bayl 14    1       10          24
## 8                                                                      Max Duggan pass intercepted Terrel Bernard return for 20 yds to the TCU 8    1       10          81
## 9                                                                        Max Duggan pass intercepted Tykee Smith return for 39 yds to the TCU 14    3        4          57

Descriptive Analysis

Let's do some more descriptive analysis. How did playoff teams fare in the passing game in 2019? We will use filter() to include only the four playoff teams and plays that were coded as pass, making great use of the "or" | symbol, and then we will use group_by() and summarise() to create a new variable called mean_epa.

plays %>% 
  filter(offense_play %in% c("LSU", "Ohio State", "Clemson", "Oklahoma") & pass == 1) %>%  
  group_by(offense_play) %>%
  summarize(mean_epa = mean(EPA)) %>%
  arrange(desc(mean_epa))
## # A tibble: 4 x 2
##   offense_play mean_epa
##   <chr>           <dbl>
## 1 Oklahoma        0.459
## 2 LSU             0.438
## 3 Ohio State      0.404
## 4 Clemson         0.246

We can then easily recreate the same idea above using success rate instead of EPA.

plays %>% 
  filter(offense_play %in% c("LSU", "Ohio State", "Clemson", "Oklahoma") & pass == 1) %>%  
  group_by(offense_play) %>%
  summarize(success.rate = mean(success)) %>%
  arrange(desc(success.rate))
## # A tibble: 4 x 2
##   offense_play success.rate
##   <chr>               <dbl>
## 1 LSU                 0.573
## 2 Oklahoma            0.535
## 3 Ohio State          0.533
## 4 Clemson             0.455

Lastly, let's explore the gt() pacakge to list the top ten rushing and passing offenses and defenses in the country. We will take our dataset, sort it by the value we want, add a rank variable using mutate(), then apply gt(), simply enough.

#Passing
team.epa %>% arrange(desc(epa.pass.off)) %>% mutate(rank = dense_rank(desc(epa.pass.off))) %>% 
  filter(rank < 10) %>% gt()

Screen Shot 2020-01-16 at 11 10 32 AM

Now, over in your viewer (bottom right), you should see a sleek table. Ok, well, not so sleek. We will do three things to this table to make it a little more palatable. First, we'll add a title, second, we'll switch the order of the variables, and third, we will change the column titles.

team.epa %>% arrange(desc(epa.pass.off)) %>% mutate(rank = dense_rank(desc(epa.pass.off))) %>%
  select(rank, offense_play, epa.pass.off) %>% 
  filter(rank < 11) %>% gt() %>%
  tab_header(title = "Best Passing Teams") %>%
  cols_label(rank = "Rank", offense_play = "Offense", epa.pass.off = "EPA/Attempt")

Screen Shot 2020-01-16 at 11 10 48 AM

This is just a little taste of what you can do in college football data, thanks to collegefootballdata.com and #cfbscrapR.

Let me know of any work you do, resources you can think of, or code you want to share, then I'll link to it below!

Other Great Resources

Books I'd Recommend:

  • MATHLETICS by Wayne L. Winson
  • Introductory Econometrics: A Modern Approach by Jeffry Wooldridge
  • Analyzing Baseball Data with R by Baumer, Albert, and Marchi

Glossary of terms and functions:

  • cfbscrapR: the R package that helps you get the college football data. Documentation can be found here:
    • Functions in cfbscrapR:
      • cfb_pbp_data() gets you the play by play for a given week or team. Can loop over this to pull the whole season. If you include 'drives=TRUE', you can get the drive data.
      • cfb_game_info() gets you the home team, away team, game score, and more info about games.
      • plot_pbp_sequencing() and plot_wpa() are both built-in graphics you can play with. See documentation.
  • tidyverse: the R data science package that helps you wrangle and analyze data.
    • Functions we used in the tidyverse:
      • mutate(): creates new variables
      • group_by() and summarise(): groups observation and calculates summary statistics
      • filter(): keeps observations based on certain conditions (remember to use ==)
      • left_join(): takes one dataframe and merges it with another
      • head() tail() glimpse() and levels(): all help you inspect the data
      • ggplot() and geom_point(): graphing functions that set up a graph and plots scatter plots
  • gt: a nice package for getting publication-quality tables
@brockthebear

This comment has been minimized.

Copy link

@brockthebear brockthebear commented Jan 17, 2020

Incredible stuff man 👏

@DSupps

This comment has been minimized.

Copy link

@DSupps DSupps commented Jan 25, 2020

I only get a data frame with 22095 observations and not 155,891

@kirinzero13

This comment has been minimized.

Copy link

@kirinzero13 kirinzero13 commented Feb 21, 2020

Great, you are simply magnificent, everything is painted very beautifully and clearly! I just study as a programmer, played football in college, I was very dumb, I ordered training tasks from https://au.edubirdie.com/help-with-nursing-assignments and spent all my time in sports, although my physical form is good but knowledge is a problem. In general, they provided me with nursing assignment help, but my brain didn’t work any better, of course, now I’m learning a little bit of python, it’s kind of like, can someone tell me some interesting courses?

@vcardamone

This comment has been minimized.

Copy link

@vcardamone vcardamone commented Mar 28, 2020

Hi, I am having a problem running this code:
library(gt) library(tidyverse) library(cfscrapR) pbp_2019 <- data.frame() for(i in 1:15){ data <- cfb_pbp_data(year = 2019, season_type = "both", week = i, epa_wpa = TRUE) %>% mutate(week = i) df <- data.frame(data) pbp_2019 <- bind_rows(pbp_2019, df) }
When I run it I get an error message saying that "could not find function "cfp_pbp_data"", any help would be greatly appreciated as I am very new to R. Thanks in advance.

@akornberg

This comment has been minimized.

Copy link

@akornberg akornberg commented Apr 28, 2020

@vcardamone

I had the same problem, however it was fixed by re-installing the packages. Try these again!

install.packages('tidyverse')
install.packages("devtools")
devtools::install_github("meysubb/cfbscrapR")
remotes::install_github("rstudio/gt")

@JagsStats

This comment has been minimized.

Copy link

@JagsStats JagsStats commented May 6, 2020

How can I filter plays by player? For example, how would I look at passing plays by Joe Burrow or rushing plays by Jonathan Taylor? I don't see any variables like "passer_player_name"

@akornberg

This comment has been minimized.

Copy link

@akornberg akornberg commented May 7, 2020

@JagsStats You can do it with str_detect! If your data frame is called "plays", this will select every play featuring Jonathan Taylor (run or pass) and create a data frame called "JT". Good player choice by the way, On Wisconsin!

JT <- plays %>% filter(str_detect(play_text, "Jonathan Taylor"))

@JagsStats

This comment has been minimized.

Copy link

@JagsStats JagsStats commented May 8, 2020

Thank you! @akornberg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.