Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?

nflfastR Python Guide

This guide serves as an update to my original nflscrapR Python Guide. A new R library has been created, appropriately titled nflfastR, which speeds up the process of scraping new play by play data.

Using Jupyter Notebooks or Jupyter Lab, which come pre-installed with Anaconda is typically the best way to work with data in Python. This guide assumes you are using the Ananconda distribution and therefore already have the required packages installed. If you are not using the Anaconda distribution, install the required libraries mentioned throughout the guide. Once Anaconda has been downloaded and installed, open the Anaconda Navigator. Click launch on the Jupyter Notebook or Jupyter Lab section which will open in your browser.

Since the nflfastR data is largely the same as nflscrapR, the process to analyze and plot it is nearly identical. This guide will use much of the same code from the original nflscrapR guide.

Reading in Data

One season of play by play data is around 80 megabytes of data which can make reading in multiple seasons a slow process. Luckily, nflfastR now includes compressed CSVs in their repository which speeds up the process of reading data into Python/pandas and it only requires an extra parameter to work.

First on the to do list is to import pandas, type the following into the first cell. Then press Shift+Enter, this will execute the code and provide a new cell for you.

import pandas as pd

Pandas is now known as pd so instead typing out pandas for every commnad, pd can be typed instead to save time.

The data repository for nflfastR can be found here.

Single Season

To read in a year of data declare a variable with the value of the desired year, like the following.

YEAR = 2019

Next the data will be read in using pandas .read_csv() method.

data = pd.read_csv('https://github.com/guga31bb/nflfastR-data/blob/master/data/' \
                         'play_by_play_' + str(YEAR) + '.csv.gz?raw=True',
                         compression='gzip', low_memory=False)

The url is location of the data and str(YEAR) fills in the desired year of data.

The two extra parameters compression='gzip' tells pandas the file is compressed as a gzip and low_memory=False eliminates a warning that pops up when reading in large CSV files.

Multiple Seasons

To read in multiple years of data, a for loop can be used, which will iterate through a list of years, putting all the data into a single dataframe. First, specify the desired years of data in a list.

YEARS = [2019,2018,2017]

Next, declare an empty pandas dataframe that data will be added to and write the for loop. The for loop will read in a year of data, add it to the dataframe, then go to the next year, and repeat.

data = pd.DataFrame()

for i in YEARS:  
    i_data = pd.read_csv('https://github.com/guga31bb/nflfastR-data/blob/master/data/' \
                         'play_by_play_' + str(i) + '.csv.gz?raw=True',
                         compression='gzip', low_memory=False)

    data = data.append(i_data, sort=True)

The .append() method adds the data to the existing dataframe, sort=True sorts the columns alphabetically.

When reading in multiple seasons, the indexes will be replicated in each season, meaning there will be several 0 indexes, 1 indexes, 2 indexes, etc. (if this doesn't make sense don't worry). The following line will reset the index to ensure each row has a unique identifier.

data.reset_index(drop=True, inplace=True)

drop=True drops the existing index, replacing it with the new one. Without specifying inplace=True the command will return a dataframe with those changes, but not make the change permanent to the dataframe.

Saving Data

Saving data locally is a good idea so that reading in data doesn't require an internet connection and a fresh download for each analysis. The data can be outputed in a compressed CSV to save space (only ~15 megabytes per season) so the quick reading in of data is retained.

The .to_csv() method in pandas makes it easy to turn a dataframe into a CSV file.

data.to_csv('filename.csv.gz', compression='gzip', index=False)

A full file path can also be specified to save in a specific location, such as C:/Users/Name/Documents/filename.csv.gz, otherwise it will be stored in wherever the Jupyter Notebook is stored.

The process to read in the saved data is the same as above, but instead of using the url, specify the full filepath of the file.

data = pd.read_csv('C:/Users/Name/Documents/Football/play_by_play_2019.csv.gz', compression='gzip', low_memory=False)

Or if the data is in the same folder as the Jupyter Notebook, only the filename (with the extension .csv.gz) needs to be specified.

data = pd.read_csv('play_by_play_2019.csv.gz', compression='gzip', low_memory=False)

Setting With Copy Warning

Here's a good place to mention a warning pandas will give when making changes to data. The warning is SettingWithCopyWarning, which is explained in detail here. The tldr is that pandas creates a warning when making changes to a slice of a dataframe that could potentially change the original dataframe as well. As shown in the link above, when creating a new dataframe from a portion of another dataframe, calling .copy() can avoid the SettingWithCopyWarning.

For example, pretend there's a dataframe named A containing every play from a season. Then dataframe B is created but only with 1st down plays. Making a change to dataframe B's values will result in a SettingWithCopyWarning because the change may also affect dataframe A.

#Full season
A = pd.read_csv('play_by_play_2019.csv.gz', compression='gzip', low_memory=False)

#First downs only
B = A.loc[A.down==1]

#Making a change to B will raise the warning
B.epa.loc[B.fumble==1] = 0

The code above would result in a warning. The simple change would be to add .copy() to the creation of B.

B = A.loc[A.down==1].copy()

Alternatively, the following line of code can be ran that will surpress the warning, but runs the risk of making changes to the original dataframe when trying to modify a slice of a dataframe.

pd.options.mode.chained_assignment = None

The next two lines have nothing to do with warnings but simply making viewing data easier. Play by play data has nearly 300 columns so being able to scroll horizontally is important. Adding some extra scrolling ability vertically is also a good idea for working with smaller portions of data.

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 300)

Now that the data has been read in, some optional cleaning will be outlined below.

Cleaning Data

The good news continues here, nflfastR's data repository includes cleaned data by default, eliminating most of the work necessary to clean the data. However, there are now a few redundant columns which can be removed.

The traditional columns passer_player_name, rusher_player_name, and receiver_player_name have been supplanted by passer, rusher, and receiver, respectfully. Note that QB scrambles will mark the QB in the passer column, as the play call was a pass. Likewise, the player ID columns have also changed and the old ones can be dropped.

As such, the six old columns can be removed with the following code.

data.drop(['passer_player_name', 'passer_player_id',
           'rusher_player_name', 'rusher_player_id',
           'receiver_player_name', 'receiver_player_id'],
          axis=1, inplace=True)

axis=1 indicates that these are columns that should be dropped, not rows. inplace=True saves the change to the dataframe.

The compressed CSVs in the data repository also include postseason data by default, so this next step is completely up to the user if they want to include playoff data or not. To remove postseason data, use the following code.

data = data.loc[data.season_type=='REG']

Another optional change would be to remove plays that aren't typically used for analysis such as kickoffs, field goals, kneel downs, etc. Those can be removed if desired.

data = data.loc[(data.play_type.isin(['no_play','pass','run'])) & (data.epa.isna()==False)]

This will limit plays to passes, runs, and plays with penalties.

One last cleaning step is to change the play_type to match the play call. For example, QB scrambles are labeled as run plays but in reality they are pass plays. The data already contains binary value columns for pass or rush.

data.play_type.loc[data['pass']==1] = 'pass'
data.play_type.loc[data.rush==1] = 'run'

Note how in the first line data['pass'] is used with the brackets and column name in quotes, the second line data.rush features a dot and no quotes for the column name. pass is a special word in Python so the interpreter gets confused when using the dot method and therefore requires quotes so it knows the word is referencing the column.

That finishes up the cleaning portion, with plays being removed in that process it may be a good idea to reset the index once again, ensuring there's no missing numbers.

data.reset_index(drop=True, inplace=True)

This would also be a good place to save the data so that the cleaning process doesn't need to be repeated every time data is used. See Saving Data section above for how to save the data as a compressed CSV file.

Note that some of the above cleaning will result in SettingWithCopyWarning so make sure to either surpress it or use the .copy() method and create separate dataframes.

Analyzing Data

Now to the fun part of analyzing data. The two most important processes will be selecting the data needed for an analysis, and grouping data, those are both described below.

Querying Data

To understand what the dataframe looks like, enter the name of the dataframe ('data' in this case) into a cell and execute. The dataframe will be displayed. All 250+ columns can be scrolled through, on the left there is an unnamed column with a number. This is the index and a play or range of plays can be specified using the .loc[] function.

In Python, counters start at 0 so the first row is actually index 0, not 1. To get just the first row enter

data.loc[0]

To get a range of index use a colon (:) to specify the range. When using .loc[] Pandas follows the rules of [inclusive:inclusive] meaning both the first and last number index is included in the range. For example data.loc[0:5] would return the first six rows (index 0,1,2,3,4,5).

This is not the case for other Python lists or arrays. Even the Pandas function .iloc[] uses the more traditional [inclusive:exclusive] syntax. Meaning data.iloc[0:5] would return only five rows (index 0,1,2,3,4). Since the play by play data's index is already numerical, using .loc makes sense. If the key were a word or not in a numerical order, .iloc[] would be used to return specific elements.

To clarify, if the index was a random number and the first rows' index was 245, data.loc[245] or data.iloc[0] would be used to return the first row. Using just data.loc[0] would search for the row where the index is 0.

Filtering Data

The .loc[] method is not just limited to indexing with numbers, it's also used for filtering by column values.

There are many filters you can use to check for values:

  • Greater than > or greater than equal to >=
  • Less than < or less than equal to <=
  • Equal to ==
  • String contains .str.contains('word')

Multiple filters can also be chained together using paranthesis () around each filter and an & as an AND to make all conditions necessary or a | to indicate OR.

Each filter needs to state the name of the dataframe and the field/column to condition on. Fields can be specified using data.field_name or data['field_name'] they work the same way. Just note that is a field shares the same name with a special word in Python (such as pass) then the data['field_name'] method must be used.

Here's an example to filter to just first and second down, with run plays only. Below is a couple examples of filtering.

#Run plays on 1st or 2nd down
data.loc[(data.down<3) & (data.rush==1)]

#Plays where the Seahawks are on offense
data.loc[data.posteam=='SEA']

That will return a filtered dataframe with plays matching those criteria. You can save that smaller dataframe to a new variable like this early_down_runs = data.loc[(data.down<3) & (data.rush==1)]

Grouping Data

Data can be grouped by any category such as a player, team, game, down, etc. and can also have multiple groupings. Use the .groupby() function to do this. The columns that should be grouped are then specified in double brackets and a function applied such as .mean(),.sum(),.count()

The visuals and data in the next couple sections will use 2019 regular season data, following the cleaning procedures outlined above.

Here's an example of finding the average expected points added (EPA) per play by offense (posteam).

data.groupby('posteam')[['epa']].mean()

This will return the following dataframe (this is cropped to save space).

posteamepa

Note that the index is now the team names and no longer a number. To query this table, it could be saved to a new dataframe and the .loc[] method could be used to find a specific team's EPA per play.

#Create table
team_epa = data.groupby('posteam')[['epa']].mean()

#Get Cardinals EPA per play
team_epa.loc[team_epa.index=='ARI']

Since the index is the team abbreviations, team_epa.index is used to find the teams. To have the table contain a numerical index instead, with posteam as a column, the code would be data.groupby('posteam', as_index=False)[['epa']].mean().

Groupby can also have multiple values to groupby, for example, receiver targets. Naming convention in nflscrapR is FirstInitial.LastName, this becomes an issue when two players in the league have the same first initial and last name. For this reason it is best to use a secondary groupby to eliminate any conflicts.

data.groupby(['receiver','posteam'])[['play_id']].count()

When using multiple groupby conditions they need to be put in brackets and separated by commas. Again, the index is now players and teams, not a numerical index. This can make further filtering an issue so it can be fixed in two ways. First, the original groupby can be changed to not be an index.

data.groupby(['receiver','posteam'], as_index=False)[['play_id']].count()

The as_index=False means do not make the name and team the index, create a simple numerical index instead and make the names and team separate columns.

The other way is to reset the index after the data is grouped by simply calling .reset_index(inplace=True).

Useful Columns

Since nflfastR features all the columns in nflscrapR, the nflscrapR documentation can be used to view most of the columns in nflfastR.

Below is an incomplete list of frquently used columns:

  • posteam - the offensive team (possesion team)
  • defteam - the defensive team
  • game_id - a unique id given to each NFL game
  • epa - expected points added
  • wp - current win probability of the posteam
  • def_wp - current win probability of the defteam
  • yardline_100 - number of yards from the opponent's endzone
  • passer - the player that passed the ball (on QB scramble plays, the QB is marked as a passer, not rusher)
  • rusher - the player that ran the ball
  • receiver - the player that was targeted on a pass
  • passer_id, rusher_id, receiver_id - the player ID for the passer, rusher, or receiver on a play (useful for joining roster data)
  • cpoe - completion percentage over expected on a given pass
  • down - down of the play
  • play_type - either run, pass, or no_play
  • series_success - marked as a 1 if the series becomes successful (first down or a touchdown)

Finding Specific Games

There are a couple ways to find specific games. The most straight forward is using the home and away team.

data.loc[(data.home_team=='ARI') & (data.away_team=='SEA')]

The above would return all plays from the Seattle @ Arizona game. If there are multiple seasons of data in the dataframe just add the season.

data.loc[(data.home_team=='ARI') & (data.away_team=='SEA') & (data.season==2019)]

Another way to find a specific game is to find it on NFL's website. For example, here is the link for the 2019 game between the Seahawks and Cardinals: https://www.nfl.com/gamecenter/2019092908/2019/REG4/seahawks@cardinals

The number after gamecenter/ is the game_id.

data.loc[data.game_id==2019092908]

Example Analysis

This example will look at QB performance in 2019.

qbs = data.groupby(['passer','posteam'], as_index=False).agg({'epa':'mean',
                                                              'cpoe':'mean',
                                                              'play_id':'count'})

This groups the data by the passer and the team, averaging the expected points added (epa), the completion percentage over expected, and the number of dropbacks they had. To filter out players with only a few plays, a filter on play_id can be used.

#Filter to players with 200 or more dropbacks
qbs = qbs.loc[qbs.play_id>199]

Next, sort by EPA, round the values off, and rename the columns.

#Sort in descending order by EPA 
qbs.sort_values('epa', ascending=False, inplace=True)

#Round to two decimal places where appropriate
qbs = qbs.round(2)

#Rename columns
qbs.columns = ['Player','Team','EPA per Dropback','CPOE','Dropbacks']

Note that dropbacks are specified because this does not include designed running plays for the QB (scrambles are counted). This leaves a pretty good looking table. qbperf

Plotting Data

The most popular library for plotting data is Matplotlib.

import matplotlib.pyplot as plt

Like with pandas the name is abbreviated to plt so there's no need to type out matplotlib.pyplot every time. There's a couple ways to plot in matplotlib, both are outlined below.

Simple Method

The first plot will be a couple histograms, showing the distribution of EPA based on play type. First, create the necessary series.

rush_epa = data.epa.loc[data.play_type=='run']
pass_epa = data.epa.loc[data.play_type=='pass']

A single column of a pandas dataframe is known as a series. By selecting just the epa column in each of the above statements, two series have been created. The following lines are commented along the way to explain what's happening.

#Create figure and enter in a figsize
plt.figure(figsize=(12,8))

#Place a histogram on the figure with the EPA of all pass plays
#Bins are how many groupings or buckets the data will be split into
#Assign a label for the legend and choose a color
plt.hist(pass_epa, bins=25, label='Pass', color='slategrey')

#Place a second histogram this time for rush plays, 
#The alpha < 1 will make this somewhat transparent
plt.hist(rush_epa, bins=25, label='Run', alpha=.7, color='lime')

#Add labels and title
plt.xlabel('Expected Points Added',fontsize=12)
plt.ylabel('Number of Plays',fontsize=12)
plt.title('EPA Distribution Based on Play Type',fontsize=14)

#Add source, the first two numbers are x and y 
#coordinates as a decimal of the whole image
plt.figtext(.8,.04,'Data: nflfastR', fontsize=10)

#Add a legend
plt.legend()

#Save the figure as a png
plt.savefig('epa_dist.png', dpi=400)

epa_dist

The next will be a scatter plot of neutral situation passing rates and EPA per dropback, with each dot being a team game in 2019. Neutral situation is typically defined as 1st and 2nd down, with win probability between 20-80%, outside of the last two minutes of each half.

#Create dataframe of just plays in neutral situations
neutral_situation = data.loc[(data.down<3) & (data.half_seconds_remaining>120) &
                             (data.wp>=.2) & (data.wp<=.8)]
                             
#Groupby team game, taking the average of the 'pass' column
#Pass column = 1 when the play call is a pass and 0 when the play call is a run
pass_rates = neutral_situation.groupby(['game_id','posteam'])[['pass']].mean()

#Add a new column to pass rates dataframe for the EPA per dropback
#Filter to pass plays and groupby the same game_id and posteam
pass_rates['epa'] = neutral_situation.loc[neutral_situation['pass']==1].groupby(
    ['game_id','posteam'])[['epa']].mean()

pass_rates.reset_index(inplace=True)                           

The resulting pass_rates dataframe will contain two rows for each game (one for each team), with the team's passing rate and EPA per dropback in neutral situations.

Declaring the x and y data series before creating the graph can help reduce clutter.

x = pass_rates['pass']
y = pass_rates.epa

To create a line of best fit, another Python library numpy will be needed.

import numpy as np

Now to create the plot.

#Create figure and enter in a figsize
plt.figure(figsize=(10,10))

#Make a scatter plot with neutral situation pass rate on the x-axis, EPA per dropback on the y
plt.scatter(x, y, alpha=.7, color='navy')

#Create line of best fit
#Linestyle gives a dashed line
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), 
         color='darkorange', linestyle='--')

#Add grid lines
plt.grid(zorder=0,alpha=.4)

#Add labels and title
plt.xlabel('Neutral Situation Pass Rate',fontsize=12)
plt.ylabel('EPA per Dropback',fontsize=12)
plt.title('Neutral Situation Pass Rate and EPA per Dropback \n' \
          '1st & 2nd Down, WP between 20-80%, Outside of Two Minute Warnings',fontsize=14)

#Add source, the first two numbers are x and y 
#coordinates as a decimal of the whole image
plt.figtext(.79,.05,'Data: nflfastR', fontsize=10)

#Save the figure as a png
plt.savefig('ns_rates.png', dpi=400)

ns_rates

Getting Team Logos

Before going any further, a few more imports are required for gathering and using team logos:

import os
import urllib.request
from matplotlib.offsetbox import OffsetImage, AnnotationBbox

os comes with Python, but urllib may need to be installed. Open the terminal/command prompt and enter: pip install urllib or conda install urllib if using Anaconda.

Now to download the logos.

Note where it says FOLDER in the last line, make a new folder in the same folder as the jupyter notebook is with a name such as logos and replace FOLDER with the logos.

urls = pd.read_csv('https://raw.githubusercontent.com/statsbylopez/BlogPosts/master/nfl_teamlogos.csv')

for i in range(0,len(urls)):
    urllib.request.urlretrieve(urls['url'].iloc[i], os.getcwd() + '/FOLDER/' + urls['team_code'].iloc[i] + '.png')

Some logos are significantly bigger than others. A program like paint.net can be used to manually scale down the large logos.

Python doesn't easily allow for images to be used on charts as you'll see below, but luckily jezlax found a solution.

Create this function to be able to put images onto a chart.

def getImage(path): 
    return OffsetImage(plt.imread(path), zoom=.5)

Next, a list will be created to store the required information to use the logos, again replace FOLDER with the name the folder with the logos in it.

logos = os.listdir(os.getcwd() + '/FOLDER')

logo_paths = []

for i in logos:
    logo_paths.append(os.getcwd() + '/FOLDER/' + str(i))

Complex Method

This will show you the second way to build charts in Matplotlib, it takes some getting used to but ultimately allows for more flexibility.

This will plot EPA per rush and EPA per dropback for each team in 2019.

#Filter to pass plays and groupby offensive team
team_epa = data.loc[data['pass']==1].groupby('posteam')[['epa']].mean()

#Do the same but for rushing plays
team_epa['rush_epa'] = data.loc[data.rush==1].groupby('posteam')[['epa']].mean()

#Define x and y
x = team_epa.rush_epa
y = team_epa.epa

#Create a figure with size 12x12
fig, ax = plt.subplots(figsize=(15,15))

#Make a scatter plot first to get the points to place logos
ax.scatter(x, y, s=.001)

#Adding logos to the chart
for x0, y0, path in zip(x, y, logo_paths):
    ab = AnnotationBbox(getImage(path), (x0, y0), frameon=False, fontsize=4)
    ax.add_artist(ab)
    
#Add a grid
ax.grid(zorder=0,alpha=.4)
ax.set_axisbelow(True)
    
#Adding labels and text
ax.set_xlabel('EPA per Rush', fontsize=16)
ax.set_ylabel('EPA per Dropback', fontsize=16)
ax.set_title('Avg. EPA by Team & Play Type - 2019', fontsize=20)
plt.figtext(.81, .07, 'Data: nflfastR', fontsize=12)

#Save the figure as a png
plt.savefig('team_epas.png', dpi=400)

team_epas

The above uses manually resized logos so the output with the unadjusted logos will look slightly different.

Another example will show using logos and team colors on a bar chart. For team colors, scroll down to find the dictionary of team colors, copy and paste it into a cell, then execute.

This bar chart shows neutral situation pass rate by team in 2019.

#Create dataframe of just plays in neutral situations
neutral_situation = data.loc[(data.down<3) & (data.half_seconds_remaining>120) &
                             (data.wp>=.2) & (data.wp<=.8)]
                             
#Get neutral situation pass rate
teams = plays.groupby('posteam')[['pass']].mean()

#Add team colors
teams['color'] = COLORS.values()

#Add path column where each row will have the location of the team's logo
teams['path'] = logo_paths

#Sort from highest to lowest pass rate
teams.sort_values('pass',ascending=False,inplace=True)

fig, ax = plt.subplots(figsize=(30,10))

#Add logos
#X data is an array from 0-31 (for 32 teams)
for x0, y0, path in zip(np.arange(0,32),teams['pass']+.005,teams['path']):
    ab = AnnotationBbox(getImage(path), (x0, y0), frameon=False, fontsize=4)
    ax.add_artist(ab)

#Create bar chart, setting color(s) equal to the 32 team color series
ax.bar(np.arange(0,32),teams['pass'],color=teams.color,width=.5)

#Add league average line
ax.axhline(y=teams['pass'].mean(),linestyle='--',color='black')
    
#Add grid
ax.grid(zorder=0,alpha=.6,axis='y')
ax.set_axisbelow(True)

#Set x-ticks labels to be team abbreviations
ax.set_xticks(np.arange(0,32))
ax.set_xticklabels(teams.index,fontsize=16)

#Manually adjust y-axis
ax.set_yticklabels([0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7],fontsize=16)

#Set labels and give source
ax.set_ylabel('Pass Rate',fontsize=20,labelpad=20)
ax.set_title('Neutral Situation Pass Rates - 2019 \n 1st & 2nd Down, Win Prob. Between 20-80%, Final Two Minutes Excluded',
             fontsize=26,pad=20)
plt.figtext(.85,.05,'Data: nflfastR',fontsize=14)

#League average line label
plt.text(30.5,.53,'NFL Average',fontsize=14)

plt.savefig('pr.png',dpi=400)

pr

The last example will show how to create an annotated scatter plot. This also requires a library to be installed, called AdjustText.

Open the terminal/command prompt again and enter pip install adjustText or conda install -c conda-forge adjusttext.

Then in the jupyter notebook, import the library.

from adjustText import adjust_text

This will be a plot with CPOE on the x-axis and EPA per dropback on the y-axis.

#Create QBs dataframe with avg epa, avg cpoe, and number of plays
qbs = data.groupby(['passer','posteam'], as_index=False).agg({'epa':'mean',
                                                              'cpoe':'mean',
                                                              'play_id':'count'})

#Set minimum limit of 200 dropbacks
qbs = qbs.loc[qbs.play_id>200]


fig, ax = plt.subplots(figsize=(15,15))

#Create vertical and horizontal lines for averages of each metric
ax.axvline(x=qbs.cpoe.mean(), linestyle='--', alpha=.5, color='black')
ax.axhline(y=qbs.epa.mean(), linestyle='--', alpha=.5, color='black')

#Create a dot for each player
#Find their team color in the COLORS dictionary
#s stands for size, the dot size is proportional to the QBs number of plays
for i in range(len(qbs)):
    ax.scatter(qbs.cpoe.iloc[i], qbs.epa.iloc[i],
               s=qbs.play_id.iloc[i], alpha=.7,
               color=COLORS[qbs.posteam.iloc[i]])

#Add text to each dot
texts = [plt.text(x0,y0,name,ha='right',va='bottom') for x0,y0,name in zip(
    qbs.cpoe, qbs.epa, qbs.passer)]

adjust_text(texts)

#Add grid
ax.grid(zorder=0,alpha=.4)
ax.set_axisbelow(True)

#Remove top and right boundary lines
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

#Add title, labels, and source
ax.set_title('CPOE & EPA - 2019',fontsize=20,pad=15)
ax.set_xlabel('Completion % Over Expected (CPOE)',fontsize=16,labelpad=15)
ax.set_ylabel('EPA per Attempt',fontsize=16,labelpad=15)
plt.figtext(.8,.06,'Data: nflfastR',fontsize=12)

#Save figure
plt.savefig('CPOEepa.png',dpi=400)

CPOEepa

Extras

The following didn't fit well into any section but is helpful nonetheless.

  • A list of named matplotlib colors can be found here
  • The .astype() method can be used to change column types, helpful for turning numbers into strings
  • Tutorial for matplotlib
  • Another popular visualization library is seaborn and is based on matplotlib

Using a Database

nflfastR recently added a feature to create and update a database containing play by play data. That feature only works in R however, so I created a python script to do essentially the same thing. The script, found here contains two functions, one for creating the database and one for updating. Make sure to install SQLite before proceeding.

If you're more experienced with python you can simply start python from the command line, import the script, and execute the function create_db(). If you're newer to python, you can simply copy and paste the code into a jupyter notebook and work from there. See below if you're choosing the latter route.

Once the code is copy and pasted into the jupyter notebook, execute the cell(s) and there will now be few things that happen. First, a database connection is created, if the database provided in con = sqlite3.connect(name) already exists, it will establish a connection to that database. If the database does not exist (first time use), the code will create a database for you.

Now that the connection is established and/or the database is created, you can run the create_db function by typing create_db() into a cell and executing. This will take a few minutes as it's downloading CSVs from the nflfastR data repository, cleaning them, and putting them into a database. Note, the cleaning process is the same as described above, but you can remove any cleaning you don't want (i.e. removing post season games).

The second function is called update_db and similarly can be executed by entering update_db() in a cell and executing. This function also takes a parameter called current_season that you can specify by typing update_db(2020) and that will tell the function to look for games in the 2020 season. Specifying a season that is already in the database will not result in any games being added to the database.

Below are a couple simply SQL queries to get data from the database into a usable pandas dataframe.

To get a single season of data (2019)

query = 'SELECT * FROM plays WHERE season=2019'

data = pd.read_sql(query, con)

To get a few specific columns of data

query = 'SELECT passer, posteam, epa FROM plays WHERE pass=1'

data = pd.read_sql(query, con)

Team Colors

Team colors can be useful for bar graphs and other plots. This is a dictionary of team abbreviations and colors. To get a color, specify a team abbreviation and the color is returned. Typing in COLORS['ARI'] and executing the cell would return #97233F which can be used as a color in a plot.

COLORS = {'ARI':'#97233F','ATL':'#A71930','BAL':'#241773','BUF':'#00338D','CAR':'#0085CA','CHI':'#00143F',
          'CIN':'#FB4F14','CLE':'#FB4F14','DAL':'#B0B7BC','DEN':'#002244','DET':'#046EB4','GB':'#24423C',
          'HOU':'#C9243F','IND':'#003D79','JAX':'#136677','KC':'#CA2430','LA':'#002147','LAC':'#2072BA',
          'LV':'#C4C9CC','MIA':'#0091A0','MIN':'#4F2E84','NE':'#0A2342','NO':'#A08A58','NYG':'#192E6C',
          'NYJ':'#203731','PHI':'#014A53','PIT':'#FFC20E','SEA':'#7AC142','SF':'#C9243F','TB':'#D40909',
          'TEN':'#4095D1','WAS':'#FFC20F'}

Thanks for reading, please leave feedback, and follow me on twitter @DeryckG_.

@gsadams2

This comment has been minimized.

Copy link

@gsadams2 gsadams2 commented May 8, 2020

This is amazing, thanks again Deryck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.