Skip to content

Instantly share code, notes, and snippets.

@Btibert3
Created February 29, 2020 20:21
Show Gist options
  • Save Btibert3/3cc236ca7795664bd67d7b5d17e1e705 to your computer and use it in GitHub Desktop.
Save Btibert3/3cc236ca7795664bd67d7b5d17e1e705 to your computer and use it in GitHub Desktop.
Use RMarkdown and knitr to estimate shooter embeddings in R and python via Tensorflow at the same time.
---
title: Tensorflow, Tip-Ins, and Tableau, Oh My
author: Brock Tibert
date: '2020-02-18'
slug: tensorflow-tip-ins-and-tableau-oh-my
categories:
- R
tags:
- NHL
- Tensorflow
- Keras
image:
caption: ''
focal_point: ''
draft: false
---
This notebook aims to show the basics of:
1. Tensorflow 2.0
2. Shooter Embedding estimation for NHL Player evaluation
3. Evaluate feasibility generating a post that switches between `R` and `python` via reticulate
4. Demonstrate code similarity/approach in both languages side-by-side
## TL;DR
- Combine Tensorflow/Keras with R
- NHL Data to estimate Shooter Player Embeddings
- Export to Tableau for exploration (yes we could use ggplot et. al, but highlights we have other options, especially for those new to the language)
```{r defaults, include=FALSE}
knitr::opts_chunk$set(comment = NA)
```
## R Setup
```{r setup, include=TRUE}
# packages
library(keras)
suppressPackageStartupMessages(library(tidyverse))
library(reticulate)
suppressPackageStartupMessages(library(caret))
# options
options(stringsAsFactors = FALSE)
use_condaenv("tensorflow")
```
## Python setup
```{python setup2, include=TRUE}
# imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Activation, concatenate, Dense, Dropout, Embedding, Input, Reshape, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model
from tensorflow.keras.preprocessing.text import Tokenizer
```
## Get the data
### R
```{r message=F}
URL = "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"
download.file(URL, destfile="shots.zip")
shots_raw = read_csv("shots.zip")
```
What's the shape?
```{r}
dim(shots_raw)
```
### python
```{python}
URL = "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"
shots_raw = pd.read_csv(URL)
```
What do we have?
```{python}
shots_raw.shape
```
## Filter rows
We want to keep shots on net, and not on an empty net, as well as remove records where the shooter id is 0.
### R
```{r}
# keep shots that were on goal
shots_raw = shots_raw %>% filter(shotWasOnGoal == 1 )
shots_raw = shots_raw %>% filter(shotOnEmptyNet == 0)
shots_raw = shots_raw %>% filter(shooterPlayerId != 0)
shots_raw = shots_raw %>% filter(!is.na(shooterPlayerId))
```
What we do have for a shape?
```{r}
dim(shots_raw)
```
### python
```{python}
shots_raw = shots_raw.loc[shots_raw.shotOnEmptyNet == 0, :]
shots_raw = shots_raw.loc[shots_raw.shotWasOnGoal == 1, :]
shots_raw = shots_raw.loc[shots_raw.shooterPlayerId != 0, :]
shots_raw = shots_raw.loc[~shots_raw.shooterPlayerId.isna(), :]
```
What do we have for a shape?
```{python}
shots_raw.shape
```
## Select Columns
With the rows select, let's keep the columns that we want to include in this analysis.
### R
```{r}
# keep just the columns that we need
shots_raw = shots_raw %>% select(shooterPlayerId, shotType, goal, arenaAdjustedShotDistance,
arenaAdjustedXCord, arenaAdjustedYCord, shotAngle, offWing)
```
The shape ...
```{r}
dim(shots_raw)
```
### python
```{python}
COLS = ['shooterPlayerId', 'shotType', 'goal', 'offWing',
'arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord',
'shotAngle']
shots_raw = shots_raw[COLS]
```
The shape ...
```{python}
shots_raw.shape
```
## Encode the shot types
I am going to one-hot the shot types, though in the future I will explore the use of `keras.preprocessing.text.one_hot`. The result will be new columns added to our `shots_raw` dataset, with each shot type flagged as 0/1.
### R
```{r}
x <- dummyVars(" ~ .", data = shots_raw)
shots_raw <- data.frame(predict(x, newdata = shots_raw))
rm(x)
```
What do we have?
```{r}
glimpse(shots_raw)
```
### python
```{python}
shots_raw = pd.get_dummies(shots_raw, columns=['shotType'])
print(shots_raw.shape)
print(shots_raw.head(3).T)
```
## Scale the numeric data to 0/1
### R
```{r}
# clunky, but break out columns to standardize
tmp = shots_raw %>% select(arenaAdjustedShotDistance:shotAngle)
tmp2 = preProcess(tmp, method = "range")
pp = predict(tmp2, tmp)
rm(tmp, tmp2)
# drop the original and append these
shots_raw = select(shots_raw, -arenaAdjustedShotDistance:-shotAngle)
shots_raw = cbind(shots_raw, pp)
dim(shots_raw)
```
### python
```{python}
scaler = MinMaxScaler()
COLS = ['arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord', 'shotAngle']
shots_raw[COLS] = scaler.fit_transform(shots_raw[COLS])
shots_raw.shape
```
## Setup the tokenizer and fit to the Player IDs
For this exercise, instead of converting the player ids to be 0-based, I am going to treat the player ids as if they are unique words, with the unique number of players representing our complete vocabulary. As such, document represents a shot of the puck on net, and each document only includes one "word", or shooter.
> The trick here is that we have to treat our player ids as character strings.
### R
```{r}
# ensure that the shooter ID is a string
shots_raw$shooterPlayerId = as.character(shots_raw$shooterPlayerId)
# setup the tokenizer
shooter_tokenizer = text_tokenizer()
# fit the shooters
fit_text_tokenizer(shooter_tokenizer, shots_raw$shooterPlayerId)
```
What do we have?
```{r}
shooter_tokenizer$index_word[1:3]
shooter_tokenizer$word_index[1:3]
```
And how many?
```{r}
length(shooter_tokenizer$index_word)
```
### python
```{python}
# make an integer so zero is not parsed
shots_raw.shooterPlayerId = shots_raw.shooterPlayerId.astype('int')
# ensure that the player ID is a string
shots_raw.shooterPlayerId = shots_raw.shooterPlayerId.astype('str')
# setup the tokenizer
shooter_tokenizer = Tokenizer()
# fit the tokenizer to shooters
shooter_tokenizer.fit_on_texts(shots_raw.shooterPlayerId)
```
What do we have?
```{python}
list(shooter_tokenizer.index_word.items())[:3]
list(shooter_tokenizer.word_index.items())[:3]
```
And how many?
```{python}
len(shooter_tokenizer.index_word.items())
```
## Create the Shooter `sequences`
These are size 1 sequences that do not require padding, as we only allow 1 word (or player) per shot. The key here is that we are using `keras` to help us easily map our data to the new id system.
### R
```{r}
# make sequences with the new index
shooters = texts_to_sequences(shooter_tokenizer, shots_raw$shooterPlayerId)
shooters = unlist(shooters)
```
What do we have?
```{r}
class(shooters)
length(shooters)
```
### python
```{python}
shooters = shooter_tokenizer.texts_to_sequences(shots_raw.shooterPlayerId)
shooters = [x[0] for x in shooters]
shooters = np.array(shooters)
```
What do we have?
```{python}
type(shooters)
len(shooters)
```
## Isolate the other features/targets
### R
```{r}
# Was the shot a goal? This is our target.
goal = shots_raw$goal
# the shot info
shot_info = shots_raw %>% select(-shooterPlayerId, -goal)
shot_info = as.matrix(shot_info)
```
What do we have now?
```{r}
length(goal); mean(goal);
dim(shot_info)
colnames(shot_info)
```
### python
```{python}
# Was the shot a goal? This is our target.
goal = np.array(shots_raw.goal)
# the shot info
shot_info = shots_raw.drop(columns=['shooterPlayerId', 'goal'], axis=1, inplace=False)
```
What do we have?
```{python}
len(goal)
goal.mean()
shot_info.shape
shot_info.columns
```
## Define the model architecture
### R
> Note the +1, it's needed to avoid the index error
```{r}
# the setup
NUM_SHOOTERS = length(unique(unlist(shooter_tokenizer$index_word))) +1
SHOT_COLS = ncol(shot_info)
VEC_SIZE = 50
# the input layers
shooter_input = layer_input(shape=c(1), name = "shooter_input")
shot_input = layer_input(shape=c(SHOT_COLS), name = "shot_input")
# shooter layers
s1 = layer_embedding(input_dim = NUM_SHOOTERS,
output_dim = VEC_SIZE,
input_length = 1,
name="shooter_embedding")(shooter_input)
s2 = layer_flatten(name = "shooter_flat")(s1)
s3 = layer_dense(units = 1, activation = "sigmoid")(s2)
# put the model together
model = keras_model(inputs = shooter_input, outputs = s3)
```
Summarize:
```{r}
summary(model)
```
### python
> Note the +2, it's needed to avoid the index error and differs from abvoe
```{python}
# setup
NUM_SHOOTERS = len(np.unique(shooters)) + 1
SHOT_COLS = shot_info.shape[1]
VEC_SIZE = 50
# the input layers
shooter_input = Input(shape=(1, ), name="shooter_input")
shot_input = Input(shape=(SHOT_COLS, ), name="shot_input")
# shooter layers
s1 = Embedding(NUM_SHOOTERS, VEC_SIZE, input_length=1)(shooter_input)
s2 = Flatten()(s1)
s3 = Dense(1, activation="sigmoid")(s2)
# put the model together
model = Model(inputs = shooter_input, outputs = s3)
```
What do we have?
```{python}
model.summary()
```
and plot the model, this is not available within R at the moment.
```{python, eval=F}
# below might choke RMD
plot_model(model, to_file='model.png')
```
## Train and Evaluate the Model
### R
Compile the model.
```{r}
model %>%
compile(optimizer = "adam",
loss="binary_crossentropy",
metrics =c("accuracy"))
```
Fit the model and record the history for plotting, if needed
```{r, eval=FALSE}
history =
model %>%
fit(x=list(shooters),
y=goal,
epochs = 5,
verbose = 2)
```
```{r echo=FALSE}
model %>%
fit(x=list(shooters),
y=goal,
epochs = 5,
verbose = 2)
```
### python
Compile the model.
```{python}
model.compile(optimizer="adam", loss = "binary_crossentropy", metrics = ['accuracy'])
```
Fit the model.
```{python eval=FALSE}
X = [shooters, shot_info]
history = model.fit(X, goal, epochs=5)
```
```{python echo=FALSE, results='hide'}
# doing this to help with the document to compile so it doesnt hang Rstudio
# issue is sequence is a list of tuples, on colab fixed
X = [shooters, shot_info]
history = model.fit(X, goal, epochs=5)
```
## Get the Embeddings
With our simple model, we have estimated embeddings for each shooter. Let's grab those.
### R
```{r}
shooter_embeddings = get_weights(model)[[1]]
```
What do we have?
```{r}
shooter_embeddings[1:3, 1:3]
```
The shape.
```{r}
dim(shooter_embeddings)
```
### python
```{python}
shooter_embeddings = model.layers[1].get_weights()[0]
```
What do we have?
```{python}
shooter_embeddings[1:4, 1:4]
```
The shape.
```{python}
shooter_embeddings.shape
```
## Map the embeddings to the players
The embeddings are related to a player, so we are intereseted extracting these vectors and looking at player similarity, etc.
### R
> This is to help with some of the mapping. There may be more elegant ways to do this, but below is intuitive and simple in my opinion.
```{r}
# build our vocabulary (player) dataframe
# https://www.r-bloggers.com/word-embeddings-with-keras/
players = data.frame(
playerid = names(shooter_tokenizer$word_index),
id = as.integer(unlist(shooter_tokenizer$word_index)), stringsAsFactors=FALSE)
players = dplyr::arrange(players, id)
```
The embeddings with names and references
```{r comment=NA}
# keep only those rows where the indexes align - R is 1-based
shooter_embeddings = shooter_embeddings[players$id, ]
rownames(shooter_embeddings) = players$playerid
colnames(shooter_embeddings) = paste0("e", 1:ncol(shooter_embeddings))
shooter_embeddings[1:3, 1:3]
```
### python
```{python}
# make the embed vectors a pandas dataframe
shooter_embeddings = pd.DataFrame(shooter_embeddings)
# a list of true shooter ids
#shooter_id = [v for k, v in shooter_tokenizer.index_word.items()]
shooter_id = {k:v for k, v in shooter_tokenizer.index_word.items()}
shooter_df = pd.DataFrame.from_dict(shooter_id, orient='index', columns=["playerid"])
# name the columns
shooter_embeddings.columns = ["e" + str(i + 1) for i in range(shooter_embeddings.shape[1])]
# align the data by index
shooter_embeddings = pd.merge(shooter_embeddings, shooter_df, how='inner', left_index=True, right_index=True)
# clean up the index so its the player
shooter_embeddings.index = shooter_embeddings.playerid
# the first few
shooter_embeddings.iloc[:3, :3]
```
## Export the data to Tableau
Whether it is R or python, you might be asking why I am exporting the data to Tableau. That is a fair question, but the point is to show how the ecosystem of data science programming libraries can also leverage best-of-breed data visualization suites such as Tableau. The tool plays a key role in my exploratory analysis pipeline, and the goal below is show how in 1-line of code, we can export our data for rapid exploration, which can aid in our data cleaning and modeling tasks within R/python.
## R
I ported a copy of the `pantab` library in python into R. The trick is that I use `reticulate` to port the python bits into R. As such, at present, it will not work if you are using Google Colab.
Installation is simple:
```{r eval=FALSE}
devtools::install_github("btibert3/pantabR")
```
```{r eval=FALSE}
sdf = as.data.frame(shooter_embeddings)
pantabR::frame_to_hyper(sdf, f="embeddings.hyper", tbl="shooters")
```
> In the python section, I am going to use t-SNE to reduce the estimated shooter embeddings into a two dimensional space. The process is similar in `R` using the `Rtsne` package.
### python
The `pantab` library is easy to install:
```{python eval=FALSE}
pip install pantab
# if in a notebook environment
# !pip install pantab
```
However, prior to writing out the data for exploration, I am going to use t-SNE to compact the estimated shooter embeddings into a two dimensional coordinate system. For more on t-SNE, refer to [this introduction.](https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1)
```{python}
from sklearn.manifold import TSNE
shooter_tsne = TSNE(n_components=2).fit_transform(shooter_embeddings.iloc[:, :50])
```
Add them to the dataframe
```{python}
shooter_tsne = pd.DataFrame(shooter_tsne)
shooter_tsne.columns = ['t1', 't2']
shooter_embeddings.reset_index(inplace=True, drop=True)
shooter_embeddings = pd.concat([shooter_embeddings, shooter_tsne], axis=1)
```
With pantab setup, that package makes it really simple to write pandas dataframes to `hyper` files for Tableau.
```{python}
import pantab
pantab.frame_to_hyper(shooter_embeddings, "embeddings.hyper", table="shooters")
```
And the simple embeddings, plotted from our exported `embeddings.hyper` file within Tableau.
![](https://github.com/Btibert3/brocktibert/blob/master/public/img/simple-nhl-shooter-embeddings.png?raw=true)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment