Created
February 29, 2020 20:21
-
-
Save Btibert3/3cc236ca7795664bd67d7b5d17e1e705 to your computer and use it in GitHub Desktop.
Use RMarkdown and knitr to estimate shooter embeddings in R and python via Tensorflow at the same time.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: Tensorflow, Tip-Ins, and Tableau, Oh My | |
author: Brock Tibert | |
date: '2020-02-18' | |
slug: tensorflow-tip-ins-and-tableau-oh-my | |
categories: | |
- R | |
tags: | |
- NHL | |
- Tensorflow | |
- Keras | |
image: | |
caption: '' | |
focal_point: '' | |
draft: false | |
--- | |
This notebook aims to show the basics of: | |
1. Tensorflow 2.0 | |
2. Shooter Embedding estimation for NHL Player evaluation | |
3. Evaluate feasibility generating a post that switches between `R` and `python` via reticulate | |
4. Demonstrate code similarity/approach in both languages side-by-side | |
## TL;DR | |
- Combine Tensorflow/Keras with R | |
- NHL Data to estimate Shooter Player Embeddings | |
- Export to Tableau for exploration (yes we could use ggplot et. al, but highlights we have other options, especially for those new to the language) | |
```{r defaults, include=FALSE} | |
knitr::opts_chunk$set(comment = NA) | |
``` | |
## R Setup | |
```{r setup, include=TRUE} | |
# packages | |
library(keras) | |
suppressPackageStartupMessages(library(tidyverse)) | |
library(reticulate) | |
suppressPackageStartupMessages(library(caret)) | |
# options | |
options(stringsAsFactors = FALSE) | |
use_condaenv("tensorflow") | |
``` | |
## Python setup | |
```{python setup2, include=TRUE} | |
# imports | |
import pandas as pd | |
import numpy as np | |
from sklearn.preprocessing import MinMaxScaler | |
from sklearn.model_selection import train_test_split | |
from tensorflow.keras.layers import Activation, concatenate, Dense, Dropout, Embedding, Input, Reshape, Flatten | |
from tensorflow.keras.models import Model | |
from tensorflow.keras.utils import plot_model | |
from tensorflow.keras.preprocessing.text import Tokenizer | |
``` | |
## Get the data | |
### R | |
```{r message=F} | |
URL = "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip" | |
download.file(URL, destfile="shots.zip") | |
shots_raw = read_csv("shots.zip") | |
``` | |
What's the shape? | |
```{r} | |
dim(shots_raw) | |
``` | |
### python | |
```{python} | |
URL = "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip" | |
shots_raw = pd.read_csv(URL) | |
``` | |
What do we have? | |
```{python} | |
shots_raw.shape | |
``` | |
## Filter rows | |
We want to keep shots on net, and not on an empty net, as well as remove records where the shooter id is 0. | |
### R | |
```{r} | |
# keep shots that were on goal | |
shots_raw = shots_raw %>% filter(shotWasOnGoal == 1 ) | |
shots_raw = shots_raw %>% filter(shotOnEmptyNet == 0) | |
shots_raw = shots_raw %>% filter(shooterPlayerId != 0) | |
shots_raw = shots_raw %>% filter(!is.na(shooterPlayerId)) | |
``` | |
What we do have for a shape? | |
```{r} | |
dim(shots_raw) | |
``` | |
### python | |
```{python} | |
shots_raw = shots_raw.loc[shots_raw.shotOnEmptyNet == 0, :] | |
shots_raw = shots_raw.loc[shots_raw.shotWasOnGoal == 1, :] | |
shots_raw = shots_raw.loc[shots_raw.shooterPlayerId != 0, :] | |
shots_raw = shots_raw.loc[~shots_raw.shooterPlayerId.isna(), :] | |
``` | |
What do we have for a shape? | |
```{python} | |
shots_raw.shape | |
``` | |
## Select Columns | |
With the rows select, let's keep the columns that we want to include in this analysis. | |
### R | |
```{r} | |
# keep just the columns that we need | |
shots_raw = shots_raw %>% select(shooterPlayerId, shotType, goal, arenaAdjustedShotDistance, | |
arenaAdjustedXCord, arenaAdjustedYCord, shotAngle, offWing) | |
``` | |
The shape ... | |
```{r} | |
dim(shots_raw) | |
``` | |
### python | |
```{python} | |
COLS = ['shooterPlayerId', 'shotType', 'goal', 'offWing', | |
'arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord', | |
'shotAngle'] | |
shots_raw = shots_raw[COLS] | |
``` | |
The shape ... | |
```{python} | |
shots_raw.shape | |
``` | |
## Encode the shot types | |
I am going to one-hot the shot types, though in the future I will explore the use of `keras.preprocessing.text.one_hot`. The result will be new columns added to our `shots_raw` dataset, with each shot type flagged as 0/1. | |
### R | |
```{r} | |
x <- dummyVars(" ~ .", data = shots_raw) | |
shots_raw <- data.frame(predict(x, newdata = shots_raw)) | |
rm(x) | |
``` | |
What do we have? | |
```{r} | |
glimpse(shots_raw) | |
``` | |
### python | |
```{python} | |
shots_raw = pd.get_dummies(shots_raw, columns=['shotType']) | |
print(shots_raw.shape) | |
print(shots_raw.head(3).T) | |
``` | |
## Scale the numeric data to 0/1 | |
### R | |
```{r} | |
# clunky, but break out columns to standardize | |
tmp = shots_raw %>% select(arenaAdjustedShotDistance:shotAngle) | |
tmp2 = preProcess(tmp, method = "range") | |
pp = predict(tmp2, tmp) | |
rm(tmp, tmp2) | |
# drop the original and append these | |
shots_raw = select(shots_raw, -arenaAdjustedShotDistance:-shotAngle) | |
shots_raw = cbind(shots_raw, pp) | |
dim(shots_raw) | |
``` | |
### python | |
```{python} | |
scaler = MinMaxScaler() | |
COLS = ['arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord', 'shotAngle'] | |
shots_raw[COLS] = scaler.fit_transform(shots_raw[COLS]) | |
shots_raw.shape | |
``` | |
## Setup the tokenizer and fit to the Player IDs | |
For this exercise, instead of converting the player ids to be 0-based, I am going to treat the player ids as if they are unique words, with the unique number of players representing our complete vocabulary. As such, document represents a shot of the puck on net, and each document only includes one "word", or shooter. | |
> The trick here is that we have to treat our player ids as character strings. | |
### R | |
```{r} | |
# ensure that the shooter ID is a string | |
shots_raw$shooterPlayerId = as.character(shots_raw$shooterPlayerId) | |
# setup the tokenizer | |
shooter_tokenizer = text_tokenizer() | |
# fit the shooters | |
fit_text_tokenizer(shooter_tokenizer, shots_raw$shooterPlayerId) | |
``` | |
What do we have? | |
```{r} | |
shooter_tokenizer$index_word[1:3] | |
shooter_tokenizer$word_index[1:3] | |
``` | |
And how many? | |
```{r} | |
length(shooter_tokenizer$index_word) | |
``` | |
### python | |
```{python} | |
# make an integer so zero is not parsed | |
shots_raw.shooterPlayerId = shots_raw.shooterPlayerId.astype('int') | |
# ensure that the player ID is a string | |
shots_raw.shooterPlayerId = shots_raw.shooterPlayerId.astype('str') | |
# setup the tokenizer | |
shooter_tokenizer = Tokenizer() | |
# fit the tokenizer to shooters | |
shooter_tokenizer.fit_on_texts(shots_raw.shooterPlayerId) | |
``` | |
What do we have? | |
```{python} | |
list(shooter_tokenizer.index_word.items())[:3] | |
list(shooter_tokenizer.word_index.items())[:3] | |
``` | |
And how many? | |
```{python} | |
len(shooter_tokenizer.index_word.items()) | |
``` | |
## Create the Shooter `sequences` | |
These are size 1 sequences that do not require padding, as we only allow 1 word (or player) per shot. The key here is that we are using `keras` to help us easily map our data to the new id system. | |
### R | |
```{r} | |
# make sequences with the new index | |
shooters = texts_to_sequences(shooter_tokenizer, shots_raw$shooterPlayerId) | |
shooters = unlist(shooters) | |
``` | |
What do we have? | |
```{r} | |
class(shooters) | |
length(shooters) | |
``` | |
### python | |
```{python} | |
shooters = shooter_tokenizer.texts_to_sequences(shots_raw.shooterPlayerId) | |
shooters = [x[0] for x in shooters] | |
shooters = np.array(shooters) | |
``` | |
What do we have? | |
```{python} | |
type(shooters) | |
len(shooters) | |
``` | |
## Isolate the other features/targets | |
### R | |
```{r} | |
# Was the shot a goal? This is our target. | |
goal = shots_raw$goal | |
# the shot info | |
shot_info = shots_raw %>% select(-shooterPlayerId, -goal) | |
shot_info = as.matrix(shot_info) | |
``` | |
What do we have now? | |
```{r} | |
length(goal); mean(goal); | |
dim(shot_info) | |
colnames(shot_info) | |
``` | |
### python | |
```{python} | |
# Was the shot a goal? This is our target. | |
goal = np.array(shots_raw.goal) | |
# the shot info | |
shot_info = shots_raw.drop(columns=['shooterPlayerId', 'goal'], axis=1, inplace=False) | |
``` | |
What do we have? | |
```{python} | |
len(goal) | |
goal.mean() | |
shot_info.shape | |
shot_info.columns | |
``` | |
## Define the model architecture | |
### R | |
> Note the +1, it's needed to avoid the index error | |
```{r} | |
# the setup | |
NUM_SHOOTERS = length(unique(unlist(shooter_tokenizer$index_word))) +1 | |
SHOT_COLS = ncol(shot_info) | |
VEC_SIZE = 50 | |
# the input layers | |
shooter_input = layer_input(shape=c(1), name = "shooter_input") | |
shot_input = layer_input(shape=c(SHOT_COLS), name = "shot_input") | |
# shooter layers | |
s1 = layer_embedding(input_dim = NUM_SHOOTERS, | |
output_dim = VEC_SIZE, | |
input_length = 1, | |
name="shooter_embedding")(shooter_input) | |
s2 = layer_flatten(name = "shooter_flat")(s1) | |
s3 = layer_dense(units = 1, activation = "sigmoid")(s2) | |
# put the model together | |
model = keras_model(inputs = shooter_input, outputs = s3) | |
``` | |
Summarize: | |
```{r} | |
summary(model) | |
``` | |
### python | |
> Note the +2, it's needed to avoid the index error and differs from abvoe | |
```{python} | |
# setup | |
NUM_SHOOTERS = len(np.unique(shooters)) + 1 | |
SHOT_COLS = shot_info.shape[1] | |
VEC_SIZE = 50 | |
# the input layers | |
shooter_input = Input(shape=(1, ), name="shooter_input") | |
shot_input = Input(shape=(SHOT_COLS, ), name="shot_input") | |
# shooter layers | |
s1 = Embedding(NUM_SHOOTERS, VEC_SIZE, input_length=1)(shooter_input) | |
s2 = Flatten()(s1) | |
s3 = Dense(1, activation="sigmoid")(s2) | |
# put the model together | |
model = Model(inputs = shooter_input, outputs = s3) | |
``` | |
What do we have? | |
```{python} | |
model.summary() | |
``` | |
and plot the model, this is not available within R at the moment. | |
```{python, eval=F} | |
# below might choke RMD | |
plot_model(model, to_file='model.png') | |
``` | |
## Train and Evaluate the Model | |
### R | |
Compile the model. | |
```{r} | |
model %>% | |
compile(optimizer = "adam", | |
loss="binary_crossentropy", | |
metrics =c("accuracy")) | |
``` | |
Fit the model and record the history for plotting, if needed | |
```{r, eval=FALSE} | |
history = | |
model %>% | |
fit(x=list(shooters), | |
y=goal, | |
epochs = 5, | |
verbose = 2) | |
``` | |
```{r echo=FALSE} | |
model %>% | |
fit(x=list(shooters), | |
y=goal, | |
epochs = 5, | |
verbose = 2) | |
``` | |
### python | |
Compile the model. | |
```{python} | |
model.compile(optimizer="adam", loss = "binary_crossentropy", metrics = ['accuracy']) | |
``` | |
Fit the model. | |
```{python eval=FALSE} | |
X = [shooters, shot_info] | |
history = model.fit(X, goal, epochs=5) | |
``` | |
```{python echo=FALSE, results='hide'} | |
# doing this to help with the document to compile so it doesnt hang Rstudio | |
# issue is sequence is a list of tuples, on colab fixed | |
X = [shooters, shot_info] | |
history = model.fit(X, goal, epochs=5) | |
``` | |
## Get the Embeddings | |
With our simple model, we have estimated embeddings for each shooter. Let's grab those. | |
### R | |
```{r} | |
shooter_embeddings = get_weights(model)[[1]] | |
``` | |
What do we have? | |
```{r} | |
shooter_embeddings[1:3, 1:3] | |
``` | |
The shape. | |
```{r} | |
dim(shooter_embeddings) | |
``` | |
### python | |
```{python} | |
shooter_embeddings = model.layers[1].get_weights()[0] | |
``` | |
What do we have? | |
```{python} | |
shooter_embeddings[1:4, 1:4] | |
``` | |
The shape. | |
```{python} | |
shooter_embeddings.shape | |
``` | |
## Map the embeddings to the players | |
The embeddings are related to a player, so we are intereseted extracting these vectors and looking at player similarity, etc. | |
### R | |
> This is to help with some of the mapping. There may be more elegant ways to do this, but below is intuitive and simple in my opinion. | |
```{r} | |
# build our vocabulary (player) dataframe | |
# https://www.r-bloggers.com/word-embeddings-with-keras/ | |
players = data.frame( | |
playerid = names(shooter_tokenizer$word_index), | |
id = as.integer(unlist(shooter_tokenizer$word_index)), stringsAsFactors=FALSE) | |
players = dplyr::arrange(players, id) | |
``` | |
The embeddings with names and references | |
```{r comment=NA} | |
# keep only those rows where the indexes align - R is 1-based | |
shooter_embeddings = shooter_embeddings[players$id, ] | |
rownames(shooter_embeddings) = players$playerid | |
colnames(shooter_embeddings) = paste0("e", 1:ncol(shooter_embeddings)) | |
shooter_embeddings[1:3, 1:3] | |
``` | |
### python | |
```{python} | |
# make the embed vectors a pandas dataframe | |
shooter_embeddings = pd.DataFrame(shooter_embeddings) | |
# a list of true shooter ids | |
#shooter_id = [v for k, v in shooter_tokenizer.index_word.items()] | |
shooter_id = {k:v for k, v in shooter_tokenizer.index_word.items()} | |
shooter_df = pd.DataFrame.from_dict(shooter_id, orient='index', columns=["playerid"]) | |
# name the columns | |
shooter_embeddings.columns = ["e" + str(i + 1) for i in range(shooter_embeddings.shape[1])] | |
# align the data by index | |
shooter_embeddings = pd.merge(shooter_embeddings, shooter_df, how='inner', left_index=True, right_index=True) | |
# clean up the index so its the player | |
shooter_embeddings.index = shooter_embeddings.playerid | |
# the first few | |
shooter_embeddings.iloc[:3, :3] | |
``` | |
## Export the data to Tableau | |
Whether it is R or python, you might be asking why I am exporting the data to Tableau. That is a fair question, but the point is to show how the ecosystem of data science programming libraries can also leverage best-of-breed data visualization suites such as Tableau. The tool plays a key role in my exploratory analysis pipeline, and the goal below is show how in 1-line of code, we can export our data for rapid exploration, which can aid in our data cleaning and modeling tasks within R/python. | |
## R | |
I ported a copy of the `pantab` library in python into R. The trick is that I use `reticulate` to port the python bits into R. As such, at present, it will not work if you are using Google Colab. | |
Installation is simple: | |
```{r eval=FALSE} | |
devtools::install_github("btibert3/pantabR") | |
``` | |
```{r eval=FALSE} | |
sdf = as.data.frame(shooter_embeddings) | |
pantabR::frame_to_hyper(sdf, f="embeddings.hyper", tbl="shooters") | |
``` | |
> In the python section, I am going to use t-SNE to reduce the estimated shooter embeddings into a two dimensional space. The process is similar in `R` using the `Rtsne` package. | |
### python | |
The `pantab` library is easy to install: | |
```{python eval=FALSE} | |
pip install pantab | |
# if in a notebook environment | |
# !pip install pantab | |
``` | |
However, prior to writing out the data for exploration, I am going to use t-SNE to compact the estimated shooter embeddings into a two dimensional coordinate system. For more on t-SNE, refer to [this introduction.](https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1) | |
```{python} | |
from sklearn.manifold import TSNE | |
shooter_tsne = TSNE(n_components=2).fit_transform(shooter_embeddings.iloc[:, :50]) | |
``` | |
Add them to the dataframe | |
```{python} | |
shooter_tsne = pd.DataFrame(shooter_tsne) | |
shooter_tsne.columns = ['t1', 't2'] | |
shooter_embeddings.reset_index(inplace=True, drop=True) | |
shooter_embeddings = pd.concat([shooter_embeddings, shooter_tsne], axis=1) | |
``` | |
With pantab setup, that package makes it really simple to write pandas dataframes to `hyper` files for Tableau. | |
```{python} | |
import pantab | |
pantab.frame_to_hyper(shooter_embeddings, "embeddings.hyper", table="shooters") | |
``` | |
And the simple embeddings, plotted from our exported `embeddings.hyper` file within Tableau. | |
![](https://github.com/Btibert3/brocktibert/blob/master/public/img/simple-nhl-shooter-embeddings.png?raw=true) | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment