Btibert3/shooter-embeddings-r-python.RMD

## shooter-embeddings-r-python.RMD
---
title: Tensorflow, Tip-Ins, and Tableau, Oh My
author: Brock Tibert
date: '2020-02-18'
slug: tensorflow-tip-ins-and-tableau-oh-my
categories:
  - R
tags:
  - NHL
  - Tensorflow
  - Keras
image:
  caption: ''
  focal_point: ''
draft: false
---

This notebook aims to show the basics of:

1. Tensorflow 2.0
2. Shooter Embedding estimation for NHL Player evaluation
3. Evaluate feasibility generating a post that switches between `R` and `python` via reticulate
4. Demonstrate code similarity/approach in both languages side-by-side

## TL;DR

- Combine Tensorflow/Keras with R
- NHL Data to estimate Shooter Player Embeddings
- Export to Tableau for exploration (yes we could use ggplot et. al, but highlights we have other options, especially for those new to the language)


```{r defaults, include=FALSE}
knitr::opts_chunk$set(comment = NA)
```

## R Setup

```{r setup, include=TRUE}
# packages
library(keras)

suppressPackageStartupMessages(library(tidyverse))
library(reticulate)
suppressPackageStartupMessages(library(caret))

# options
options(stringsAsFactors = FALSE)
use_condaenv("tensorflow")
```

## Python setup

```{python setup2, include=TRUE}
# imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Activation, concatenate, Dense, Dropout, Embedding, Input, Reshape, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model
from tensorflow.keras.preprocessing.text import Tokenizer

```


## Get the data


### R

```{r message=F}
URL = "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"
download.file(URL, destfile="shots.zip")
shots_raw = read_csv("shots.zip")
```


What's the shape?

```{r}
dim(shots_raw)
```


### python

```{python}
URL = "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"
shots_raw = pd.read_csv(URL)
```

What do we have?

```{python}
shots_raw.shape
```


## Filter rows

We want to keep shots on net, and not on an empty net, as well as remove records where the shooter id is 0.

### R

```{r}
# keep shots that were on goal
shots_raw = shots_raw %>% filter(shotWasOnGoal == 1 )
shots_raw = shots_raw %>% filter(shotOnEmptyNet == 0)
shots_raw = shots_raw %>% filter(shooterPlayerId != 0)
shots_raw = shots_raw %>% filter(!is.na(shooterPlayerId))
```

What we do have for a shape?

```{r}
dim(shots_raw)
```


### python

```{python}
shots_raw = shots_raw.loc[shots_raw.shotOnEmptyNet == 0, :]
shots_raw = shots_raw.loc[shots_raw.shotWasOnGoal == 1, :]
shots_raw = shots_raw.loc[shots_raw.shooterPlayerId != 0, :]
shots_raw = shots_raw.loc[~shots_raw.shooterPlayerId.isna(), :]
```

What do we have for a shape?

```{python}
shots_raw.shape
```


## Select Columns

With the rows select, let's keep the columns that we want to include in this analysis.

### R

```{r}
# keep just the columns that we need
shots_raw = shots_raw %>% select(shooterPlayerId, shotType, goal, arenaAdjustedShotDistance,
                                   arenaAdjustedXCord, arenaAdjustedYCord,  shotAngle, offWing)
```

The shape ...

```{r}
dim(shots_raw)
```


### python

```{python}
COLS = ['shooterPlayerId', 'shotType', 'goal', 'offWing',
        'arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord',
        'shotAngle']
shots_raw = shots_raw[COLS]
```


The shape ...

```{python}
shots_raw.shape
```


## Encode the shot types

I am going to one-hot the shot types, though in the future I will explore the use of `keras.preprocessing.text.one_hot`.  The result will be new columns added to our `shots_raw` dataset, with each shot type flagged as 0/1.

### R

```{r}
x <- dummyVars(" ~ .", data = shots_raw)
shots_raw <- data.frame(predict(x, newdata = shots_raw))
rm(x)
```

What do we have?

```{r}
glimpse(shots_raw)
```


### python


```{python}
shots_raw = pd.get_dummies(shots_raw, columns=['shotType'])
print(shots_raw.shape)
print(shots_raw.head(3).T)
```


## Scale the numeric data to 0/1

### R

```{r}
# clunky, but break out columns to standardize
tmp = shots_raw %>% select(arenaAdjustedShotDistance:shotAngle)
tmp2 = preProcess(tmp, method = "range")
pp = predict(tmp2, tmp)
rm(tmp, tmp2)

# drop the original and append these
shots_raw = select(shots_raw, -arenaAdjustedShotDistance:-shotAngle)
shots_raw = cbind(shots_raw, pp)
dim(shots_raw)
```


### python

```{python}
scaler = MinMaxScaler()
COLS = ['arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord', 'shotAngle']
shots_raw[COLS] = scaler.fit_transform(shots_raw[COLS])
shots_raw.shape
```


## Setup the tokenizer and fit to the Player IDs

For this exercise, instead of converting the player ids to be 0-based, I am going to treat the player ids as if they are unique words, with the unique number of players representing our complete vocabulary.  As such, document represents a shot of the puck on net, and each document only includes one "word", or shooter.

> The trick here is that we have to treat our player ids as character strings.

### R

```{r}
# ensure that the shooter ID is a string
shots_raw$shooterPlayerId = as.character(shots_raw$shooterPlayerId)

# setup the tokenizer
shooter_tokenizer = text_tokenizer()

# fit the shooters
fit_text_tokenizer(shooter_tokenizer, shots_raw$shooterPlayerId)
```

What do we have?

```{r}
shooter_tokenizer$index_word[1:3]
shooter_tokenizer$word_index[1:3]
```

And how many?

```{r}
length(shooter_tokenizer$index_word)
```


### python

```{python}
# make an integer so zero is not parsed
shots_raw.shooterPlayerId = shots_raw.shooterPlayerId.astype('int')

# ensure that the player ID is a string
shots_raw.shooterPlayerId = shots_raw.shooterPlayerId.astype('str')

# setup the tokenizer
shooter_tokenizer = Tokenizer()

# fit the tokenizer to shooters
shooter_tokenizer.fit_on_texts(shots_raw.shooterPlayerId)

```


What do we have?

```{python}
list(shooter_tokenizer.index_word.items())[:3]
list(shooter_tokenizer.word_index.items())[:3]
```


And how many?

```{python}
len(shooter_tokenizer.index_word.items())
```


## Create the Shooter `sequences`

These are size 1 sequences that do not require padding, as we only allow 1 word (or player) per shot.  The key here is that we are using `keras` to help us easily map our data to the new id system.

### R

```{r}
# make sequences with the new index
shooters = texts_to_sequences(shooter_tokenizer, shots_raw$shooterPlayerId)
shooters = unlist(shooters)
```


What do we have?

```{r}
class(shooters)
length(shooters)
```


### python

```{python}
shooters = shooter_tokenizer.texts_to_sequences(shots_raw.shooterPlayerId)
shooters = [x[0] for x in shooters]
shooters = np.array(shooters)
```

What do we have?

```{python}
type(shooters)
len(shooters)
```


## Isolate the other features/targets

### R

```{r}
# Was the shot a goal?   This is our target.
goal = shots_raw$goal

# the shot info
shot_info = shots_raw %>% select(-shooterPlayerId, -goal)
shot_info = as.matrix(shot_info)
```

What do we have now?

```{r}
length(goal); mean(goal);
dim(shot_info)
colnames(shot_info)
```

### python


```{python}

# Was the shot a goal?   This is our target.
goal = np.array(shots_raw.goal)

# the shot info
shot_info = shots_raw.drop(columns=['shooterPlayerId', 'goal'], axis=1, inplace=False)

```

What do we have?

```{python}
len(goal)
goal.mean()
shot_info.shape
shot_info.columns
```


## Define the model architecture

### R

> Note the +1, it's needed to avoid the index error

```{r}
# the setup
NUM_SHOOTERS = length(unique(unlist(shooter_tokenizer$index_word))) +1
SHOT_COLS = ncol(shot_info)
VEC_SIZE = 50

# the input layers
shooter_input = layer_input(shape=c(1), name = "shooter_input")
shot_input = layer_input(shape=c(SHOT_COLS), name = "shot_input")

# shooter layers
s1 = layer_embedding(input_dim = NUM_SHOOTERS,
                     output_dim = VEC_SIZE,
                     input_length = 1,
                     name="shooter_embedding")(shooter_input)
s2 = layer_flatten(name = "shooter_flat")(s1)
s3 = layer_dense(units = 1, activation = "sigmoid")(s2)

# put the model together
model = keras_model(inputs = shooter_input, outputs = s3)
```


Summarize:

```{r}
summary(model)
```


### python

> Note the +2, it's needed to avoid the index error and differs from abvoe

```{python}
# setup
NUM_SHOOTERS = len(np.unique(shooters)) + 1
SHOT_COLS = shot_info.shape[1]
VEC_SIZE = 50

# the input layers
shooter_input = Input(shape=(1, ), name="shooter_input")
shot_input = Input(shape=(SHOT_COLS, ), name="shot_input")

# shooter layers
s1 = Embedding(NUM_SHOOTERS, VEC_SIZE, input_length=1)(shooter_input)
s2 = Flatten()(s1)
s3 = Dense(1, activation="sigmoid")(s2)

# put the model together
model = Model(inputs = shooter_input, outputs = s3)
```


What do we have?

```{python}
model.summary()
```


and plot the model, this is not available within R at the moment.

```{python, eval=F}
# below might choke RMD
plot_model(model, to_file='model.png')
```


## Train and Evaluate the Model

### R

Compile the model.

```{r}
model %>%
  compile(optimizer = "adam",
          loss="binary_crossentropy",
          metrics =c("accuracy"))
```

Fit the model and record the history for plotting, if needed

```{r, eval=FALSE}
history =
  model %>%
  fit(x=list(shooters),
      y=goal,
      epochs = 5,
      verbose = 2)
```

```{r echo=FALSE}
model %>%
  fit(x=list(shooters),
      y=goal,
      epochs = 5,
      verbose = 2)
```


### python

Compile the model.

```{python}
model.compile(optimizer="adam", loss = "binary_crossentropy", metrics = ['accuracy'])
```


Fit the model.

```{python eval=FALSE}
X = [shooters, shot_info]
history = model.fit(X, goal, epochs=5)
```

```{python echo=FALSE, results='hide'}
# doing this to help with the document to compile so it doesnt hang Rstudio
# issue is sequence is a list of tuples, on colab fixed
X = [shooters, shot_info]
history = model.fit(X, goal, epochs=5)
```


## Get the Embeddings

With our simple model, we have estimated embeddings for each shooter.  Let's grab those.

### R

```{r}
shooter_embeddings = get_weights(model)[[1]]
```


What do we have?

```{r}
shooter_embeddings[1:3, 1:3]
```


The shape.

```{r}
dim(shooter_embeddings)
```


### python

```{python}
shooter_embeddings = model.layers[1].get_weights()[0]
```

What do we have?

```{python}
shooter_embeddings[1:4, 1:4]
```


The shape.

```{python}
shooter_embeddings.shape
```


## Map the embeddings to the players

The embeddings are related to a player, so we are intereseted extracting these vectors and looking at player similarity, etc.

### R

> This is to help with some of the mapping.  There may be more elegant ways to do this, but below is intuitive and simple in my opinion.

```{r}
# build our vocabulary (player) dataframe
# https://www.r-bloggers.com/word-embeddings-with-keras/
players = data.frame(
  playerid = names(shooter_tokenizer$word_index),
  id = as.integer(unlist(shooter_tokenizer$word_index)), stringsAsFactors=FALSE)

players = dplyr::arrange(players, id)
```

The embeddings with names and references

```{r comment=NA}
# keep only those rows where the indexes align - R is 1-based
shooter_embeddings = shooter_embeddings[players$id, ]
rownames(shooter_embeddings) = players$playerid
colnames(shooter_embeddings) = paste0("e", 1:ncol(shooter_embeddings))
shooter_embeddings[1:3, 1:3]
```


### python


```{python}
# make the embed vectors a pandas dataframe
shooter_embeddings = pd.DataFrame(shooter_embeddings)

# a list of true shooter ids
#shooter_id = [v for k, v in shooter_tokenizer.index_word.items()]
shooter_id = {k:v for k, v in shooter_tokenizer.index_word.items()}
shooter_df = pd.DataFrame.from_dict(shooter_id, orient='index', columns=["playerid"])

# name the columns
shooter_embeddings.columns = ["e" + str(i + 1) for i in range(shooter_embeddings.shape[1])]

# align the data by index
shooter_embeddings = pd.merge(shooter_embeddings, shooter_df, how='inner', left_index=True, right_index=True)

# clean up the index so its the player
shooter_embeddings.index = shooter_embeddings.playerid

# the first few
shooter_embeddings.iloc[:3, :3]

```


## Export the data to Tableau

Whether it is R or python, you might be asking why I am exporting the data to Tableau.  That is a fair question, but the point is to show how the ecosystem of data science programming libraries can also leverage best-of-breed data visualization suites such as Tableau.  The tool plays a key role in my exploratory analysis pipeline, and the goal below is show how in 1-line of code, we can export our data for rapid exploration, which can aid in our data cleaning and modeling tasks within R/python.

## R

I ported a copy of the `pantab` library in python into R.  The trick is that I use `reticulate` to port the python bits into R.  As such, at present, it will not work if you are using Google Colab.

Installation is simple:

```{r eval=FALSE}
devtools::install_github("btibert3/pantabR")
```


```{r eval=FALSE}
sdf = as.data.frame(shooter_embeddings)
pantabR::frame_to_hyper(sdf, f="embeddings.hyper", tbl="shooters")
```

> In the python section, I am going to use t-SNE to reduce the estimated shooter embeddings into a two dimensional space.  The process is similar in `R` using the `Rtsne` package.

### python

The `pantab` library is easy to install:

```{python eval=FALSE}
pip install pantab
# if in a notebook environment
# !pip install pantab
```


However, prior to writing out the data for exploration, I am going to use t-SNE to compact the estimated shooter embeddings into a two dimensional coordinate system.  For more on t-SNE, refer to [this introduction.](https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1)

```{python}
from sklearn.manifold import TSNE
shooter_tsne = TSNE(n_components=2).fit_transform(shooter_embeddings.iloc[:, :50])
```

Add them to the dataframe

```{python}
shooter_tsne = pd.DataFrame(shooter_tsne)
shooter_tsne.columns = ['t1', 't2']
shooter_embeddings.reset_index(inplace=True, drop=True)
shooter_embeddings = pd.concat([shooter_embeddings, shooter_tsne], axis=1)
```


With pantab setup, that package makes it really simple to write pandas dataframes to `hyper` files for Tableau.

```{python}
import pantab
pantab.frame_to_hyper(shooter_embeddings, "embeddings.hyper", table="shooters")
```

And the simple embeddings, plotted from our exported `embeddings.hyper` file within Tableau.

![](https://github.com/Btibert3/brocktibert/blob/master/public/img/simple-nhl-shooter-embeddings.png?raw=true)
	---
	title: Tensorflow, Tip-Ins, and Tableau, Oh My
	author: Brock Tibert
	date: '2020-02-18'
	slug: tensorflow-tip-ins-and-tableau-oh-my
	categories:
	- R
	tags:
	- NHL
	- Tensorflow
	- Keras
	image:
	caption: ''
	focal_point: ''
	draft: false
	---

	This notebook aims to show the basics of:

	1. Tensorflow 2.0
	2. Shooter Embedding estimation for NHL Player evaluation
	3. Evaluate feasibility generating a post that switches between `R` and `python` via reticulate
	4. Demonstrate code similarity/approach in both languages side-by-side

	## TL;DR

	- Combine Tensorflow/Keras with R
	- NHL Data to estimate Shooter Player Embeddings
	- Export to Tableau for exploration (yes we could use ggplot et. al, but highlights we have other options, especially for those new to the language)



	```{r defaults, include=FALSE}
	knitr::opts_chunk$set(comment = NA)
	```

	## R Setup

	```{r setup, include=TRUE}
	# packages
	library(keras)

	suppressPackageStartupMessages(library(tidyverse))
	library(reticulate)
	suppressPackageStartupMessages(library(caret))

	# options
	options(stringsAsFactors = FALSE)
	use_condaenv("tensorflow")
	```

	## Python setup

	```{python setup2, include=TRUE}
	# imports
	import pandas as pd
	import numpy as np
	from sklearn.preprocessing import MinMaxScaler
	from sklearn.model_selection import train_test_split
	from tensorflow.keras.layers import Activation, concatenate, Dense, Dropout, Embedding, Input, Reshape, Flatten
	from tensorflow.keras.models import Model
	from tensorflow.keras.utils import plot_model
	from tensorflow.keras.preprocessing.text import Tokenizer

	```




	## Get the data


	### R

	```{r message=F}
	URL = "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"
	download.file(URL, destfile="shots.zip")
	shots_raw = read_csv("shots.zip")
	```


	What's the shape?

	```{r}
	dim(shots_raw)
	```


	### python

	```{python}
	URL = "http://peter-tanner.com/moneypuck/downloads/shots_2019.zip"
	shots_raw = pd.read_csv(URL)
	```

	What do we have?

	```{python}
	shots_raw.shape
	```


	## Filter rows

	We want to keep shots on net, and not on an empty net, as well as remove records where the shooter id is 0.

	### R

	```{r}
	# keep shots that were on goal
	shots_raw = shots_raw %>% filter(shotWasOnGoal == 1 )
	shots_raw = shots_raw %>% filter(shotOnEmptyNet == 0)
	shots_raw = shots_raw %>% filter(shooterPlayerId != 0)
	shots_raw = shots_raw %>% filter(!is.na(shooterPlayerId))
	```

	What we do have for a shape?

	```{r}
	dim(shots_raw)
	```


	### python

	```{python}
	shots_raw = shots_raw.loc[shots_raw.shotOnEmptyNet == 0, :]
	shots_raw = shots_raw.loc[shots_raw.shotWasOnGoal == 1, :]
	shots_raw = shots_raw.loc[shots_raw.shooterPlayerId != 0, :]
	shots_raw = shots_raw.loc[~shots_raw.shooterPlayerId.isna(), :]
	```

	What do we have for a shape?

	```{python}
	shots_raw.shape
	```




	## Select Columns

	With the rows select, let's keep the columns that we want to include in this analysis.

	### R

	```{r}
	# keep just the columns that we need
	shots_raw = shots_raw %>% select(shooterPlayerId, shotType, goal, arenaAdjustedShotDistance,
	arenaAdjustedXCord, arenaAdjustedYCord, shotAngle, offWing)
	```

	The shape ...

	```{r}
	dim(shots_raw)
	```



	### python

	```{python}
	COLS = ['shooterPlayerId', 'shotType', 'goal', 'offWing',
	'arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord',
	'shotAngle']
	shots_raw = shots_raw[COLS]
	```


	The shape ...

	```{python}
	shots_raw.shape
	```


	## Encode the shot types

	I am going to one-hot the shot types, though in the future I will explore the use of `keras.preprocessing.text.one_hot`. The result will be new columns added to our `shots_raw` dataset, with each shot type flagged as 0/1.

	### R

	```{r}
	x <- dummyVars(" ~ .", data = shots_raw)
	shots_raw <- data.frame(predict(x, newdata = shots_raw))
	rm(x)
	```

	What do we have?

	```{r}
	glimpse(shots_raw)
	```


	### python


	```{python}
	shots_raw = pd.get_dummies(shots_raw, columns=['shotType'])
	print(shots_raw.shape)
	print(shots_raw.head(3).T)
	```




	## Scale the numeric data to 0/1

	### R

	```{r}
	# clunky, but break out columns to standardize
	tmp = shots_raw %>% select(arenaAdjustedShotDistance:shotAngle)
	tmp2 = preProcess(tmp, method = "range")
	pp = predict(tmp2, tmp)
	rm(tmp, tmp2)

	# drop the original and append these
	shots_raw = select(shots_raw, -arenaAdjustedShotDistance:-shotAngle)
	shots_raw = cbind(shots_raw, pp)
	dim(shots_raw)
	```


	### python

	```{python}
	scaler = MinMaxScaler()
	COLS = ['arenaAdjustedShotDistance', 'arenaAdjustedXCord', 'arenaAdjustedYCord', 'shotAngle']
	shots_raw[COLS] = scaler.fit_transform(shots_raw[COLS])
	shots_raw.shape
	```



	## Setup the tokenizer and fit to the Player IDs

	For this exercise, instead of converting the player ids to be 0-based, I am going to treat the player ids as if they are unique words, with the unique number of players representing our complete vocabulary. As such, document represents a shot of the puck on net, and each document only includes one "word", or shooter.

	> The trick here is that we have to treat our player ids as character strings.

	### R

	```{r}
	# ensure that the shooter ID is a string
	shots_raw$shooterPlayerId = as.character(shots_raw$shooterPlayerId)

	# setup the tokenizer
	shooter_tokenizer = text_tokenizer()

	# fit the shooters
	fit_text_tokenizer(shooter_tokenizer, shots_raw$shooterPlayerId)
	```

	What do we have?

	```{r}
	shooter_tokenizer$index_word[1:3]
	shooter_tokenizer$word_index[1:3]
	```

	And how many?

	```{r}
	length(shooter_tokenizer$index_word)
	```


	### python

	```{python}
	# make an integer so zero is not parsed
	shots_raw.shooterPlayerId = shots_raw.shooterPlayerId.astype('int')

	# ensure that the player ID is a string
	shots_raw.shooterPlayerId = shots_raw.shooterPlayerId.astype('str')

	# setup the tokenizer
	shooter_tokenizer = Tokenizer()

	# fit the tokenizer to shooters
	shooter_tokenizer.fit_on_texts(shots_raw.shooterPlayerId)

	```


	What do we have?

	```{python}
	list(shooter_tokenizer.index_word.items())[:3]
	list(shooter_tokenizer.word_index.items())[:3]
	```


	And how many?

	```{python}
	len(shooter_tokenizer.index_word.items())
	```


	## Create the Shooter `sequences`

	These are size 1 sequences that do not require padding, as we only allow 1 word (or player) per shot. The key here is that we are using `keras` to help us easily map our data to the new id system.

	### R

	```{r}
	# make sequences with the new index
	shooters = texts_to_sequences(shooter_tokenizer, shots_raw$shooterPlayerId)
	shooters = unlist(shooters)
	```


	What do we have?

	```{r}
	class(shooters)
	length(shooters)
	```


	### python

	```{python}
	shooters = shooter_tokenizer.texts_to_sequences(shots_raw.shooterPlayerId)
	shooters = [x[0] for x in shooters]
	shooters = np.array(shooters)
	```

	What do we have?

	```{python}
	type(shooters)
	len(shooters)
	```


	## Isolate the other features/targets

	### R

	```{r}
	# Was the shot a goal? This is our target.
	goal = shots_raw$goal

	# the shot info
	shot_info = shots_raw %>% select(-shooterPlayerId, -goal)
	shot_info = as.matrix(shot_info)
	```

	What do we have now?

	```{r}
	length(goal); mean(goal);
	dim(shot_info)
	colnames(shot_info)
	```

	### python


	```{python}

	# Was the shot a goal? This is our target.
	goal = np.array(shots_raw.goal)

	# the shot info
	shot_info = shots_raw.drop(columns=['shooterPlayerId', 'goal'], axis=1, inplace=False)

	```

	What do we have?

	```{python}
	len(goal)
	goal.mean()
	shot_info.shape
	shot_info.columns
	```


	## Define the model architecture

	### R

	> Note the +1, it's needed to avoid the index error

	```{r}
	# the setup
	NUM_SHOOTERS = length(unique(unlist(shooter_tokenizer$index_word))) +1
	SHOT_COLS = ncol(shot_info)
	VEC_SIZE = 50

	# the input layers
	shooter_input = layer_input(shape=c(1), name = "shooter_input")
	shot_input = layer_input(shape=c(SHOT_COLS), name = "shot_input")

	# shooter layers
	s1 = layer_embedding(input_dim = NUM_SHOOTERS,
	output_dim = VEC_SIZE,
	input_length = 1,
	name="shooter_embedding")(shooter_input)
	s2 = layer_flatten(name = "shooter_flat")(s1)
	s3 = layer_dense(units = 1, activation = "sigmoid")(s2)

	# put the model together
	model = keras_model(inputs = shooter_input, outputs = s3)
	```


	Summarize:

	```{r}
	summary(model)
	```


	### python

	> Note the +2, it's needed to avoid the index error and differs from abvoe

	```{python}
	# setup
	NUM_SHOOTERS = len(np.unique(shooters)) + 1
	SHOT_COLS = shot_info.shape[1]
	VEC_SIZE = 50

	# the input layers
	shooter_input = Input(shape=(1, ), name="shooter_input")
	shot_input = Input(shape=(SHOT_COLS, ), name="shot_input")

	# shooter layers
	s1 = Embedding(NUM_SHOOTERS, VEC_SIZE, input_length=1)(shooter_input)
	s2 = Flatten()(s1)
	s3 = Dense(1, activation="sigmoid")(s2)

	# put the model together
	model = Model(inputs = shooter_input, outputs = s3)
	```


	What do we have?

	```{python}
	model.summary()
	```


	and plot the model, this is not available within R at the moment.

	```{python, eval=F}
	# below might choke RMD
	plot_model(model, to_file='model.png')
	```


	## Train and Evaluate the Model

	### R

	Compile the model.

	```{r}
	model %>%
	compile(optimizer = "adam",
	loss="binary_crossentropy",
	metrics =c("accuracy"))
	```

	Fit the model and record the history for plotting, if needed

	```{r, eval=FALSE}
	history =
	model %>%
	fit(x=list(shooters),
	y=goal,
	epochs = 5,
	verbose = 2)
	```

	```{r echo=FALSE}
	model %>%
	fit(x=list(shooters),
	y=goal,
	epochs = 5,
	verbose = 2)
	```



	### python

	Compile the model.

	```{python}
	model.compile(optimizer="adam", loss = "binary_crossentropy", metrics = ['accuracy'])
	```



	Fit the model.

	```{python eval=FALSE}
	X = [shooters, shot_info]
	history = model.fit(X, goal, epochs=5)
	```

	```{python echo=FALSE, results='hide'}
	# doing this to help with the document to compile so it doesnt hang Rstudio
	# issue is sequence is a list of tuples, on colab fixed
	X = [shooters, shot_info]
	history = model.fit(X, goal, epochs=5)
	```


	## Get the Embeddings

	With our simple model, we have estimated embeddings for each shooter. Let's grab those.

	### R

	```{r}
	shooter_embeddings = get_weights(model)[[1]]
	```


	What do we have?

	```{r}
	shooter_embeddings[1:3, 1:3]
	```


	The shape.

	```{r}
	dim(shooter_embeddings)
	```


	### python

	```{python}
	shooter_embeddings = model.layers[1].get_weights()[0]
	```

	What do we have?

	```{python}
	shooter_embeddings[1:4, 1:4]
	```


	The shape.

	```{python}
	shooter_embeddings.shape
	```



	## Map the embeddings to the players

	The embeddings are related to a player, so we are intereseted extracting these vectors and looking at player similarity, etc.

	### R

	> This is to help with some of the mapping. There may be more elegant ways to do this, but below is intuitive and simple in my opinion.

	```{r}
	# build our vocabulary (player) dataframe
	# https://www.r-bloggers.com/word-embeddings-with-keras/
	players = data.frame(
	playerid = names(shooter_tokenizer$word_index),
	id = as.integer(unlist(shooter_tokenizer$word_index)), stringsAsFactors=FALSE)

	players = dplyr::arrange(players, id)
	```

	The embeddings with names and references

	```{r comment=NA}
	# keep only those rows where the indexes align - R is 1-based
	shooter_embeddings = shooter_embeddings[players$id, ]
	rownames(shooter_embeddings) = players$playerid
	colnames(shooter_embeddings) = paste0("e", 1:ncol(shooter_embeddings))
	shooter_embeddings[1:3, 1:3]
	```


	### python


	```{python}
	# make the embed vectors a pandas dataframe
	shooter_embeddings = pd.DataFrame(shooter_embeddings)

	# a list of true shooter ids
	#shooter_id = [v for k, v in shooter_tokenizer.index_word.items()]
	shooter_id = {k:v for k, v in shooter_tokenizer.index_word.items()}
	shooter_df = pd.DataFrame.from_dict(shooter_id, orient='index', columns=["playerid"])

	# name the columns
	shooter_embeddings.columns = ["e" + str(i + 1) for i in range(shooter_embeddings.shape[1])]

	# align the data by index
	shooter_embeddings = pd.merge(shooter_embeddings, shooter_df, how='inner', left_index=True, right_index=True)

	# clean up the index so its the player
	shooter_embeddings.index = shooter_embeddings.playerid

	# the first few
	shooter_embeddings.iloc[:3, :3]

	```


	## Export the data to Tableau

	Whether it is R or python, you might be asking why I am exporting the data to Tableau. That is a fair question, but the point is to show how the ecosystem of data science programming libraries can also leverage best-of-breed data visualization suites such as Tableau. The tool plays a key role in my exploratory analysis pipeline, and the goal below is show how in 1-line of code, we can export our data for rapid exploration, which can aid in our data cleaning and modeling tasks within R/python.

	## R

	I ported a copy of the `pantab` library in python into R. The trick is that I use `reticulate` to port the python bits into R. As such, at present, it will not work if you are using Google Colab.

	Installation is simple:

	```{r eval=FALSE}
	devtools::install_github("btibert3/pantabR")
	```


	```{r eval=FALSE}
	sdf = as.data.frame(shooter_embeddings)
	pantabR::frame_to_hyper(sdf, f="embeddings.hyper", tbl="shooters")
	```

	> In the python section, I am going to use t-SNE to reduce the estimated shooter embeddings into a two dimensional space. The process is similar in `R` using the `Rtsne` package.

	### python

	The `pantab` library is easy to install:

	```{python eval=FALSE}
	pip install pantab
	# if in a notebook environment
	# !pip install pantab
	```


	However, prior to writing out the data for exploration, I am going to use t-SNE to compact the estimated shooter embeddings into a two dimensional coordinate system. For more on t-SNE, refer to [this introduction.](https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1)

	```{python}
	from sklearn.manifold import TSNE
	shooter_tsne = TSNE(n_components=2).fit_transform(shooter_embeddings.iloc[:, :50])
	```

	Add them to the dataframe

	```{python}
	shooter_tsne = pd.DataFrame(shooter_tsne)
	shooter_tsne.columns = ['t1', 't2']
	shooter_embeddings.reset_index(inplace=True, drop=True)
	shooter_embeddings = pd.concat([shooter_embeddings, shooter_tsne], axis=1)
	```


	With pantab setup, that package makes it really simple to write pandas dataframes to `hyper` files for Tableau.

	```{python}
	import pantab
	pantab.frame_to_hyper(shooter_embeddings, "embeddings.hyper", table="shooters")
	```

	And the simple embeddings, plotted from our exported `embeddings.hyper` file within Tableau.

	![](https://github.com/Btibert3/brocktibert/blob/master/public/img/simple-nhl-shooter-embeddings.png?raw=true)