Skip to content

Instantly share code, notes, and snippets.

@friso
Created December 6, 2018 19:33
Show Gist options
  • Save friso/e628adf34b83ef7f660d721e363db6f3 to your computer and use it in GitHub Desktop.
Save friso/e628adf34b83ef7f660d721e363db6f3 to your computer and use it in GitHub Desktop.
Session held at Rockstart for AI / ML startups.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib\n",
"import graphviz as gv\n",
"\n",
"from plotnine import *\n",
"\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import mean_squared_error, mean_absolute_error\n",
"from IPython.display import IFrame"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# On Productionising Machine Learning Systems\n",
"\n",
"### Rockstart AI, December 6, 2018"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Agenda\n",
"\n",
"- Introduction\n",
"- About You?\n",
"- What is So Special about ML Systems?\n",
"- From ML 101 to a System\n",
"- A Dive into Data Systems\n",
"- A Word on Experimentation\n",
"- Feedback, boundaries, dependencies, and other pitfalls\n",
"- Your Big Design\n",
"- The End"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# About Me\n",
"\n",
"### Friso van Vollenhoven"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# About Me\n",
"\n",
"#### Specs:\n",
"- Formerly the co-founder and CTO of [GoDataDriven](https://godatadriven.com/), and CTO of [fashionTrade](https://www.fashiontrade.com/).\n",
"- Co-founder and previous organiser of [The Amsterdam Applied Machine Learning Meetup Group](https://www.meetup.com/The-Amsterdam-Applied-Machine-Learning-Meetup-Group/).\n",
"- Co-founder and previous organiser of The [Netherlands Hadoop User Group](https://www.meetup.com/Netherlands-Hadoop-User-Group/)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# About Me\n",
"\n",
"#### Misc.:\n",
"- Twitter: [@fzk](https://twitter.com/fzk)\n",
"- LinkedIn: [/in/frisovanvollenhoven](https://www.linkedin.com/in/frisovanvollenhoven/)\n",
"- I have 26 LinkedIn endorsements for _Awesomeness_.\n",
"- I am actually from here."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# About You?\n",
"\n",
"### Why AI / ML?\n",
"### Why do you need this workshop?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# What is So Special about ML Systems?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Some Things People Believe about Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> # Time is something that you can reliably measure."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> # Time is something that you can reliably correlate."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> # The discrepancy between times measured in different places is constant."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> # The fact that a request was served, means a user interacted with the result."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> # The fact that a request was served, means a user is aware of the result."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> # The fact that a request was served, means a user requested the result."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> # What can be solved using machine learning should be solved using machine learning."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> # Data that comes in the same format means the same thing."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> # Data that comes in the same format from the same source means the same thing."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> # Data that comes in the same format from the same source at the same time means the same thing."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> # Data used for training does not change meaning or importance over time."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> # Most of the work goes into building the prediction model."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> # Most of the _important_ work goes into building the prediction model."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> # User experience improves with better prediction accuracy."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> # Improved user experience is the result of better prediction accuracy."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Abstractions and Boundaries\n",
"\n",
"## In software development boundaries are introduced through abstractions\n",
"- Implementations reason about their internals in isolation.\n",
"- Abstraction boundaries have documented requirements and constraints (APIs).\n",
"- Input and output are validated.\n",
"- _The internals of an abstraction's implementation are encapsulated from externalities._"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Boundary Erosion in ML Systems\n",
"\n",
"## ML systems are inherently exposed to externalities\n",
"- The essential trick in ML is to internalise externalities (in a more general model).\n",
"- Successful ML systems need to focus on:\n",
" - Strictly defined representations.\n",
" - Contextualisation of data.\n",
" - Explicit reasoning about feedback loops.\n",
" - Data level monitoring (versus system level monitoring).\n",
" - Data level dependency management."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# From ML 101 to a System\n",
"\n",
"> ### The essential trick in machine learning is to make a program describe a large number of samples using a small amount of data, forcing the model to internalise the essence of the data."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>a</th>\n",
" <th>b</th>\n",
" <th>c</th>\n",
" <th>y</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>50000.000000</td>\n",
" <td>50000.000000</td>\n",
" <td>50000.000000</td>\n",
" <td>50000.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>-30.005717</td>\n",
" <td>39.994601</td>\n",
" <td>2.616806</td>\n",
" <td>-177.456562</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>5.984833</td>\n",
" <td>2.997262</td>\n",
" <td>1.133893</td>\n",
" <td>60.497689</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>-54.723454</td>\n",
" <td>27.243630</td>\n",
" <td>0.165686</td>\n",
" <td>-426.655253</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>-34.038057</td>\n",
" <td>37.966763</td>\n",
" <td>1.791152</td>\n",
" <td>-218.078399</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>-30.012182</td>\n",
" <td>39.999814</td>\n",
" <td>2.461605</td>\n",
" <td>-177.512410</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>-25.984457</td>\n",
" <td>42.016299</td>\n",
" <td>3.275286</td>\n",
" <td>-136.772116</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>-6.930930</td>\n",
" <td>52.181833</td>\n",
" <td>8.987385</td>\n",
" <td>58.031864</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" a b c y\n",
"count 50000.000000 50000.000000 50000.000000 50000.000000\n",
"mean -30.005717 39.994601 2.616806 -177.456562\n",
"std 5.984833 2.997262 1.133893 60.497689\n",
"min -54.723454 27.243630 0.165686 -426.655253\n",
"25% -34.038057 37.966763 1.791152 -218.078399\n",
"50% -30.012182 39.999814 2.461605 -177.512410\n",
"75% -25.984457 42.016299 3.275286 -136.772116\n",
"max -6.930930 52.181833 8.987385 58.031864"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a simple data set with a seemingly perfect linear correlation between a and y,\n",
"# with a small non-normal second variable b.\n",
"\n",
"a = np.random.normal(loc=-30, scale=6., size=50000)\n",
"b = np.random.normal(loc=40, scale=3., size=50000)\n",
"c = np.random.beta(5, 100, size=50000) * 55\n",
"y = 10 * a + 3 * b + c\n",
"data_frame = pd.DataFrame({\n",
" 'a': a,\n",
" 'b': b,\n",
" 'c': c,\n",
" 'y': y\n",
" })\n",
"\n",
"data_frame.describe()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"<ggplot: (302691855)>"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Plot it for a visual verification that it can be mistook for just one predictive variable.\n",
"(\n",
" ggplot(data_frame, aes(x='a', y='y'))\n",
" + geom_point(alpha=.05)\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"# Save two versions, one with and one without the second feature.\n",
"data_frame[['a', 'b', 'y']].to_csv('data.csv', index=False)\n",
"data_frame[['a', 'b', 'c', 'y']].to_csv('more-data.csv', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Machine Learning 101\n",
"\n",
"### As seen on the internet"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Data"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>a</th>\n",
" <th>b</th>\n",
" <th>y</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>50000.000000</td>\n",
" <td>50000.000000</td>\n",
" <td>50000.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>-30.005717</td>\n",
" <td>39.994601</td>\n",
" <td>-177.456562</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>5.984833</td>\n",
" <td>2.997262</td>\n",
" <td>60.497689</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>-54.723454</td>\n",
" <td>27.243630</td>\n",
" <td>-426.655253</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>-34.038057</td>\n",
" <td>37.966763</td>\n",
" <td>-218.078399</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>-30.012182</td>\n",
" <td>39.999814</td>\n",
" <td>-177.512410</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>-25.984457</td>\n",
" <td>42.016299</td>\n",
" <td>-136.772116</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>-6.930930</td>\n",
" <td>52.181833</td>\n",
" <td>58.031864</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" a b y\n",
"count 50000.000000 50000.000000 50000.000000\n",
"mean -30.005717 39.994601 -177.456562\n",
"std 5.984833 2.997262 60.497689\n",
"min -54.723454 27.243630 -426.655253\n",
"25% -34.038057 37.966763 -218.078399\n",
"50% -30.012182 39.999814 -177.512410\n",
"75% -25.984457 42.016299 -136.772116\n",
"max -6.930930 52.181833 58.031864"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"samples = pd.read_csv('data.csv')\n",
"samples.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Train"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n",
" normalize=False)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X, y = samples[['a', 'b']], samples[['y']]\n",
"train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=.15, random_state=42)\n",
"\n",
"model = LinearRegression(fit_intercept=True)\n",
"model.fit(train_X, train_y)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Test"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(0.8790466600181069, 1.2398506884891785)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(\n",
"mean_absolute_error(\n",
" test_y,\n",
" model.predict(test_X)),\n",
"mean_squared_error(\n",
" test_y,\n",
" model.predict(test_X))\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Predict"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[-290.71840531],\n",
" [-248.84540031]])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.predict([\n",
" [-39.03, 32.32],\n",
" [-36.55, 38.01]\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"def normalize_name(name):\n",
" return name.lower().replace(' ', '-')\n",
"\n",
"def node(name):\n",
" def apply(graph):\n",
" graph.node(normalize_name(name), label=name)\n",
" return apply\n",
"\n",
"def actor(name):\n",
" def apply(graph):\n",
" graph.node(normalize_name(name), label=name, shape='circle')\n",
" return apply\n",
"\n",
"def artefact(name):\n",
" def apply(graph):\n",
" graph.node(normalize_name(name), label=name, shape='tab')\n",
" return apply\n",
"\n",
"def state(name):\n",
" def apply(graph):\n",
" graph.node(normalize_name(name), label=name, shape='box3d')\n",
" return apply\n",
"\n",
"def process(name):\n",
" def apply(graph):\n",
" graph.node(normalize_name(name), label=name, style='filled')\n",
" return apply\n",
"\n",
"def sync_edge(source, target):\n",
" def apply(graph):\n",
" graph.edge(normalize_name(source), normalize_name(target), arrowType='normal', color='black')\n",
" return apply\n",
"\n",
"def async_edge(source, target):\n",
" def apply(graph):\n",
" graph.edge(normalize_name(source), normalize_name(target), arrowType='normal', color='red', style='dashed')\n",
" return apply\n",
" \n",
"def dependency(source, target):\n",
" def apply(graph):\n",
" graph.edge(normalize_name(source), normalize_name(target), arrowType='normal', style='dashed')\n",
" return apply\n",
" \n",
"def cluster(name, *contents):\n",
" def apply(graph):\n",
" sg = gv.Digraph(name='cluster %s' % normalize_name(name), graph_attr={\n",
" 'label': name, 'color':'blue', 'style':'solid'\n",
" })\n",
" for item in contents:\n",
" item(sg)\n",
" graph.subgraph(sg)\n",
" return apply\n",
" \n",
"def graph(title, *contents):\n",
" result = gv.Digraph(graph_attr={'label':title})\n",
" for item in contents:\n",
" item(result)\n",
" return result"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# A Systems View"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"diagram = graph('Simple Prediction Service',\n",
" state('model state'), actor('user'),\n",
" cluster('offline training', process('training'), artefact('data'), sync_edge('data', 'training')),\n",
" cluster('online predictions', process('evaluate model'), artefact('prediction'), artefact('unknown sample'),\n",
" sync_edge('unknown sample', 'evaluate model')),\n",
" sync_edge('training', 'model state'), dependency('evaluate model', 'model state'),\n",
" sync_edge('evaluate model', 'prediction'), sync_edge('user', 'unknown sample'), sync_edge('prediction', 'user')\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n",
" -->\n",
"<!-- Title: %3 Pages: 1 -->\n",
"<svg width=\"280pt\" height=\"334pt\"\n",
" viewBox=\"0.00 0.00 280.34 333.68\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 329.6803)\">\n",
"<title>%3</title>\n",
"<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-329.6803 276.3409,-329.6803 276.3409,4 -4,4\"/>\n",
"<text text-anchor=\"middle\" x=\"136.1705\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">Simple Prediction Service</text>\n",
"<g id=\"clust1\" class=\"cluster\">\n",
"<title>cluster offline&#45;training</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"8,-171.6803 8,-317.6803 107,-317.6803 107,-171.6803 8,-171.6803\"/>\n",
"<text text-anchor=\"middle\" x=\"57.5\" y=\"-302.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">offline training</text>\n",
"</g>\n",
"<g id=\"clust2\" class=\"cluster\">\n",
"<title>cluster online&#45;predictions</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"115,-99.6803 115,-317.6803 259,-317.6803 259,-99.6803 115,-99.6803\"/>\n",
"<text text-anchor=\"middle\" x=\"187\" y=\"-302.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">online predictions</text>\n",
"</g>\n",
"<!-- model&#45;state -->\n",
"<g id=\"node1\" class=\"node\">\n",
"<title>model&#45;state</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"104.1486,-143.6803 27.8514,-143.6803 23.8514,-139.6803 23.8514,-107.6803 100.1486,-107.6803 104.1486,-111.6803 104.1486,-143.6803\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"100.1486,-139.6803 23.8514,-139.6803 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"100.1486,-139.6803 100.1486,-107.6803 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"100.1486,-139.6803 104.1486,-143.6803 \"/>\n",
"<text text-anchor=\"middle\" x=\"64\" y=\"-121.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">model state</text>\n",
"</g>\n",
"<!-- user -->\n",
"<g id=\"node2\" class=\"node\">\n",
"<title>user</title>\n",
"<ellipse fill=\"none\" stroke=\"#000000\" cx=\"247\" cy=\"-46.8402\" rx=\"24.6814\" ry=\"24.6814\"/>\n",
"<text text-anchor=\"middle\" x=\"247\" y=\"-42.6402\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">user</text>\n",
"</g>\n",
"<!-- unknown&#45;sample -->\n",
"<g id=\"node7\" class=\"node\">\n",
"<title>unknown&#45;sample</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"246.7659,-287.6803 147.2341,-287.6803 147.2341,-291.6803 135.2341,-291.6803 135.2341,-251.6803 246.7659,-251.6803 246.7659,-287.6803\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"135.2341,-287.6803 147.2341,-287.6803 \"/>\n",
"<text text-anchor=\"middle\" x=\"191\" y=\"-265.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">unknown sample</text>\n",
"</g>\n",
"<!-- user&#45;&gt;unknown&#45;sample -->\n",
"<g id=\"edge6\" class=\"edge\">\n",
"<title>user&#45;&gt;unknown&#45;sample</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M255.8855,-70.0684C267.3352,-103.9621 283.1785,-168.1078 260,-215.6803 254.2696,-227.4417 244.6013,-237.3954 234.2879,-245.4285\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"231.9987,-242.7645 225.9415,-251.457 236.0974,-248.4391 231.9987,-242.7645\"/>\n",
"</g>\n",
"<!-- training -->\n",
"<g id=\"node3\" class=\"node\">\n",
"<title>training</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"61\" cy=\"-197.6803\" rx=\"37.7266\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"61\" y=\"-193.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">training</text>\n",
"</g>\n",
"<!-- training&#45;&gt;model&#45;state -->\n",
"<g id=\"edge3\" class=\"edge\">\n",
"<title>training&#45;&gt;model&#45;state</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M61.757,-179.5117C62.0779,-171.8113 62.4594,-162.6547 62.816,-154.097\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"66.3133,-154.2307 63.2328,-144.0936 59.3194,-153.9392 66.3133,-154.2307\"/>\n",
"</g>\n",
"<!-- data -->\n",
"<g id=\"node4\" class=\"node\">\n",
"<title>data</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"88,-287.6803 46,-287.6803 46,-291.6803 34,-291.6803 34,-251.6803 88,-251.6803 88,-287.6803\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"34,-287.6803 46,-287.6803 \"/>\n",
"<text text-anchor=\"middle\" x=\"61\" y=\"-265.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">data</text>\n",
"</g>\n",
"<!-- data&#45;&gt;training -->\n",
"<g id=\"edge1\" class=\"edge\">\n",
"<title>data&#45;&gt;training</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M61,-251.5117C61,-243.8113 61,-234.6547 61,-226.097\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"64.5001,-226.0936 61,-216.0936 57.5001,-226.0936 64.5001,-226.0936\"/>\n",
"</g>\n",
"<!-- evaluate&#45;model -->\n",
"<g id=\"node5\" class=\"node\">\n",
"<title>evaluate&#45;model</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"187\" cy=\"-197.6803\" rx=\"63.7604\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"187\" y=\"-193.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">evaluate model</text>\n",
"</g>\n",
"<!-- evaluate&#45;model&#45;&gt;model&#45;state -->\n",
"<g id=\"edge4\" class=\"edge\">\n",
"<title>evaluate&#45;model&#45;&gt;model&#45;state</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M159.0894,-181.3424C142.7927,-171.8029 121.9568,-159.6063 103.945,-149.0628\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"105.2206,-145.754 94.8223,-143.7227 101.6843,-151.7951 105.2206,-145.754\"/>\n",
"</g>\n",
"<!-- prediction -->\n",
"<g id=\"node6\" class=\"node\">\n",
"<title>prediction</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"237.2592,-143.6803 176.7408,-143.6803 176.7408,-147.6803 164.7408,-147.6803 164.7408,-107.6803 237.2592,-107.6803 237.2592,-143.6803\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"164.7408,-143.6803 176.7408,-143.6803 \"/>\n",
"<text text-anchor=\"middle\" x=\"201\" y=\"-121.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">prediction</text>\n",
"</g>\n",
"<!-- evaluate&#45;model&#45;&gt;prediction -->\n",
"<g id=\"edge5\" class=\"edge\">\n",
"<title>evaluate&#45;model&#45;&gt;prediction</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M190.5328,-179.5117C192.0301,-171.8113 193.8105,-162.6547 195.4745,-154.097\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"198.9465,-154.5778 197.4196,-144.0936 192.0752,-153.2417 198.9465,-154.5778\"/>\n",
"</g>\n",
"<!-- prediction&#45;&gt;user -->\n",
"<g id=\"edge7\" class=\"edge\">\n",
"<title>prediction&#45;&gt;user</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M211.6689,-107.3947C216.924,-98.3879 223.3984,-87.2913 229.3632,-77.0682\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"232.4138,-78.7848 234.4303,-68.3836 226.3677,-75.2571 232.4138,-78.7848\"/>\n",
"</g>\n",
"<!-- unknown&#45;sample&#45;&gt;evaluate&#45;model -->\n",
"<g id=\"edge2\" class=\"edge\">\n",
"<title>unknown&#45;sample&#45;&gt;evaluate&#45;model</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M189.9906,-251.5117C189.5628,-243.8113 189.0541,-234.6547 188.5787,-226.097\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"192.0724,-225.884 188.023,-216.0936 185.0831,-226.2724 192.0724,-225.884\"/>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<graphviz.dot.Digraph at 0x121993cf8>"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"diagram"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Model State"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[9.99904184, 3.00094574]])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.coef_"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Did Someone Say State?\n",
"- How to handle updates?\n",
"- Staleness\n",
"- Partial updates\n",
"- Atomicity\n",
"- Versioning"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Other Real World Concerns\n",
"\n",
"- business rules\n",
"- failure modes\n",
" - graceful degradation\n",
" - user facing degradation\n",
"- performance (computational)\n",
"- availability / uptime\n",
"- maintenance\n",
"- monitoring\n",
"- product experimentation"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# A Somewhat Improved System"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"diagram = graph('Simple Prediction Service',\n",
" state('cached predictions'), actor('user'),\n",
" cluster('offline training', state('model state'), process('training'), artefact('data'), process('offline predictions'),\n",
" sync_edge('data', 'training'), dependency('offline predictions', 'model state')),\n",
" cluster('online predictions', process('business rules'), artefact('prediction'), artefact('unknown sample'),\n",
" sync_edge('unknown sample', 'business rules')),\n",
" sync_edge('training', 'model state'), dependency('business rules', 'cached predictions'),\n",
" sync_edge('business rules', 'prediction'), sync_edge('offline predictions', 'cached predictions'), sync_edge('user', 'unknown sample'), sync_edge('prediction', 'user')\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n",
" -->\n",
"<!-- Title: %3 Pages: 1 -->\n",
"<svg width=\"558pt\" height=\"334pt\"\n",
" viewBox=\"0.00 0.00 558.01 333.68\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 329.6803)\">\n",
"<title>%3</title>\n",
"<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-329.6803 554.0075,-329.6803 554.0075,4 -4,4\"/>\n",
"<text text-anchor=\"middle\" x=\"275.0038\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">Simple Prediction Service</text>\n",
"<g id=\"clust1\" class=\"cluster\">\n",
"<title>cluster offline&#45;training</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"8,-99.6803 8,-317.6803 267,-317.6803 267,-99.6803 8,-99.6803\"/>\n",
"<text text-anchor=\"middle\" x=\"137.5\" y=\"-302.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">offline training</text>\n",
"</g>\n",
"<g id=\"clust2\" class=\"cluster\">\n",
"<title>cluster online&#45;predictions</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"403,-99.6803 403,-317.6803 537,-317.6803 537,-99.6803 403,-99.6803\"/>\n",
"<text text-anchor=\"middle\" x=\"470\" y=\"-302.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">online predictions</text>\n",
"</g>\n",
"<!-- cached&#45;predictions -->\n",
"<g id=\"node1\" class=\"node\">\n",
"<title>cached&#45;predictions</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"395.0633,-143.6803 278.9367,-143.6803 274.9367,-139.6803 274.9367,-107.6803 391.0633,-107.6803 395.0633,-111.6803 395.0633,-143.6803\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"391.0633,-139.6803 274.9367,-139.6803 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"391.0633,-139.6803 391.0633,-107.6803 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"391.0633,-139.6803 395.0633,-143.6803 \"/>\n",
"<text text-anchor=\"middle\" x=\"335\" y=\"-121.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">cached predictions</text>\n",
"</g>\n",
"<!-- user -->\n",
"<g id=\"node2\" class=\"node\">\n",
"<title>user</title>\n",
"<ellipse fill=\"none\" stroke=\"#000000\" cx=\"524\" cy=\"-46.8402\" rx=\"24.6814\" ry=\"24.6814\"/>\n",
"<text text-anchor=\"middle\" x=\"524\" y=\"-42.6402\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">user</text>\n",
"</g>\n",
"<!-- unknown&#45;sample -->\n",
"<g id=\"node9\" class=\"node\">\n",
"<title>unknown&#45;sample</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"526.7659,-287.6803 427.2341,-287.6803 427.2341,-291.6803 415.2341,-291.6803 415.2341,-251.6803 526.7659,-251.6803 526.7659,-287.6803\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"415.2341,-287.6803 427.2341,-287.6803 \"/>\n",
"<text text-anchor=\"middle\" x=\"471\" y=\"-265.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">unknown sample</text>\n",
"</g>\n",
"<!-- user&#45;&gt;unknown&#45;sample -->\n",
"<g id=\"edge8\" class=\"edge\">\n",
"<title>user&#45;&gt;unknown&#45;sample</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M533.0101,-70.0189C544.6446,-103.8509 560.8479,-167.9205 538,-215.6803 532.404,-227.3778 522.9007,-237.3508 512.7828,-245.424\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"510.5504,-242.7221 504.5997,-251.4879 514.718,-248.3462 510.5504,-242.7221\"/>\n",
"</g>\n",
"<!-- model&#45;state -->\n",
"<g id=\"node3\" class=\"node\">\n",
"<title>model&#45;state</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"160.1486,-143.6803 83.8514,-143.6803 79.8514,-139.6803 79.8514,-107.6803 156.1486,-107.6803 160.1486,-111.6803 160.1486,-143.6803\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"156.1486,-139.6803 79.8514,-139.6803 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"156.1486,-139.6803 156.1486,-107.6803 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"156.1486,-139.6803 160.1486,-143.6803 \"/>\n",
"<text text-anchor=\"middle\" x=\"120\" y=\"-121.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">model state</text>\n",
"</g>\n",
"<!-- training -->\n",
"<g id=\"node4\" class=\"node\">\n",
"<title>training</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"54\" cy=\"-197.6803\" rx=\"37.7266\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"54\" y=\"-193.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">training</text>\n",
"</g>\n",
"<!-- training&#45;&gt;model&#45;state -->\n",
"<g id=\"edge4\" class=\"edge\">\n",
"<title>training&#45;&gt;model&#45;state</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M69.3076,-180.9811C77.3604,-172.1963 87.4088,-161.2344 96.4306,-151.3925\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"99.2108,-153.5391 103.388,-143.8025 94.0507,-148.809 99.2108,-153.5391\"/>\n",
"</g>\n",
"<!-- data -->\n",
"<g id=\"node5\" class=\"node\">\n",
"<title>data</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"81,-287.6803 39,-287.6803 39,-291.6803 27,-291.6803 27,-251.6803 81,-251.6803 81,-287.6803\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"27,-287.6803 39,-287.6803 \"/>\n",
"<text text-anchor=\"middle\" x=\"54\" y=\"-265.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">data</text>\n",
"</g>\n",
"<!-- data&#45;&gt;training -->\n",
"<g id=\"edge1\" class=\"edge\">\n",
"<title>data&#45;&gt;training</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M54,-251.5117C54,-243.8113 54,-234.6547 54,-226.097\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"57.5001,-226.0936 54,-216.0936 50.5001,-226.0936 57.5001,-226.0936\"/>\n",
"</g>\n",
"<!-- offline&#45;predictions -->\n",
"<g id=\"node6\" class=\"node\">\n",
"<title>offline&#45;predictions</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"184\" cy=\"-197.6803\" rx=\"74.9031\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"184\" y=\"-193.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">offline predictions</text>\n",
"</g>\n",
"<!-- offline&#45;predictions&#45;&gt;cached&#45;predictions -->\n",
"<g id=\"edge7\" class=\"edge\">\n",
"<title>offline&#45;predictions&#45;&gt;cached&#45;predictions</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M217.8873,-181.5221C238.5981,-171.6468 265.3842,-158.8746 288.111,-148.038\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"289.7271,-151.145 297.2471,-143.6817 286.7143,-144.8265 289.7271,-151.145\"/>\n",
"</g>\n",
"<!-- offline&#45;predictions&#45;&gt;model&#45;state -->\n",
"<g id=\"edge2\" class=\"edge\">\n",
"<title>offline&#45;predictions&#45;&gt;model&#45;state</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M168.1798,-179.8826C160.5452,-171.2937 151.2432,-160.829 142.8561,-151.3935\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"145.3424,-148.9223 136.0828,-143.7735 140.1105,-153.5729 145.3424,-148.9223\"/>\n",
"</g>\n",
"<!-- business&#45;rules -->\n",
"<g id=\"node7\" class=\"node\">\n",
"<title>business&#45;rules</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"470\" cy=\"-197.6803\" rx=\"59.4599\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"470\" y=\"-193.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">business rules</text>\n",
"</g>\n",
"<!-- business&#45;rules&#45;&gt;cached&#45;predictions -->\n",
"<g id=\"edge5\" class=\"edge\">\n",
"<title>business&#45;rules&#45;&gt;cached&#45;predictions</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M440.3737,-181.8796C422.066,-172.1155 398.2624,-159.4203 377.9124,-148.5669\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"379.3082,-145.3447 368.8375,-143.727 376.014,-151.5212 379.3082,-145.3447\"/>\n",
"</g>\n",
"<!-- prediction -->\n",
"<g id=\"node8\" class=\"node\">\n",
"<title>prediction</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"517.2592,-143.6803 456.7408,-143.6803 456.7408,-147.6803 444.7408,-147.6803 444.7408,-107.6803 517.2592,-107.6803 517.2592,-143.6803\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"444.7408,-143.6803 456.7408,-143.6803 \"/>\n",
"<text text-anchor=\"middle\" x=\"481\" y=\"-121.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">prediction</text>\n",
"</g>\n",
"<!-- business&#45;rules&#45;&gt;prediction -->\n",
"<g id=\"edge6\" class=\"edge\">\n",
"<title>business&#45;rules&#45;&gt;prediction</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M472.7758,-179.5117C473.9522,-171.8113 475.3511,-162.6547 476.6586,-154.097\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"480.1364,-154.5075 478.1869,-144.0936 473.2167,-153.4503 480.1364,-154.5075\"/>\n",
"</g>\n",
"<!-- prediction&#45;&gt;user -->\n",
"<g id=\"edge9\" class=\"edge\">\n",
"<title>prediction&#45;&gt;user</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M490.9731,-107.3947C495.7979,-98.5485 501.7221,-87.6864 507.2146,-77.616\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"510.3779,-79.1257 512.0935,-68.6707 504.2325,-75.774 510.3779,-79.1257\"/>\n",
"</g>\n",
"<!-- unknown&#45;sample&#45;&gt;business&#45;rules -->\n",
"<g id=\"edge3\" class=\"edge\">\n",
"<title>unknown&#45;sample&#45;&gt;business&#45;rules</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M470.7477,-251.5117C470.6407,-243.8113 470.5135,-234.6547 470.3947,-226.097\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"473.8944,-226.044 470.2557,-216.0936 466.8951,-226.1413 473.8944,-226.044\"/>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<graphviz.dot.Digraph at 0x120fc0048>"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"diagram"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# What about feedback?"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"diagram = graph('Simple Prediction Service',\n",
" state('model state'), actor('user'),\n",
" cluster('offline training', process('training'), artefact('data'), sync_edge('data', 'training')),\n",
" cluster('online predictions service', artefact('feedback'), process('evaluate model'), artefact('prediction'),\n",
" artefact('unknown sample'), sync_edge('unknown sample', 'evaluate model')),\n",
" sync_edge('training', 'model state'), dependency('evaluate model', 'model state'),\n",
" sync_edge('evaluate model', 'prediction'), sync_edge('user', 'unknown sample'), sync_edge('prediction', 'user'),\n",
" sync_edge('user', 'feedback'), sync_edge('feedback', 'model state')\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n",
" -->\n",
"<!-- Title: %3 Pages: 1 -->\n",
"<svg width=\"360pt\" height=\"314pt\"\n",
" viewBox=\"0.00 0.00 359.60 313.68\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 309.6803)\">\n",
"<title>%3</title>\n",
"<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-309.6803 355.6002,-309.6803 355.6002,4 -4,4\"/>\n",
"<text text-anchor=\"middle\" x=\"175.8001\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">Simple Prediction Service</text>\n",
"<g id=\"clust1\" class=\"cluster\">\n",
"<title>cluster offline&#45;training</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"8,-102 8,-248 107,-248 107,-102 8,-102\"/>\n",
"<text text-anchor=\"middle\" x=\"57.5\" y=\"-232.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">offline training</text>\n",
"</g>\n",
"<g id=\"clust2\" class=\"cluster\">\n",
"<title>cluster online&#45;predictions&#45;service</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"115,-30 115,-248 327,-248 327,-30 115,-30\"/>\n",
"<text text-anchor=\"middle\" x=\"221\" y=\"-232.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">online predictions service</text>\n",
"</g>\n",
"<!-- model&#45;state -->\n",
"<g id=\"node1\" class=\"node\">\n",
"<title>model&#45;state</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"107.1486,-74 30.8514,-74 26.8514,-70 26.8514,-38 103.1486,-38 107.1486,-42 107.1486,-74\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"103.1486,-70 26.8514,-70 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"103.1486,-70 103.1486,-38 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"103.1486,-70 107.1486,-74 \"/>\n",
"<text text-anchor=\"middle\" x=\"67\" y=\"-51.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">model state</text>\n",
"</g>\n",
"<!-- user -->\n",
"<g id=\"node2\" class=\"node\">\n",
"<title>user</title>\n",
"<ellipse fill=\"none\" stroke=\"#000000\" cx=\"263\" cy=\"-280.8402\" rx=\"24.6814\" ry=\"24.6814\"/>\n",
"<text text-anchor=\"middle\" x=\"263\" y=\"-276.6402\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">user</text>\n",
"</g>\n",
"<!-- feedback -->\n",
"<g id=\"node5\" class=\"node\">\n",
"<title>feedback</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"189.0193,-218 134.9807,-218 134.9807,-222 122.9807,-222 122.9807,-182 189.0193,-182 189.0193,-218\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"122.9807,-218 134.9807,-218 \"/>\n",
"<text text-anchor=\"middle\" x=\"156\" y=\"-195.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">feedback</text>\n",
"</g>\n",
"<!-- user&#45;&gt;feedback -->\n",
"<g id=\"edge8\" class=\"edge\">\n",
"<title>user&#45;&gt;feedback</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M239.7828,-271.726C226.782,-266.0026 210.6858,-257.8434 198,-248 190.0335,-241.8185 182.4203,-233.9612 175.917,-226.4024\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"178.3922,-223.9025 169.3317,-218.4109 172.99,-228.3541 178.3922,-223.9025\"/>\n",
"</g>\n",
"<!-- unknown&#45;sample -->\n",
"<g id=\"node8\" class=\"node\">\n",
"<title>unknown&#45;sample</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"318.7659,-218 219.2341,-218 219.2341,-222 207.2341,-222 207.2341,-182 318.7659,-182 318.7659,-218\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"207.2341,-218 219.2341,-218 \"/>\n",
"<text text-anchor=\"middle\" x=\"263\" y=\"-195.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">unknown sample</text>\n",
"</g>\n",
"<!-- user&#45;&gt;unknown&#45;sample -->\n",
"<g id=\"edge6\" class=\"edge\">\n",
"<title>user&#45;&gt;unknown&#45;sample</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M263,-255.705C263,-247.0717 263,-237.3474 263,-228.4686\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"266.5001,-228.1877 263,-218.1878 259.5001,-228.1878 266.5001,-228.1877\"/>\n",
"</g>\n",
"<!-- training -->\n",
"<g id=\"node3\" class=\"node\">\n",
"<title>training</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"61\" cy=\"-128\" rx=\"37.7266\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"61\" y=\"-123.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">training</text>\n",
"</g>\n",
"<!-- training&#45;&gt;model&#45;state -->\n",
"<g id=\"edge3\" class=\"edge\">\n",
"<title>training&#45;&gt;model&#45;state</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M62.5141,-109.8314C63.1558,-102.131 63.9188,-92.9743 64.6319,-84.4166\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"68.1229,-84.6694 65.4656,-74.4133 61.1471,-84.088 68.1229,-84.6694\"/>\n",
"</g>\n",
"<!-- data -->\n",
"<g id=\"node4\" class=\"node\">\n",
"<title>data</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"88,-218 46,-218 46,-222 34,-222 34,-182 88,-182 88,-218\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"34,-218 46,-218 \"/>\n",
"<text text-anchor=\"middle\" x=\"61\" y=\"-195.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">data</text>\n",
"</g>\n",
"<!-- data&#45;&gt;training -->\n",
"<g id=\"edge1\" class=\"edge\">\n",
"<title>data&#45;&gt;training</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M61,-181.8314C61,-174.131 61,-164.9743 61,-156.4166\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"64.5001,-156.4132 61,-146.4133 57.5001,-156.4133 64.5001,-156.4132\"/>\n",
"</g>\n",
"<!-- feedback&#45;&gt;model&#45;state -->\n",
"<g id=\"edge9\" class=\"edge\">\n",
"<title>feedback&#45;&gt;model&#45;state</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M149.7016,-181.7106C142.2023,-161.2805 128.4665,-127.7056 111,-102 106.0894,-94.7731 100.0152,-87.6225 93.9645,-81.1905\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"96.4367,-78.7125 86.9435,-74.0084 91.4312,-83.6058 96.4367,-78.7125\"/>\n",
"</g>\n",
"<!-- evaluate&#45;model -->\n",
"<g id=\"node6\" class=\"node\">\n",
"<title>evaluate&#45;model</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"255\" cy=\"-128\" rx=\"63.7604\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"255\" y=\"-123.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">evaluate model</text>\n",
"</g>\n",
"<!-- evaluate&#45;model&#45;&gt;model&#45;state -->\n",
"<g id=\"edge4\" class=\"edge\">\n",
"<title>evaluate&#45;model&#45;&gt;model&#45;state</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M215.7654,-113.6429C189.1249,-103.8151 152.8581,-90.2835 116.8918,-76.288\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"117.8522,-72.9057 107.2641,-72.5272 115.3052,-79.4259 117.8522,-72.9057\"/>\n",
"</g>\n",
"<!-- prediction -->\n",
"<g id=\"node7\" class=\"node\">\n",
"<title>prediction</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"305.2592,-74 244.7408,-74 244.7408,-78 232.7408,-78 232.7408,-38 305.2592,-38 305.2592,-74\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"232.7408,-74 244.7408,-74 \"/>\n",
"<text text-anchor=\"middle\" x=\"269\" y=\"-51.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">prediction</text>\n",
"</g>\n",
"<!-- evaluate&#45;model&#45;&gt;prediction -->\n",
"<g id=\"edge5\" class=\"edge\">\n",
"<title>evaluate&#45;model&#45;&gt;prediction</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M258.5328,-109.8314C260.0301,-102.131 261.8105,-92.9743 263.4745,-84.4166\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"266.9465,-84.8975 265.4196,-74.4133 260.0752,-83.5614 266.9465,-84.8975\"/>\n",
"</g>\n",
"<!-- prediction&#45;&gt;user -->\n",
"<g id=\"edge7\" class=\"edge\">\n",
"<title>prediction&#45;&gt;user</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M302.6091,-74.0908C312.6618,-81.3331 322.4714,-90.6791 328,-102 356.4749,-160.3074 361.9987,-192.7311 328,-248 321.0479,-259.3014 308.8723,-266.8558 296.9242,-271.8361\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"295.6742,-268.5666 287.4949,-275.3008 298.0885,-275.1371 295.6742,-268.5666\"/>\n",
"</g>\n",
"<!-- unknown&#45;sample&#45;&gt;evaluate&#45;model -->\n",
"<g id=\"edge2\" class=\"edge\">\n",
"<title>unknown&#45;sample&#45;&gt;evaluate&#45;model</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M260.9813,-181.8314C260.1257,-174.131 259.1083,-164.9743 258.1574,-156.4166\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"261.6289,-155.9656 257.0459,-146.4133 254.6717,-156.7386 261.6289,-155.9656\"/>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<graphviz.dot.Digraph at 0x120fc09b0>"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"diagram"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Feedback cycle: better stale than sorry\n",
"\n",
"- Always asynchronous\n",
"- Operate and scale separately of the prediction service\n",
"- Add to ground truth before re-training / partially training"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"diagram = graph('Simple Prediction Service',\n",
" state('model state'), actor('user'),\n",
" cluster('offline training', process('training'), artefact('data'), sync_edge('data', 'training')),\n",
" cluster('online predictions service', artefact('feedback'), process('evaluate model'), artefact('prediction'),\n",
" artefact('unknown sample'), sync_edge('unknown sample', 'evaluate model')),\n",
" sync_edge('training', 'model state'), dependency('evaluate model', 'model state'),\n",
" sync_edge('evaluate model', 'prediction'), sync_edge('user', 'unknown sample'), sync_edge('prediction', 'user'),\n",
" async_edge('user', 'feedback'), sync_edge('feedback', 'data')\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n",
" -->\n",
"<!-- Title: %3 Pages: 1 -->\n",
"<svg width=\"358pt\" height=\"372pt\"\n",
" viewBox=\"0.00 0.00 358.10 371.68\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 367.6803)\">\n",
"<title>%3</title>\n",
"<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-367.6803 354.0996,-367.6803 354.0996,4 -4,4\"/>\n",
"<text text-anchor=\"middle\" x=\"175.0498\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">Simple Prediction Service</text>\n",
"<g id=\"clust1\" class=\"cluster\">\n",
"<title>cluster offline&#45;training</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"8,-86 8,-232 107,-232 107,-86 8,-86\"/>\n",
"<text text-anchor=\"middle\" x=\"57.5\" y=\"-216.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">offline training</text>\n",
"</g>\n",
"<g id=\"clust2\" class=\"cluster\">\n",
"<title>cluster online&#45;predictions&#45;service</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"115,-86 115,-306 327,-306 327,-86 115,-86\"/>\n",
"<text text-anchor=\"middle\" x=\"221\" y=\"-290.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">online predictions service</text>\n",
"</g>\n",
"<!-- model&#45;state -->\n",
"<g id=\"node1\" class=\"node\">\n",
"<title>model&#45;state</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"157.1486,-58 80.8514,-58 76.8514,-54 76.8514,-22 153.1486,-22 157.1486,-26 157.1486,-58\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"153.1486,-54 76.8514,-54 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"153.1486,-54 153.1486,-22 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"153.1486,-54 157.1486,-58 \"/>\n",
"<text text-anchor=\"middle\" x=\"117\" y=\"-35.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">model state</text>\n",
"</g>\n",
"<!-- user -->\n",
"<g id=\"node2\" class=\"node\">\n",
"<title>user</title>\n",
"<ellipse fill=\"none\" stroke=\"#000000\" cx=\"263\" cy=\"-338.8402\" rx=\"24.6814\" ry=\"24.6814\"/>\n",
"<text text-anchor=\"middle\" x=\"263\" y=\"-334.6402\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">user</text>\n",
"</g>\n",
"<!-- feedback -->\n",
"<g id=\"node5\" class=\"node\">\n",
"<title>feedback</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"189.0193,-276 134.9807,-276 134.9807,-280 122.9807,-280 122.9807,-240 189.0193,-240 189.0193,-276\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"122.9807,-276 134.9807,-276 \"/>\n",
"<text text-anchor=\"middle\" x=\"156\" y=\"-253.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">feedback</text>\n",
"</g>\n",
"<!-- user&#45;&gt;feedback -->\n",
"<g id=\"edge8\" class=\"edge\">\n",
"<title>user&#45;&gt;feedback</title>\n",
"<path fill=\"none\" stroke=\"#ff0000\" stroke-dasharray=\"5,2\" d=\"M239.7828,-329.726C226.782,-324.0026 210.6858,-315.8434 198,-306 190.0335,-299.8185 182.4203,-291.9612 175.917,-284.4024\"/>\n",
"<polygon fill=\"#ff0000\" stroke=\"#ff0000\" points=\"178.3922,-281.9025 169.3317,-276.4109 172.99,-286.3541 178.3922,-281.9025\"/>\n",
"</g>\n",
"<!-- unknown&#45;sample -->\n",
"<g id=\"node8\" class=\"node\">\n",
"<title>unknown&#45;sample</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"318.7659,-276 219.2341,-276 219.2341,-280 207.2341,-280 207.2341,-240 318.7659,-240 318.7659,-276\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"207.2341,-276 219.2341,-276 \"/>\n",
"<text text-anchor=\"middle\" x=\"263\" y=\"-253.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">unknown sample</text>\n",
"</g>\n",
"<!-- user&#45;&gt;unknown&#45;sample -->\n",
"<g id=\"edge6\" class=\"edge\">\n",
"<title>user&#45;&gt;unknown&#45;sample</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M263,-313.705C263,-305.0717 263,-295.3474 263,-286.4686\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"266.5001,-286.1877 263,-276.1878 259.5001,-286.1878 266.5001,-286.1877\"/>\n",
"</g>\n",
"<!-- training -->\n",
"<g id=\"node3\" class=\"node\">\n",
"<title>training</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"61\" cy=\"-112\" rx=\"37.7266\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"61\" y=\"-107.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">training</text>\n",
"</g>\n",
"<!-- training&#45;&gt;model&#45;state -->\n",
"<g id=\"edge3\" class=\"edge\">\n",
"<title>training&#45;&gt;model&#45;state</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M74.2712,-94.937C80.9254,-86.3816 89.1311,-75.8314 96.567,-66.271\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"99.4684,-68.2415 102.8451,-58.1992 93.9429,-63.9439 99.4684,-68.2415\"/>\n",
"</g>\n",
"<!-- data -->\n",
"<g id=\"node4\" class=\"node\">\n",
"<title>data</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"93,-202 51,-202 51,-206 39,-206 39,-166 93,-166 93,-202\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"39,-202 51,-202 \"/>\n",
"<text text-anchor=\"middle\" x=\"66\" y=\"-179.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">data</text>\n",
"</g>\n",
"<!-- data&#45;&gt;training -->\n",
"<g id=\"edge1\" class=\"edge\">\n",
"<title>data&#45;&gt;training</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M64.7383,-165.8314C64.2035,-158.131 63.5677,-148.9743 62.9734,-140.4166\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"66.4632,-140.1467 62.2787,-130.4133 59.48,-140.6317 66.4632,-140.1467\"/>\n",
"</g>\n",
"<!-- feedback&#45;&gt;data -->\n",
"<g id=\"edge9\" class=\"edge\">\n",
"<title>feedback&#45;&gt;data</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M122.8173,-240.3384C118.7142,-237.7208 114.6753,-234.9164 111,-232 102.6785,-225.3966 94.4643,-217.2563 87.3776,-209.5672\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"89.9786,-207.2251 80.7046,-202.1021 84.7597,-211.8902 89.9786,-207.2251\"/>\n",
"</g>\n",
"<!-- evaluate&#45;model -->\n",
"<g id=\"node6\" class=\"node\">\n",
"<title>evaluate&#45;model</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"255\" cy=\"-184\" rx=\"63.7604\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"255\" y=\"-179.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">evaluate model</text>\n",
"</g>\n",
"<!-- evaluate&#45;model&#45;&gt;model&#45;state -->\n",
"<g id=\"edge4\" class=\"edge\">\n",
"<title>evaluate&#45;model&#45;&gt;model&#45;state</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M238.0785,-166.3428C213.8974,-141.1103 169.3149,-94.5894 141.5184,-65.5845\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"143.9413,-63.0542 134.4953,-58.256 138.8874,-67.8975 143.9413,-63.0542\"/>\n",
"</g>\n",
"<!-- prediction -->\n",
"<g id=\"node7\" class=\"node\">\n",
"<title>prediction</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"305.2592,-130 244.7408,-130 244.7408,-134 232.7408,-134 232.7408,-94 305.2592,-94 305.2592,-130\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"232.7408,-130 244.7408,-130 \"/>\n",
"<text text-anchor=\"middle\" x=\"269\" y=\"-107.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">prediction</text>\n",
"</g>\n",
"<!-- evaluate&#45;model&#45;&gt;prediction -->\n",
"<g id=\"edge5\" class=\"edge\">\n",
"<title>evaluate&#45;model&#45;&gt;prediction</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M258.5328,-165.8314C260.0301,-158.131 261.8105,-148.9743 263.4745,-140.4166\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"266.9465,-140.8975 265.4196,-130.4133 260.0752,-139.5614 266.9465,-140.8975\"/>\n",
"</g>\n",
"<!-- prediction&#45;&gt;user -->\n",
"<g id=\"edge7\" class=\"edge\">\n",
"<title>prediction&#45;&gt;user</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M297.432,-130.0399C309.2228,-139.2199 321.6204,-151.5128 328,-166 353.0765,-222.9454 360.6015,-253.0024 328,-306 321.0479,-317.3014 308.8723,-324.8558 296.9242,-329.8361\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"295.6742,-326.5666 287.4949,-333.3008 298.0885,-333.1371 295.6742,-326.5666\"/>\n",
"</g>\n",
"<!-- unknown&#45;sample&#45;&gt;evaluate&#45;model -->\n",
"<g id=\"edge2\" class=\"edge\">\n",
"<title>unknown&#45;sample&#45;&gt;evaluate&#45;model</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M261.0225,-239.7079C260.1215,-231.3739 259.0348,-221.3216 258.0339,-212.0633\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"261.5094,-211.6476 256.9548,-202.0817 254.55,-212.4 261.5094,-211.6476\"/>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<graphviz.dot.Digraph at 0x120fc0128>"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"diagram"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Telemetry"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"diagram = graph('Simple Prediction Service',\n",
" state('model state'), actor('user'), process('telemetry'), artefact('feedback'),\n",
" cluster('offline training', process('training'), artefact('data'), sync_edge('data', 'training')),\n",
" cluster('online predictions service', process('evaluate model'), artefact('prediction'),\n",
" artefact('unknown sample'), sync_edge('unknown sample', 'evaluate model')),\n",
" sync_edge('training', 'model state'), dependency('evaluate model', 'model state'),\n",
" sync_edge('evaluate model', 'prediction'), sync_edge('user', 'unknown sample'), sync_edge('prediction', 'user'),\n",
" async_edge('user', 'telemetry'), async_edge('telemetry', 'feedback'), sync_edge('feedback', 'data')\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n",
" -->\n",
"<!-- Title: %3 Pages: 1 -->\n",
"<svg width=\"301pt\" height=\"444pt\"\n",
" viewBox=\"0.00 0.00 301.00 443.68\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 439.6803)\">\n",
"<title>%3</title>\n",
"<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-439.6803 297,-439.6803 297,4 -4,4\"/>\n",
"<text text-anchor=\"middle\" x=\"146.5\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">Simple Prediction Service</text>\n",
"<g id=\"clust1\" class=\"cluster\">\n",
"<title>cluster offline&#45;training</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"186,-86 186,-232 285,-232 285,-86 186,-86\"/>\n",
"<text text-anchor=\"middle\" x=\"235.5\" y=\"-216.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">offline training</text>\n",
"</g>\n",
"<g id=\"clust2\" class=\"cluster\">\n",
"<title>cluster online&#45;predictions&#45;service</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"8,-158 8,-378 167,-378 167,-158 8,-158\"/>\n",
"<text text-anchor=\"middle\" x=\"87.5\" y=\"-362.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">online predictions service</text>\n",
"</g>\n",
"<!-- model&#45;state -->\n",
"<g id=\"node1\" class=\"node\">\n",
"<title>model&#45;state</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"187.1486,-58 110.8514,-58 106.8514,-54 106.8514,-22 183.1486,-22 187.1486,-26 187.1486,-58\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"183.1486,-54 106.8514,-54 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"183.1486,-54 183.1486,-22 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"183.1486,-54 187.1486,-58 \"/>\n",
"<text text-anchor=\"middle\" x=\"147\" y=\"-35.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">model state</text>\n",
"</g>\n",
"<!-- user -->\n",
"<g id=\"node2\" class=\"node\">\n",
"<title>user</title>\n",
"<ellipse fill=\"none\" stroke=\"#000000\" cx=\"172\" cy=\"-410.8402\" rx=\"24.6814\" ry=\"24.6814\"/>\n",
"<text text-anchor=\"middle\" x=\"172\" y=\"-406.6402\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">user</text>\n",
"</g>\n",
"<!-- telemetry -->\n",
"<g id=\"node3\" class=\"node\">\n",
"<title>telemetry</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"243\" cy=\"-330\" rx=\"43.4974\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"243\" y=\"-325.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">telemetry</text>\n",
"</g>\n",
"<!-- user&#45;&gt;telemetry -->\n",
"<g id=\"edge8\" class=\"edge\">\n",
"<title>user&#45;&gt;telemetry</title>\n",
"<path fill=\"none\" stroke=\"#ff0000\" stroke-dasharray=\"5,2\" d=\"M188.4673,-392.0906C198.2797,-380.9183 210.8465,-366.6098 221.4428,-354.5449\"/>\n",
"<polygon fill=\"#ff0000\" stroke=\"#ff0000\" points=\"224.1035,-356.8193 228.0728,-346.996 218.844,-352.1999 224.1035,-356.8193\"/>\n",
"</g>\n",
"<!-- unknown&#45;sample -->\n",
"<g id=\"node9\" class=\"node\">\n",
"<title>unknown&#45;sample</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"139.7659,-348 40.2341,-348 40.2341,-352 28.2341,-352 28.2341,-312 139.7659,-312 139.7659,-348\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"28.2341,-348 40.2341,-348 \"/>\n",
"<text text-anchor=\"middle\" x=\"84\" y=\"-325.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">unknown sample</text>\n",
"</g>\n",
"<!-- user&#45;&gt;unknown&#45;sample -->\n",
"<g id=\"edge6\" class=\"edge\">\n",
"<title>user&#45;&gt;unknown&#45;sample</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M153.3375,-393.6961C141.1911,-382.5379 125.1368,-367.7899 111.5354,-355.2951\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"113.5908,-352.4306 103.8586,-348.2429 108.8552,-357.5856 113.5908,-352.4306\"/>\n",
"</g>\n",
"<!-- feedback -->\n",
"<g id=\"node4\" class=\"node\">\n",
"<title>feedback</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"274.0193,-276 219.9807,-276 219.9807,-280 207.9807,-280 207.9807,-240 274.0193,-240 274.0193,-276\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"207.9807,-276 219.9807,-276 \"/>\n",
"<text text-anchor=\"middle\" x=\"241\" y=\"-253.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">feedback</text>\n",
"</g>\n",
"<!-- telemetry&#45;&gt;feedback -->\n",
"<g id=\"edge9\" class=\"edge\">\n",
"<title>telemetry&#45;&gt;feedback</title>\n",
"<path fill=\"none\" stroke=\"#ff0000\" stroke-dasharray=\"5,2\" d=\"M242.4953,-311.8314C242.2814,-304.131 242.0271,-294.9743 241.7894,-286.4166\"/>\n",
"<polygon fill=\"#ff0000\" stroke=\"#ff0000\" points=\"245.2879,-286.3122 241.5115,-276.4133 238.2906,-286.5066 245.2879,-286.3122\"/>\n",
"</g>\n",
"<!-- data -->\n",
"<g id=\"node6\" class=\"node\">\n",
"<title>data</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"266,-202 224,-202 224,-206 212,-206 212,-166 266,-166 266,-202\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"212,-202 224,-202 \"/>\n",
"<text text-anchor=\"middle\" x=\"239\" y=\"-179.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">data</text>\n",
"</g>\n",
"<!-- feedback&#45;&gt;data -->\n",
"<g id=\"edge10\" class=\"edge\">\n",
"<title>feedback&#45;&gt;data</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M240.5056,-239.7079C240.2828,-231.4635 240.0145,-221.5376 239.7665,-212.3622\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"243.2577,-211.9835 239.4887,-202.0817 236.2602,-212.1727 243.2577,-211.9835\"/>\n",
"</g>\n",
"<!-- training -->\n",
"<g id=\"node5\" class=\"node\">\n",
"<title>training</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"235\" cy=\"-112\" rx=\"37.7266\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"235\" y=\"-107.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">training</text>\n",
"</g>\n",
"<!-- training&#45;&gt;model&#45;state -->\n",
"<g id=\"edge3\" class=\"edge\">\n",
"<title>training&#45;&gt;model&#45;state</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M215.9053,-96.3771C204.519,-87.061 189.8148,-75.0303 176.9158,-64.4766\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"179.1194,-61.7573 169.1635,-58.1338 174.6868,-67.175 179.1194,-61.7573\"/>\n",
"</g>\n",
"<!-- data&#45;&gt;training -->\n",
"<g id=\"edge1\" class=\"edge\">\n",
"<title>data&#45;&gt;training</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M237.9906,-165.8314C237.5628,-158.131 237.0541,-148.9743 236.5787,-140.4166\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"240.0724,-140.2037 236.023,-130.4133 233.0831,-140.592 240.0724,-140.2037\"/>\n",
"</g>\n",
"<!-- evaluate&#45;model -->\n",
"<g id=\"node7\" class=\"node\">\n",
"<title>evaluate&#45;model</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"80\" cy=\"-258\" rx=\"63.7604\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"80\" y=\"-253.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">evaluate model</text>\n",
"</g>\n",
"<!-- evaluate&#45;model&#45;&gt;model&#45;state -->\n",
"<g id=\"edge4\" class=\"edge\">\n",
"<title>evaluate&#45;model&#45;&gt;model&#45;state</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M76.6374,-239.9277C73.5412,-219.6097 70.4625,-185.8223 78,-158 87.2993,-123.6745 109.3551,-89.1014 126.0463,-66.3326\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"128.9396,-68.3083 132.1429,-58.2093 123.3409,-64.1064 128.9396,-68.3083\"/>\n",
"</g>\n",
"<!-- prediction -->\n",
"<g id=\"node8\" class=\"node\">\n",
"<title>prediction</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"159.2592,-202 98.7408,-202 98.7408,-206 86.7408,-206 86.7408,-166 159.2592,-166 159.2592,-202\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"86.7408,-202 98.7408,-202 \"/>\n",
"<text text-anchor=\"middle\" x=\"123\" y=\"-179.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">prediction</text>\n",
"</g>\n",
"<!-- evaluate&#45;model&#45;&gt;prediction -->\n",
"<g id=\"edge5\" class=\"edge\">\n",
"<title>evaluate&#45;model&#45;&gt;prediction</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M90.4091,-240.0867C95.4948,-231.3346 101.7205,-220.6205 107.3629,-210.9103\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"110.5397,-212.4096 112.5377,-202.0049 104.4873,-208.8926 110.5397,-212.4096\"/>\n",
"</g>\n",
"<!-- prediction&#45;&gt;user -->\n",
"<g id=\"edge7\" class=\"edge\">\n",
"<title>prediction&#45;&gt;user</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M135.0094,-202.1384C141.4276,-212.737 148.8434,-226.6236 153,-240 167.1249,-285.4549 171.0244,-340.5868 171.9572,-375.6921\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"168.4622,-375.9823 172.1595,-385.9111 175.4608,-375.8437 168.4622,-375.9823\"/>\n",
"</g>\n",
"<!-- unknown&#45;sample&#45;&gt;evaluate&#45;model -->\n",
"<g id=\"edge2\" class=\"edge\">\n",
"<title>unknown&#45;sample&#45;&gt;evaluate&#45;model</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M82.9906,-311.8314C82.5628,-304.131 82.0541,-294.9743 81.5787,-286.4166\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"85.0724,-286.2037 81.023,-276.4133 78.0831,-286.592 85.0724,-286.2037\"/>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<graphviz.dot.Digraph at 0x120fc0b70>"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"diagram"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# System Level Components\n",
"\n",
"- Model Training infrastructure\n",
" - TensorFlow\n",
" - sklearn\n",
" - custom built\n",
" - etc.\n",
"- Model Inference infrastructure\n",
" - Prediction Serving (API layer)\n",
" - Model State management\n",
"- Telemetry (clickstream or otherwise)\n",
"- Data Pipelines"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# The Data View"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"diagram = graph(\n",
" 'Simple Data Pipeline',\n",
" artefact('telemetry data'), artefact('external sources'),\n",
" process('telemetry'),\n",
" cluster(\n",
" 'workflow orchestration',\n",
" actor('scheduler'),\n",
" cluster(\n",
" 'pre-processing',\n",
" process('data cleaning'),\n",
" process('data preparation'),\n",
" sync_edge('data cleaning', 'data preparation')\n",
" ),\n",
" cluster(\n",
" 'training',\n",
" process('feature extraction'),\n",
" process('model training'),\n",
" artefact('configuration'),\n",
" dependency('configuration', 'model training'),\n",
" dependency('configuration', 'feature extraction'),\n",
" sync_edge('feature extraction', 'model training')\n",
" )\n",
" ),\n",
" actor('user'),\n",
" cluster(\n",
" 'inference',\n",
" process('prediction API'),\n",
" state('model state')\n",
" ),\n",
" dependency('telemetry data', 'data preparation'),\n",
" dependency('external sources', 'data preparation'),\n",
" dependency('prediction API', 'model state'),\n",
" sync_edge('prediction API', 'user'),\n",
" async_edge('user', 'telemetry'),\n",
" sync_edge('telemetry', 'telemetry data'),\n",
" sync_edge('scheduler', 'data cleaning'),\n",
" sync_edge('data preparation', 'feature extraction'),\n",
" sync_edge('model training', 'model state')\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n",
" -->\n",
"<!-- Title: %3 Pages: 1 -->\n",
"<svg width=\"576pt\" height=\"513pt\"\n",
" viewBox=\"0.00 0.00 576.00 513.43\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(.8737 .8737) rotate(0) translate(4 583.6622)\">\n",
"<title>%3</title>\n",
"<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-583.6622 655.2847,-583.6622 655.2847,4 -4,4\"/>\n",
"<text text-anchor=\"middle\" x=\"325.6423\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">Simple Data Pipeline</text>\n",
"<g id=\"clust1\" class=\"cluster\">\n",
"<title>cluster workflow&#45;orchestration</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"183,-94 183,-521.9819 537,-521.9819 537,-94 183,-94\"/>\n",
"<text text-anchor=\"middle\" x=\"360\" y=\"-506.7819\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">workflow orchestration</text>\n",
"</g>\n",
"<g id=\"clust2\" class=\"cluster\">\n",
"<title>cluster pre&#45;processing</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"191,-248 191,-396 341,-396 341,-248 191,-248\"/>\n",
"<text text-anchor=\"middle\" x=\"266\" y=\"-380.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">pre&#45;processing</text>\n",
"</g>\n",
"<g id=\"clust3\" class=\"cluster\">\n",
"<title>cluster training</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"349,-102 349,-322 529,-322 529,-102 349,-102\"/>\n",
"<text text-anchor=\"middle\" x=\"439\" y=\"-306.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">training</text>\n",
"</g>\n",
"<g id=\"clust4\" class=\"cluster\">\n",
"<title>cluster inference</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"8,-30 8,-176 148,-176 148,-30 8,-30\"/>\n",
"<text text-anchor=\"middle\" x=\"78\" y=\"-160.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">inference</text>\n",
"</g>\n",
"<!-- telemetry&#45;data -->\n",
"<g id=\"node1\" class=\"node\">\n",
"<title>telemetry&#45;data</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"174.6802,-366 91.3198,-366 91.3198,-370 79.3198,-370 79.3198,-330 174.6802,-330 174.6802,-366\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"79.3198,-366 91.3198,-366 \"/>\n",
"<text text-anchor=\"middle\" x=\"127\" y=\"-343.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">telemetry data</text>\n",
"</g>\n",
"<!-- data&#45;preparation -->\n",
"<g id=\"node6\" class=\"node\">\n",
"<title>data&#45;preparation</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"266\" cy=\"-274\" rx=\"67.1265\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"266\" y=\"-269.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">data preparation</text>\n",
"</g>\n",
"<!-- telemetry&#45;data&#45;&gt;data&#45;preparation -->\n",
"<g id=\"edge5\" class=\"edge\">\n",
"<title>telemetry&#45;data&#45;&gt;data&#45;preparation</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M161.0032,-329.8976C180.8866,-319.3121 206.0687,-305.9059 226.7868,-294.8761\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"228.4773,-297.9413 235.6596,-290.1525 225.1878,-291.7623 228.4773,-297.9413\"/>\n",
"</g>\n",
"<!-- external&#45;sources -->\n",
"<g id=\"node2\" class=\"node\">\n",
"<title>external&#45;sources</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"651.0702,-366 556.9298,-366 556.9298,-370 544.9298,-370 544.9298,-330 651.0702,-330 651.0702,-366\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"544.9298,-366 556.9298,-366 \"/>\n",
"<text text-anchor=\"middle\" x=\"598\" y=\"-343.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">external sources</text>\n",
"</g>\n",
"<!-- external&#45;sources&#45;&gt;data&#45;preparation -->\n",
"<g id=\"edge6\" class=\"edge\">\n",
"<title>external&#45;sources&#45;&gt;data&#45;preparation</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M544.6818,-345.2817C481.6481,-341.5867 380.7858,-334.0259 345,-322 327.867,-316.2424 310.5187,-306.2605 296.4606,-296.8855\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"298.3846,-293.9609 288.164,-291.1699 294.4134,-299.7254 298.3846,-293.9609\"/>\n",
"</g>\n",
"<!-- telemetry -->\n",
"<g id=\"node3\" class=\"node\">\n",
"<title>telemetry</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"124\" cy=\"-447.9909\" rx=\"43.4974\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"124\" y=\"-443.7909\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">telemetry</text>\n",
"</g>\n",
"<!-- telemetry&#45;&gt;telemetry&#45;data -->\n",
"<g id=\"edge10\" class=\"edge\">\n",
"<title>telemetry&#45;&gt;telemetry&#45;data</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M124.5502,-429.6511C124.9985,-414.7104 125.639,-393.3624 126.1527,-376.242\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"129.6517,-376.3241 126.4532,-366.2236 122.6548,-376.1141 129.6517,-376.3241\"/>\n",
"</g>\n",
"<!-- scheduler -->\n",
"<g id=\"node4\" class=\"node\">\n",
"<title>scheduler</title>\n",
"<ellipse fill=\"none\" stroke=\"#000000\" cx=\"266\" cy=\"-447.9909\" rx=\"43.9819\" ry=\"43.9819\"/>\n",
"<text text-anchor=\"middle\" x=\"266\" y=\"-443.7909\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">scheduler</text>\n",
"</g>\n",
"<!-- data&#45;cleaning -->\n",
"<g id=\"node5\" class=\"node\">\n",
"<title>data&#45;cleaning</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"266\" cy=\"-348\" rx=\"57.0027\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"266\" y=\"-343.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">data cleaning</text>\n",
"</g>\n",
"<!-- scheduler&#45;&gt;data&#45;cleaning -->\n",
"<g id=\"edge11\" class=\"edge\">\n",
"<title>scheduler&#45;&gt;data&#45;cleaning</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M266,-403.8424C266,-394.4989 266,-384.8793 266,-376.313\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"269.5001,-376.1257 266,-366.1257 262.5001,-376.1257 269.5001,-376.1257\"/>\n",
"</g>\n",
"<!-- data&#45;cleaning&#45;&gt;data&#45;preparation -->\n",
"<g id=\"edge1\" class=\"edge\">\n",
"<title>data&#45;cleaning&#45;&gt;data&#45;preparation</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M266,-329.7079C266,-321.4635 266,-311.5376 266,-302.3622\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"269.5001,-302.0817 266,-292.0817 262.5001,-302.0818 269.5001,-302.0817\"/>\n",
"</g>\n",
"<!-- feature&#45;extraction -->\n",
"<g id=\"node7\" class=\"node\">\n",
"<title>feature&#45;extraction</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"429\" cy=\"-202\" rx=\"72.4374\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"429\" y=\"-197.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">feature extraction</text>\n",
"</g>\n",
"<!-- data&#45;preparation&#45;&gt;feature&#45;extraction -->\n",
"<g id=\"edge12\" class=\"edge\">\n",
"<title>data&#45;preparation&#45;&gt;feature&#45;extraction</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M300.9676,-258.5542C325.2117,-247.8451 357.7175,-233.4868 383.9088,-221.9176\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"385.5123,-225.0356 393.2455,-217.7934 382.6839,-218.6324 385.5123,-225.0356\"/>\n",
"</g>\n",
"<!-- model&#45;training -->\n",
"<g id=\"node8\" class=\"node\">\n",
"<title>model&#45;training</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"429\" cy=\"-128\" rx=\"61.8567\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"429\" y=\"-123.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">model training</text>\n",
"</g>\n",
"<!-- feature&#45;extraction&#45;&gt;model&#45;training -->\n",
"<g id=\"edge4\" class=\"edge\">\n",
"<title>feature&#45;extraction&#45;&gt;model&#45;training</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M429,-183.7079C429,-175.4635 429,-165.5376 429,-156.3622\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"432.5001,-156.0817 429,-146.0817 425.5001,-156.0818 432.5001,-156.0817\"/>\n",
"</g>\n",
"<!-- model&#45;state -->\n",
"<g id=\"node12\" class=\"node\">\n",
"<title>model&#45;state</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"129.1486,-74 52.8514,-74 48.8514,-70 48.8514,-38 125.1486,-38 129.1486,-42 129.1486,-74\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"125.1486,-70 48.8514,-70 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"125.1486,-70 125.1486,-38 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"125.1486,-70 129.1486,-74 \"/>\n",
"<text text-anchor=\"middle\" x=\"89\" y=\"-51.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">model state</text>\n",
"</g>\n",
"<!-- model&#45;training&#45;&gt;model&#45;state -->\n",
"<g id=\"edge13\" class=\"edge\">\n",
"<title>model&#45;training&#45;&gt;model&#45;state</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M378.8318,-117.3762C314.4821,-103.7492 203.5243,-80.2522 139.1725,-66.6248\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"139.7671,-63.1731 129.2589,-64.5254 138.3168,-70.0213 139.7671,-63.1731\"/>\n",
"</g>\n",
"<!-- configuration -->\n",
"<g id=\"node9\" class=\"node\">\n",
"<title>configuration</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"497.1559,-292 418.8441,-292 418.8441,-296 406.8441,-296 406.8441,-256 497.1559,-256 497.1559,-292\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"406.8441,-292 418.8441,-292 \"/>\n",
"<text text-anchor=\"middle\" x=\"452\" y=\"-269.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">configuration</text>\n",
"</g>\n",
"<!-- configuration&#45;&gt;feature&#45;extraction -->\n",
"<g id=\"edge3\" class=\"edge\">\n",
"<title>configuration&#45;&gt;feature&#45;extraction</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M446.1961,-255.8314C443.7092,-248.0463 440.7469,-238.7729 437.9875,-230.1347\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"441.2591,-228.874 434.882,-220.4133 434.591,-231.0041 441.2591,-228.874\"/>\n",
"</g>\n",
"<!-- configuration&#45;&gt;model&#45;training -->\n",
"<g id=\"edge2\" class=\"edge\">\n",
"<title>configuration&#45;&gt;model&#45;training</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M479.9609,-255.862C491.554,-246.6621 503.7402,-234.3809 510,-220 516.3858,-205.3296 517.4303,-198.1701 510,-184 501.9789,-168.7031 487.673,-156.758 473.2923,-147.9017\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"474.7987,-144.7307 464.3841,-142.7842 471.3117,-150.8004 474.7987,-144.7307\"/>\n",
"</g>\n",
"<!-- user -->\n",
"<g id=\"node10\" class=\"node\">\n",
"<title>user</title>\n",
"<ellipse fill=\"none\" stroke=\"#000000\" cx=\"86\" cy=\"-554.8221\" rx=\"24.6814\" ry=\"24.6814\"/>\n",
"<text text-anchor=\"middle\" x=\"86\" y=\"-550.6221\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">user</text>\n",
"</g>\n",
"<!-- user&#45;&gt;telemetry -->\n",
"<g id=\"edge9\" class=\"edge\">\n",
"<title>user&#45;&gt;telemetry</title>\n",
"<path fill=\"none\" stroke=\"#ff0000\" stroke-dasharray=\"5,2\" d=\"M94.3393,-531.3775C100.1835,-514.9474 108.0173,-492.9238 114.1665,-475.6364\"/>\n",
"<polygon fill=\"#ff0000\" stroke=\"#ff0000\" points=\"117.5108,-476.6779 117.5646,-466.0832 110.9156,-474.3319 117.5108,-476.6779\"/>\n",
"</g>\n",
"<!-- prediction&#45;api -->\n",
"<g id=\"node11\" class=\"node\">\n",
"<title>prediction&#45;api</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"78\" cy=\"-128\" rx=\"61.874\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"78\" y=\"-123.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">prediction API</text>\n",
"</g>\n",
"<!-- prediction&#45;api&#45;&gt;user -->\n",
"<g id=\"edge8\" class=\"edge\">\n",
"<title>prediction&#45;api&#45;&gt;user</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M72.4056,-146.2376C64.4971,-173.6779 51,-227.3436 51,-274 51,-348 51,-348 51,-348 51,-425.8865 51.8603,-446.7442 72,-521.9819 72.0561,-522.1914 72.1133,-522.4014 72.1715,-522.6117\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"68.8695,-523.7738 75.36,-532.1478 75.5083,-521.554 68.8695,-523.7738\"/>\n",
"</g>\n",
"<!-- prediction&#45;api&#45;&gt;model&#45;state -->\n",
"<g id=\"edge7\" class=\"edge\">\n",
"<title>prediction&#45;api&#45;&gt;model&#45;state</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M80.7758,-109.8314C81.9522,-102.131 83.3511,-92.9743 84.6586,-84.4166\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"88.1364,-84.8272 86.1869,-74.4133 81.2167,-83.7699 88.1364,-84.8272\"/>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<graphviz.dot.Digraph at 0x120fc6470>"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"diagram.graph_attr.update(size='8!')\n",
"diagram"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Workflow Orchestration\n",
"\n",
"- Takes care of routing data into a DAG of operations\n",
"- Useful features:\n",
" - versioning\n",
" - lineage\n",
" - audit trails\n",
" - retry logic\n",
" - failure handling\n",
" - pipeline monitoring\n",
"- Implementations:\n",
" - OSS: [Apache Airflow](https://airflow.apache.org/index.html)\n",
" - AWS: [Amazon Data Pipeline](https://aws.amazon.com/datapipeline/)\n",
" - GCP: [Cloud Composer](https://cloud.google.com/composer/) (which is hosted Airflow)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Model State\n",
"\n",
"- Owned and managed by the inference component.\n",
"- Docker is one great way to encapsulate a version of inference code and model state.\n",
"- Alternatively, the inference component should expose _internal_ API for updating the model state.\n",
"- Push \"the latest\" model state into inference.\n",
"- _Inference component defines the interface for model state updates and format._"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Training\n",
"\n",
"- This is the fun part, right?\n",
"- Version your models\n",
"- Simplest solution for versioned configuration is to just make it part of the training code\n",
"- Think about hyper param tunig"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Pre-processing\n",
"\n",
"- Bot detection\n",
"- Deduplication\n",
"- Enrichment\n",
" - e.g. ip2geo\n",
" - e.g. User Agent parsing\n",
"- Compliance\n",
" - e.g. removal of information that is not allowed to be trained on"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# JSON or CSV?\n",
"\n",
"# <font color=\"red\">NO.</font>\n",
"\n",
"&nbsp;\n",
"\n",
"- Real data has a schema.\n",
"- Enforce this from telemetry all the way through the pipeline.\n",
"- Because:\n",
" - Was it `user_id` or `userid`?\n",
" - All our timestamps are in ms since epoch, right?\n",
" - The `session_id` will never be `null`, right?\n",
" - Sorry, the new telemetry doesn't populate `segment_id` anymore.\n",
"- Good use of schemas allow you to update producers and consumer independently\n",
" - Using proper _schema evolution_."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# The Industry's Favourite Serialisation Format\n",
"\n",
"- [Apache Avro](http://avro.apache.org/)\n",
"- Takes care of enforcing reading and writing schemas\n",
"- Allows for default values and `null`s (if you must).\n",
"- Works out of the box in:\n",
" - Apache Spark (including Google Cloud Dataproc, which is Spark)\n",
" - Google BigQuery\n",
"- Code generation for Java, Python, and others.\n",
"\n",
"### Alternatives:\n",
"- [Protocol Buffers](https://developers.google.com/protocol-buffers/) (protobuf)\n",
"- ~~[MessagePack](https://msgpack.org/index.html)~~ (no schema)\n",
"- ~~[BSON](http://bsonspec.org/)~~ (no schema)\n",
"- [JSON Schema](https://json-schema.org/) (if you must)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Schema Evolution\n",
"- When adding a new field to an entity, it must be optional\n",
"- Removing fields from the schema can’t be done\n",
" - But a producer can stop populating optional fields\n",
"- Readers / consumers / clients must have sensible handling of empty optionals\n",
" - Usually default values\n",
" - Sometimes different behaviour\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# What About Streaming\n",
"\n",
"- (Which, by the way, is not the same as real-time.)\n",
"- Two types of streaming data uses:\n",
" 1. Capture the stream and divide into batches\n",
" - Telemetry data usually comes in streaming\n",
" - Event driven systems deliver streams\n",
" - But for some models, streaming updates are difficult (e.g. deep learning)\n",
" 2. Truly streaming model updates\n",
" - Fairly straightforward for Bayesian inference models and simpler statistical inference models\n",
" - Brings some pre-processing concerns to streaming pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Streaming Solutions\n",
"- Everybody's favourite: [Apache Kafka](http://kafka.apache.org/)\n",
" - Has configurable _long term_ retention.\n",
" - Allows for complete stream replay or starting from a arbitrary position.\n",
" - Log compaction.\n",
" - Additional components, including [Schema Registry](https://docs.confluent.io/current/schema-registry/docs/index.html), available in Confluent distribution.\n",
"- On AWS:\n",
" - [Amazon MSK](https://aws.amazon.com/msk/) (managed, hosted Kafka)\n",
" - [Amazon Kinesis](https://aws.amazon.com/kinesis/)\n",
"- On GCP:\n",
" - [Cloud Pub/Sub](https://cloud.google.com/pubsub/)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Capturing the Stream\n",
"\n",
"- Simple strategy: capture events and periodically write to a file\n",
" - Always apply partitioning, at leat by time (e.g. 1 hour buckets)\n",
" - Make sure to keep partial files separate until completion\n",
"- For Kafka:\n",
" - [Kafka Connect](https://docs.confluent.io/current/connect/index.html)\n",
" - Writes to GCP, AWS and others\n",
" - Applies partitioning\n",
"- Hosted alternatives:\n",
" - On GCP: [Cloud Dataflow](https://cloud.google.com/dataflow/) (for Pub/Sub or Kafka)\n",
" - On AWS: Meh, just run Kafka Connect or perhaps Lambda's for Kinesis"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Streaming Model Updates\n",
"\n",
"- Training divided into two phases:\n",
" - Initial training, creates baseline model state\n",
" - Streaming model updates, as feedback comes in\n",
"- Split out responsibilities between training component and inference component\n",
" - Inference becomes a consumer of telemetry or other feedback mechanism\n",
" - Cross cutting coordination to ensure consistency"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"diagram = graph(\n",
" 'Simple ML System',\n",
" artefact('telemetry data'), artefact('external sources'),\n",
" process('telemetry'),\n",
" cluster(\n",
" 'workflow orchestration',\n",
" actor('scheduler'),\n",
" cluster(\n",
" 'pre-processing',\n",
" process('data cleaning'),\n",
" process('data preparation'),\n",
" sync_edge('data cleaning', 'data preparation')\n",
" ),\n",
" cluster(\n",
" 'training',\n",
" process('feature extraction'),\n",
" process('model training'),\n",
" artefact('configuration'),\n",
" dependency('configuration', 'model training'),\n",
" dependency('configuration', 'feature extraction'),\n",
" sync_edge('feature extraction', 'model training')\n",
" )\n",
" ),\n",
" actor('user'),\n",
" cluster(\n",
" 'inference',\n",
" process('prediction API'),\n",
" state('model state')\n",
" ),\n",
" dependency('telemetry data', 'data preparation'),\n",
" dependency('external sources', 'data preparation'),\n",
" dependency('prediction API', 'model state'),\n",
" sync_edge('prediction API', 'user'),\n",
" async_edge('user', 'telemetry'),\n",
" sync_edge('telemetry', 'telemetry data'),\n",
" async_edge('telemetry', 'prediction API'),\n",
" sync_edge('scheduler', 'data cleaning'),\n",
" sync_edge('data preparation', 'feature extraction'),\n",
" sync_edge('model training', 'model state')\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n",
" -->\n",
"<!-- Title: %3 Pages: 1 -->\n",
"<svg width=\"576pt\" height=\"429pt\"\n",
" viewBox=\"0.00 0.00 576.00 428.84\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(.7802 .7802) rotate(0) translate(4 545.6622)\">\n",
"<title>%3</title>\n",
"<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-545.6622 734.2847,-545.6622 734.2847,4 -4,4\"/>\n",
"<text text-anchor=\"middle\" x=\"365.1423\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">Simple ML System</text>\n",
"<g id=\"clust1\" class=\"cluster\">\n",
"<title>cluster workflow&#45;orchestration</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"262,-94 262,-533.6622 616,-533.6622 616,-94 262,-94\"/>\n",
"<text text-anchor=\"middle\" x=\"439\" y=\"-518.4622\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">workflow orchestration</text>\n",
"</g>\n",
"<g id=\"clust2\" class=\"cluster\">\n",
"<title>cluster pre&#45;processing</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"270,-254.8402 270,-407.6803 420,-407.6803 420,-254.8402 270,-254.8402\"/>\n",
"<text text-anchor=\"middle\" x=\"345\" y=\"-392.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">pre&#45;processing</text>\n",
"</g>\n",
"<g id=\"clust3\" class=\"cluster\">\n",
"<title>cluster training</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"428,-102 428,-328.8402 608,-328.8402 608,-102 428,-102\"/>\n",
"<text text-anchor=\"middle\" x=\"518\" y=\"-313.6402\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">training</text>\n",
"</g>\n",
"<g id=\"clust4\" class=\"cluster\">\n",
"<title>cluster inference</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"8,-30 8,-176 148,-176 148,-30 8,-30\"/>\n",
"<text text-anchor=\"middle\" x=\"78\" y=\"-160.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">inference</text>\n",
"</g>\n",
"<!-- telemetry&#45;data -->\n",
"<g id=\"node1\" class=\"node\">\n",
"<title>telemetry&#45;data</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"253.6802,-146 170.3198,-146 170.3198,-150 158.3198,-150 158.3198,-110 253.6802,-110 253.6802,-146\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"158.3198,-146 170.3198,-146 \"/>\n",
"<text text-anchor=\"middle\" x=\"206\" y=\"-123.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">telemetry data</text>\n",
"</g>\n",
"<!-- data&#45;preparation -->\n",
"<g id=\"node6\" class=\"node\">\n",
"<title>data&#45;preparation</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"345\" cy=\"-280.8402\" rx=\"67.1265\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"345\" y=\"-276.6402\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">data preparation</text>\n",
"</g>\n",
"<!-- telemetry&#45;data&#45;&gt;data&#45;preparation -->\n",
"<g id=\"edge5\" class=\"edge\">\n",
"<title>telemetry&#45;data&#45;&gt;data&#45;preparation</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M222.62,-146.2748C247.3757,-173.4955 294.0169,-224.7807 322.0224,-255.5747\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"319.7374,-258.2642 329.0549,-263.3075 324.9161,-253.5545 319.7374,-258.2642\"/>\n",
"</g>\n",
"<!-- external&#45;sources -->\n",
"<g id=\"node2\" class=\"node\">\n",
"<title>external&#45;sources</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"730.0702,-377.6803 635.9298,-377.6803 635.9298,-381.6803 623.9298,-381.6803 623.9298,-341.6803 730.0702,-341.6803 730.0702,-377.6803\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"623.9298,-377.6803 635.9298,-377.6803 \"/>\n",
"<text text-anchor=\"middle\" x=\"677\" y=\"-355.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">external sources</text>\n",
"</g>\n",
"<!-- external&#45;sources&#45;&gt;data&#45;preparation -->\n",
"<g id=\"edge6\" class=\"edge\">\n",
"<title>external&#45;sources&#45;&gt;data&#45;preparation</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M623.6688,-355.8359C560.622,-350.828 459.7464,-341.2299 424,-328.8402 406.9221,-322.921 389.5804,-312.9197 375.5125,-303.5737\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"377.4351,-300.6482 367.2079,-297.8815 373.4775,-306.4221 377.4351,-300.6482\"/>\n",
"</g>\n",
"<!-- telemetry -->\n",
"<g id=\"node3\" class=\"node\">\n",
"<title>telemetry</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"149\" cy=\"-202\" rx=\"43.4974\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"149\" y=\"-197.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">telemetry</text>\n",
"</g>\n",
"<!-- telemetry&#45;&gt;telemetry&#45;data -->\n",
"<g id=\"edge10\" class=\"edge\">\n",
"<title>telemetry&#45;&gt;telemetry&#45;data</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M162.2202,-184.8369C169.2088,-175.764 177.9377,-164.4317 185.7585,-154.2784\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"188.7267,-156.1605 192.0562,-146.1024 183.1811,-151.8889 188.7267,-156.1605\"/>\n",
"</g>\n",
"<!-- prediction&#45;api -->\n",
"<g id=\"node11\" class=\"node\">\n",
"<title>prediction&#45;api</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"78\" cy=\"-128\" rx=\"61.874\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"78\" y=\"-123.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">prediction API</text>\n",
"</g>\n",
"<!-- telemetry&#45;&gt;prediction&#45;api -->\n",
"<g id=\"edge11\" class=\"edge\">\n",
"<title>telemetry&#45;&gt;prediction&#45;api</title>\n",
"<path fill=\"none\" stroke=\"#ff0000\" stroke-dasharray=\"5,2\" d=\"M132.889,-185.2083C123.7364,-175.6689 112.1035,-163.5444 101.9048,-152.9149\"/>\n",
"<polygon fill=\"#ff0000\" stroke=\"#ff0000\" points=\"104.2369,-150.2901 94.7881,-145.4974 99.1858,-155.1364 104.2369,-150.2901\"/>\n",
"</g>\n",
"<!-- scheduler -->\n",
"<g id=\"node4\" class=\"node\">\n",
"<title>scheduler</title>\n",
"<ellipse fill=\"none\" stroke=\"#000000\" cx=\"345\" cy=\"-459.6713\" rx=\"43.9819\" ry=\"43.9819\"/>\n",
"<text text-anchor=\"middle\" x=\"345\" y=\"-455.4713\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">scheduler</text>\n",
"</g>\n",
"<!-- data&#45;cleaning -->\n",
"<g id=\"node5\" class=\"node\">\n",
"<title>data&#45;cleaning</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"345\" cy=\"-359.6803\" rx=\"57.0027\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"345\" y=\"-355.4803\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">data cleaning</text>\n",
"</g>\n",
"<!-- scheduler&#45;&gt;data&#45;cleaning -->\n",
"<g id=\"edge12\" class=\"edge\">\n",
"<title>scheduler&#45;&gt;data&#45;cleaning</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M345,-415.5227C345,-406.1793 345,-396.5596 345,-387.9933\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"348.5001,-387.806 345,-377.8061 341.5001,-387.8061 348.5001,-387.806\"/>\n",
"</g>\n",
"<!-- data&#45;cleaning&#45;&gt;data&#45;preparation -->\n",
"<g id=\"edge1\" class=\"edge\">\n",
"<title>data&#45;cleaning&#45;&gt;data&#45;preparation</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M345,-341.3947C345,-331.842 345,-319.9386 345,-309.2197\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"348.5001,-309.0267 345,-299.0267 341.5001,-309.0267 348.5001,-309.0267\"/>\n",
"</g>\n",
"<!-- feature&#45;extraction -->\n",
"<g id=\"node7\" class=\"node\">\n",
"<title>feature&#45;extraction</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"508\" cy=\"-202\" rx=\"72.4374\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"508\" y=\"-197.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">feature extraction</text>\n",
"</g>\n",
"<!-- data&#45;preparation&#45;&gt;feature&#45;extraction -->\n",
"<g id=\"edge13\" class=\"edge\">\n",
"<title>data&#45;preparation&#45;&gt;feature&#45;extraction</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M377.5937,-265.0752C402.9685,-252.8018 438.3803,-235.6738 465.8998,-222.3631\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"467.4258,-225.513 474.9041,-218.0079 464.3778,-219.2114 467.4258,-225.513\"/>\n",
"</g>\n",
"<!-- model&#45;training -->\n",
"<g id=\"node8\" class=\"node\">\n",
"<title>model&#45;training</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"508\" cy=\"-128\" rx=\"61.8567\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"508\" y=\"-123.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">model training</text>\n",
"</g>\n",
"<!-- feature&#45;extraction&#45;&gt;model&#45;training -->\n",
"<g id=\"edge4\" class=\"edge\">\n",
"<title>feature&#45;extraction&#45;&gt;model&#45;training</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M508,-183.7079C508,-175.4635 508,-165.5376 508,-156.3622\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"511.5001,-156.0817 508,-146.0817 504.5001,-156.0818 511.5001,-156.0817\"/>\n",
"</g>\n",
"<!-- model&#45;state -->\n",
"<g id=\"node12\" class=\"node\">\n",
"<title>model&#45;state</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"129.1486,-74 52.8514,-74 48.8514,-70 48.8514,-38 125.1486,-38 129.1486,-42 129.1486,-74\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"125.1486,-70 48.8514,-70 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"125.1486,-70 125.1486,-38 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"125.1486,-70 129.1486,-74 \"/>\n",
"<text text-anchor=\"middle\" x=\"89\" y=\"-51.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">model state</text>\n",
"</g>\n",
"<!-- model&#45;training&#45;&gt;model&#45;state -->\n",
"<g id=\"edge14\" class=\"edge\">\n",
"<title>model&#45;training&#45;&gt;model&#45;state</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M454.5078,-118.808C373.1134,-104.8214 219.2377,-78.3798 139.6,-64.695\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"139.7839,-61.1754 129.3355,-62.9312 138.5983,-68.0742 139.7839,-61.1754\"/>\n",
"</g>\n",
"<!-- configuration -->\n",
"<g id=\"node9\" class=\"node\">\n",
"<title>configuration</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"576.1559,-298.8402 497.8441,-298.8402 497.8441,-302.8402 485.8441,-302.8402 485.8441,-262.8402 576.1559,-262.8402 576.1559,-298.8402\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"485.8441,-298.8402 497.8441,-298.8402 \"/>\n",
"<text text-anchor=\"middle\" x=\"531\" y=\"-276.6402\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">configuration</text>\n",
"</g>\n",
"<!-- configuration&#45;&gt;feature&#45;extraction -->\n",
"<g id=\"edge3\" class=\"edge\">\n",
"<title>configuration&#45;&gt;feature&#45;extraction</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M525.6655,-262.5545C522.8497,-252.9024 519.3337,-240.8502 516.1815,-230.045\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"519.4661,-228.8061 513.3055,-220.1865 512.7462,-230.7666 519.4661,-228.8061\"/>\n",
"</g>\n",
"<!-- configuration&#45;&gt;model&#45;training -->\n",
"<g id=\"edge2\" class=\"edge\">\n",
"<title>configuration&#45;&gt;model&#45;training</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M555.6941,-262.6545C568.234,-251.8375 582.2211,-236.9351 589,-220 594.9459,-205.1458 596.4303,-198.1701 589,-184 580.9789,-168.7031 566.673,-156.758 552.2923,-147.9017\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"553.7987,-144.7307 543.3841,-142.7842 550.3117,-150.8004 553.7987,-144.7307\"/>\n",
"</g>\n",
"<!-- user -->\n",
"<g id=\"node10\" class=\"node\">\n",
"<title>user</title>\n",
"<ellipse fill=\"none\" stroke=\"#000000\" cx=\"113\" cy=\"-280.8402\" rx=\"24.6814\" ry=\"24.6814\"/>\n",
"<text text-anchor=\"middle\" x=\"113\" y=\"-276.6402\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">user</text>\n",
"</g>\n",
"<!-- user&#45;&gt;telemetry -->\n",
"<g id=\"edge9\" class=\"edge\">\n",
"<title>user&#45;&gt;telemetry</title>\n",
"<path fill=\"none\" stroke=\"#ff0000\" stroke-dasharray=\"5,2\" d=\"M123.4127,-258.0363C127.5432,-248.9906 132.3261,-238.516 136.6311,-229.0879\"/>\n",
"<polygon fill=\"#ff0000\" stroke=\"#ff0000\" points=\"139.9271,-230.2958 140.8971,-219.7455 133.5595,-227.3882 139.9271,-230.2958\"/>\n",
"</g>\n",
"<!-- prediction&#45;api&#45;&gt;user -->\n",
"<g id=\"edge8\" class=\"edge\">\n",
"<title>prediction&#45;api&#45;&gt;user</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M81.4301,-146.0438C85.0344,-164.6386 91.0033,-194.436 97,-220 99.0437,-228.7125 101.4183,-238.0804 103.7025,-246.7918\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"100.3678,-247.8713 106.3173,-256.6379 107.1333,-246.0746 100.3678,-247.8713\"/>\n",
"</g>\n",
"<!-- prediction&#45;api&#45;&gt;model&#45;state -->\n",
"<g id=\"edge7\" class=\"edge\">\n",
"<title>prediction&#45;api&#45;&gt;model&#45;state</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M80.7758,-109.8314C81.9522,-102.131 83.3511,-92.9743 84.6586,-84.4166\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"88.1364,-84.8272 86.1869,-74.4133 81.2167,-83.7699 88.1364,-84.8272\"/>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<graphviz.dot.Digraph at 0x120fc69b0>"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"diagram.graph_attr.update(size='8!,')\n",
"diagram"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Telemetry\n",
"\n",
"> _Telemetry is an automated communications process by which measurements and other data are collected at remote or inaccessible points and transmitted to receiving equipment for monitoring._\n",
"\n",
"- Remote with respect to the training pipeline\n",
" - In the user's browser.\n",
" - At inference time in the prediction API.\n",
" - Collected from externally hosted sources, e.g. banner ad CTR from Google.\n",
" - Etc.\n",
"- Should fix as much context as feasible at logging time\n",
" - As opposed to inferring information during pre-processing for training\n",
" - E.g. don't infer the sensor configuration from time, when it's available at logging time\n",
" - Examples of context:\n",
" - Which version of the model created the prediction\n",
" - Which version of the prodcut / UI was the user seeing\n",
" - Geolocation / device / connection (speed) / etc.\n",
" - Anything else that could influence the experience.\n",
"\n",
"\n",
"**IT IS SUPER IMPORTANT TO GET THIS RIGHT!**"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Telemery\n",
"\n",
"- This is inherently a cross cutting concern\n",
"- Strategies:\n",
" - Server side\n",
" - Everything under your own control\n",
" - Telemetry code all over the place\n",
" - Think about a telemetry logging service\n",
" - Be sure to still use schemas!\n",
" - Client side\n",
" - Popular for browser based apps\n",
" - Telemetry code in one place\n",
" - Requires to push more information to the client for the sake of logging back to server side\n",
" - Needs translation from client side events back into messages with a schema!\n",
" - Mixed\n",
"- Solutions:\n",
" - [Divolte Collector](https://divolte.io/)\n",
" - Disclaimer: project started by yours truly\n",
" - Takes in events in JSON or from the browser and maps onto (Avro) schema\n",
" - Writes to GCS, S3 or HDFS as well as Kafka, Pub/Sub and hopefully soon Kinesis\n",
" - MS Azure has [Event Hubs](https://azure.microsoft.com/en-us/services/event-hubs/) for this purpose\n",
" - AWS / GCP managed: none that I am aware of"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# A Word on (Online) Experimentation\n",
"\n",
"- If telemetry is setup correctly, experimentation becomes part of the context\n",
" - Which (experiment) versions was a user subject to when the event was logged\n",
"- Consistently assigning experiments to users is key\n",
"- Strategies:\n",
" - The inference API is aware of the experiments and user information and serves accordingly\n",
" - Multiple versions of inference API are deployed and request routing handles experiment assignment (preferable)\n",
"- Partial solutions:\n",
" - [Facebook PlanOut](https://facebook.github.io/planout/), for consistent experiment assignment with configured distributional properties.\n",
" - Service Mesh deployment solutions (e.g. Istio on top of Kubernetes) for dynamic request routing"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"diagram = graph('Simple Prediction Service',\n",
" actor('user'), process('telemetry'),\n",
" artefact('feedback'), state('model state A'),\n",
" state('model state B'),\n",
" cluster('offline training',\n",
" process('training A'),\n",
" process('training B'),\n",
" artefact('data'),\n",
" sync_edge('data', 'training A'),\n",
" sync_edge('data', 'training B')),\n",
" cluster('online predictions service',\n",
" process('evaluate model A'),\n",
" process('evaluate model B'),\n",
" artefact('prediction'),\n",
" artefact('unknown sample'),\n",
" sync_edge('unknown sample', 'evaluate model A'),\n",
" sync_edge('unknown sample', 'evaluate model B')),\n",
" cluster('model monitoring',\n",
" process('residual tracking'),\n",
" actor('you'),\n",
" sync_edge('residual tracking', 'you')),\n",
" cluster('experimentation service',\n",
" artefact('experiment definitions'),\n",
" process('experiment routing'),\n",
" dependency('experiment routing', 'experiment definitions'),\n",
" sync_edge('experiment routing', 'unknown sample')),\n",
" sync_edge('training A', 'model state A'),\n",
" dependency('evaluate model A', 'model state A'),\n",
" sync_edge('evaluate model', 'prediction'),\n",
" sync_edge('user', 'experiment routing'),\n",
" sync_edge('prediction', 'user'),\n",
" async_edge('user', 'clickstream collection'),\n",
" sync_edge('telemetry', 'feedback'),\n",
" sync_edge('feedback', 'data'),\n",
" dependency('residual tracking', 'feedback'),\n",
" dependency('evaluate model B', 'model state B'),\n",
" dependency('clickstream collection', 'experiment definitions'),\n",
" sync_edge('training B', 'model state B')\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n",
" -->\n",
"<!-- Title: %3 Pages: 1 -->\n",
"<svg width=\"720pt\" height=\"296pt\"\n",
" viewBox=\"0.00 0.00 720.00 295.69\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(.7047 .7047) rotate(0) translate(4 415.5842)\">\n",
"<title>%3</title>\n",
"<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-415.5842 1017.6789,-415.5842 1017.6789,4 -4,4\"/>\n",
"<text text-anchor=\"middle\" x=\"506.8395\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">Simple Prediction Service</text>\n",
"<g id=\"clust1\" class=\"cluster\">\n",
"<title>cluster offline&#45;training</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"696.6789,-86 696.6789,-232 913.6789,-232 913.6789,-86 696.6789,-86\"/>\n",
"<text text-anchor=\"middle\" x=\"805.1789\" y=\"-216.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">offline training</text>\n",
"</g>\n",
"<g id=\"clust2\" class=\"cluster\">\n",
"<title>cluster online&#45;predictions&#45;service</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"366.6789,-86 366.6789,-232 688.6789,-232 688.6789,-86 366.6789,-86\"/>\n",
"<text text-anchor=\"middle\" x=\"527.6789\" y=\"-216.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">online predictions service</text>\n",
"</g>\n",
"<g id=\"clust3\" class=\"cluster\">\n",
"<title>cluster model&#45;monitoring</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"851.6789,-240 851.6789,-403.5842 1005.6789,-403.5842 1005.6789,-240 851.6789,-240\"/>\n",
"<text text-anchor=\"middle\" x=\"928.6789\" y=\"-388.3842\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">model monitoring</text>\n",
"</g>\n",
"<g id=\"clust4\" class=\"cluster\">\n",
"<title>cluster experimentation&#45;service</title>\n",
"<polygon fill=\"none\" stroke=\"#0000ff\" points=\"186.6789,-158 186.6789,-319.372 358.6789,-319.372 358.6789,-158 186.6789,-158\"/>\n",
"<text text-anchor=\"middle\" x=\"272.6789\" y=\"-304.172\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">experimentation service</text>\n",
"</g>\n",
"<!-- user -->\n",
"<g id=\"node1\" class=\"node\">\n",
"<title>user</title>\n",
"<ellipse fill=\"none\" stroke=\"#000000\" cx=\"272.6789\" cy=\"-355.5842\" rx=\"24.6814\" ry=\"24.6814\"/>\n",
"<text text-anchor=\"middle\" x=\"272.6789\" y=\"-351.3842\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">user</text>\n",
"</g>\n",
"<!-- experiment&#45;routing -->\n",
"<g id=\"node16\" class=\"node\">\n",
"<title>experiment&#45;routing</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"272.6789\" cy=\"-271.372\" rx=\"77.784\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"272.6789\" y=\"-267.172\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">experiment routing</text>\n",
"</g>\n",
"<!-- user&#45;&gt;experiment&#45;routing -->\n",
"<g id=\"edge11\" class=\"edge\">\n",
"<title>user&#45;&gt;experiment&#45;routing</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M272.6789,-330.3178C272.6789,-320.6403 272.6789,-309.542 272.6789,-299.6045\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"276.179,-299.3989 272.6789,-289.3989 269.179,-299.3989 276.179,-299.3989\"/>\n",
"</g>\n",
"<!-- clickstream&#45;collection -->\n",
"<g id=\"node18\" class=\"node\">\n",
"<title>clickstream&#45;collection</title>\n",
"<ellipse fill=\"none\" stroke=\"#000000\" cx=\"88.6789\" cy=\"-271.372\" rx=\"88.8583\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"88.6789\" y=\"-267.172\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">clickstream&#45;collection</text>\n",
"</g>\n",
"<!-- user&#45;&gt;clickstream&#45;collection -->\n",
"<g id=\"edge13\" class=\"edge\">\n",
"<title>user&#45;&gt;clickstream&#45;collection</title>\n",
"<path fill=\"none\" stroke=\"#ff0000\" stroke-dasharray=\"5,2\" d=\"M249.1264,-346.7725C230.8259,-339.7665 204.9043,-329.4933 182.6789,-319.372 164.9748,-311.3097 145.7886,-301.6793 129.3943,-293.168\"/>\n",
"<polygon fill=\"#ff0000\" stroke=\"#ff0000\" points=\"130.5799,-289.8384 120.0965,-288.3063 127.3363,-296.0416 130.5799,-289.8384\"/>\n",
"</g>\n",
"<!-- telemetry -->\n",
"<g id=\"node2\" class=\"node\">\n",
"<title>telemetry</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"797.6789\" cy=\"-355.5842\" rx=\"43.4974\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"797.6789\" y=\"-351.3842\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">telemetry</text>\n",
"</g>\n",
"<!-- feedback -->\n",
"<g id=\"node3\" class=\"node\">\n",
"<title>feedback</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"837.6983,-289.372 783.6596,-289.372 783.6596,-293.372 771.6596,-293.372 771.6596,-253.372 837.6983,-253.372 837.6983,-289.372\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"771.6596,-289.372 783.6596,-289.372 \"/>\n",
"<text text-anchor=\"middle\" x=\"804.6789\" y=\"-267.172\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">feedback</text>\n",
"</g>\n",
"<!-- telemetry&#45;&gt;feedback -->\n",
"<g id=\"edge14\" class=\"edge\">\n",
"<title>telemetry&#45;&gt;feedback</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M799.1978,-337.3114C800.106,-326.3865 801.2794,-312.2697 802.3074,-299.9026\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"805.8225,-299.8647 803.163,-289.6091 798.8466,-299.2848 805.8225,-299.8647\"/>\n",
"</g>\n",
"<!-- data -->\n",
"<g id=\"node8\" class=\"node\">\n",
"<title>data</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"831.6789,-202 789.6789,-202 789.6789,-206 777.6789,-206 777.6789,-166 831.6789,-166 831.6789,-202\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"777.6789,-202 789.6789,-202 \"/>\n",
"<text text-anchor=\"middle\" x=\"804.6789\" y=\"-179.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">data</text>\n",
"</g>\n",
"<!-- feedback&#45;&gt;data -->\n",
"<g id=\"edge15\" class=\"edge\">\n",
"<title>feedback&#45;&gt;data</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M804.6789,-253.2685C804.6789,-241.4363 804.6789,-225.7384 804.6789,-212.2785\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"808.179,-212.0803 804.6789,-202.0803 801.179,-212.0804 808.179,-212.0803\"/>\n",
"</g>\n",
"<!-- model&#45;state&#45;a -->\n",
"<g id=\"node4\" class=\"node\">\n",
"<title>model&#45;state&#45;a</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"644.1721,-58 555.1858,-58 551.1858,-54 551.1858,-22 640.1721,-22 644.1721,-26 644.1721,-58\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"640.1721,-54 551.1858,-54 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"640.1721,-54 640.1721,-22 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"640.1721,-54 644.1721,-58 \"/>\n",
"<text text-anchor=\"middle\" x=\"597.6789\" y=\"-35.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">model state A</text>\n",
"</g>\n",
"<!-- model&#45;state&#45;b -->\n",
"<g id=\"node5\" class=\"node\">\n",
"<title>model&#45;state&#45;b</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"780.1653,-58 691.1926,-58 687.1926,-54 687.1926,-22 776.1653,-22 780.1653,-26 780.1653,-58\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"776.1653,-54 687.1926,-54 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"776.1653,-54 776.1653,-22 \"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"776.1653,-54 780.1653,-58 \"/>\n",
"<text text-anchor=\"middle\" x=\"733.6789\" y=\"-35.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">model state B</text>\n",
"</g>\n",
"<!-- training&#45;a -->\n",
"<g id=\"node6\" class=\"node\">\n",
"<title>training&#45;a</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"750.6789\" cy=\"-112\" rx=\"45.9548\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"750.6789\" y=\"-107.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">training A</text>\n",
"</g>\n",
"<!-- training&#45;a&#45;&gt;model&#45;state&#45;a -->\n",
"<g id=\"edge8\" class=\"edge\">\n",
"<title>training&#45;a&#45;&gt;model&#45;state&#45;a</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M721.179,-98.1177C699.6963,-88.0081 670.1399,-74.0993 645.3279,-62.423\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"646.616,-59.1611 636.0775,-58.0699 643.6354,-65.4948 646.616,-59.1611\"/>\n",
"</g>\n",
"<!-- training&#45;b -->\n",
"<g id=\"node7\" class=\"node\">\n",
"<title>training&#45;b</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"859.6789\" cy=\"-112\" rx=\"45.9461\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"859.6789\" y=\"-107.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">training B</text>\n",
"</g>\n",
"<!-- training&#45;b&#45;&gt;model&#45;state&#45;b -->\n",
"<g id=\"edge19\" class=\"edge\">\n",
"<title>training&#45;b&#45;&gt;model&#45;state&#45;b</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M833.5717,-97.0816C816.5214,-87.3385 793.9046,-74.4147 774.5038,-63.3285\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"775.9821,-60.1421 765.5631,-58.2195 772.5091,-66.2198 775.9821,-60.1421\"/>\n",
"</g>\n",
"<!-- data&#45;&gt;training&#45;a -->\n",
"<g id=\"edge1\" class=\"edge\">\n",
"<title>data&#45;&gt;training&#45;a</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M791.0525,-165.8314C784.6415,-157.2835 776.8848,-146.9412 769.8906,-137.6156\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"772.5423,-135.3177 763.7422,-129.4177 766.9423,-139.5177 772.5423,-135.3177\"/>\n",
"</g>\n",
"<!-- data&#45;&gt;training&#45;b -->\n",
"<g id=\"edge2\" class=\"edge\">\n",
"<title>data&#45;&gt;training&#45;b</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M818.5578,-165.8314C825.0874,-157.2835 832.9878,-146.9412 840.1115,-137.6156\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"843.0847,-139.4891 846.3737,-129.4177 837.522,-135.2398 843.0847,-139.4891\"/>\n",
"</g>\n",
"<!-- evaluate&#45;model&#45;a -->\n",
"<g id=\"node9\" class=\"node\">\n",
"<title>evaluate&#45;model&#45;a</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"446.6789\" cy=\"-112\" rx=\"71.9876\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"446.6789\" y=\"-107.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">evaluate model A</text>\n",
"</g>\n",
"<!-- evaluate&#45;model&#45;a&#45;&gt;model&#45;state&#45;a -->\n",
"<g id=\"edge9\" class=\"edge\">\n",
"<title>evaluate&#45;model&#45;a&#45;&gt;model&#45;state&#45;a</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M480.1907,-96.0209C500.8856,-86.1531 527.7519,-73.3427 550.569,-62.463\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"552.2228,-65.552 559.7428,-58.0888 549.21,-59.2336 552.2228,-65.552\"/>\n",
"</g>\n",
"<!-- evaluate&#45;model&#45;b -->\n",
"<g id=\"node10\" class=\"node\">\n",
"<title>evaluate&#45;model&#45;b</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"608.6789\" cy=\"-112\" rx=\"71.979\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"608.6789\" y=\"-107.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">evaluate model B</text>\n",
"</g>\n",
"<!-- evaluate&#45;model&#45;b&#45;&gt;model&#45;state&#45;b -->\n",
"<g id=\"edge17\" class=\"edge\">\n",
"<title>evaluate&#45;model&#45;b&#45;&gt;model&#45;state&#45;b</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M637.3565,-95.4817C653.9571,-85.9198 675.1133,-73.7338 693.3676,-63.2193\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"695.1446,-66.2349 702.063,-58.2108 691.6507,-60.1692 695.1446,-66.2349\"/>\n",
"</g>\n",
"<!-- prediction -->\n",
"<g id=\"node11\" class=\"node\">\n",
"<title>prediction</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"584.9381,-202 524.4198,-202 524.4198,-206 512.4198,-206 512.4198,-166 584.9381,-166 584.9381,-202\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"512.4198,-202 524.4198,-202 \"/>\n",
"<text text-anchor=\"middle\" x=\"548.6789\" y=\"-179.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">prediction</text>\n",
"</g>\n",
"<!-- prediction&#45;&gt;user -->\n",
"<g id=\"edge12\" class=\"edge\">\n",
"<title>prediction&#45;&gt;user</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M535.5652,-202.3301C527.4439,-212.4401 516.1939,-224.4401 503.6789,-232 492.2346,-238.9132 486.7595,-234.2706 474.6789,-240 419.554,-266.1437 416.2342,-290.1472 362.6789,-319.372 344.6351,-329.2184 323.3402,-337.9069 305.9951,-344.3036\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"304.5166,-341.1159 296.2935,-347.7965 306.8879,-347.702 304.5166,-341.1159\"/>\n",
"</g>\n",
"<!-- unknown&#45;sample -->\n",
"<g id=\"node12\" class=\"node\">\n",
"<title>unknown&#45;sample</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"494.4449,-202 394.913,-202 394.913,-206 382.913,-206 382.913,-166 494.4449,-166 494.4449,-202\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"382.913,-202 394.913,-202 \"/>\n",
"<text text-anchor=\"middle\" x=\"438.6789\" y=\"-179.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">unknown sample</text>\n",
"</g>\n",
"<!-- unknown&#45;sample&#45;&gt;evaluate&#45;model&#45;a -->\n",
"<g id=\"edge3\" class=\"edge\">\n",
"<title>unknown&#45;sample&#45;&gt;evaluate&#45;model&#45;a</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M440.6977,-165.8314C441.5533,-158.131 442.5707,-148.9743 443.5215,-140.4166\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"447.0072,-140.7386 444.633,-130.4133 440.05,-139.9656 447.0072,-140.7386\"/>\n",
"</g>\n",
"<!-- unknown&#45;sample&#45;&gt;evaluate&#45;model&#45;b -->\n",
"<g id=\"edge4\" class=\"edge\">\n",
"<title>unknown&#45;sample&#45;&gt;evaluate&#45;model&#45;b</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M481.5771,-165.8314C506.2601,-155.3774 537.2799,-142.2396 562.4957,-131.56\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"564.0072,-134.7208 571.8504,-127.598 561.2772,-128.2751 564.0072,-134.7208\"/>\n",
"</g>\n",
"<!-- residual&#45;tracking -->\n",
"<g id=\"node13\" class=\"node\">\n",
"<title>residual&#45;tracking</title>\n",
"<ellipse fill=\"#d3d3d3\" stroke=\"#000000\" cx=\"928.6789\" cy=\"-355.5842\" rx=\"69.0734\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"928.6789\" y=\"-351.3842\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">residual tracking</text>\n",
"</g>\n",
"<!-- residual&#45;tracking&#45;&gt;feedback -->\n",
"<g id=\"edge16\" class=\"edge\">\n",
"<title>residual&#45;tracking&#45;&gt;feedback</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M887.3069,-341.0441C874.0018,-335.3597 859.6248,-328.0883 847.6789,-319.372 839.3159,-313.27 831.3808,-305.3081 824.6567,-297.628\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"826.9698,-294.9375 817.8729,-289.5063 821.5973,-299.4249 826.9698,-294.9375\"/>\n",
"</g>\n",
"<!-- you -->\n",
"<g id=\"node14\" class=\"node\">\n",
"<title>you</title>\n",
"<ellipse fill=\"none\" stroke=\"#000000\" cx=\"928.6789\" cy=\"-271.372\" rx=\"23.2447\" ry=\"23.2447\"/>\n",
"<text text-anchor=\"middle\" x=\"928.6789\" y=\"-267.172\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">you</text>\n",
"</g>\n",
"<!-- residual&#45;tracking&#45;&gt;you -->\n",
"<g id=\"edge5\" class=\"edge\">\n",
"<title>residual&#45;tracking&#45;&gt;you</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M928.6789,-337.3114C928.6789,-327.9269 928.6789,-316.1871 928.6789,-305.2243\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"932.179,-305.0539 928.6789,-295.0539 925.179,-305.0539 932.179,-305.0539\"/>\n",
"</g>\n",
"<!-- experiment&#45;definitions -->\n",
"<g id=\"node15\" class=\"node\">\n",
"<title>experiment&#45;definitions</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"339.2694,-202 210.0885,-202 210.0885,-206 198.0885,-206 198.0885,-166 339.2694,-166 339.2694,-202\"/>\n",
"<polyline fill=\"none\" stroke=\"#000000\" points=\"198.0885,-202 210.0885,-202 \"/>\n",
"<text text-anchor=\"middle\" x=\"268.6789\" y=\"-179.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">experiment definitions</text>\n",
"</g>\n",
"<!-- experiment&#45;routing&#45;&gt;unknown&#45;sample -->\n",
"<g id=\"edge7\" class=\"edge\">\n",
"<title>experiment&#45;routing&#45;&gt;unknown&#45;sample</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M311.8259,-255.7664C327.8482,-248.9899 346.4086,-240.6335 362.6789,-232 376.4555,-224.6898 391.0581,-215.772 403.7422,-207.6159\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"405.7126,-210.5095 412.1833,-202.1201 401.8932,-204.6433 405.7126,-210.5095\"/>\n",
"</g>\n",
"<!-- experiment&#45;routing&#45;&gt;experiment&#45;definitions -->\n",
"<g id=\"edge6\" class=\"edge\">\n",
"<title>experiment&#45;routing&#45;&gt;experiment&#45;definitions</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M271.8501,-253.2685C271.3085,-241.4363 270.5898,-225.7384 269.9736,-212.2785\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"273.4604,-211.9098 269.5067,-202.0803 266.4678,-212.23 273.4604,-211.9098\"/>\n",
"</g>\n",
"<!-- evaluate&#45;model -->\n",
"<g id=\"node17\" class=\"node\">\n",
"<title>evaluate&#45;model</title>\n",
"<ellipse fill=\"none\" stroke=\"#000000\" cx=\"548.6789\" cy=\"-271.372\" rx=\"64.7286\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"548.6789\" y=\"-267.172\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">evaluate&#45;model</text>\n",
"</g>\n",
"<!-- evaluate&#45;model&#45;&gt;prediction -->\n",
"<g id=\"edge10\" class=\"edge\">\n",
"<title>evaluate&#45;model&#45;&gt;prediction</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M548.6789,-253.2685C548.6789,-241.4363 548.6789,-225.7384 548.6789,-212.2785\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"552.179,-212.0803 548.6789,-202.0803 545.179,-212.0804 552.179,-212.0803\"/>\n",
"</g>\n",
"<!-- clickstream&#45;collection&#45;&gt;experiment&#45;definitions -->\n",
"<g id=\"edge18\" class=\"edge\">\n",
"<title>clickstream&#45;collection&#45;&gt;experiment&#45;definitions</title>\n",
"<path fill=\"none\" stroke=\"#000000\" stroke-dasharray=\"5,2\" d=\"M122.9593,-254.7324C151.061,-241.0918 191.1116,-221.6512 222.1155,-206.6019\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"223.956,-209.5991 231.4238,-202.0837 220.8992,-203.3018 223.956,-209.5991\"/>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<graphviz.dot.Digraph at 0x120fd66a0>"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"diagram.graph_attr.update(size='10!,')\n",
"diagram"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Unfortunately, interpretation of experimentation results is beyond the scope of this talk&hellip;"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Monitoring\n",
"\n",
"- You use monitoring to see to what extent the system is performing its intended behaviour.\n",
"- _In ML systems, you use monitoring to observe the quality of this intended behaviour._"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Data Level Monitoring\n",
"\n",
"- Actively monitor the distributions of your input data, telemetry data, and intermediate results.\n",
" - Deviations in these distributions are usually a sign of external changes that cause data pipelines to fail semantically (while remaining operational technically).\n",
"- Actively monitor the distributions of your prediction residuals as soon as you can.\n",
" - Sometimes this can be done online (e.g. CTR, etc).\n",
" - Sometimes the feedback latency is large (e.g. bookings can be cancelled weeks later).\n",
" - In this case, look for proxies\n",
" - Monitor proxies by sampling from users (e.g. satisfaction survey, NPS, etc.)\n",
"- Put reasonable constraints on actions and alert when a system reaches the limit.\n",
" - Any prediction that results in a action being taken should be limited to some reasonable threshold.\n",
" - Alert when that threshold is reached, so a human can look into it.\n",
" - Take additional care when that action has a monetary cost; model in a budget."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Feedback, boundaries, dependencies, and other pitfalls\n",
"\n",
"> ### [Machine Learning: The High-Interest Credit Card of Technical Debt](https://ai.google/research/pubs/pub43146)\n",
"> D. Scully, et al. (Google)\n",
"\n",
"```\n",
"@inproceedings{43146,\n",
"title = {Machine Learning: The High Interest Credit Card of Technical Debt},\n",
"author = {D. Sculley and Gary Holt and Daniel Golovin and Eugene Davydov and Todd Phillips and Dietmar Ebner and Vinay Chaudhary and Michael Young},\n",
"year = {2014},\n",
"booktitle = {SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop)}\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Your Big Design\n",
"\n",
"### Discovery:\n",
"\n",
"- Identify all prediction models in your solution\n",
"- Idenfity all required data sources\n",
"- Identify all consumers of predictions\n",
"- Idenfity all available feedback"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Your Big Design\n",
"\n",
"### Define the interface for your Prediction API\n",
"\n",
"- Internal interface for:\n",
" - Pushing updated model state\n",
" - Pushing partial updates in case of streaming model updates\n",
"- External interface for:\n",
" - serving predictions\n",
"\n",
"> Sometimes these interfaces need to carry additional properties for the sake of telemetry that are not important to the prediction itself."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Your Big Design\n",
"\n",
"### Data pipeline setup:\n",
"- Research the data preparation required for each prediction model in isolation\n",
"- All pre-processing steps that _do not influence training outcome_ become part of pre-processing (e.g. sessionisation, basic enrichments). These outcomes can be reused across multiple models.\n",
"- All other pre-processing steps go into the training pipeline of individual models; _even if that means duplicating them_ (e.g. common feature extractions). For code reuse, use libraries if you must. But there will be a time when one model needs tweaks another one doesn't and then you'll end up duplicating the code anyway.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Your Big Design\n",
"\n",
"### Telemetry:\n",
"- Find out for each piece of feedback where it is available the earliest.\n",
" - Figure out if enough context is available in those components as well or whether it needs to be passed in or retrieved.\n",
"- Decide which parts of the system you care to pollute with telemetry.\n",
"- Based on this decision, either pass through information or log into telemetry from where it's available.\n",
"- Define schemas for all telemetry messages.\n",
"\n",
"> Based on your identified prediction models, you should understand whether you need any streaming data. If not, don't bother with it. Just log telemetry data to files in some storage. It's simple to later update the telemetry logging to also log to a stream, but it's very time consuming to manage e.g. Kafka."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Your Big Design\n",
"\n",
"### Optimisation:\n",
"- Caching is purely a concern of the prediction components (the training pipeline should not pre-warm any caches or such).\n",
"- Cutting back on training time by sharing intermediate training results between models is usually dangerous.\n",
"- Storage is really cheap: favour replication over cleverness."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Your Big Design\n",
"\n",
"### Monitoring\n",
"- Identify feedback cycles of your predictions that can be:\n",
" - directly monitored\n",
" - ideally _correlated with individual predictions_\n",
"- Use these to calculate and report residuals, monitor the distribution of residuals\n",
"- Identify proxies for long feedback cycles. You need to know if something is wrong with a model in production before users start walking away.\n",
"- Monitor the distributions of you input data for quality issues."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# On Technology Choices\n",
"\n",
"- Use the platform that you are comfortable with\n",
"- No piece of technology is going to give you a order of magnitude productivity improvement\n",
"- Except managed services: if you are in a hurry try not to setup and run anything yourself\n",
"- Use this time for a proper design\n",
" - Implement this first at the expense of more sophisticated models\n",
" - Iterate on the models using this infrastructure"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# The End\n",
"\n",
"#### Models are stateful\n",
"- Training produces model state\n",
"- Prediction depends on model state\n",
"- It helps to keep those concerns strictly separate\n",
"\n",
"#### Data management is a thing\n",
"- Take great care of pipeine orchestration\n",
"- Keep your training / modelling code clean of preparation (but not feature engineering)\n",
"- Add to ground truth before re-training or incrementally training\n",
"\n",
"#### Good telemetry is a fine art\n",
"- You will invariable find this the most difficult thing to keep \"nice and clean\"\n",
"- It's a cross cutting concern and interferes with everything\n",
"- It may slow down feature development"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Q&A"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment