Skip to content

Instantly share code, notes, and snippets.

@lukemerrick
Last active April 28, 2021 11:50
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lukemerrick/76a6e8ce383431c0c6f9ad076ca087bd to your computer and use it in GitHub Desktop.
Save lukemerrick/76a6e8ce383431c0c6f9ad076ca087bd to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"\n",
"import fiddler as fdl\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# use the nicer plotting styles from seaborn\n",
"sns.set()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Intro\n",
"This notebook assumes you already have the models and data you're interested in using uploaded to Fiddler. Please refer to the previous notebook in this series for more information on uploading to Fiddler. You will also need to have run notebook 1 in order to upload the bikesharing example used in this notebook.\n",
"\n",
"In this notebook we run through a number of other Fiddler functionalities that have been integrated into the Python package. Unlike the previous notebook, there is not as much of a sequential flow to the steps demonstrated here.\n",
"\n",
"## Before you start: set up your API connection\n",
"\n",
"### Launch onebox or authenticate with a remote server\n",
"Before you can start working with a Fiddler-integrated Jupyter environment, you should set up access to a running instance of Fiddler.\n",
"\n",
"#### Onebox\n",
"In onebox, this means running the `start.sh` script to launch onebox locally.\n",
"\n",
"#### Cloud\n",
"For the cloud version of our product, this means looking up your authentication token in the [Fiddler settings dashboard](https://app.fiddler.ai/settings/credentials)\n",
"\n",
"### Create a FiddlerApi object\n",
"\n",
"In order to get your data and models into the Fiddler Engine, you'll need to connect using the API. The `FiddlerApi` object to handles most of the nitty-gritty for you, so all you have to do is specify some details about the Fiddler system you're connecting to."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# NOTE: typically the API url for your running instance of Fiddler will be \"https://api.fiddler.ai\" (or \"http://localhost:4100\" for onebox)\n",
"# however, use \"http://host.docker.internal:4100\" as our URL if Jupyter is running in a docker VM on the same macOS machine as onebox\n",
"url = 'http://host.docker.internal:4100'\n",
"\n",
"# see <Fiddler URL>/settings/credentials to find, create, or change this token\n",
"token = os.getenv('FIDDLER_API_TOKEN')\n",
"\n",
"# see <Fiddler URL>/settings/general to find this id (listed as \"Organization Name\")\n",
"org_id = 'onebox'\n",
"\n",
"fiddler_api = fdl.FiddlerApi(url=url, org_id=org_id, auth_token=token)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pulling data from Fiddler"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['imdb_rnn',\n",
" 'iris',\n",
" 'bank_churn',\n",
" '20news',\n",
" 'p2p_loans',\n",
" 'winequality',\n",
" 'bikeshare']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's see which datasets we have on Fiddler\n",
"fiddler_api.list_datasets()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatasetInfo:\n",
" display_name: Bikeshare Dataset\n",
" files: ['train.csv', 'test.csv']\n",
" columns:\n",
" column dtype count(possible_values)\n",
" 0 dteday STRING -\n",
" 1 season CATEGORY 4\n",
" 2 yr INTEGER -\n",
" 3 mnth INTEGER -\n",
" 4 hr INTEGER -\n",
" 5 holiday BOOLEAN -\n",
" 6 weekday INTEGER -\n",
" 7 workingday BOOLEAN -\n",
" 8 weathersit CATEGORY 7\n",
" 9 temp FLOAT -\n",
" 10 atemp FLOAT -\n",
" 11 hum FLOAT -\n",
" 12 windspeed FLOAT -\n",
" 13 casual INTEGER -\n",
" 14 registered INTEGER -\n",
" 15 cnt INTEGER -"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# the info for any dataset can quickly and easily be fetched with the `dataset_info` method\n",
"bikeshare_dataset_info = fiddler_api.get_dataset_info('bikeshare')\n",
"bikeshare_dataset_info"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The bikeshare_dataset object is a <class 'dict'> with keys (['train', 'test'])\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>8326</th>\n",
" <th>6451</th>\n",
" <th>6429</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>dteday</th>\n",
" <td>2011-12-18</td>\n",
" <td>2011-10-01</td>\n",
" <td>2011-09-30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>season</th>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>yr</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mnth</th>\n",
" <td>12</td>\n",
" <td>10</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>hr</th>\n",
" <td>15</td>\n",
" <td>10</td>\n",
" <td>12</td>\n",
" </tr>\n",
" <tr>\n",
" <th>holiday</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>weekday</th>\n",
" <td>0</td>\n",
" <td>6</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>workingday</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>weathersit</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>temp</th>\n",
" <td>0.32</td>\n",
" <td>0.4</td>\n",
" <td>0.64</td>\n",
" </tr>\n",
" <tr>\n",
" <th>atemp</th>\n",
" <td>0.303</td>\n",
" <td>0.4091</td>\n",
" <td>0.6212</td>\n",
" </tr>\n",
" <tr>\n",
" <th>hum</th>\n",
" <td>0.45</td>\n",
" <td>0.76</td>\n",
" <td>0.57</td>\n",
" </tr>\n",
" <tr>\n",
" <th>windspeed</th>\n",
" <td>0.2836</td>\n",
" <td>0.3582</td>\n",
" <td>0.194</td>\n",
" </tr>\n",
" <tr>\n",
" <th>casual</th>\n",
" <td>23</td>\n",
" <td>21</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>registered</th>\n",
" <td>184</td>\n",
" <td>100</td>\n",
" <td>195</td>\n",
" </tr>\n",
" <tr>\n",
" <th>cnt</th>\n",
" <td>207</td>\n",
" <td>121</td>\n",
" <td>254</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 8326 6451 6429\n",
"dteday 2011-12-18 2011-10-01 2011-09-30\n",
"season 4 4 4\n",
"yr 0 0 0\n",
"mnth 12 10 9\n",
"hr 15 10 12\n",
"holiday False False False\n",
"weekday 0 6 5\n",
"workingday False False True\n",
"weathersit 1 3 2\n",
"temp 0.32 0.4 0.64\n",
"atemp 0.303 0.4091 0.6212\n",
"hum 0.45 0.76 0.57\n",
"windspeed 0.2836 0.3582 0.194\n",
"casual 23 21 59\n",
"registered 184 100 195\n",
"cnt 207 121 254"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# we can also pull data from the dataset directly into Pandas\n",
"bikeshare_dataset = fiddler_api.get_dataset('bikeshare', max_rows=999_999)\n",
"print(f'The bikeshare_dataset object is a {type(bikeshare_dataset)} with keys ({list(bikeshare_dataset.keys())})')\n",
"\n",
"df_train = bikeshare_dataset['train']\n",
"df_test = bikeshare_dataset['test']\n",
"\n",
"# demo the data\n",
"df_train.sample(3, random_state=0).T"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x7f74890f93c8>"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# for example, let's plot a regression plot of the target (cnt) against the temperature feature (temp)\n",
"sns.regplot(df_train['temp'], df_train['cnt'], marker='.', scatter_kws=dict(alpha=0.1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using the Fiddler model-builder feature\n",
"If you have data but haven't built a model yet, you can take advantage of the model-builder feature to whip up a model instantly so you can dive right into running explanations."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['dteday',\n",
" 'season',\n",
" 'yr',\n",
" 'mnth',\n",
" 'hr',\n",
" 'holiday',\n",
" 'weekday',\n",
" 'workingday',\n",
" 'weathersit',\n",
" 'temp',\n",
" 'atemp',\n",
" 'hum',\n",
" 'windspeed',\n",
" 'casual',\n",
" 'registered',\n",
" 'cnt']"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bikeshare_dataset_info.get_column_names()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# fiddler_api.delete_model('bikeshare_forecasting', 'generated_bikeshare_model')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'created_files': {'package.py': 'Wrapper code to run the model on the Fiddler engine',\n",
" 'model.yaml': 'Model metadata and configuration',\n",
" 'data_processor.py': 'Data cleaning and feature engineering code',\n",
" '__init__.py': 'Empty file. Makes this model directory a python package so the Fiddler engine can run it properly.',\n",
" 'model.pkl': 'Serialized model artifact.',\n",
" 'processor.pkl': 'Serialized model artifact.',\n",
" 'training_features.pkl': 'Serialized training data',\n",
" 'train.py': 'Model training script',\n",
" 'training_targets.pkl': 'Serialized training data'},\n",
" 'project_name': 'bikeshare_forecasting',\n",
" 'model_name': 'generated_bikeshare_model'}"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# NOTE: to avoid training on the whole dataset, we pass `train_splits`\n",
"features = list(set(bikeshare_dataset_info.get_column_names()) - {'casual', 'registered', 'cnt', 'dteday'})\n",
"fiddler_api.create_model(project_id='bikeshare_forecasting', \n",
" dataset_id='bikeshare',\n",
" target='cnt',\n",
" features=features,\n",
" model_id='generated_bikeshare_model',\n",
" train_splits=['train'])"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['knn_model', 'generated_bikeshare_model']"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# the new model shows up when we list the models in the bikeshare_forecasting project\n",
"fiddler_api.list_models(project_id='bikeshare_forecasting')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explanations in Jupyter\n",
"We also support basic integration of our explanation and prediction functionality in Jupyter. The `FiddlerApi` object is your friend here."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>predicted_cnt</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>17.201259</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>13.841701</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>8.990367</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.717645</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.031056</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>-2.928235</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>16.928393</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>35.049540</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>47.441858</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>72.827217</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" predicted_cnt\n",
"0 17.201259\n",
"1 13.841701\n",
"2 8.990367\n",
"3 0.717645\n",
"4 4.031056\n",
"5 -2.928235\n",
"6 16.928393\n",
"7 35.049540\n",
"8 47.441858\n",
"9 72.827217"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# running some predictions on the generated model\n",
"fiddler_api.run_model(project_id='bikeshare_forecasting', model_id='generated_bikeshare_model', df=df_test.head(10))"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>predicted_cnt</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>29.084173</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>21.888947</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>17.315574</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>25.716166</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>26.971802</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>25.599356</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>28.150124</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>28.376281</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>34.132442</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>68.728265</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" predicted_cnt\n",
"0 29.084173\n",
"1 21.888947\n",
"2 17.315574\n",
"3 25.716166\n",
"4 26.971802\n",
"5 25.599356\n",
"6 28.150124\n",
"7 28.376281\n",
"8 34.132442\n",
"9 68.728265"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# compare against predictions on our kNN model\n",
"fiddler_api.run_model(project_id='bikeshare_forecasting', model_id='knn_model', df=df_test.head(10))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# Run explanations on both models\n",
"selected_point = df_test.head(1)\n",
"ex_generated = fiddler_api.run_explanation(\n",
" project_id='bikeshare_forecasting',\n",
" model_id='generated_bikeshare_model', \n",
" df=selected_point, \n",
" dataset_id='bikeshare')\n",
"\n",
"ex_knn = fiddler_api.run_explanation(\n",
" project_id='bikeshare_forecasting',\n",
" model_id='knn_model', \n",
" df=selected_point, \n",
" dataset_id='bikeshare')"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.0, 'Top SHAP attributions on first row of bikeshare for generated model')"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 864x432 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Create a plot comparing attributions\n",
"fig = plt.figure(figsize=(12, 6))\n",
"comparison_table = pd.DataFrame({\n",
" 'Generated Model': pd.Series(ex_generated.attributions, index=ex_generated.inputs),\n",
" 'kNN Model': pd.Series(ex_knn.attributions, index=ex_knn.inputs)\n",
"})\n",
"comparison_table = comparison_table.loc[comparison_table['kNN Model'].abs().sort_values(ascending=False).index, :]\n",
"\n",
"melted_table = (comparison_table\n",
" .reset_index()\n",
" .rename(columns={'index': 'Feature'})\n",
" .melt(id_vars='Feature', \n",
" var_name='Model', \n",
" value_name='Attribution'))\n",
"sns.barplot(x='Attribution', y='Feature', hue='Model', data=melted_table)\n",
"\n",
"plt.title('Top SHAP attributions on first row of bikeshare for generated model')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conclusion\n",
"As we have seen in this notebook, once data and models have been deployed to Fiddler, it becomes very easy to share the data, automatically train a model on Fiddler, and run explanations all without leaving"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment