Skip to content

Instantly share code, notes, and snippets.

@rth
Last active March 2, 2018 16:47
Show Gist options
  • Save rth/78c45b1d19b71f6c86abef4824fdf0d7 to your computer and use it in GitHub Desktop.
Save rth/78c45b1d19b71f6c86abef4824fdf0d7 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Benchmark tsfresh performance (large)\n",
"\n",
"In this example we benchmark, \n",
" * A. applying tsfresh directly to extract features on a columnar pd.DataFrame (uncompressed CSV of 300 MB)\n",
"\n",
"with aggregating the DataFrame by user and date, to convert it to a labeled array (with xarray)\n",
" * B. followed by computing individual features manually with `_do_extraction_on_chunk`\n",
" * C. re-implementing a few metrics in a vectorized fashion and applying it on the xarray directly"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/datageek/anaconda2/envs/ts-env/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.\n",
" from pandas.core import datetools\n"
]
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import xarray as xr\n",
"from tqdm import tqdm\n",
"from IPython.display import display\n",
"import dask.dataframe as dd\n",
"from dask.diagnostics import ProgressBar\n",
"\n",
"from tsfresh import extract_features\n",
"from tsfresh.feature_extraction.extraction import _do_extraction_on_chunk"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we load a randomly generated dataset,"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[########################################] | 100% Completed | 2.1s\n"
]
}
],
"source": [
"with ProgressBar():\n",
"\n",
" df_raw = dd.read_parquet('synthetic_payment_data_subset.parq/',\n",
" engine='pyarrow')\n",
" df = df_raw.compute()\n",
" \n",
"df['uid'] = df['user_id']\n",
"df['t'] = df['date']\n",
"del df['user_id']\n",
"del df['date']"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape: (14270836, 3)\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>amount</th>\n",
" <th>uid</th>\n",
" <th>t</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-28.00</td>\n",
" <td>5b3ecda7b4f48aa7fad7ceb2ae6b11</td>\n",
" <td>2013-01-01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-7.99</td>\n",
" <td>020c7a57c3393ea13d6a0c30eee62e</td>\n",
" <td>2013-01-01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1.79</td>\n",
" <td>f8618d79da85a037f52221517e6147</td>\n",
" <td>2013-01-01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.89</td>\n",
" <td>f8618d79da85a037f52221517e6147</td>\n",
" <td>2013-01-01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>103.00</td>\n",
" <td>5acbd7dac6c86bc773a5689b38489d</td>\n",
" <td>2013-01-01</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" amount uid t\n",
"0 -28.00 5b3ecda7b4f48aa7fad7ceb2ae6b11 2013-01-01\n",
"1 -7.99 020c7a57c3393ea13d6a0c30eee62e 2013-01-01\n",
"2 1.79 f8618d79da85a037f52221517e6147 2013-01-01\n",
"3 0.89 f8618d79da85a037f52221517e6147 2013-01-01\n",
"4 103.00 5acbd7dac6c86bc773a5689b38489d 2013-01-01"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print('Shape:', df.shape)\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"amount 32396\n",
"uid 7500\n",
"t 1462\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.nunique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Evaluate conversion to xarray**\n",
"\n",
"Next we will aggregate by `uid` and `t` (time), and convert the DataFrame to an xarray.\n",
"\n",
"**Note:** here, the dates have a daily precision, but to do this properly we should use `pd.Grouper(freq=<some_freq>)`to aggregate with a specific time frequency. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<xarray.Dataset>\n",
"Dimensions: (t: 1462, uid: 7500)\n",
"Coordinates:\n",
" * uid (uid) object '0006a9bd90e4ac5c5f3f92629f4724' ...\n",
" * t (t) datetime64[ns] 2013-01-01 2013-01-02 2013-01-03 2013-01-04 ...\n",
"Data variables:\n",
" amount (uid, t) float64 -64.76 -409.9 119.9 8.44 0.0 0.0 2.61 -1.78 ..."
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 6.01 s, sys: 460 ms, total: 6.47 s\n",
"Wall time: 6.64 s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"X = xr.Dataset.from_dataframe(df.groupby(['uid', 't']).sum()).fillna(0.0)\n",
"display(X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A. Feature extraction with pandas.DataFrame input\n",
"\n",
"This is the default tsfresh approach"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Feature Extraction: 100%|██████████| 7500/7500 [00:01<00:00, 3959.26it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 8.69 s, sys: 1.02 s, total: 9.71 s\n",
"Wall time: 10.8 s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"fc_params = {'abs_energy': None, 'absolute_sum_of_changes': None}\n",
"\n",
"F_A = extract_features(df, column_id=\"uid\", column_sort=\"t\",\n",
" default_fc_parameters=fc_params,\n",
" disable_progressbar=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## B. Feature extraction with xarray input (tsfresh)\n",
"\n",
"We manually apply `_do_extraction_on_chunk` on the rows of the aggregated matrix."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 5 s, sys: 120 ms, total: 5.12 s\n",
"Wall time: 5.18 s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"res = []\n",
"for row in X['amount']:\n",
" idx = np.asscalar(row.coords['uid'].values)\n",
" \n",
" res_row = _do_extraction_on_chunk((idx, 'amount', pd.Series(row.values, index=X.coords['t'].values)),\n",
" fc_params, None)\n",
" res += res_row\n",
"\n",
"F_B = pd.DataFrame(res).groupby(['id', 'variable']).value.sum().unstack()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## C. Feature extraction with xarray input (vectorized)\n",
"\n",
"Here we reimplement a few features extraction functions that work directly on the whole xarray using vectorized numpy functions"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 104 ms, sys: 40 ms, total: 144 ms\n",
"Wall time: 163 ms\n"
]
}
],
"source": [
"%%time\n",
"\n",
"\n",
"def abs_energy(X):\n",
" return xr.apply_ufunc(np.linalg.norm, X,\n",
" input_core_dims=[['t']],\n",
" kwargs={'ord': 2, 'axis': -1})**2\n",
"\n",
"\n",
"def absolute_sum_of_changes(X):\n",
" return np.abs(X.diff('t')).sum('t')\n",
"\n",
"\n",
"res = []\n",
"for name, func in [('abs_energy', abs_energy),\n",
" ('absolute_sum_of_changes', absolute_sum_of_changes)]:\n",
" y = func(X['amount'])\n",
" y.coords['variable'] = \"amount__\" + name\n",
" res.append(y)\n",
"F_C = xr.concat(res, dim='variable').to_dataframe()['amount'].unstack(0)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>variable</th>\n",
" <th>amount__abs_energy</th>\n",
" <th>amount__absolute_sum_of_changes</th>\n",
" </tr>\n",
" <tr>\n",
" <th>id</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0006a9bd90e4ac5c5f3f92629f4724</th>\n",
" <td>7.425800e+07</td>\n",
" <td>352066.56</td>\n",
" </tr>\n",
" <tr>\n",
" <th>000b94b2061238d0e9d05ed0a0697e</th>\n",
" <td>3.205806e+07</td>\n",
" <td>172919.06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>000cf24a99a82534abfda1ee2c4861</th>\n",
" <td>7.605662e+07</td>\n",
" <td>356121.22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0028dbdac34e8bb1a030eaa5416899</th>\n",
" <td>2.747808e+07</td>\n",
" <td>157476.23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>002dead637214c6b7245d571bf3985</th>\n",
" <td>3.091495e+07</td>\n",
" <td>195512.50</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"variable amount__abs_energy \\\n",
"id \n",
"0006a9bd90e4ac5c5f3f92629f4724 7.425800e+07 \n",
"000b94b2061238d0e9d05ed0a0697e 3.205806e+07 \n",
"000cf24a99a82534abfda1ee2c4861 7.605662e+07 \n",
"0028dbdac34e8bb1a030eaa5416899 2.747808e+07 \n",
"002dead637214c6b7245d571bf3985 3.091495e+07 \n",
"\n",
"variable amount__absolute_sum_of_changes \n",
"id \n",
"0006a9bd90e4ac5c5f3f92629f4724 352066.56 \n",
"000b94b2061238d0e9d05ed0a0697e 172919.06 \n",
"000cf24a99a82534abfda1ee2c4861 356121.22 \n",
"0028dbdac34e8bb1a030eaa5416899 157476.23 \n",
"002dead637214c6b7245d571bf3985 195512.50 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"F_A.head()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>variable</th>\n",
" <th>amount__abs_energy</th>\n",
" <th>amount__absolute_sum_of_changes</th>\n",
" </tr>\n",
" <tr>\n",
" <th>id</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0006a9bd90e4ac5c5f3f92629f4724</th>\n",
" <td>7.328224e+07</td>\n",
" <td>243156.58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>000b94b2061238d0e9d05ed0a0697e</th>\n",
" <td>3.065817e+07</td>\n",
" <td>140270.77</td>\n",
" </tr>\n",
" <tr>\n",
" <th>000cf24a99a82534abfda1ee2c4861</th>\n",
" <td>6.928719e+07</td>\n",
" <td>248496.93</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0028dbdac34e8bb1a030eaa5416899</th>\n",
" <td>2.788298e+07</td>\n",
" <td>136965.99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>002dead637214c6b7245d571bf3985</th>\n",
" <td>2.977928e+07</td>\n",
" <td>155954.97</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"variable amount__abs_energy \\\n",
"id \n",
"0006a9bd90e4ac5c5f3f92629f4724 7.328224e+07 \n",
"000b94b2061238d0e9d05ed0a0697e 3.065817e+07 \n",
"000cf24a99a82534abfda1ee2c4861 6.928719e+07 \n",
"0028dbdac34e8bb1a030eaa5416899 2.788298e+07 \n",
"002dead637214c6b7245d571bf3985 2.977928e+07 \n",
"\n",
"variable amount__absolute_sum_of_changes \n",
"id \n",
"0006a9bd90e4ac5c5f3f92629f4724 243156.58 \n",
"000b94b2061238d0e9d05ed0a0697e 140270.77 \n",
"000cf24a99a82534abfda1ee2c4861 248496.93 \n",
"0028dbdac34e8bb1a030eaa5416899 136965.99 \n",
"002dead637214c6b7245d571bf3985 155954.97 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"F_B.head()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>variable</th>\n",
" <th>amount__abs_energy</th>\n",
" <th>amount__absolute_sum_of_changes</th>\n",
" </tr>\n",
" <tr>\n",
" <th>uid</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0006a9bd90e4ac5c5f3f92629f4724</th>\n",
" <td>7.328224e+07</td>\n",
" <td>243156.58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>000b94b2061238d0e9d05ed0a0697e</th>\n",
" <td>3.065817e+07</td>\n",
" <td>140270.77</td>\n",
" </tr>\n",
" <tr>\n",
" <th>000cf24a99a82534abfda1ee2c4861</th>\n",
" <td>6.928719e+07</td>\n",
" <td>248496.93</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0028dbdac34e8bb1a030eaa5416899</th>\n",
" <td>2.788298e+07</td>\n",
" <td>136965.99</td>\n",
" </tr>\n",
" <tr>\n",
" <th>002dead637214c6b7245d571bf3985</th>\n",
" <td>2.977928e+07</td>\n",
" <td>155954.97</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"variable amount__abs_energy \\\n",
"uid \n",
"0006a9bd90e4ac5c5f3f92629f4724 7.328224e+07 \n",
"000b94b2061238d0e9d05ed0a0697e 3.065817e+07 \n",
"000cf24a99a82534abfda1ee2c4861 6.928719e+07 \n",
"0028dbdac34e8bb1a030eaa5416899 2.788298e+07 \n",
"002dead637214c6b7245d571bf3985 2.977928e+07 \n",
"\n",
"variable amount__absolute_sum_of_changes \n",
"uid \n",
"0006a9bd90e4ac5c5f3f92629f4724 243156.58 \n",
"000b94b2061238d0e9d05ed0a0697e 140270.77 \n",
"000cf24a99a82534abfda1ee2c4861 248496.93 \n",
"0028dbdac34e8bb1a030eaa5416899 136965.99 \n",
"002dead637214c6b7245d571bf3985 155954.97 "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"F_C.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check that approaches B and C produce identical result"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"assert ((np.abs(F_B - F_C) / F_B) < 1e-9).values.all()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"\n",
"The total run time for method A is ~200 ms. Comparing with the runtime of method B, it seems likely that ~100-150ms are spent on the feature extraction proper.\n",
"\n",
"\n",
"The cost of converting to xarray is ~100ms. If we use the vectorized implementations, computing the `abs_energy` and `absolute_sum_of_changes` is ~10x faster. \n",
"\n",
"On much a much larger dataset (i.e. 14M rows instead of 0.2 M rows) this same operations have the following run time,\n",
" * the conversion to xarray takes 6.6s\n",
" * method A: 10.8 s\n",
" * method B: 5.5 s\n",
" * method C: 160 ms\n",
" \n",
"so on purely on the feature extraction we seem to get an improvement of ~30x. Conversion to xarray has some fixed cost but when computing hundreds of features, it will be negligible with respect to running feature extraction in tsfresh."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment