rth/tsfresh_benchmark.ipynb

## tsfresh_benchmark.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              tsfresh_benchmark.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## tsfresh_benchmark_large.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Benchmark tsfresh performance (large)\n",
    "\n",
    "In this example we benchmark, \n",
    "  * A. applying tsfresh directly to extract features on a columnar pd.DataFrame (uncompressed CSV of 300 MB)\n",
    "\n",
    "with aggregating the DataFrame by user and date, to convert it to a labeled array (with xarray)\n",
    "  * B. followed by computing individual features manually with `_do_extraction_on_chunk`\n",
    "  * C. re-implementing a few metrics in a vectorized fashion and applying it on the xarray directly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/datageek/anaconda2/envs/ts-env/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.\n",
      "  from pandas.core import datetools\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import xarray as xr\n",
    "from tqdm import tqdm\n",
    "from IPython.display import display\n",
    "import dask.dataframe as dd\n",
    "from dask.diagnostics import ProgressBar\n",
    "\n",
    "from tsfresh import extract_features\n",
    "from tsfresh.feature_extraction.extraction import _do_extraction_on_chunk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we load a randomly generated dataset,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[########################################] | 100% Completed |  2.1s\n"
     ]
    }
   ],
   "source": [
    "with ProgressBar():\n",
    "\n",
    "    df_raw = dd.read_parquet('synthetic_payment_data_subset.parq/',\n",
    "                             engine='pyarrow')\n",
    "    df = df_raw.compute()\n",
    "    \n",
    "df['uid'] = df['user_id']\n",
    "df['t'] = df['date']\n",
    "del df['user_id']\n",
    "del df['date']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape: (14270836, 3)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>amount</th>\n",
       "      <th>uid</th>\n",
       "      <th>t</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-28.00</td>\n",
       "      <td>5b3ecda7b4f48aa7fad7ceb2ae6b11</td>\n",
       "      <td>2013-01-01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-7.99</td>\n",
       "      <td>020c7a57c3393ea13d6a0c30eee62e</td>\n",
       "      <td>2013-01-01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.79</td>\n",
       "      <td>f8618d79da85a037f52221517e6147</td>\n",
       "      <td>2013-01-01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.89</td>\n",
       "      <td>f8618d79da85a037f52221517e6147</td>\n",
       "      <td>2013-01-01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>103.00</td>\n",
       "      <td>5acbd7dac6c86bc773a5689b38489d</td>\n",
       "      <td>2013-01-01</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   amount                             uid          t\n",
       "0  -28.00  5b3ecda7b4f48aa7fad7ceb2ae6b11 2013-01-01\n",
       "1   -7.99  020c7a57c3393ea13d6a0c30eee62e 2013-01-01\n",
       "2    1.79  f8618d79da85a037f52221517e6147 2013-01-01\n",
       "3    0.89  f8618d79da85a037f52221517e6147 2013-01-01\n",
       "4  103.00  5acbd7dac6c86bc773a5689b38489d 2013-01-01"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print('Shape:', df.shape)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "amount    32396\n",
       "uid        7500\n",
       "t          1462\n",
       "dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Evaluate conversion to xarray**\n",
    "\n",
    "Next we will aggregate by `uid` and `t` (time), and convert the DataFrame to an xarray.\n",
    "\n",
    "**Note:** here, the dates have a daily precision, but to do this properly we should use `pd.Grouper(freq=<some_freq>)`to aggregate with a specific time frequency. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<xarray.Dataset>\n",
       "Dimensions:  (t: 1462, uid: 7500)\n",
       "Coordinates:\n",
       "  * uid      (uid) object '0006a9bd90e4ac5c5f3f92629f4724' ...\n",
       "  * t        (t) datetime64[ns] 2013-01-01 2013-01-02 2013-01-03 2013-01-04 ...\n",
       "Data variables:\n",
       "    amount   (uid, t) float64 -64.76 -409.9 119.9 8.44 0.0 0.0 2.61 -1.78 ..."
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 6.01 s, sys: 460 ms, total: 6.47 s\n",
      "Wall time: 6.64 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "X = xr.Dataset.from_dataframe(df.groupby(['uid', 't']).sum()).fillna(0.0)\n",
    "display(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A. Feature extraction with pandas.DataFrame input\n",
    "\n",
    "This is the default tsfresh approach"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Feature Extraction: 100%|██████████| 7500/7500 [00:01<00:00, 3959.26it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 8.69 s, sys: 1.02 s, total: 9.71 s\n",
      "Wall time: 10.8 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "fc_params = {'abs_energy': None, 'absolute_sum_of_changes': None}\n",
    "\n",
    "F_A = extract_features(df, column_id=\"uid\", column_sort=\"t\",\n",
    "                       default_fc_parameters=fc_params,\n",
    "                       disable_progressbar=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## B. Feature extraction with xarray input (tsfresh)\n",
    "\n",
    "We manually apply `_do_extraction_on_chunk` on the rows of the aggregated matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 5 s, sys: 120 ms, total: 5.12 s\n",
      "Wall time: 5.18 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "res = []\n",
    "for row in X['amount']:\n",
    "    idx = np.asscalar(row.coords['uid'].values)\n",
    "    \n",
    "    res_row = _do_extraction_on_chunk((idx, 'amount', pd.Series(row.values, index=X.coords['t'].values)),\n",
    "                                      fc_params, None)\n",
    "    res += res_row\n",
    "\n",
    "F_B = pd.DataFrame(res).groupby(['id', 'variable']).value.sum().unstack()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## C. Feature extraction with xarray input (vectorized)\n",
    "\n",
    "Here we reimplement a few features extraction functions that work directly on the whole xarray using vectorized numpy functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 104 ms, sys: 40 ms, total: 144 ms\n",
      "Wall time: 163 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "\n",
    "def abs_energy(X):\n",
    "    return xr.apply_ufunc(np.linalg.norm, X,\n",
    "                          input_core_dims=[['t']],\n",
    "                          kwargs={'ord': 2, 'axis': -1})**2\n",
    "\n",
    "\n",
    "def absolute_sum_of_changes(X):\n",
    "    return np.abs(X.diff('t')).sum('t')\n",
    "\n",
    "\n",
    "res = []\n",
    "for name, func in [('abs_energy', abs_energy),\n",
    "                   ('absolute_sum_of_changes', absolute_sum_of_changes)]:\n",
    "    y = func(X['amount'])\n",
    "    y.coords['variable'] = \"amount__\" + name\n",
    "    res.append(y)\n",
    "F_C = xr.concat(res, dim='variable').to_dataframe()['amount'].unstack(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>variable</th>\n",
       "      <th>amount__abs_energy</th>\n",
       "      <th>amount__absolute_sum_of_changes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0006a9bd90e4ac5c5f3f92629f4724</th>\n",
       "      <td>7.425800e+07</td>\n",
       "      <td>352066.56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>000b94b2061238d0e9d05ed0a0697e</th>\n",
       "      <td>3.205806e+07</td>\n",
       "      <td>172919.06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>000cf24a99a82534abfda1ee2c4861</th>\n",
       "      <td>7.605662e+07</td>\n",
       "      <td>356121.22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0028dbdac34e8bb1a030eaa5416899</th>\n",
       "      <td>2.747808e+07</td>\n",
       "      <td>157476.23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>002dead637214c6b7245d571bf3985</th>\n",
       "      <td>3.091495e+07</td>\n",
       "      <td>195512.50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "variable                        amount__abs_energy  \\\n",
       "id                                                   \n",
       "0006a9bd90e4ac5c5f3f92629f4724        7.425800e+07   \n",
       "000b94b2061238d0e9d05ed0a0697e        3.205806e+07   \n",
       "000cf24a99a82534abfda1ee2c4861        7.605662e+07   \n",
       "0028dbdac34e8bb1a030eaa5416899        2.747808e+07   \n",
       "002dead637214c6b7245d571bf3985        3.091495e+07   \n",
       "\n",
       "variable                        amount__absolute_sum_of_changes  \n",
       "id                                                               \n",
       "0006a9bd90e4ac5c5f3f92629f4724                        352066.56  \n",
       "000b94b2061238d0e9d05ed0a0697e                        172919.06  \n",
       "000cf24a99a82534abfda1ee2c4861                        356121.22  \n",
       "0028dbdac34e8bb1a030eaa5416899                        157476.23  \n",
       "002dead637214c6b7245d571bf3985                        195512.50  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "F_A.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>variable</th>\n",
       "      <th>amount__abs_energy</th>\n",
       "      <th>amount__absolute_sum_of_changes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0006a9bd90e4ac5c5f3f92629f4724</th>\n",
       "      <td>7.328224e+07</td>\n",
       "      <td>243156.58</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>000b94b2061238d0e9d05ed0a0697e</th>\n",
       "      <td>3.065817e+07</td>\n",
       "      <td>140270.77</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>000cf24a99a82534abfda1ee2c4861</th>\n",
       "      <td>6.928719e+07</td>\n",
       "      <td>248496.93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0028dbdac34e8bb1a030eaa5416899</th>\n",
       "      <td>2.788298e+07</td>\n",
       "      <td>136965.99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>002dead637214c6b7245d571bf3985</th>\n",
       "      <td>2.977928e+07</td>\n",
       "      <td>155954.97</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "variable                        amount__abs_energy  \\\n",
       "id                                                   \n",
       "0006a9bd90e4ac5c5f3f92629f4724        7.328224e+07   \n",
       "000b94b2061238d0e9d05ed0a0697e        3.065817e+07   \n",
       "000cf24a99a82534abfda1ee2c4861        6.928719e+07   \n",
       "0028dbdac34e8bb1a030eaa5416899        2.788298e+07   \n",
       "002dead637214c6b7245d571bf3985        2.977928e+07   \n",
       "\n",
       "variable                        amount__absolute_sum_of_changes  \n",
       "id                                                               \n",
       "0006a9bd90e4ac5c5f3f92629f4724                        243156.58  \n",
       "000b94b2061238d0e9d05ed0a0697e                        140270.77  \n",
       "000cf24a99a82534abfda1ee2c4861                        248496.93  \n",
       "0028dbdac34e8bb1a030eaa5416899                        136965.99  \n",
       "002dead637214c6b7245d571bf3985                        155954.97  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "F_B.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>variable</th>\n",
       "      <th>amount__abs_energy</th>\n",
       "      <th>amount__absolute_sum_of_changes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uid</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0006a9bd90e4ac5c5f3f92629f4724</th>\n",
       "      <td>7.328224e+07</td>\n",
       "      <td>243156.58</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>000b94b2061238d0e9d05ed0a0697e</th>\n",
       "      <td>3.065817e+07</td>\n",
       "      <td>140270.77</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>000cf24a99a82534abfda1ee2c4861</th>\n",
       "      <td>6.928719e+07</td>\n",
       "      <td>248496.93</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0028dbdac34e8bb1a030eaa5416899</th>\n",
       "      <td>2.788298e+07</td>\n",
       "      <td>136965.99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>002dead637214c6b7245d571bf3985</th>\n",
       "      <td>2.977928e+07</td>\n",
       "      <td>155954.97</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "variable                        amount__abs_energy  \\\n",
       "uid                                                  \n",
       "0006a9bd90e4ac5c5f3f92629f4724        7.328224e+07   \n",
       "000b94b2061238d0e9d05ed0a0697e        3.065817e+07   \n",
       "000cf24a99a82534abfda1ee2c4861        6.928719e+07   \n",
       "0028dbdac34e8bb1a030eaa5416899        2.788298e+07   \n",
       "002dead637214c6b7245d571bf3985        2.977928e+07   \n",
       "\n",
       "variable                        amount__absolute_sum_of_changes  \n",
       "uid                                                              \n",
       "0006a9bd90e4ac5c5f3f92629f4724                        243156.58  \n",
       "000b94b2061238d0e9d05ed0a0697e                        140270.77  \n",
       "000cf24a99a82534abfda1ee2c4861                        248496.93  \n",
       "0028dbdac34e8bb1a030eaa5416899                        136965.99  \n",
       "002dead637214c6b7245d571bf3985                        155954.97  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "F_C.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check that approaches B and C produce identical result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert ((np.abs(F_B - F_C) /  F_B) < 1e-9).values.all()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "\n",
    "The total run time for method A is ~200 ms. Comparing with the runtime of method B, it seems likely that ~100-150ms are spent on the feature extraction proper.\n",
    "\n",
    "\n",
    "The cost of converting to xarray is ~100ms. If we use the vectorized implementations, computing the `abs_energy` and `absolute_sum_of_changes` is ~10x faster.  \n",
    "\n",
    "On much a much larger dataset (i.e. 14M rows instead of 0.2 M rows) this same operations have the following run time,\n",
    "  * the conversion to xarray takes 6.6s\n",
    "  * method A: 10.8 s\n",
    "  * method B: 5.5 s\n",
    "  * method C: 160 ms\n",
    "  \n",
    "so on purely on the feature extraction we seem to get an improvement of ~30x. Conversion to xarray has some fixed cost but when computing hundreds of features, it will be negligible with respect to running feature extraction in tsfresh."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

## tsfresh_feature_caclulators_profiling.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              tsfresh_feature_caclulators_profiling.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Benchmark tsfresh performance (large)\n",
	"\n",
	"In this example we benchmark, \n",
	" * A. applying tsfresh directly to extract features on a columnar pd.DataFrame (uncompressed CSV of 300 MB)\n",
	"\n",
	"with aggregating the DataFrame by user and date, to convert it to a labeled array (with xarray)\n",
	" * B. followed by computing individual features manually with `_do_extraction_on_chunk`\n",
	" * C. re-implementing a few metrics in a vectorized fashion and applying it on the xarray directly"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"/home/datageek/anaconda2/envs/ts-env/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.\n",
	" from pandas.core import datetools\n"
	]
	}
	],
	"source": [
	"import numpy as np\n",
	"import pandas as pd\n",
	"import xarray as xr\n",
	"from tqdm import tqdm\n",
	"from IPython.display import display\n",
	"import dask.dataframe as dd\n",
	"from dask.diagnostics import ProgressBar\n",
	"\n",
	"from tsfresh import extract_features\n",
	"from tsfresh.feature_extraction.extraction import _do_extraction_on_chunk"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Here we load a randomly generated dataset,"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[########################################] \| 100% Completed \| 2.1s\n"
	]
	}
	],
	"source": [
	"with ProgressBar():\n",
	"\n",
	" df_raw = dd.read_parquet('synthetic_payment_data_subset.parq/',\n",
	" engine='pyarrow')\n",
	" df = df_raw.compute()\n",
	" \n",
	"df['uid'] = df['user_id']\n",
	"df['t'] = df['date']\n",
	"del df['user_id']\n",
	"del df['date']"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Shape: (14270836, 3)\n"
	]
	},
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>amount</th>\n",
	" <th>uid</th>\n",
	" <th>t</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>-28.00</td>\n",
	" <td>5b3ecda7b4f48aa7fad7ceb2ae6b11</td>\n",
	" <td>2013-01-01</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>-7.99</td>\n",
	" <td>020c7a57c3393ea13d6a0c30eee62e</td>\n",
	" <td>2013-01-01</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>1.79</td>\n",
	" <td>f8618d79da85a037f52221517e6147</td>\n",
	" <td>2013-01-01</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>0.89</td>\n",
	" <td>f8618d79da85a037f52221517e6147</td>\n",
	" <td>2013-01-01</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>103.00</td>\n",
	" <td>5acbd7dac6c86bc773a5689b38489d</td>\n",
	" <td>2013-01-01</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" amount uid t\n",
	"0 -28.00 5b3ecda7b4f48aa7fad7ceb2ae6b11 2013-01-01\n",
	"1 -7.99 020c7a57c3393ea13d6a0c30eee62e 2013-01-01\n",
	"2 1.79 f8618d79da85a037f52221517e6147 2013-01-01\n",
	"3 0.89 f8618d79da85a037f52221517e6147 2013-01-01\n",
	"4 103.00 5acbd7dac6c86bc773a5689b38489d 2013-01-01"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"print('Shape:', df.shape)\n",
	"df.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"amount 32396\n",
	"uid 7500\n",
	"t 1462\n",
	"dtype: int64"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"df.nunique()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Evaluate conversion to xarray\n",
	"\n",
	"Next we will aggregate by `uid` and `t` (time), and convert the DataFrame to an xarray.\n",
	"\n",
	"Note: here, the dates have a daily precision, but to do this properly we should use `pd.Grouper(freq=<some_freq>)`to aggregate with a specific time frequency. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<xarray.Dataset>\n",
	"Dimensions: (t: 1462, uid: 7500)\n",
	"Coordinates:\n",
	" * uid (uid) object '0006a9bd90e4ac5c5f3f92629f4724' ...\n",
	" * t (t) datetime64[ns] 2013-01-01 2013-01-02 2013-01-03 2013-01-04 ...\n",
	"Data variables:\n",
	" amount (uid, t) float64 -64.76 -409.9 119.9 8.44 0.0 0.0 2.61 -1.78 ..."
	]
	},
	"metadata": {},
	"output_type": "display_data"
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 6.01 s, sys: 460 ms, total: 6.47 s\n",
	"Wall time: 6.64 s\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"\n",
	"X = xr.Dataset.from_dataframe(df.groupby(['uid', 't']).sum()).fillna(0.0)\n",
	"display(X)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## A. Feature extraction with pandas.DataFrame input\n",
	"\n",
	"This is the default tsfresh approach"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"Feature Extraction: 100%\|██████████\| 7500/7500 [00:01<00:00, 3959.26it/s]\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 8.69 s, sys: 1.02 s, total: 9.71 s\n",
	"Wall time: 10.8 s\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"\n",
	"fc_params = {'abs_energy': None, 'absolute_sum_of_changes': None}\n",
	"\n",
	"F_A = extract_features(df, column_id=\"uid\", column_sort=\"t\",\n",
	" default_fc_parameters=fc_params,\n",
	" disable_progressbar=False)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## B. Feature extraction with xarray input (tsfresh)\n",
	"\n",
	"We manually apply `_do_extraction_on_chunk` on the rows of the aggregated matrix."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 5 s, sys: 120 ms, total: 5.12 s\n",
	"Wall time: 5.18 s\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"\n",
	"res = []\n",
	"for row in X['amount']:\n",
	" idx = np.asscalar(row.coords['uid'].values)\n",
	" \n",
	" res_row = _do_extraction_on_chunk((idx, 'amount', pd.Series(row.values, index=X.coords['t'].values)),\n",
	" fc_params, None)\n",
	" res += res_row\n",
	"\n",
	"F_B = pd.DataFrame(res).groupby(['id', 'variable']).value.sum().unstack()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## C. Feature extraction with xarray input (vectorized)\n",
	"\n",
	"Here we reimplement a few features extraction functions that work directly on the whole xarray using vectorized numpy functions"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 104 ms, sys: 40 ms, total: 144 ms\n",
	"Wall time: 163 ms\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"\n",
	"\n",
	"def abs_energy(X):\n",
	" return xr.apply_ufunc(np.linalg.norm, X,\n",
	" input_core_dims=[['t']],\n",
	" kwargs={'ord': 2, 'axis': -1})**2\n",
	"\n",
	"\n",
	"def absolute_sum_of_changes(X):\n",
	" return np.abs(X.diff('t')).sum('t')\n",
	"\n",
	"\n",
	"res = []\n",
	"for name, func in [('abs_energy', abs_energy),\n",
	" ('absolute_sum_of_changes', absolute_sum_of_changes)]:\n",
	" y = func(X['amount'])\n",
	" y.coords['variable'] = \"amount__\" + name\n",
	" res.append(y)\n",
	"F_C = xr.concat(res, dim='variable').to_dataframe()['amount'].unstack(0)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th>variable</th>\n",
	" <th>amount__abs_energy</th>\n",
	" <th>amount__absolute_sum_of_changes</th>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>id</th>\n",
	" <th></th>\n",
	" <th></th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0006a9bd90e4ac5c5f3f92629f4724</th>\n",
	" <td>7.425800e+07</td>\n",
	" <td>352066.56</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>000b94b2061238d0e9d05ed0a0697e</th>\n",
	" <td>3.205806e+07</td>\n",
	" <td>172919.06</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>000cf24a99a82534abfda1ee2c4861</th>\n",
	" <td>7.605662e+07</td>\n",
	" <td>356121.22</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>0028dbdac34e8bb1a030eaa5416899</th>\n",
	" <td>2.747808e+07</td>\n",
	" <td>157476.23</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>002dead637214c6b7245d571bf3985</th>\n",
	" <td>3.091495e+07</td>\n",
	" <td>195512.50</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	"variable amount__abs_energy \\\n",
	"id \n",
	"0006a9bd90e4ac5c5f3f92629f4724 7.425800e+07 \n",
	"000b94b2061238d0e9d05ed0a0697e 3.205806e+07 \n",
	"000cf24a99a82534abfda1ee2c4861 7.605662e+07 \n",
	"0028dbdac34e8bb1a030eaa5416899 2.747808e+07 \n",
	"002dead637214c6b7245d571bf3985 3.091495e+07 \n",
	"\n",
	"variable amount__absolute_sum_of_changes \n",
	"id \n",
	"0006a9bd90e4ac5c5f3f92629f4724 352066.56 \n",
	"000b94b2061238d0e9d05ed0a0697e 172919.06 \n",
	"000cf24a99a82534abfda1ee2c4861 356121.22 \n",
	"0028dbdac34e8bb1a030eaa5416899 157476.23 \n",
	"002dead637214c6b7245d571bf3985 195512.50 "
	]
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"F_A.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th>variable</th>\n",
	" <th>amount__abs_energy</th>\n",
	" <th>amount__absolute_sum_of_changes</th>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>id</th>\n",
	" <th></th>\n",
	" <th></th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0006a9bd90e4ac5c5f3f92629f4724</th>\n",
	" <td>7.328224e+07</td>\n",
	" <td>243156.58</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>000b94b2061238d0e9d05ed0a0697e</th>\n",
	" <td>3.065817e+07</td>\n",
	" <td>140270.77</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>000cf24a99a82534abfda1ee2c4861</th>\n",
	" <td>6.928719e+07</td>\n",
	" <td>248496.93</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>0028dbdac34e8bb1a030eaa5416899</th>\n",
	" <td>2.788298e+07</td>\n",
	" <td>136965.99</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>002dead637214c6b7245d571bf3985</th>\n",
	" <td>2.977928e+07</td>\n",
	" <td>155954.97</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	"variable amount__abs_energy \\\n",
	"id \n",
	"0006a9bd90e4ac5c5f3f92629f4724 7.328224e+07 \n",
	"000b94b2061238d0e9d05ed0a0697e 3.065817e+07 \n",
	"000cf24a99a82534abfda1ee2c4861 6.928719e+07 \n",
	"0028dbdac34e8bb1a030eaa5416899 2.788298e+07 \n",
	"002dead637214c6b7245d571bf3985 2.977928e+07 \n",
	"\n",
	"variable amount__absolute_sum_of_changes \n",
	"id \n",
	"0006a9bd90e4ac5c5f3f92629f4724 243156.58 \n",
	"000b94b2061238d0e9d05ed0a0697e 140270.77 \n",
	"000cf24a99a82534abfda1ee2c4861 248496.93 \n",
	"0028dbdac34e8bb1a030eaa5416899 136965.99 \n",
	"002dead637214c6b7245d571bf3985 155954.97 "
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"F_B.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th>variable</th>\n",
	" <th>amount__abs_energy</th>\n",
	" <th>amount__absolute_sum_of_changes</th>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>uid</th>\n",
	" <th></th>\n",
	" <th></th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0006a9bd90e4ac5c5f3f92629f4724</th>\n",
	" <td>7.328224e+07</td>\n",
	" <td>243156.58</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>000b94b2061238d0e9d05ed0a0697e</th>\n",
	" <td>3.065817e+07</td>\n",
	" <td>140270.77</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>000cf24a99a82534abfda1ee2c4861</th>\n",
	" <td>6.928719e+07</td>\n",
	" <td>248496.93</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>0028dbdac34e8bb1a030eaa5416899</th>\n",
	" <td>2.788298e+07</td>\n",
	" <td>136965.99</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>002dead637214c6b7245d571bf3985</th>\n",
	" <td>2.977928e+07</td>\n",
	" <td>155954.97</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	"variable amount__abs_energy \\\n",
	"uid \n",
	"0006a9bd90e4ac5c5f3f92629f4724 7.328224e+07 \n",
	"000b94b2061238d0e9d05ed0a0697e 3.065817e+07 \n",
	"000cf24a99a82534abfda1ee2c4861 6.928719e+07 \n",
	"0028dbdac34e8bb1a030eaa5416899 2.788298e+07 \n",
	"002dead637214c6b7245d571bf3985 2.977928e+07 \n",
	"\n",
	"variable amount__absolute_sum_of_changes \n",
	"uid \n",
	"0006a9bd90e4ac5c5f3f92629f4724 243156.58 \n",
	"000b94b2061238d0e9d05ed0a0697e 140270.77 \n",
	"000cf24a99a82534abfda1ee2c4861 248496.93 \n",
	"0028dbdac34e8bb1a030eaa5416899 136965.99 \n",
	"002dead637214c6b7245d571bf3985 155954.97 "
	]
	},
	"execution_count": 12,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"F_C.head()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Check that approaches B and C produce identical result"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [],
	"source": [
	"assert ((np.abs(F_B - F_C) / F_B) < 1e-9).values.all()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Conclusion\n",
	"\n",
	"\n",
	"The total run time for method A is ~200 ms. Comparing with the runtime of method B, it seems likely that ~100-150ms are spent on the feature extraction proper.\n",
	"\n",
	"\n",
	"The cost of converting to xarray is ~100ms. If we use the vectorized implementations, computing the `abs_energy` and `absolute_sum_of_changes` is ~10x faster. \n",
	"\n",
	"On much a much larger dataset (i.e. 14M rows instead of 0.2 M rows) this same operations have the following run time,\n",
	" * the conversion to xarray takes 6.6s\n",
	" * method A: 10.8 s\n",
	" * method B: 5.5 s\n",
	" * method C: 160 ms\n",
	" \n",
	"so on purely on the feature extraction we seem to get an improvement of ~30x. Conversion to xarray has some fixed cost but when computing hundreds of features, it will be negligible with respect to running feature extraction in tsfresh."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}