rth/tsfresh_benchmark.ipynb

## tsfresh_benchmark.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Benchmark tsfresh performance\n",
    "\n",
    "In this example we benchmark, \n",
    "  * A. applying tsfresh directly to extract features on a columnar pd.DataFrame (uncompressed CSV of ~10MB)\n",
    "\n",
    "with aggregating the DataFrame by user and date, to convert it to a labeled array (with xarray)\n",
    "  * B. followed by computing individual features manually with `_do_extraction_on_chunk`\n",
    "  * C. re-implementing a few metrics in a vectorized fashion and applying it on the xarray directly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/datageek/anaconda2/envs/ts-env/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.\n",
      "  from pandas.core import datetools\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import xarray as xr\n",
    "from tqdm import tqdm\n",
    "from IPython.display import display\n",
    "\n",
    "from tsfresh import extract_features\n",
    "from tsfresh.feature_extraction.extraction import _do_extraction_on_chunk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we load a randomly generated dataset,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('https://github.com/blue-yonder/tsfresh/files/1751897/sample_dataset.csv.gz')\n",
    "df['t'] = pd.to_datetime(df.t)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape: (200000, 3)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>amount</th>\n",
       "      <th>t</th>\n",
       "      <th>uid</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-28.00</td>\n",
       "      <td>2013-01-01</td>\n",
       "      <td>5b3ecda7b4f48aa7fad7ceb2ae6b11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-7.99</td>\n",
       "      <td>2013-01-01</td>\n",
       "      <td>020c7a57c3393ea13d6a0c30eee62e</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.79</td>\n",
       "      <td>2013-01-01</td>\n",
       "      <td>f8618d79da85a037f52221517e6147</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.89</td>\n",
       "      <td>2013-01-01</td>\n",
       "      <td>f8618d79da85a037f52221517e6147</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>103.00</td>\n",
       "      <td>2013-01-01</td>\n",
       "      <td>5acbd7dac6c86bc773a5689b38489d</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   amount          t                             uid\n",
       "0  -28.00 2013-01-01  5b3ecda7b4f48aa7fad7ceb2ae6b11\n",
       "1   -7.99 2013-01-01  020c7a57c3393ea13d6a0c30eee62e\n",
       "2    1.79 2013-01-01  f8618d79da85a037f52221517e6147\n",
       "3    0.89 2013-01-01  f8618d79da85a037f52221517e6147\n",
       "4  103.00 2013-01-01  5acbd7dac6c86bc773a5689b38489d"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print('Shape:', df.shape)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "amount    24738\n",
       "t           601\n",
       "uid         250\n",
       "dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Evaluate conversion to xarray**\n",
    "\n",
    "Next we will aggregate by `uid` and `t` (time), and convert the DataFrame to an xarray.\n",
    "\n",
    "**Note:** here, the dates have a daily precision, but to do this properly we should use `pd.Grouper(freq=<some_freq>)`to aggregate with a specific time frequency. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<xarray.Dataset>\n",
       "Dimensions:  (t: 601, uid: 250)\n",
       "Coordinates:\n",
       "  * uid      (uid) object '00c052da3b927126b4553e96c4083c' ...\n",
       "  * t        (t) datetime64[ns] 2013-01-01 2013-01-02 2013-01-03 2013-01-04 ...\n",
       "Data variables:\n",
       "    amount   (uid, t) float64 24.25 25.49 -1.98 254.7 9.94 0.0 202.2 -230.2 ..."
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 84 ms, sys: 12 ms, total: 96 ms\n",
      "Wall time: 96.7 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "X = xr.Dataset.from_dataframe(df.groupby(['uid', 't']).sum()).fillna(0.0)\n",
    "display(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A. Feature extraction with pandas.DataFrame input\n",
    "\n",
    "This is the default tsfresh approach"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 156 ms, sys: 24 ms, total: 180 ms\n",
      "Wall time: 235 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "fc_params = {'abs_energy': None, 'absolute_sum_of_changes': None}\n",
    "\n",
    "F_A = extract_features(df, column_id=\"uid\", column_sort=\"t\",\n",
    "                       default_fc_parameters=fc_params,\n",
    "                       disable_progressbar=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## B. Feature extraction with xarray input (tsfresh)\n",
    "\n",
    "We manually apply `_do_extraction_on_chunk` on the rows of the aggregated matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 160 ms, sys: 12 ms, total: 172 ms\n",
      "Wall time: 170 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "res = []\n",
    "for row in X['amount']:\n",
    "    idx = np.asscalar(row.coords['uid'].values)\n",
    "    \n",
    "    res_row = _do_extraction_on_chunk((idx, 'amount', pd.Series(row.values, index=X.coords['t'].values)),\n",
    "                                      fc_params, None)\n",
    "    res += res_row\n",
    "\n",
    "F_B = pd.DataFrame(res).groupby(['id', 'variable']).value.sum().unstack()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## C. Feature extraction with xarray input (vectorized)\n",
    "\n",
    "Here we reimplement a few features extraction functions that work directly on the whole xarray using vectorized numpy functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 12 ms, sys: 4 ms, total: 16 ms\n",
      "Wall time: 21.5 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "\n",
    "def abs_energy(X):\n",
    "    return xr.apply_ufunc(np.linalg.norm, X,\n",
    "                          input_core_dims=[['t']],\n",
    "                          kwargs={'ord': 2, 'axis': -1})**2\n",
    "\n",
    "\n",
    "def absolute_sum_of_changes(X):\n",
    "    return np.abs(X.diff('t')).sum('t')\n",
    "\n",
    "\n",
    "res = []\n",
    "for name, func in [('abs_energy', abs_energy),\n",
    "                   ('absolute_sum_of_changes', absolute_sum_of_changes)]:\n",
    "    y = func(X['amount'])\n",
    "    y.coords['variable'] = \"amount__\" + name\n",
    "    res.append(y)\n",
    "F_C = xr.concat(res, dim='variable').to_dataframe()['amount'].unstack(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>variable</th>\n",
       "      <th>amount__abs_energy</th>\n",
       "      <th>amount__absolute_sum_of_changes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>00c052da3b927126b4553e96c4083c</th>\n",
       "      <td>1.236324e+07</td>\n",
       "      <td>68395.37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0195b9bdf14913d11d95d294e1afd9</th>\n",
       "      <td>1.026265e+07</td>\n",
       "      <td>61450.25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>01f06920099b3c97bbc5b7e6c0dcb4</th>\n",
       "      <td>2.046349e+07</td>\n",
       "      <td>105852.06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>020c7a57c3393ea13d6a0c30eee62e</th>\n",
       "      <td>1.015636e+07</td>\n",
       "      <td>49664.44</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0227d2368280e2754ce05a15c32f5c</th>\n",
       "      <td>1.253965e+07</td>\n",
       "      <td>93912.65</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "variable                        amount__abs_energy  \\\n",
       "id                                                   \n",
       "00c052da3b927126b4553e96c4083c        1.236324e+07   \n",
       "0195b9bdf14913d11d95d294e1afd9        1.026265e+07   \n",
       "01f06920099b3c97bbc5b7e6c0dcb4        2.046349e+07   \n",
       "020c7a57c3393ea13d6a0c30eee62e        1.015636e+07   \n",
       "0227d2368280e2754ce05a15c32f5c        1.253965e+07   \n",
       "\n",
       "variable                        amount__absolute_sum_of_changes  \n",
       "id                                                               \n",
       "00c052da3b927126b4553e96c4083c                         68395.37  \n",
       "0195b9bdf14913d11d95d294e1afd9                         61450.25  \n",
       "01f06920099b3c97bbc5b7e6c0dcb4                        105852.06  \n",
       "020c7a57c3393ea13d6a0c30eee62e                         49664.44  \n",
       "0227d2368280e2754ce05a15c32f5c                         93912.65  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "F_A.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>variable</th>\n",
       "      <th>amount__abs_energy</th>\n",
       "      <th>amount__absolute_sum_of_changes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>00c052da3b927126b4553e96c4083c</th>\n",
       "      <td>1.387087e+07</td>\n",
       "      <td>57329.05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0195b9bdf14913d11d95d294e1afd9</th>\n",
       "      <td>1.143035e+07</td>\n",
       "      <td>56149.97</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>01f06920099b3c97bbc5b7e6c0dcb4</th>\n",
       "      <td>1.865184e+07</td>\n",
       "      <td>77300.61</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>020c7a57c3393ea13d6a0c30eee62e</th>\n",
       "      <td>9.911486e+06</td>\n",
       "      <td>45632.03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0227d2368280e2754ce05a15c32f5c</th>\n",
       "      <td>1.201602e+07</td>\n",
       "      <td>67769.26</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "variable                        amount__abs_energy  \\\n",
       "id                                                   \n",
       "00c052da3b927126b4553e96c4083c        1.387087e+07   \n",
       "0195b9bdf14913d11d95d294e1afd9        1.143035e+07   \n",
       "01f06920099b3c97bbc5b7e6c0dcb4        1.865184e+07   \n",
       "020c7a57c3393ea13d6a0c30eee62e        9.911486e+06   \n",
       "0227d2368280e2754ce05a15c32f5c        1.201602e+07   \n",
       "\n",
       "variable                        amount__absolute_sum_of_changes  \n",
       "id                                                               \n",
       "00c052da3b927126b4553e96c4083c                         57329.05  \n",
       "0195b9bdf14913d11d95d294e1afd9                         56149.97  \n",
       "01f06920099b3c97bbc5b7e6c0dcb4                         77300.61  \n",
       "020c7a57c3393ea13d6a0c30eee62e                         45632.03  \n",
       "0227d2368280e2754ce05a15c32f5c                         67769.26  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "F_B.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>variable</th>\n",
       "      <th>amount__abs_energy</th>\n",
       "      <th>amount__absolute_sum_of_changes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uid</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>00c052da3b927126b4553e96c4083c</th>\n",
       "      <td>1.387087e+07</td>\n",
       "      <td>57329.05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0195b9bdf14913d11d95d294e1afd9</th>\n",
       "      <td>1.143035e+07</td>\n",
       "      <td>56149.97</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>01f06920099b3c97bbc5b7e6c0dcb4</th>\n",
       "      <td>1.865184e+07</td>\n",
       "      <td>77300.61</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>020c7a57c3393ea13d6a0c30eee62e</th>\n",
       "      <td>9.911486e+06</td>\n",
       "      <td>45632.03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0227d2368280e2754ce05a15c32f5c</th>\n",
       "      <td>1.201602e+07</td>\n",
       "      <td>67769.26</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "variable                        amount__abs_energy  \\\n",
       "uid                                                  \n",
       "00c052da3b927126b4553e96c4083c        1.387087e+07   \n",
       "0195b9bdf14913d11d95d294e1afd9        1.143035e+07   \n",
       "01f06920099b3c97bbc5b7e6c0dcb4        1.865184e+07   \n",
       "020c7a57c3393ea13d6a0c30eee62e        9.911486e+06   \n",
       "0227d2368280e2754ce05a15c32f5c        1.201602e+07   \n",
       "\n",
       "variable                        amount__absolute_sum_of_changes  \n",
       "uid                                                              \n",
       "00c052da3b927126b4553e96c4083c                         57329.05  \n",
       "0195b9bdf14913d11d95d294e1afd9                         56149.97  \n",
       "01f06920099b3c97bbc5b7e6c0dcb4                         77300.61  \n",
       "020c7a57c3393ea13d6a0c30eee62e                         45632.03  \n",
       "0227d2368280e2754ce05a15c32f5c                         67769.26  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "F_C.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check that approaches B and C produce identical result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert ((np.abs(F_B - F_C) /  F_B) < 1e-9).values.all()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "\n",
    "The total run time for method A is ~200 ms. Comparing with the runtime of method B, it seems likely that ~100-150ms are spent on the feature extraction proper.\n",
    "\n",
    "\n",
    "The cost of converting to xarray is ~100ms. If we use the vectorized implementations, computing the `abs_energy` and `absolute_sum_of_changes` is ~10x faster.  \n",
    "\n",
    "On much a much larger dataset (i.e. 14M rows instead of 0.2 M rows) this same operations have the following run time,\n",
    "  * the conversion to xarray takes 6.6s\n",
    "  * method A: 10.8 s\n",
    "  * method B: 5.5 s\n",
    "  * method C: 160 ms\n",
    "  \n",
    "so on purely on the feature extraction we seem to get an improvement of ~30x. Conversion to xarray has some fixed cost but when computing hundreds of features, it will be negligible with respect to running feature extraction in tsfresh."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

## tsfresh_benchmark_large.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              tsfresh_benchmark_large.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## tsfresh_feature_caclulators_profiling.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              tsfresh_feature_caclulators_profiling.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Benchmark tsfresh performance\n",
	"\n",
	"In this example we benchmark, \n",
	" * A. applying tsfresh directly to extract features on a columnar pd.DataFrame (uncompressed CSV of ~10MB)\n",
	"\n",
	"with aggregating the DataFrame by user and date, to convert it to a labeled array (with xarray)\n",
	" * B. followed by computing individual features manually with `_do_extraction_on_chunk`\n",
	" * C. re-implementing a few metrics in a vectorized fashion and applying it on the xarray directly"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"/home/datageek/anaconda2/envs/ts-env/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.\n",
	" from pandas.core import datetools\n"
	]
	}
	],
	"source": [
	"import numpy as np\n",
	"import pandas as pd\n",
	"import xarray as xr\n",
	"from tqdm import tqdm\n",
	"from IPython.display import display\n",
	"\n",
	"from tsfresh import extract_features\n",
	"from tsfresh.feature_extraction.extraction import _do_extraction_on_chunk"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Here we load a randomly generated dataset,"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"df = pd.read_csv('https://github.com/blue-yonder/tsfresh/files/1751897/sample_dataset.csv.gz')\n",
	"df['t'] = pd.to_datetime(df.t)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Shape: (200000, 3)\n"
	]
	},
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>amount</th>\n",
	" <th>t</th>\n",
	" <th>uid</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>-28.00</td>\n",
	" <td>2013-01-01</td>\n",
	" <td>5b3ecda7b4f48aa7fad7ceb2ae6b11</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>-7.99</td>\n",
	" <td>2013-01-01</td>\n",
	" <td>020c7a57c3393ea13d6a0c30eee62e</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>1.79</td>\n",
	" <td>2013-01-01</td>\n",
	" <td>f8618d79da85a037f52221517e6147</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>0.89</td>\n",
	" <td>2013-01-01</td>\n",
	" <td>f8618d79da85a037f52221517e6147</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>103.00</td>\n",
	" <td>2013-01-01</td>\n",
	" <td>5acbd7dac6c86bc773a5689b38489d</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" amount t uid\n",
	"0 -28.00 2013-01-01 5b3ecda7b4f48aa7fad7ceb2ae6b11\n",
	"1 -7.99 2013-01-01 020c7a57c3393ea13d6a0c30eee62e\n",
	"2 1.79 2013-01-01 f8618d79da85a037f52221517e6147\n",
	"3 0.89 2013-01-01 f8618d79da85a037f52221517e6147\n",
	"4 103.00 2013-01-01 5acbd7dac6c86bc773a5689b38489d"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"print('Shape:', df.shape)\n",
	"df.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"amount 24738\n",
	"t 601\n",
	"uid 250\n",
	"dtype: int64"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"df.nunique()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Evaluate conversion to xarray\n",
	"\n",
	"Next we will aggregate by `uid` and `t` (time), and convert the DataFrame to an xarray.\n",
	"\n",
	"Note: here, the dates have a daily precision, but to do this properly we should use `pd.Grouper(freq=<some_freq>)`to aggregate with a specific time frequency. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<xarray.Dataset>\n",
	"Dimensions: (t: 601, uid: 250)\n",
	"Coordinates:\n",
	" * uid (uid) object '00c052da3b927126b4553e96c4083c' ...\n",
	" * t (t) datetime64[ns] 2013-01-01 2013-01-02 2013-01-03 2013-01-04 ...\n",
	"Data variables:\n",
	" amount (uid, t) float64 24.25 25.49 -1.98 254.7 9.94 0.0 202.2 -230.2 ..."
	]
	},
	"metadata": {},
	"output_type": "display_data"
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 84 ms, sys: 12 ms, total: 96 ms\n",
	"Wall time: 96.7 ms\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"\n",
	"X = xr.Dataset.from_dataframe(df.groupby(['uid', 't']).sum()).fillna(0.0)\n",
	"display(X)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## A. Feature extraction with pandas.DataFrame input\n",
	"\n",
	"This is the default tsfresh approach"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 156 ms, sys: 24 ms, total: 180 ms\n",
	"Wall time: 235 ms\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"\n",
	"fc_params = {'abs_energy': None, 'absolute_sum_of_changes': None}\n",
	"\n",
	"F_A = extract_features(df, column_id=\"uid\", column_sort=\"t\",\n",
	" default_fc_parameters=fc_params,\n",
	" disable_progressbar=True)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## B. Feature extraction with xarray input (tsfresh)\n",
	"\n",
	"We manually apply `_do_extraction_on_chunk` on the rows of the aggregated matrix."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 160 ms, sys: 12 ms, total: 172 ms\n",
	"Wall time: 170 ms\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"\n",
	"res = []\n",
	"for row in X['amount']:\n",
	" idx = np.asscalar(row.coords['uid'].values)\n",
	" \n",
	" res_row = _do_extraction_on_chunk((idx, 'amount', pd.Series(row.values, index=X.coords['t'].values)),\n",
	" fc_params, None)\n",
	" res += res_row\n",
	"\n",
	"F_B = pd.DataFrame(res).groupby(['id', 'variable']).value.sum().unstack()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## C. Feature extraction with xarray input (vectorized)\n",
	"\n",
	"Here we reimplement a few features extraction functions that work directly on the whole xarray using vectorized numpy functions"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 12 ms, sys: 4 ms, total: 16 ms\n",
	"Wall time: 21.5 ms\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"\n",
	"\n",
	"def abs_energy(X):\n",
	" return xr.apply_ufunc(np.linalg.norm, X,\n",
	" input_core_dims=[['t']],\n",
	" kwargs={'ord': 2, 'axis': -1})**2\n",
	"\n",
	"\n",
	"def absolute_sum_of_changes(X):\n",
	" return np.abs(X.diff('t')).sum('t')\n",
	"\n",
	"\n",
	"res = []\n",
	"for name, func in [('abs_energy', abs_energy),\n",
	" ('absolute_sum_of_changes', absolute_sum_of_changes)]:\n",
	" y = func(X['amount'])\n",
	" y.coords['variable'] = \"amount__\" + name\n",
	" res.append(y)\n",
	"F_C = xr.concat(res, dim='variable').to_dataframe()['amount'].unstack(0)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th>variable</th>\n",
	" <th>amount__abs_energy</th>\n",
	" <th>amount__absolute_sum_of_changes</th>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>id</th>\n",
	" <th></th>\n",
	" <th></th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>00c052da3b927126b4553e96c4083c</th>\n",
	" <td>1.236324e+07</td>\n",
	" <td>68395.37</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>0195b9bdf14913d11d95d294e1afd9</th>\n",
	" <td>1.026265e+07</td>\n",
	" <td>61450.25</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>01f06920099b3c97bbc5b7e6c0dcb4</th>\n",
	" <td>2.046349e+07</td>\n",
	" <td>105852.06</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>020c7a57c3393ea13d6a0c30eee62e</th>\n",
	" <td>1.015636e+07</td>\n",
	" <td>49664.44</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>0227d2368280e2754ce05a15c32f5c</th>\n",
	" <td>1.253965e+07</td>\n",
	" <td>93912.65</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	"variable amount__abs_energy \\\n",
	"id \n",
	"00c052da3b927126b4553e96c4083c 1.236324e+07 \n",
	"0195b9bdf14913d11d95d294e1afd9 1.026265e+07 \n",
	"01f06920099b3c97bbc5b7e6c0dcb4 2.046349e+07 \n",
	"020c7a57c3393ea13d6a0c30eee62e 1.015636e+07 \n",
	"0227d2368280e2754ce05a15c32f5c 1.253965e+07 \n",
	"\n",
	"variable amount__absolute_sum_of_changes \n",
	"id \n",
	"00c052da3b927126b4553e96c4083c 68395.37 \n",
	"0195b9bdf14913d11d95d294e1afd9 61450.25 \n",
	"01f06920099b3c97bbc5b7e6c0dcb4 105852.06 \n",
	"020c7a57c3393ea13d6a0c30eee62e 49664.44 \n",
	"0227d2368280e2754ce05a15c32f5c 93912.65 "
	]
	},
	"execution_count": 13,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"F_A.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th>variable</th>\n",
	" <th>amount__abs_energy</th>\n",
	" <th>amount__absolute_sum_of_changes</th>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>id</th>\n",
	" <th></th>\n",
	" <th></th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>00c052da3b927126b4553e96c4083c</th>\n",
	" <td>1.387087e+07</td>\n",
	" <td>57329.05</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>0195b9bdf14913d11d95d294e1afd9</th>\n",
	" <td>1.143035e+07</td>\n",
	" <td>56149.97</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>01f06920099b3c97bbc5b7e6c0dcb4</th>\n",
	" <td>1.865184e+07</td>\n",
	" <td>77300.61</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>020c7a57c3393ea13d6a0c30eee62e</th>\n",
	" <td>9.911486e+06</td>\n",
	" <td>45632.03</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>0227d2368280e2754ce05a15c32f5c</th>\n",
	" <td>1.201602e+07</td>\n",
	" <td>67769.26</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	"variable amount__abs_energy \\\n",
	"id \n",
	"00c052da3b927126b4553e96c4083c 1.387087e+07 \n",
	"0195b9bdf14913d11d95d294e1afd9 1.143035e+07 \n",
	"01f06920099b3c97bbc5b7e6c0dcb4 1.865184e+07 \n",
	"020c7a57c3393ea13d6a0c30eee62e 9.911486e+06 \n",
	"0227d2368280e2754ce05a15c32f5c 1.201602e+07 \n",
	"\n",
	"variable amount__absolute_sum_of_changes \n",
	"id \n",
	"00c052da3b927126b4553e96c4083c 57329.05 \n",
	"0195b9bdf14913d11d95d294e1afd9 56149.97 \n",
	"01f06920099b3c97bbc5b7e6c0dcb4 77300.61 \n",
	"020c7a57c3393ea13d6a0c30eee62e 45632.03 \n",
	"0227d2368280e2754ce05a15c32f5c 67769.26 "
	]
	},
	"execution_count": 14,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"F_B.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th>variable</th>\n",
	" <th>amount__abs_energy</th>\n",
	" <th>amount__absolute_sum_of_changes</th>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>uid</th>\n",
	" <th></th>\n",
	" <th></th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>00c052da3b927126b4553e96c4083c</th>\n",
	" <td>1.387087e+07</td>\n",
	" <td>57329.05</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>0195b9bdf14913d11d95d294e1afd9</th>\n",
	" <td>1.143035e+07</td>\n",
	" <td>56149.97</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>01f06920099b3c97bbc5b7e6c0dcb4</th>\n",
	" <td>1.865184e+07</td>\n",
	" <td>77300.61</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>020c7a57c3393ea13d6a0c30eee62e</th>\n",
	" <td>9.911486e+06</td>\n",
	" <td>45632.03</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>0227d2368280e2754ce05a15c32f5c</th>\n",
	" <td>1.201602e+07</td>\n",
	" <td>67769.26</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	"variable amount__abs_energy \\\n",
	"uid \n",
	"00c052da3b927126b4553e96c4083c 1.387087e+07 \n",
	"0195b9bdf14913d11d95d294e1afd9 1.143035e+07 \n",
	"01f06920099b3c97bbc5b7e6c0dcb4 1.865184e+07 \n",
	"020c7a57c3393ea13d6a0c30eee62e 9.911486e+06 \n",
	"0227d2368280e2754ce05a15c32f5c 1.201602e+07 \n",
	"\n",
	"variable amount__absolute_sum_of_changes \n",
	"uid \n",
	"00c052da3b927126b4553e96c4083c 57329.05 \n",
	"0195b9bdf14913d11d95d294e1afd9 56149.97 \n",
	"01f06920099b3c97bbc5b7e6c0dcb4 77300.61 \n",
	"020c7a57c3393ea13d6a0c30eee62e 45632.03 \n",
	"0227d2368280e2754ce05a15c32f5c 67769.26 "
	]
	},
	"execution_count": 15,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"F_C.head()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Check that approaches B and C produce identical result"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {},
	"outputs": [],
	"source": [
	"assert ((np.abs(F_B - F_C) / F_B) < 1e-9).values.all()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Conclusion\n",
	"\n",
	"\n",
	"The total run time for method A is ~200 ms. Comparing with the runtime of method B, it seems likely that ~100-150ms are spent on the feature extraction proper.\n",
	"\n",
	"\n",
	"The cost of converting to xarray is ~100ms. If we use the vectorized implementations, computing the `abs_energy` and `absolute_sum_of_changes` is ~10x faster. \n",
	"\n",
	"On much a much larger dataset (i.e. 14M rows instead of 0.2 M rows) this same operations have the following run time,\n",
	" * the conversion to xarray takes 6.6s\n",
	" * method A: 10.8 s\n",
	" * method B: 5.5 s\n",
	" * method C: 160 ms\n",
	" \n",
	"so on purely on the feature extraction we seem to get an improvement of ~30x. Conversion to xarray has some fixed cost but when computing hundreds of features, it will be negligible with respect to running feature extraction in tsfresh."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}