natashawatkins/hdf5-pandas.ipynb

## hdf5-pandas.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HDF5 Tables with Pandas\n",
    "\n",
    "Hdf5 is intended for storing very large and complicated datasets\n",
    "\n",
    "Basically allows storing dataframes, series, etc. in a dictionary-like object\n",
    "\n",
    "#### Resources:\n",
    "\n",
    "* Here are the docs for HDF5 in pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables\n",
    "\n",
    "* This is a nice article for pandas with hdf5: http://glowingpython.blogspot.com.au/2014/08/quick-hdf5-with-pandas.html\n",
    "\n",
    "* This article looks at its speed in relation to pickle: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization\n",
    "\n",
    "Ideally, `pytables` > 3.2 should be used"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a hdf5 file and open it for writing\n",
    "\n",
    "**Note**: by default, data is simply appended to the file. If you plan on re-running notebooks, I'd recommend setting `mode='w'` so the original file is overwritten and you don't have a build up of old data. Just a suggestion..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "my_hdf = pd.HDFStore('my_hdf.h5', mode='w')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.io.pytables.HDFStore"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(my_hdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import some dataframes I want to store in hdf5 format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df1 = pd.read_stata('ace1.dta')\n",
    "df2 = pd.read_stata('ace2.dta')\n",
    "df4 = pd.read_stata('ace4.dta')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Add the dataframes to the hdf5 file using `.put()`\n",
    "\n",
    "The `key` will be what we use to access each piece of data we store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_hdf.put(key='df1', value=df1)\n",
    "my_hdf.put(key='df2', value=df2)\n",
    "my_hdf.put(key='df4', value=df4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see the contents of our hdf file, along with the type (ie. frame, series) and shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<class 'pandas.io.pytables.HDFStore'>\n",
       "File path: my_hdf.h5\n",
       "/df1            frame        (shape->[163,13])\n",
       "/df2            frame        (shape->[163,9]) \n",
       "/df4            frame        (shape->[163,10])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_hdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see the keys available"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['/df1', '/df2', '/df4']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_hdf.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dataframes can be accessed by key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>shortnam</th>\n",
       "      <th>euro1900</th>\n",
       "      <th>excolony</th>\n",
       "      <th>avexpr</th>\n",
       "      <th>logpgp95</th>\n",
       "      <th>cons1</th>\n",
       "      <th>cons90</th>\n",
       "      <th>democ00a</th>\n",
       "      <th>cons00a</th>\n",
       "      <th>extmort4</th>\n",
       "      <th>logem4</th>\n",
       "      <th>loghjypl</th>\n",
       "      <th>baseco</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AFG</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>93.699997</td>\n",
       "      <td>4.540098</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AGO</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>5.363636</td>\n",
       "      <td>7.770645</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>280.000000</td>\n",
       "      <td>5.634789</td>\n",
       "      <td>-3.411248</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ARE</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>7.181818</td>\n",
       "      <td>9.804219</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ARG</td>\n",
       "      <td>60.000004</td>\n",
       "      <td>1.0</td>\n",
       "      <td>6.386364</td>\n",
       "      <td>9.133459</td>\n",
       "      <td>1.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>68.900002</td>\n",
       "      <td>4.232656</td>\n",
       "      <td>-0.872274</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ARM</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7.682482</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  shortnam   euro1900  excolony    avexpr  logpgp95  cons1  cons90  democ00a  \\\n",
       "0      AFG   0.000000       1.0       NaN       NaN    1.0     2.0       1.0   \n",
       "1      AGO   8.000000       1.0  5.363636  7.770645    3.0     3.0       0.0   \n",
       "2      ARE   0.000000       1.0  7.181818  9.804219    NaN     NaN       NaN   \n",
       "3      ARG  60.000004       1.0  6.386364  9.133459    1.0     6.0       3.0   \n",
       "4      ARM   0.000000       0.0       NaN  7.682482    NaN     NaN       NaN   \n",
       "\n",
       "   cons00a    extmort4    logem4  loghjypl  baseco  \n",
       "0      1.0   93.699997  4.540098       NaN     NaN  \n",
       "1      1.0  280.000000  5.634789 -3.411248     1.0  \n",
       "2      NaN         NaN       NaN       NaN     NaN  \n",
       "3      3.0   68.900002  4.232656 -0.872274     1.0  \n",
       "4      NaN         NaN       NaN       NaN     NaN  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_hdf['df1'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read in data from the hdf5 file to a dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>shortnam</th>\n",
       "      <th>euro1900</th>\n",
       "      <th>excolony</th>\n",
       "      <th>avexpr</th>\n",
       "      <th>logpgp95</th>\n",
       "      <th>cons1</th>\n",
       "      <th>cons90</th>\n",
       "      <th>democ00a</th>\n",
       "      <th>cons00a</th>\n",
       "      <th>extmort4</th>\n",
       "      <th>logem4</th>\n",
       "      <th>loghjypl</th>\n",
       "      <th>baseco</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AFG</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>93.699997</td>\n",
       "      <td>4.540098</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AGO</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>5.363636</td>\n",
       "      <td>7.770645</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>280.000000</td>\n",
       "      <td>5.634789</td>\n",
       "      <td>-3.411248</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ARE</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>7.181818</td>\n",
       "      <td>9.804219</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ARG</td>\n",
       "      <td>60.000004</td>\n",
       "      <td>1.0</td>\n",
       "      <td>6.386364</td>\n",
       "      <td>9.133459</td>\n",
       "      <td>1.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>68.900002</td>\n",
       "      <td>4.232656</td>\n",
       "      <td>-0.872274</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ARM</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7.682482</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  shortnam   euro1900  excolony    avexpr  logpgp95  cons1  cons90  democ00a  \\\n",
       "0      AFG   0.000000       1.0       NaN       NaN    1.0     2.0       1.0   \n",
       "1      AGO   8.000000       1.0  5.363636  7.770645    3.0     3.0       0.0   \n",
       "2      ARE   0.000000       1.0  7.181818  9.804219    NaN     NaN       NaN   \n",
       "3      ARG  60.000004       1.0  6.386364  9.133459    1.0     6.0       3.0   \n",
       "4      ARM   0.000000       0.0       NaN  7.682482    NaN     NaN       NaN   \n",
       "\n",
       "   cons00a    extmort4    logem4  loghjypl  baseco  \n",
       "0      1.0   93.699997  4.540098       NaN     NaN  \n",
       "1      1.0  280.000000  5.634789 -3.411248     1.0  \n",
       "2      NaN         NaN       NaN       NaN     NaN  \n",
       "3      3.0   68.900002  4.232656 -0.872274     1.0  \n",
       "4      NaN         NaN       NaN       NaN     NaN  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_dataframe = pd.read_hdf('my_hdf.h5', key='df1')\n",
    "my_dataframe.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(my_dataframe)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, we have been using `fixed` format to store our data. This is more efficient, but less flexible.\n",
    "\n",
    "Using a more flexible format `table`, and specifying `data_columns=True` allows us to subset data with conditions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>shortnam</th>\n",
       "      <th>euro1900</th>\n",
       "      <th>excolony</th>\n",
       "      <th>avexpr</th>\n",
       "      <th>logpgp95</th>\n",
       "      <th>cons1</th>\n",
       "      <th>cons90</th>\n",
       "      <th>democ00a</th>\n",
       "      <th>cons00a</th>\n",
       "      <th>extmort4</th>\n",
       "      <th>logem4</th>\n",
       "      <th>loghjypl</th>\n",
       "      <th>baseco</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ARE</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>7.181818</td>\n",
       "      <td>9.804219</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ARG</td>\n",
       "      <td>60.000004</td>\n",
       "      <td>1.0</td>\n",
       "      <td>6.386364</td>\n",
       "      <td>9.133459</td>\n",
       "      <td>1.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>68.900002</td>\n",
       "      <td>4.232656</td>\n",
       "      <td>-0.872274</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>AUS</td>\n",
       "      <td>98.000000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>9.318182</td>\n",
       "      <td>9.897972</td>\n",
       "      <td>7.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>8.550000</td>\n",
       "      <td>2.145931</td>\n",
       "      <td>-0.170788</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>AUT</td>\n",
       "      <td>100.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>9.727273</td>\n",
       "      <td>9.974877</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.343900</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>BEL</td>\n",
       "      <td>100.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>9.681818</td>\n",
       "      <td>9.992871</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>-0.179127</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  shortnam    euro1900  excolony    avexpr  logpgp95  cons1  cons90  democ00a  \\\n",
       "2      ARE    0.000000       1.0  7.181818  9.804219    NaN     NaN       NaN   \n",
       "3      ARG   60.000004       1.0  6.386364  9.133459    1.0     6.0       3.0   \n",
       "5      AUS   98.000000       1.0  9.318182  9.897972    7.0     7.0      10.0   \n",
       "6      AUT  100.000000       0.0  9.727273  9.974877    NaN     NaN       NaN   \n",
       "9      BEL  100.000000       0.0  9.681818  9.992871    NaN     NaN       NaN   \n",
       "\n",
       "   cons00a   extmort4    logem4  loghjypl  baseco  \n",
       "2      NaN        NaN       NaN       NaN     NaN  \n",
       "3      3.0  68.900002  4.232656 -0.872274     1.0  \n",
       "5      7.0   8.550000  2.145931 -0.170788     1.0  \n",
       "6      NaN        NaN       NaN -0.343900     NaN  \n",
       "9      NaN        NaN       NaN -0.179127     NaN  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_hdf.put(key='df1_table', value=df1, format='table', data_columns=True)\n",
    "pd.read_hdf('my_hdf.h5', key='df1_table', where=['logpgp95>8']).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Files can be stored hierarchically, and therefore can be accessed in groups"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_hdf.put(key='data1/fixed/df1', value=df1)\n",
    "my_hdf.put(key='data2/fixed/df2', value=df2)\n",
    "my_hdf.put(key='data4/fixed/df4', value=df4)\n",
    "my_hdf.put(key='data1/tables/df1', value=df1, format='table', data_columns=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see all those under `data1` as follows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "/data1 (Group) ''\n",
       "  children := ['fixed' (Group), 'tables' (Group)]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_hdf.root.data1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And those under `data1/fixed` ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "/data1/fixed (Group) ''\n",
       "  children := ['df1' (Group)]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_hdf.root.data1.fixed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can delete all those under the `data1` node"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['/df1', '/df1_table', '/df2', '/df4', '/data4/fixed/df4', '/data2/fixed/df2']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_hdf.remove(key='data1')\n",
    "my_hdf.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we will close the file and stop writing to it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<class 'pandas.io.pytables.HDFStore'>\n",
       "File path: my_hdf.h5\n",
       "File is CLOSED"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "my_hdf.close()\n",
    "my_hdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# HDF5 Tables with Pandas\n",
	"\n",
	"Hdf5 is intended for storing very large and complicated datasets\n",
	"\n",
	"Basically allows storing dataframes, series, etc. in a dictionary-like object\n",
	"\n",
	"#### Resources:\n",
	"\n",
	"* Here are the docs for HDF5 in pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables\n",
	"\n",
	"* This is a nice article for pandas with hdf5: http://glowingpython.blogspot.com.au/2014/08/quick-hdf5-with-pandas.html\n",
	"\n",
	"* This article looks at its speed in relation to pickle: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization\n",
	"\n",
	"Ideally, `pytables` > 3.2 should be used"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import pandas as pd"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Create a hdf5 file and open it for writing\n",
	"\n",
	"Note: by default, data is simply appended to the file. If you plan on re-running notebooks, I'd recommend setting `mode='w'` so the original file is overwritten and you don't have a build up of old data. Just a suggestion..."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"my_hdf = pd.HDFStore('my_hdf.h5', mode='w')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"pandas.io.pytables.HDFStore"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"type(my_hdf)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Import some dataframes I want to store in hdf5 format"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"df1 = pd.read_stata('ace1.dta')\n",
	"df2 = pd.read_stata('ace2.dta')\n",
	"df4 = pd.read_stata('ace4.dta')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Add the dataframes to the hdf5 file using `.put()`\n",
	"\n",
	"The `key` will be what we use to access each piece of data we store"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [],
	"source": [
	"my_hdf.put(key='df1', value=df1)\n",
	"my_hdf.put(key='df2', value=df2)\n",
	"my_hdf.put(key='df4', value=df4)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can see the contents of our hdf file, along with the type (ie. frame, series) and shape"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<class 'pandas.io.pytables.HDFStore'>\n",
	"File path: my_hdf.h5\n",
	"/df1 frame (shape->[163,13])\n",
	"/df2 frame (shape->[163,9]) \n",
	"/df4 frame (shape->[163,10])"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"my_hdf"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can see the keys available"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['/df1', '/df2', '/df4']"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"my_hdf.keys()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Dataframes can be accessed by key"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>shortnam</th>\n",
	" <th>euro1900</th>\n",
	" <th>excolony</th>\n",
	" <th>avexpr</th>\n",
	" <th>logpgp95</th>\n",
	" <th>cons1</th>\n",
	" <th>cons90</th>\n",
	" <th>democ00a</th>\n",
	" <th>cons00a</th>\n",
	" <th>extmort4</th>\n",
	" <th>logem4</th>\n",
	" <th>loghjypl</th>\n",
	" <th>baseco</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>AFG</td>\n",
	" <td>0.000000</td>\n",
	" <td>1.0</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>1.0</td>\n",
	" <td>2.0</td>\n",
	" <td>1.0</td>\n",
	" <td>1.0</td>\n",
	" <td>93.699997</td>\n",
	" <td>4.540098</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>AGO</td>\n",
	" <td>8.000000</td>\n",
	" <td>1.0</td>\n",
	" <td>5.363636</td>\n",
	" <td>7.770645</td>\n",
	" <td>3.0</td>\n",
	" <td>3.0</td>\n",
	" <td>0.0</td>\n",
	" <td>1.0</td>\n",
	" <td>280.000000</td>\n",
	" <td>5.634789</td>\n",
	" <td>-3.411248</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>ARE</td>\n",
	" <td>0.000000</td>\n",
	" <td>1.0</td>\n",
	" <td>7.181818</td>\n",
	" <td>9.804219</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>ARG</td>\n",
	" <td>60.000004</td>\n",
	" <td>1.0</td>\n",
	" <td>6.386364</td>\n",
	" <td>9.133459</td>\n",
	" <td>1.0</td>\n",
	" <td>6.0</td>\n",
	" <td>3.0</td>\n",
	" <td>3.0</td>\n",
	" <td>68.900002</td>\n",
	" <td>4.232656</td>\n",
	" <td>-0.872274</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>ARM</td>\n",
	" <td>0.000000</td>\n",
	" <td>0.0</td>\n",
	" <td>NaN</td>\n",
	" <td>7.682482</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" shortnam euro1900 excolony avexpr logpgp95 cons1 cons90 democ00a \\\n",
	"0 AFG 0.000000 1.0 NaN NaN 1.0 2.0 1.0 \n",
	"1 AGO 8.000000 1.0 5.363636 7.770645 3.0 3.0 0.0 \n",
	"2 ARE 0.000000 1.0 7.181818 9.804219 NaN NaN NaN \n",
	"3 ARG 60.000004 1.0 6.386364 9.133459 1.0 6.0 3.0 \n",
	"4 ARM 0.000000 0.0 NaN 7.682482 NaN NaN NaN \n",
	"\n",
	" cons00a extmort4 logem4 loghjypl baseco \n",
	"0 1.0 93.699997 4.540098 NaN NaN \n",
	"1 1.0 280.000000 5.634789 -3.411248 1.0 \n",
	"2 NaN NaN NaN NaN NaN \n",
	"3 3.0 68.900002 4.232656 -0.872274 1.0 \n",
	"4 NaN NaN NaN NaN NaN "
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"my_hdf['df1'].head()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Read in data from the hdf5 file to a dataframe"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>shortnam</th>\n",
	" <th>euro1900</th>\n",
	" <th>excolony</th>\n",
	" <th>avexpr</th>\n",
	" <th>logpgp95</th>\n",
	" <th>cons1</th>\n",
	" <th>cons90</th>\n",
	" <th>democ00a</th>\n",
	" <th>cons00a</th>\n",
	" <th>extmort4</th>\n",
	" <th>logem4</th>\n",
	" <th>loghjypl</th>\n",
	" <th>baseco</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>AFG</td>\n",
	" <td>0.000000</td>\n",
	" <td>1.0</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>1.0</td>\n",
	" <td>2.0</td>\n",
	" <td>1.0</td>\n",
	" <td>1.0</td>\n",
	" <td>93.699997</td>\n",
	" <td>4.540098</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>AGO</td>\n",
	" <td>8.000000</td>\n",
	" <td>1.0</td>\n",
	" <td>5.363636</td>\n",
	" <td>7.770645</td>\n",
	" <td>3.0</td>\n",
	" <td>3.0</td>\n",
	" <td>0.0</td>\n",
	" <td>1.0</td>\n",
	" <td>280.000000</td>\n",
	" <td>5.634789</td>\n",
	" <td>-3.411248</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>ARE</td>\n",
	" <td>0.000000</td>\n",
	" <td>1.0</td>\n",
	" <td>7.181818</td>\n",
	" <td>9.804219</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>ARG</td>\n",
	" <td>60.000004</td>\n",
	" <td>1.0</td>\n",
	" <td>6.386364</td>\n",
	" <td>9.133459</td>\n",
	" <td>1.0</td>\n",
	" <td>6.0</td>\n",
	" <td>3.0</td>\n",
	" <td>3.0</td>\n",
	" <td>68.900002</td>\n",
	" <td>4.232656</td>\n",
	" <td>-0.872274</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>ARM</td>\n",
	" <td>0.000000</td>\n",
	" <td>0.0</td>\n",
	" <td>NaN</td>\n",
	" <td>7.682482</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" shortnam euro1900 excolony avexpr logpgp95 cons1 cons90 democ00a \\\n",
	"0 AFG 0.000000 1.0 NaN NaN 1.0 2.0 1.0 \n",
	"1 AGO 8.000000 1.0 5.363636 7.770645 3.0 3.0 0.0 \n",
	"2 ARE 0.000000 1.0 7.181818 9.804219 NaN NaN NaN \n",
	"3 ARG 60.000004 1.0 6.386364 9.133459 1.0 6.0 3.0 \n",
	"4 ARM 0.000000 0.0 NaN 7.682482 NaN NaN NaN \n",
	"\n",
	" cons00a extmort4 logem4 loghjypl baseco \n",
	"0 1.0 93.699997 4.540098 NaN NaN \n",
	"1 1.0 280.000000 5.634789 -3.411248 1.0 \n",
	"2 NaN NaN NaN NaN NaN \n",
	"3 3.0 68.900002 4.232656 -0.872274 1.0 \n",
	"4 NaN NaN NaN NaN NaN "
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"my_dataframe = pd.read_hdf('my_hdf.h5', key='df1')\n",
	"my_dataframe.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"pandas.core.frame.DataFrame"
	]
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"type(my_dataframe)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"By default, we have been using `fixed` format to store our data. This is more efficient, but less flexible.\n",
	"\n",
	"Using a more flexible format `table`, and specifying `data_columns=True` allows us to subset data with conditions"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>shortnam</th>\n",
	" <th>euro1900</th>\n",
	" <th>excolony</th>\n",
	" <th>avexpr</th>\n",
	" <th>logpgp95</th>\n",
	" <th>cons1</th>\n",
	" <th>cons90</th>\n",
	" <th>democ00a</th>\n",
	" <th>cons00a</th>\n",
	" <th>extmort4</th>\n",
	" <th>logem4</th>\n",
	" <th>loghjypl</th>\n",
	" <th>baseco</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>ARE</td>\n",
	" <td>0.000000</td>\n",
	" <td>1.0</td>\n",
	" <td>7.181818</td>\n",
	" <td>9.804219</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>ARG</td>\n",
	" <td>60.000004</td>\n",
	" <td>1.0</td>\n",
	" <td>6.386364</td>\n",
	" <td>9.133459</td>\n",
	" <td>1.0</td>\n",
	" <td>6.0</td>\n",
	" <td>3.0</td>\n",
	" <td>3.0</td>\n",
	" <td>68.900002</td>\n",
	" <td>4.232656</td>\n",
	" <td>-0.872274</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>5</th>\n",
	" <td>AUS</td>\n",
	" <td>98.000000</td>\n",
	" <td>1.0</td>\n",
	" <td>9.318182</td>\n",
	" <td>9.897972</td>\n",
	" <td>7.0</td>\n",
	" <td>7.0</td>\n",
	" <td>10.0</td>\n",
	" <td>7.0</td>\n",
	" <td>8.550000</td>\n",
	" <td>2.145931</td>\n",
	" <td>-0.170788</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>6</th>\n",
	" <td>AUT</td>\n",
	" <td>100.000000</td>\n",
	" <td>0.0</td>\n",
	" <td>9.727273</td>\n",
	" <td>9.974877</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>-0.343900</td>\n",
	" <td>NaN</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>9</th>\n",
	" <td>BEL</td>\n",
	" <td>100.000000</td>\n",
	" <td>0.0</td>\n",
	" <td>9.681818</td>\n",
	" <td>9.992871</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>-0.179127</td>\n",
	" <td>NaN</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" shortnam euro1900 excolony avexpr logpgp95 cons1 cons90 democ00a \\\n",
	"2 ARE 0.000000 1.0 7.181818 9.804219 NaN NaN NaN \n",
	"3 ARG 60.000004 1.0 6.386364 9.133459 1.0 6.0 3.0 \n",
	"5 AUS 98.000000 1.0 9.318182 9.897972 7.0 7.0 10.0 \n",
	"6 AUT 100.000000 0.0 9.727273 9.974877 NaN NaN NaN \n",
	"9 BEL 100.000000 0.0 9.681818 9.992871 NaN NaN NaN \n",
	"\n",
	" cons00a extmort4 logem4 loghjypl baseco \n",
	"2 NaN NaN NaN NaN NaN \n",
	"3 3.0 68.900002 4.232656 -0.872274 1.0 \n",
	"5 7.0 8.550000 2.145931 -0.170788 1.0 \n",
	"6 NaN NaN NaN -0.343900 NaN \n",
	"9 NaN NaN NaN -0.179127 NaN "
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"my_hdf.put(key='df1_table', value=df1, format='table', data_columns=True)\n",
	"pd.read_hdf('my_hdf.h5', key='df1_table', where=['logpgp95>8']).head()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Files can be stored hierarchically, and therefore can be accessed in groups"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [],
	"source": [
	"my_hdf.put(key='data1/fixed/df1', value=df1)\n",
	"my_hdf.put(key='data2/fixed/df2', value=df2)\n",
	"my_hdf.put(key='data4/fixed/df4', value=df4)\n",
	"my_hdf.put(key='data1/tables/df1', value=df1, format='table', data_columns=True)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can see all those under `data1` as follows"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"/data1 (Group) ''\n",
	" children := ['fixed' (Group), 'tables' (Group)]"
	]
	},
	"execution_count": 13,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"my_hdf.root.data1"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"And those under `data1/fixed` ..."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"/data1/fixed (Group) ''\n",
	" children := ['df1' (Group)]"
	]
	},
	"execution_count": 14,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"my_hdf.root.data1.fixed"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can delete all those under the `data1` node"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['/df1', '/df1_table', '/df2', '/df4', '/data4/fixed/df4', '/data2/fixed/df2']"
	]
	},
	"execution_count": 15,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"my_hdf.remove(key='data1')\n",
	"my_hdf.keys()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Finally, we will close the file and stop writing to it"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<class 'pandas.io.pytables.HDFStore'>\n",
	"File path: my_hdf.h5\n",
	"File is CLOSED"
	]
	},
	"execution_count": 16,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"my_hdf.close()\n",
	"my_hdf"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.1"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}