Skip to content

Instantly share code, notes, and snippets.

@natashawatkins
Last active May 29, 2017 13:04
Show Gist options
  • Save natashawatkins/be43e7f0019fe0ef83d0a79be0fb9908 to your computer and use it in GitHub Desktop.
Save natashawatkins/be43e7f0019fe0ef83d0a79be0fb9908 to your computer and use it in GitHub Desktop.
HDF5 Files with Pandas
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# HDF5 Tables with Pandas\n",
"\n",
"Hdf5 is intended for storing very large and complicated datasets\n",
"\n",
"Basically allows storing dataframes, series, etc. in a dictionary-like object\n",
"\n",
"#### Resources:\n",
"\n",
"* Here are the docs for HDF5 in pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables\n",
"\n",
"* This is a nice article for pandas with hdf5: http://glowingpython.blogspot.com.au/2014/08/quick-hdf5-with-pandas.html\n",
"\n",
"* This article looks at its speed in relation to pickle: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization\n",
"\n",
"Ideally, `pytables` > 3.2 should be used"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a hdf5 file and open it for writing\n",
"\n",
"**Note**: by default, data is simply appended to the file. If you plan on re-running notebooks, I'd recommend setting `mode='w'` so the original file is overwritten and you don't have a build up of old data. Just a suggestion..."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"my_hdf = pd.HDFStore('my_hdf.h5', mode='w')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.io.pytables.HDFStore"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(my_hdf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import some dataframes I want to store in hdf5 format"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df1 = pd.read_stata('ace1.dta')\n",
"df2 = pd.read_stata('ace2.dta')\n",
"df4 = pd.read_stata('ace4.dta')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Add the dataframes to the hdf5 file using `.put()`\n",
"\n",
"The `key` will be what we use to access each piece of data we store"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"my_hdf.put(key='df1', value=df1)\n",
"my_hdf.put(key='df2', value=df2)\n",
"my_hdf.put(key='df4', value=df4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see the contents of our hdf file, along with the type (ie. frame, series) and shape"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<class 'pandas.io.pytables.HDFStore'>\n",
"File path: my_hdf.h5\n",
"/df1 frame (shape->[163,13])\n",
"/df2 frame (shape->[163,9]) \n",
"/df4 frame (shape->[163,10])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_hdf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see the keys available"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['/df1', '/df2', '/df4']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_hdf.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dataframes can be accessed by key"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>shortnam</th>\n",
" <th>euro1900</th>\n",
" <th>excolony</th>\n",
" <th>avexpr</th>\n",
" <th>logpgp95</th>\n",
" <th>cons1</th>\n",
" <th>cons90</th>\n",
" <th>democ00a</th>\n",
" <th>cons00a</th>\n",
" <th>extmort4</th>\n",
" <th>logem4</th>\n",
" <th>loghjypl</th>\n",
" <th>baseco</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>AFG</td>\n",
" <td>0.000000</td>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>93.699997</td>\n",
" <td>4.540098</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>AGO</td>\n",
" <td>8.000000</td>\n",
" <td>1.0</td>\n",
" <td>5.363636</td>\n",
" <td>7.770645</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>280.000000</td>\n",
" <td>5.634789</td>\n",
" <td>-3.411248</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>ARE</td>\n",
" <td>0.000000</td>\n",
" <td>1.0</td>\n",
" <td>7.181818</td>\n",
" <td>9.804219</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ARG</td>\n",
" <td>60.000004</td>\n",
" <td>1.0</td>\n",
" <td>6.386364</td>\n",
" <td>9.133459</td>\n",
" <td>1.0</td>\n",
" <td>6.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>68.900002</td>\n",
" <td>4.232656</td>\n",
" <td>-0.872274</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>ARM</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>7.682482</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" shortnam euro1900 excolony avexpr logpgp95 cons1 cons90 democ00a \\\n",
"0 AFG 0.000000 1.0 NaN NaN 1.0 2.0 1.0 \n",
"1 AGO 8.000000 1.0 5.363636 7.770645 3.0 3.0 0.0 \n",
"2 ARE 0.000000 1.0 7.181818 9.804219 NaN NaN NaN \n",
"3 ARG 60.000004 1.0 6.386364 9.133459 1.0 6.0 3.0 \n",
"4 ARM 0.000000 0.0 NaN 7.682482 NaN NaN NaN \n",
"\n",
" cons00a extmort4 logem4 loghjypl baseco \n",
"0 1.0 93.699997 4.540098 NaN NaN \n",
"1 1.0 280.000000 5.634789 -3.411248 1.0 \n",
"2 NaN NaN NaN NaN NaN \n",
"3 3.0 68.900002 4.232656 -0.872274 1.0 \n",
"4 NaN NaN NaN NaN NaN "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_hdf['df1'].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read in data from the hdf5 file to a dataframe"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>shortnam</th>\n",
" <th>euro1900</th>\n",
" <th>excolony</th>\n",
" <th>avexpr</th>\n",
" <th>logpgp95</th>\n",
" <th>cons1</th>\n",
" <th>cons90</th>\n",
" <th>democ00a</th>\n",
" <th>cons00a</th>\n",
" <th>extmort4</th>\n",
" <th>logem4</th>\n",
" <th>loghjypl</th>\n",
" <th>baseco</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>AFG</td>\n",
" <td>0.000000</td>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>93.699997</td>\n",
" <td>4.540098</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>AGO</td>\n",
" <td>8.000000</td>\n",
" <td>1.0</td>\n",
" <td>5.363636</td>\n",
" <td>7.770645</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>280.000000</td>\n",
" <td>5.634789</td>\n",
" <td>-3.411248</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>ARE</td>\n",
" <td>0.000000</td>\n",
" <td>1.0</td>\n",
" <td>7.181818</td>\n",
" <td>9.804219</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ARG</td>\n",
" <td>60.000004</td>\n",
" <td>1.0</td>\n",
" <td>6.386364</td>\n",
" <td>9.133459</td>\n",
" <td>1.0</td>\n",
" <td>6.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>68.900002</td>\n",
" <td>4.232656</td>\n",
" <td>-0.872274</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>ARM</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>7.682482</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" shortnam euro1900 excolony avexpr logpgp95 cons1 cons90 democ00a \\\n",
"0 AFG 0.000000 1.0 NaN NaN 1.0 2.0 1.0 \n",
"1 AGO 8.000000 1.0 5.363636 7.770645 3.0 3.0 0.0 \n",
"2 ARE 0.000000 1.0 7.181818 9.804219 NaN NaN NaN \n",
"3 ARG 60.000004 1.0 6.386364 9.133459 1.0 6.0 3.0 \n",
"4 ARM 0.000000 0.0 NaN 7.682482 NaN NaN NaN \n",
"\n",
" cons00a extmort4 logem4 loghjypl baseco \n",
"0 1.0 93.699997 4.540098 NaN NaN \n",
"1 1.0 280.000000 5.634789 -3.411248 1.0 \n",
"2 NaN NaN NaN NaN NaN \n",
"3 3.0 68.900002 4.232656 -0.872274 1.0 \n",
"4 NaN NaN NaN NaN NaN "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_dataframe = pd.read_hdf('my_hdf.h5', key='df1')\n",
"my_dataframe.head()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(my_dataframe)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default, we have been using `fixed` format to store our data. This is more efficient, but less flexible.\n",
"\n",
"Using a more flexible format `table`, and specifying `data_columns=True` allows us to subset data with conditions"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>shortnam</th>\n",
" <th>euro1900</th>\n",
" <th>excolony</th>\n",
" <th>avexpr</th>\n",
" <th>logpgp95</th>\n",
" <th>cons1</th>\n",
" <th>cons90</th>\n",
" <th>democ00a</th>\n",
" <th>cons00a</th>\n",
" <th>extmort4</th>\n",
" <th>logem4</th>\n",
" <th>loghjypl</th>\n",
" <th>baseco</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>ARE</td>\n",
" <td>0.000000</td>\n",
" <td>1.0</td>\n",
" <td>7.181818</td>\n",
" <td>9.804219</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ARG</td>\n",
" <td>60.000004</td>\n",
" <td>1.0</td>\n",
" <td>6.386364</td>\n",
" <td>9.133459</td>\n",
" <td>1.0</td>\n",
" <td>6.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>68.900002</td>\n",
" <td>4.232656</td>\n",
" <td>-0.872274</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>AUS</td>\n",
" <td>98.000000</td>\n",
" <td>1.0</td>\n",
" <td>9.318182</td>\n",
" <td>9.897972</td>\n",
" <td>7.0</td>\n",
" <td>7.0</td>\n",
" <td>10.0</td>\n",
" <td>7.0</td>\n",
" <td>8.550000</td>\n",
" <td>2.145931</td>\n",
" <td>-0.170788</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>AUT</td>\n",
" <td>100.000000</td>\n",
" <td>0.0</td>\n",
" <td>9.727273</td>\n",
" <td>9.974877</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>-0.343900</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>BEL</td>\n",
" <td>100.000000</td>\n",
" <td>0.0</td>\n",
" <td>9.681818</td>\n",
" <td>9.992871</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>-0.179127</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" shortnam euro1900 excolony avexpr logpgp95 cons1 cons90 democ00a \\\n",
"2 ARE 0.000000 1.0 7.181818 9.804219 NaN NaN NaN \n",
"3 ARG 60.000004 1.0 6.386364 9.133459 1.0 6.0 3.0 \n",
"5 AUS 98.000000 1.0 9.318182 9.897972 7.0 7.0 10.0 \n",
"6 AUT 100.000000 0.0 9.727273 9.974877 NaN NaN NaN \n",
"9 BEL 100.000000 0.0 9.681818 9.992871 NaN NaN NaN \n",
"\n",
" cons00a extmort4 logem4 loghjypl baseco \n",
"2 NaN NaN NaN NaN NaN \n",
"3 3.0 68.900002 4.232656 -0.872274 1.0 \n",
"5 7.0 8.550000 2.145931 -0.170788 1.0 \n",
"6 NaN NaN NaN -0.343900 NaN \n",
"9 NaN NaN NaN -0.179127 NaN "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_hdf.put(key='df1_table', value=df1, format='table', data_columns=True)\n",
"pd.read_hdf('my_hdf.h5', key='df1_table', where=['logpgp95>8']).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Files can be stored hierarchically, and therefore can be accessed in groups"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"my_hdf.put(key='data1/fixed/df1', value=df1)\n",
"my_hdf.put(key='data2/fixed/df2', value=df2)\n",
"my_hdf.put(key='data4/fixed/df4', value=df4)\n",
"my_hdf.put(key='data1/tables/df1', value=df1, format='table', data_columns=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see all those under `data1` as follows"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"/data1 (Group) ''\n",
" children := ['fixed' (Group), 'tables' (Group)]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_hdf.root.data1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And those under `data1/fixed` ..."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"/data1/fixed (Group) ''\n",
" children := ['df1' (Group)]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_hdf.root.data1.fixed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can delete all those under the `data1` node"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['/df1', '/df1_table', '/df2', '/df4', '/data4/fixed/df4', '/data2/fixed/df2']"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_hdf.remove(key='data1')\n",
"my_hdf.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we will close the file and stop writing to it"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<class 'pandas.io.pytables.HDFStore'>\n",
"File path: my_hdf.h5\n",
"File is CLOSED"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_hdf.close()\n",
"my_hdf"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment