Skip to content

Instantly share code, notes, and snippets.

@chrisburr
Created November 22, 2016 09:17
Show Gist options
  • Save chrisburr/ce3e683f5db82e2a7160e3c4ab9b6920 to your computer and use it in GitHub Desktop.
Save chrisburr/ce3e683f5db82e2a7160e3c4ab9b6920 to your computer and use it in GitHub Desktop.
Pytables traversal performance
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reading keys from HDF files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### TLDR\n",
"Pytables is optimised for the use case where the file is traversed many times. As a result it takes a significant performance hit for first traversal."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import os\n",
"import pandas as pd\n",
"import h5py\n",
"import tables"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Make an example file"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"if os.path.isfile('test.h5'):\n",
" os.remove('test.h5')\n",
"for i in range(10):\n",
" for j in range(10):\n",
" df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]})\n",
" df.to_hdf('test.h5', key='a{i}/b{j}'.format(i=i, j=j), format='fixed')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pytables caching options"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### In the case the cache is too small so every call to `keys` takes a long time"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"*** First access ***\n",
"CPU times: user 529 ms, sys: 11 ms, total: 540 ms\n",
"Wall time: 539 ms\n",
"*** Second access ***\n",
"CPU times: user 524 ms, sys: 0 ns, total: 524 ms\n",
"Wall time: 524 ms\n"
]
}
],
"source": [
"with pd.HDFStore('test.h5') as f:\n",
" print('*** First access ***')\n",
" %time pd_keys = f.keys()\n",
" print('*** Second access ***')\n",
" %time pd_keys = f.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### If we use a bigger LRU cache subsiquent calls are faster"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"*** First access ***\n",
"CPU times: user 568 ms, sys: 9 ms, total: 577 ms\n",
"Wall time: 576 ms\n",
"*** Second access ***\n",
"CPU times: user 67 ms, sys: 0 ns, total: 67 ms\n",
"Wall time: 67.3 ms\n"
]
}
],
"source": [
"with pd.HDFStore('test.h5', NODE_CACHE_SLOTS=1000) as f:\n",
" print('*** First access ***')\n",
" %time pd_keys = f.keys()\n",
" print('*** Second access ***')\n",
" %time pd_keys = f.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Using no cache is a little faster than the LRU cache, but it's badly optimised for this use case"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"*** First access ***\n",
"CPU times: user 536 ms, sys: 3 ms, total: 539 ms\n",
"Wall time: 538 ms\n",
"*** Second access ***\n",
"CPU times: user 527 ms, sys: 999 µs, total: 528 ms\n",
"Wall time: 526 ms\n"
]
}
],
"source": [
"with pd.HDFStore('test.h5', NODE_CACHE_SLOTS=0) as f:\n",
" print('*** First access ***')\n",
" %time pd_keys = f.keys()\n",
" print('*** Second access ***')\n",
" %time pd_keys = f.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Using an unlimted sized cache makes everything faster, especially on subsiqent calls"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"First access took\n",
"CPU times: user 534 ms, sys: 1 ms, total: 535 ms\n",
"Wall time: 535 ms\n",
"Second access took\n",
"CPU times: user 37 ms, sys: 0 ns, total: 37 ms\n",
"Wall time: 37.4 ms\n"
]
}
],
"source": [
"with pd.HDFStore('test.h5', NODE_CACHE_SLOTS=-1000) as f:\n",
" print('First access took')\n",
" %time pd_keys = f.keys()\n",
" print('Second access took')\n",
" %time pd_keys = f.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### If we monkey patch `_NoCache` to be an unlimited size dictionary its slightly faster again (and could definitely be optimised)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"First access took\n",
"CPU times: user 517 ms, sys: 3 ms, total: 520 ms\n",
"Wall time: 519 ms\n",
"Second access took\n",
"CPU times: user 34 ms, sys: 0 ns, total: 34 ms\n",
"Wall time: 34.7 ms\n"
]
}
],
"source": [
"old_NoCache = tables.file._NoCache\n",
"\n",
"tables.file._NoCache = dict\n",
"\n",
"with pd.HDFStore('test.h5', NODE_CACHE_SLOTS=0) as f:\n",
" print('First access took')\n",
" %time pd_keys = f.keys()\n",
" print('Second access took')\n",
" %time pd_keys = f.keys()\n",
"\n",
"tables.file._NoCache = old_NoCache"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using h5py much is faster than pytables, provided pytables hasn't cached the result"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"*** First access ***\n",
"CPU times: user 80 ms, sys: 999 µs, total: 81 ms\n",
"Wall time: 80.8 ms\n",
"*** Second access ***\n",
"CPU times: user 74 ms, sys: 0 ns, total: 74 ms\n",
"Wall time: 72.9 ms\n"
]
}
],
"source": [
"def get_keys(key, obj):\n",
" # If the object corresponding to this key is a `Dataset` and the key ends in\n",
" # table then the pytables key is contained within `key` as follows:\n",
" # key = \"/{pytables_key}/table\"\n",
" if isinstance(obj, h5py._hl.dataset.Dataset) and key.split('/')[-1] in ['table', 'pandas_type']:\n",
" get_keys.result.append('/'.join([''] + key.split('/')[:-1]))\n",
"\n",
"with h5py.File('test.h5', mode='r') as h5py_f:\n",
" print('*** First access ***')\n",
" get_keys.result = []\n",
" %time h5py_f.visititems(get_keys)\n",
" print('*** Second access ***')\n",
" get_keys.result = []\n",
" %time h5py_f.visititems(get_keys)\n",
"h5_keys = get_keys.result"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:analysis]",
"language": "python",
"name": "conda-env-analysis-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment