chrisburr/Pytables traversal performance.ipynb

## Pytables traversal performance.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reading keys from HDF files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### TLDR\n",
    "Pytables is optimised for the use case where the file is traversed many times. As a result it takes a significant performance hit for first traversal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "import h5py\n",
    "import tables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Make an example file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "if os.path.isfile('test.h5'):\n",
    "    os.remove('test.h5')\n",
    "for i in range(10):\n",
    "    for j in range(10):\n",
    "        df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]})\n",
    "        df.to_hdf('test.h5', key='a{i}/b{j}'.format(i=i, j=j), format='fixed')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pytables caching options"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### In the case the cache is too small so every call to `keys` takes a long time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "*** First access ***\n",
      "CPU times: user 529 ms, sys: 11 ms, total: 540 ms\n",
      "Wall time: 539 ms\n",
      "*** Second access ***\n",
      "CPU times: user 524 ms, sys: 0 ns, total: 524 ms\n",
      "Wall time: 524 ms\n"
     ]
    }
   ],
   "source": [
    "with pd.HDFStore('test.h5') as f:\n",
    "    print('*** First access ***')\n",
    "    %time pd_keys = f.keys()\n",
    "    print('*** Second access ***')\n",
    "    %time pd_keys = f.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### If we use a bigger LRU cache subsiquent calls are faster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "*** First access ***\n",
      "CPU times: user 568 ms, sys: 9 ms, total: 577 ms\n",
      "Wall time: 576 ms\n",
      "*** Second access ***\n",
      "CPU times: user 67 ms, sys: 0 ns, total: 67 ms\n",
      "Wall time: 67.3 ms\n"
     ]
    }
   ],
   "source": [
    "with pd.HDFStore('test.h5', NODE_CACHE_SLOTS=1000) as f:\n",
    "    print('*** First access ***')\n",
    "    %time pd_keys = f.keys()\n",
    "    print('*** Second access ***')\n",
    "    %time pd_keys = f.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Using no cache is a little faster than the LRU cache, but it's badly optimised for this use case"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "*** First access ***\n",
      "CPU times: user 536 ms, sys: 3 ms, total: 539 ms\n",
      "Wall time: 538 ms\n",
      "*** Second access ***\n",
      "CPU times: user 527 ms, sys: 999 µs, total: 528 ms\n",
      "Wall time: 526 ms\n"
     ]
    }
   ],
   "source": [
    "with pd.HDFStore('test.h5', NODE_CACHE_SLOTS=0) as f:\n",
    "    print('*** First access ***')\n",
    "    %time pd_keys = f.keys()\n",
    "    print('*** Second access ***')\n",
    "    %time pd_keys = f.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Using an unlimted sized cache makes everything faster, especially on subsiqent calls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First access took\n",
      "CPU times: user 534 ms, sys: 1 ms, total: 535 ms\n",
      "Wall time: 535 ms\n",
      "Second access took\n",
      "CPU times: user 37 ms, sys: 0 ns, total: 37 ms\n",
      "Wall time: 37.4 ms\n"
     ]
    }
   ],
   "source": [
    "with pd.HDFStore('test.h5', NODE_CACHE_SLOTS=-1000) as f:\n",
    "    print('First access took')\n",
    "    %time pd_keys = f.keys()\n",
    "    print('Second access took')\n",
    "    %time pd_keys = f.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### If we monkey patch `_NoCache` to be an unlimited size dictionary its slightly faster again (and could definitely be optimised)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First access took\n",
      "CPU times: user 517 ms, sys: 3 ms, total: 520 ms\n",
      "Wall time: 519 ms\n",
      "Second access took\n",
      "CPU times: user 34 ms, sys: 0 ns, total: 34 ms\n",
      "Wall time: 34.7 ms\n"
     ]
    }
   ],
   "source": [
    "old_NoCache = tables.file._NoCache\n",
    "\n",
    "tables.file._NoCache = dict\n",
    "\n",
    "with pd.HDFStore('test.h5', NODE_CACHE_SLOTS=0) as f:\n",
    "    print('First access took')\n",
    "    %time pd_keys = f.keys()\n",
    "    print('Second access took')\n",
    "    %time pd_keys = f.keys()\n",
    "\n",
    "tables.file._NoCache = old_NoCache"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using h5py much is faster than pytables, provided pytables hasn't cached the result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "*** First access ***\n",
      "CPU times: user 80 ms, sys: 999 µs, total: 81 ms\n",
      "Wall time: 80.8 ms\n",
      "*** Second access ***\n",
      "CPU times: user 74 ms, sys: 0 ns, total: 74 ms\n",
      "Wall time: 72.9 ms\n"
     ]
    }
   ],
   "source": [
    "def get_keys(key, obj):\n",
    "    # If the object corresponding to this key is a `Dataset` and the key ends in\n",
    "    # table then the pytables key is contained within `key` as follows:\n",
    "    # key = \"/{pytables_key}/table\"\n",
    "    if isinstance(obj, h5py._hl.dataset.Dataset) and key.split('/')[-1] in ['table', 'pandas_type']:\n",
    "        get_keys.result.append('/'.join([''] + key.split('/')[:-1]))\n",
    "\n",
    "with h5py.File('test.h5', mode='r') as h5py_f:\n",
    "    print('*** First access ***')\n",
    "    get_keys.result = []\n",
    "    %time h5py_f.visititems(get_keys)\n",
    "    print('*** Second access ***')\n",
    "    get_keys.result = []\n",
    "    %time h5py_f.visititems(get_keys)\n",
    "h5_keys = get_keys.result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:analysis]",
   "language": "python",
   "name": "conda-env-analysis-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Reading keys from HDF files"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### TLDR\n",
	"Pytables is optimised for the use case where the file is traversed many times. As a result it takes a significant performance hit for first traversal."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import os\n",
	"import pandas as pd\n",
	"import h5py\n",
	"import tables"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Make an example file"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"if os.path.isfile('test.h5'):\n",
	" os.remove('test.h5')\n",
	"for i in range(10):\n",
	" for j in range(10):\n",
	" df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]})\n",
	" df.to_hdf('test.h5', key='a{i}/b{j}'.format(i=i, j=j), format='fixed')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Pytables caching options"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### In the case the cache is too small so every call to `keys` takes a long time"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"* First access *\n",
	"CPU times: user 529 ms, sys: 11 ms, total: 540 ms\n",
	"Wall time: 539 ms\n",
	"* Second access *\n",
	"CPU times: user 524 ms, sys: 0 ns, total: 524 ms\n",
	"Wall time: 524 ms\n"
	]
	}
	],
	"source": [
	"with pd.HDFStore('test.h5') as f:\n",
	" print('* First access *')\n",
	" %time pd_keys = f.keys()\n",
	" print('* Second access *')\n",
	" %time pd_keys = f.keys()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### If we use a bigger LRU cache subsiquent calls are faster"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"* First access *\n",
	"CPU times: user 568 ms, sys: 9 ms, total: 577 ms\n",
	"Wall time: 576 ms\n",
	"* Second access *\n",
	"CPU times: user 67 ms, sys: 0 ns, total: 67 ms\n",
	"Wall time: 67.3 ms\n"
	]
	}
	],
	"source": [
	"with pd.HDFStore('test.h5', NODE_CACHE_SLOTS=1000) as f:\n",
	" print('* First access *')\n",
	" %time pd_keys = f.keys()\n",
	" print('* Second access *')\n",
	" %time pd_keys = f.keys()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Using no cache is a little faster than the LRU cache, but it's badly optimised for this use case"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"* First access *\n",
	"CPU times: user 536 ms, sys: 3 ms, total: 539 ms\n",
	"Wall time: 538 ms\n",
	"* Second access *\n",
	"CPU times: user 527 ms, sys: 999 µs, total: 528 ms\n",
	"Wall time: 526 ms\n"
	]
	}
	],
	"source": [
	"with pd.HDFStore('test.h5', NODE_CACHE_SLOTS=0) as f:\n",
	" print('* First access *')\n",
	" %time pd_keys = f.keys()\n",
	" print('* Second access *')\n",
	" %time pd_keys = f.keys()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Using an unlimted sized cache makes everything faster, especially on subsiqent calls"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"First access took\n",
	"CPU times: user 534 ms, sys: 1 ms, total: 535 ms\n",
	"Wall time: 535 ms\n",
	"Second access took\n",
	"CPU times: user 37 ms, sys: 0 ns, total: 37 ms\n",
	"Wall time: 37.4 ms\n"
	]
	}
	],
	"source": [
	"with pd.HDFStore('test.h5', NODE_CACHE_SLOTS=-1000) as f:\n",
	" print('First access took')\n",
	" %time pd_keys = f.keys()\n",
	" print('Second access took')\n",
	" %time pd_keys = f.keys()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### If we monkey patch `_NoCache` to be an unlimited size dictionary its slightly faster again (and could definitely be optimised)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"First access took\n",
	"CPU times: user 517 ms, sys: 3 ms, total: 520 ms\n",
	"Wall time: 519 ms\n",
	"Second access took\n",
	"CPU times: user 34 ms, sys: 0 ns, total: 34 ms\n",
	"Wall time: 34.7 ms\n"
	]
	}
	],
	"source": [
	"old_NoCache = tables.file._NoCache\n",
	"\n",
	"tables.file._NoCache = dict\n",
	"\n",
	"with pd.HDFStore('test.h5', NODE_CACHE_SLOTS=0) as f:\n",
	" print('First access took')\n",
	" %time pd_keys = f.keys()\n",
	" print('Second access took')\n",
	" %time pd_keys = f.keys()\n",
	"\n",
	"tables.file._NoCache = old_NoCache"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Using h5py much is faster than pytables, provided pytables hasn't cached the result"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"* First access *\n",
	"CPU times: user 80 ms, sys: 999 µs, total: 81 ms\n",
	"Wall time: 80.8 ms\n",
	"* Second access *\n",
	"CPU times: user 74 ms, sys: 0 ns, total: 74 ms\n",
	"Wall time: 72.9 ms\n"
	]
	}
	],
	"source": [
	"def get_keys(key, obj):\n",
	" # If the object corresponding to this key is a `Dataset` and the key ends in\n",
	" # table then the pytables key is contained within `key` as follows:\n",
	" # key = \"/{pytables_key}/table\"\n",
	" if isinstance(obj, h5py._hl.dataset.Dataset) and key.split('/')[-1] in ['table', 'pandas_type']:\n",
	" get_keys.result.append('/'.join([''] + key.split('/')[:-1]))\n",
	"\n",
	"with h5py.File('test.h5', mode='r') as h5py_f:\n",
	" print('* First access *')\n",
	" get_keys.result = []\n",
	" %time h5py_f.visititems(get_keys)\n",
	" print('* Second access *')\n",
	" get_keys.result = []\n",
	" %time h5py_f.visititems(get_keys)\n",
	"h5_keys = get_keys.result"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python [conda env:analysis]",
	"language": "python",
	"name": "conda-env-analysis-py"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}