Skip to content

Instantly share code, notes, and snippets.

@tacaswell
Last active September 18, 2019 14:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tacaswell/3f9de462fad62d9178b3e8c63854fddd to your computer and use it in GitHub Desktop.
Save tacaswell/3f9de462fad62d9178b3e8c63854fddd to your computer and use it in GitHub Desktop.
Slides for h5py update
*.h5
.ipynb_checkpoints/
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# h5py update\n",
"\n",
"## HDF5 European Workshop for Science and Industry\n",
"## ESRF, Grenoble, 2019-09-18"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# History, Particurals & Usage\n",
"\n",
" - Started in 2008 by Andrew Collette\n",
" - Now maintained by community\n",
" - https://github.com/h5py/h5py\n",
" - https://h5py.readthedocs.io/en/stable/\n",
" - 129th most downlodaded package on pypi (mostil CI machines)\n",
" - used by keras / tensorflow"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Basic Philosophy\n",
"\n",
" - Provides a \"pythonic\" wrapping of `libhdf5`\n",
" - less opnionated about use cases than `pytables`\n",
" - less tuned that `pytables`\n",
" \n",
"## Core Analogies\n",
"\n",
"- `dict` <-> {`h5py.File`, `h5py.Group`}\n",
" - `g['key']` access to children (groups or datasets)\n",
"- `np.array` <-> `h5py.Dataset`\n",
" - `Dateset` object support array protocol, slicing\n",
" - only pulls data from disk on demand"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Write some data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"import h5py\n",
"import numpy as np\n",
"\n",
"with h5py.File('example.h5', 'w') as fout:\n",
" # do the right thing in simple cases\n",
" fout['data'] = [0, 1, 2, 3, 4]\n",
" fout['nested/twoD'] = np.array([[1, 2], [3, 4]])\n",
" # method provides access to all of the dataset creation knobs\n",
" fout.create_dataset('data_B', \n",
" data=np.arange(10).reshape(2, 5),\n",
" chunks=(1, 5))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Read some data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<HDF5 file \"example.h5\" (mode r)>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fin = h5py.File('example.h5', 'r')\n",
"# the File object\n",
"fin"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<HDF5 group \"/\" (3 members)>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# root group\n",
"fin['/']"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['data', 'data_B', 'nested']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(fin['/'])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<HDF5 dataset \"data\": shape (5,), type \"<i8\">"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# a Dateset, has not read any data yet\n",
"fin['data']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### numpy-stlye slicing on datasets"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 1, 2, 3, 4])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# pull data from disk to an array\n",
"fin['data'][:]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 2])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# pull part of the dataset\n",
"fin['data'][1:3]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[1, 3],\n",
" [6, 8]])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# handles numpy-style strided ND slicing\n",
"fin['data_B'][:, 1::2]"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 3, 4])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# fancy slicing\n",
"fin['data'][[0, 3, 4]]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Accessing Nested Groups/Datasets"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<HDF5 dataset \"twoD\": shape (2, 2), type \"<i8\">"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# acess nested groups / datasets via repeated []\n",
"fin['nested']['twoD']"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<HDF5 dataset \"twoD\": shape (2, 2), type \"<i8\">"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Or use file-path like access\n",
"fin['nested/twoD']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"### Close the file"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<Closed HDF5 file>"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# if not using a context manager, remember to clean up!\n",
"fin.close()\n",
"fin"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# New is h5py 2.8\n",
"\n",
" - register new file drivers\n",
" - track object creation order\n",
" - lots of bug fixes!"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# New in `h5py` 2.9\n",
"\n",
" - high-level API for creating virtual datasets\n",
" - passing in python \"file-like\" objects to `h5py.File`\n",
" - control chunk cache when creating `h5py.File`\n",
" - `create_dataset_like` method\n",
" - track creation order of attributes\n",
" - bug fixes!"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## High level API for Virtual Datasets \n",
"\n",
"\n",
"- Work stared by Aaron Parsons at DLS\n",
"- continued by Thomas Caswell at NSLS-II\n",
"- finished by Thomas Kluyver at EuXFEL\n",
"\n",
"\n",
"low-level API has been availble from h5py 2.6"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Create some data"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"# create some sample data\n",
"data = np.arange(0, 100).reshape(1, 100) + np.arange(1, 5).reshape(4, 1)\n",
"\n",
"# Create source files (0.h5 to 3.h5)\n",
"for n in range(4):\n",
" with h5py.File(f\"{n}.h5\", \"w\") as f:\n",
" d = f.create_dataset(\"data\", (100,), \"i4\", data[n])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"### Create the Virtual Dataset"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"# Assemble virtual dataset\n",
"layout = h5py.VirtualLayout(shape=(4, 100), dtype=\"i4\")\n",
"for n in range(4):\n",
" layout[n] = h5py.VirtualSource(f\"{n}.h5\", \"data\", shape=(100,))\n",
"\n",
"# Add virtual dataset to output file\n",
"with h5py.File(\"VDS.h5\", \"w\", libver=\"latest\") as f:\n",
" # the virtual dataset\n",
" f.create_virtual_dataset(\"data_A\", layout, fillvalue=-5)\n",
" # normal dataset with identical values\n",
" f.create_dataset(\"data_B\", data=data, dtype='i4')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Read it back"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Virtual dataset: <HDF5 dataset \"data_A\": shape (4, 100), type \"<i4\">\n",
"[[ 1 11 21 31 41 51 61 71 81 91]\n",
" [ 2 12 22 32 42 52 62 72 82 92]\n",
" [ 3 13 23 33 43 53 63 73 83 93]\n",
" [ 4 14 24 34 44 54 64 74 84 94]]\n",
"Normal dataset : <HDF5 dataset \"data_B\": shape (4, 100), type \"<i4\">\n",
"[[ 1 11 21 31 41 51 61 71 81 91]\n",
" [ 2 12 22 32 42 52 62 72 82 92]\n",
" [ 3 13 23 33 43 53 63 73 83 93]\n",
" [ 4 14 24 34 44 54 64 74 84 94]]\n"
]
}
],
"source": [
"# read data back\n",
"# virtual dataset is transparent for reader!\n",
"with h5py.File(\"VDS.h5\", \"r\") as f:\n",
" print(f\"Virtual dataset: {f['data_A']}\")\n",
" print(f[\"data_A\"][:, ::10])\n",
" print(f\"Normal dataset : {f['data_B']}\")\n",
" print(f[\"data_B\"][:, ::10])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Pass Python file-like objects to `h5py.File`\n",
"\n",
" - contributed by Andrey Paramonov (Андрей Парамонов)\n",
" - can pass in object returned by `open` or a `BytesIO` object"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Creat a `BtyesIO` object and write data to it"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"from io import BytesIO\n",
"\n",
"obj = BytesIO()\n",
"with h5py.File(obj, 'w') as fout:\n",
" fout['data'] = np.linspace(0, 30, 10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"### Read the data back"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"the frist 5 bytse: b'\\x89HDF\\r'\n"
]
}
],
"source": [
"obj.seek(0)\n",
"print(f\"the frist 5 bytse: {obj.read(5)}\")"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<HDF5 dataset \"data\": shape (10,), type \"<f8\">\n"
]
}
],
"source": [
"obj.seek(0)\n",
"with h5py.File(obj, 'r') as fin:\n",
" print(fin['data'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Write buffer to disk"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"obj.seek(0)\n",
"with open('test_out.h5', 'wb') as fout:\n",
" fout.write(obj.getbuffer())"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"### Read back with hdf5 opening the file"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<HDF5 dataset \"data\": shape (10,), type \"<f8\">\n"
]
}
],
"source": [
"with h5py.File('test_out.h5', 'r') as fin:\n",
" print(fin['data'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"### Use `open` to read the file"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<HDF5 dataset \"data\": shape (10,), type \"<f8\">\n"
]
}
],
"source": [
"with open('test_out.h5', 'rb') as raw_file:\n",
" with h5py.File(raw_file, 'r') as fin:\n",
" print(fin['data'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Better KeysView repr"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<KeysViewHDF5 ['data', 'data_B', 'nested']>\n"
]
}
],
"source": [
"with h5py.File('example.h5', 'r') as fin:\n",
" print(fin.keys())\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# New in h5py 2.10\n",
"\n",
"- Better support for reading bit fields\n",
"- deprecate implicit file mode\n",
"- better tab-completion out-of-the-box in IPython\n",
"- add `Dataset.make_scale` helper\n",
"- improve handling of spcial data types\n",
"- expose `H5PL` functions\n",
"- expose `H5Dread_chunk` and `h5d.read_direct_chunk`"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Require file mode (so we can change the default next release)\n",
"\n",
"- the current default mode is \"open append, or create if needed\"\n",
"- this is dangerous as users may accindentally mutate files they did not want to!\n",
"- does not match behivor of `open`\n",
"- for back-compatibliity did not want to change default in one step"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/tcaswell/mc3/envs/dd37/lib/python3.7/site-packages/ipykernel_launcher.py:1: H5pyDeprecationWarning: The default file mode will change to 'r' (read-only) in h5py 3.0. To suppress this warning, pass the mode you need to h5py.File(), or set the global default h5.get_config().default_file_mode, or set the environment variable H5PY_DEFAULT_READONLY=1. Available modes are: 'r', 'r+', 'w', 'w-'/'x', 'a'. See the docs for details.\n",
" \"\"\"Entry point for launching an IPython kernel.\n"
]
}
],
"source": [
"with h5py.File('blahblah.h5') as fout:\n",
" pass"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"h5py.get_config().default_file_mode = 'r'\n",
"with h5py.File('blahblah.h5') as fout:\n",
" pass\n",
"# put it back to default just to be tidy!\n",
"h5py.get_config().default_file_mode = None"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## `make_scale` helper"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"with h5py.File(\"with_scale.h5\", 'w') as fout:\n",
" fout['data'] = range(10)\n",
" fout['pos'] = np.arange(10) + 5\n",
" fout['pos'].make_scale(\"pos\")\n",
" fout['data'].dims[0].attach_scale(fout['pos'])"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"HDF5 \"with_scale.h5\" {\r\n",
"DATASET \"data\" {\r\n",
" DATATYPE H5T_STD_I64LE\r\n",
" DATASPACE SIMPLE { ( 10 ) / ( 10 ) }\r\n",
" DATA {\r\n",
" (0): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9\r\n",
" }\r\n",
" ATTRIBUTE \"DIMENSION_LIST\" {\r\n",
" DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}\r\n",
" DATASPACE SIMPLE { ( 1 ) / ( 1 ) }\r\n",
" DATA {\r\n",
" (0): (DATASET 1400 /pos )\r\n",
" }\r\n",
" }\r\n",
"}\r\n",
"}\r\n"
]
}
],
"source": [
"!h5dump --dataset=data with_scale.h5"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Plan for h5py 3.0\n",
"\n",
" - drop python 2.7, 3.4, 3.5 support\n",
" - open with 'r' mode by default\n",
" - improve string handling\n",
" - improve date / timestamp handling\n",
" - MPI improvements (?)\n",
" - vlen improvements (?)\n",
" \n",
" ## h5py Code Camp Thursday/Friday\n",
" \n",
" work needed at all levels (docs, performance tuning, bug fixes, libhdf5 reflection, API design)\n",
" \n",
" https://github.com/h5py/h5py/projects/1\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Questions?"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"h5py: 2.10.0\n"
]
}
],
"source": [
"import h5py\n",
"print(f\"h5py: {h5py.__version__}\")"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment