tacaswell/.gitignore

## .gitignore
*.h5
.ipynb_checkpoints/

## h5py_update.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# h5py update\n",
    "\n",
    "## HDF5 European Workshop for Science and Industry\n",
    "## ESRF, Grenoble, 2019-09-18"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# History, Particurals & Usage\n",
    "\n",
    " - Started in 2008 by Andrew Collette\n",
    "   - Now maintained by community\n",
    " - https://github.com/h5py/h5py\n",
    " - https://h5py.readthedocs.io/en/stable/\n",
    " - 129th most downlodaded package on pypi (mostil CI machines)\n",
    " - used by keras / tensorflow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Basic Philosophy\n",
    "\n",
    " - Provides a \"pythonic\" wrapping of `libhdf5`\n",
    "   - less opnionated about use cases than `pytables`\n",
    "   - less tuned that `pytables`\n",
    " \n",
    "## Core Analogies\n",
    "\n",
    "- `dict` <-> {`h5py.File`, `h5py.Group`}\n",
    "  -  `g['key']` access to children (groups or datasets)\n",
    "- `np.array` <-> `h5py.Dataset`\n",
    "  - `Dateset` object support array protocol, slicing\n",
    "  - only pulls data from disk on demand"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Write some data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "import h5py\n",
    "import numpy as np\n",
    "\n",
    "with h5py.File('example.h5', 'w') as fout:\n",
    "    # do the right thing in simple cases\n",
    "    fout['data'] = [0, 1, 2, 3, 4]\n",
    "    fout['nested/twoD'] = np.array([[1, 2], [3, 4]])\n",
    "    # method provides access to all of the dataset creation knobs\n",
    "    fout.create_dataset('data_B', \n",
    "                        data=np.arange(10).reshape(2, 5),\n",
    "                        chunks=(1, 5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Read some data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<HDF5 file \"example.h5\" (mode r)>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fin = h5py.File('example.h5', 'r')\n",
    "# the File object\n",
    "fin"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<HDF5 group \"/\" (3 members)>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# root group\n",
    "fin['/']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['data', 'data_B', 'nested']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(fin['/'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<HDF5 dataset \"data\": shape (5,), type \"<i8\">"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# a Dateset, has not read any data yet\n",
    "fin['data']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### numpy-stlye slicing on datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 1, 2, 3, 4])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# pull data from disk to an array\n",
    "fin['data'][:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# pull part of the dataset\n",
    "fin['data'][1:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1, 3],\n",
       "       [6, 8]])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# handles numpy-style strided ND slicing\n",
    "fin['data_B'][:, 1::2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 3, 4])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# fancy slicing\n",
    "fin['data'][[0, 3, 4]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Accessing Nested Groups/Datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<HDF5 dataset \"twoD\": shape (2, 2), type \"<i8\">"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# acess nested groups / datasets via repeated []\n",
    "fin['nested']['twoD']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<HDF5 dataset \"twoD\": shape (2, 2), type \"<i8\">"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Or use file-path like access\n",
    "fin['nested/twoD']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### Close the file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Closed HDF5 file>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# if not using a context manager, remember to clean up!\n",
    "fin.close()\n",
    "fin"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# New is h5py 2.8\n",
    "\n",
    " - register new file drivers\n",
    " - track object creation order\n",
    " - lots of bug fixes!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# New in `h5py` 2.9\n",
    "\n",
    " - high-level API for creating virtual datasets\n",
    " - passing in python \"file-like\" objects to `h5py.File`\n",
    " - control chunk cache when creating `h5py.File`\n",
    " - `create_dataset_like` method\n",
    " - track creation order of attributes\n",
    " - bug fixes!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## High level API for Virtual Datasets \n",
    "\n",
    "\n",
    "- Work stared by Aaron Parsons at DLS\n",
    "- continued by Thomas Caswell at NSLS-II\n",
    "- finished by Thomas Kluyver at EuXFEL\n",
    "\n",
    "\n",
    "low-level API has been availble from h5py 2.6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Create some data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# create some sample data\n",
    "data = np.arange(0, 100).reshape(1, 100) + np.arange(1, 5).reshape(4, 1)\n",
    "\n",
    "# Create source files (0.h5 to 3.h5)\n",
    "for n in range(4):\n",
    "    with h5py.File(f\"{n}.h5\", \"w\") as f:\n",
    "        d = f.create_dataset(\"data\", (100,), \"i4\", data[n])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### Create the Virtual Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# Assemble virtual dataset\n",
    "layout = h5py.VirtualLayout(shape=(4, 100), dtype=\"i4\")\n",
    "for n in range(4):\n",
    "    layout[n] = h5py.VirtualSource(f\"{n}.h5\", \"data\", shape=(100,))\n",
    "\n",
    "# Add virtual dataset to output file\n",
    "with h5py.File(\"VDS.h5\", \"w\", libver=\"latest\") as f:\n",
    "    # the virtual dataset\n",
    "    f.create_virtual_dataset(\"data_A\", layout, fillvalue=-5)\n",
    "    # normal dataset with identical values\n",
    "    f.create_dataset(\"data_B\", data=data, dtype='i4')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Read it back"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Virtual dataset: <HDF5 dataset \"data_A\": shape (4, 100), type \"<i4\">\n",
      "[[ 1 11 21 31 41 51 61 71 81 91]\n",
      " [ 2 12 22 32 42 52 62 72 82 92]\n",
      " [ 3 13 23 33 43 53 63 73 83 93]\n",
      " [ 4 14 24 34 44 54 64 74 84 94]]\n",
      "Normal dataset : <HDF5 dataset \"data_B\": shape (4, 100), type \"<i4\">\n",
      "[[ 1 11 21 31 41 51 61 71 81 91]\n",
      " [ 2 12 22 32 42 52 62 72 82 92]\n",
      " [ 3 13 23 33 43 53 63 73 83 93]\n",
      " [ 4 14 24 34 44 54 64 74 84 94]]\n"
     ]
    }
   ],
   "source": [
    "# read data back\n",
    "# virtual dataset is transparent for reader!\n",
    "with h5py.File(\"VDS.h5\", \"r\") as f:\n",
    "    print(f\"Virtual dataset: {f['data_A']}\")\n",
    "    print(f[\"data_A\"][:, ::10])\n",
    "    print(f\"Normal dataset : {f['data_B']}\")\n",
    "    print(f[\"data_B\"][:, ::10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Pass Python file-like objects to `h5py.File`\n",
    "\n",
    " - contributed by Andrey Paramonov (Андрей Парамонов)\n",
    " - can pass in object returned by `open` or a `BytesIO` object"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Creat a `BtyesIO` object and write data to it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "from io import BytesIO\n",
    "\n",
    "obj = BytesIO()\n",
    "with h5py.File(obj, 'w') as fout:\n",
    "    fout['data'] = np.linspace(0, 30, 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### Read the data back"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "the frist 5 bytse: b'\\x89HDF\\r'\n"
     ]
    }
   ],
   "source": [
    "obj.seek(0)\n",
    "print(f\"the frist 5 bytse: {obj.read(5)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<HDF5 dataset \"data\": shape (10,), type \"<f8\">\n"
     ]
    }
   ],
   "source": [
    "obj.seek(0)\n",
    "with h5py.File(obj, 'r') as fin:\n",
    "    print(fin['data'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Write buffer to disk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "obj.seek(0)\n",
    "with open('test_out.h5', 'wb') as fout:\n",
    "    fout.write(obj.getbuffer())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### Read back with hdf5 opening the file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<HDF5 dataset \"data\": shape (10,), type \"<f8\">\n"
     ]
    }
   ],
   "source": [
    "with h5py.File('test_out.h5', 'r') as fin:\n",
    "    print(fin['data'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### Use `open` to read the file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<HDF5 dataset \"data\": shape (10,), type \"<f8\">\n"
     ]
    }
   ],
   "source": [
    "with open('test_out.h5', 'rb') as raw_file:\n",
    "    with h5py.File(raw_file, 'r') as fin:\n",
    "        print(fin['data'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Better KeysView repr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<KeysViewHDF5 ['data', 'data_B', 'nested']>\n"
     ]
    }
   ],
   "source": [
    "with h5py.File('example.h5', 'r') as fin:\n",
    "    print(fin.keys())\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# New in h5py 2.10\n",
    "\n",
    "- Better support for reading bit fields\n",
    "- deprecate implicit file mode\n",
    "- better tab-completion out-of-the-box in IPython\n",
    "- add `Dataset.make_scale` helper\n",
    "- improve handling of spcial data types\n",
    "- expose `H5PL` functions\n",
    "- expose `H5Dread_chunk` and `h5d.read_direct_chunk`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Require file mode (so we can change the default next release)\n",
    "\n",
    "- the current default mode is \"open append, or create if needed\"\n",
    "- this is dangerous as users may accindentally mutate files they did not want to!\n",
    "- does not match behivor of `open`\n",
    "- for back-compatibliity did not want to change default in one step"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/tcaswell/mc3/envs/dd37/lib/python3.7/site-packages/ipykernel_launcher.py:1: H5pyDeprecationWarning: The default file mode will change to 'r' (read-only) in h5py 3.0. To suppress this warning, pass the mode you need to h5py.File(), or set the global default h5.get_config().default_file_mode, or set the environment variable H5PY_DEFAULT_READONLY=1. Available modes are: 'r', 'r+', 'w', 'w-'/'x', 'a'. See the docs for details.\n",
      "  \"\"\"Entry point for launching an IPython kernel.\n"
     ]
    }
   ],
   "source": [
    "with h5py.File('blahblah.h5') as fout:\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "h5py.get_config().default_file_mode = 'r'\n",
    "with h5py.File('blahblah.h5') as fout:\n",
    "    pass\n",
    "# put it back to default just to be tidy!\n",
    "h5py.get_config().default_file_mode = None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## `make_scale` helper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "with h5py.File(\"with_scale.h5\", 'w') as fout:\n",
    "    fout['data'] = range(10)\n",
    "    fout['pos'] = np.arange(10) + 5\n",
    "    fout['pos'].make_scale(\"pos\")\n",
    "    fout['data'].dims[0].attach_scale(fout['pos'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HDF5 \"with_scale.h5\" {\r\n",
      "DATASET \"data\" {\r\n",
      "   DATATYPE  H5T_STD_I64LE\r\n",
      "   DATASPACE  SIMPLE { ( 10 ) / ( 10 ) }\r\n",
      "   DATA {\r\n",
      "   (0): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9\r\n",
      "   }\r\n",
      "   ATTRIBUTE \"DIMENSION_LIST\" {\r\n",
      "      DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}\r\n",
      "      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }\r\n",
      "      DATA {\r\n",
      "      (0): (DATASET 1400 /pos )\r\n",
      "      }\r\n",
      "   }\r\n",
      "}\r\n",
      "}\r\n"
     ]
    }
   ],
   "source": [
    "!h5dump --dataset=data with_scale.h5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Plan for h5py 3.0\n",
    "\n",
    " - drop python 2.7, 3.4, 3.5 support\n",
    " - open with 'r' mode by default\n",
    " - improve string handling\n",
    " - improve date / timestamp handling\n",
    " - MPI improvements (?)\n",
    " - vlen improvements (?)\n",
    " \n",
    " ## h5py Code Camp Thursday/Friday\n",
    " \n",
    " work needed at all levels (docs, performance tuning, bug fixes, libhdf5 reflection, API design)\n",
    " \n",
    " https://github.com/h5py/h5py/projects/1\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Questions?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "h5py: 2.10.0\n"
     ]
    }
   ],
   "source": [
    "import h5py\n",
    "print(f\"h5py: {h5py.__version__}\")"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"# h5py update\n",
	"\n",
	"## HDF5 European Workshop for Science and Industry\n",
	"## ESRF, Grenoble, 2019-09-18"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"# History, Particurals & Usage\n",
	"\n",
	" - Started in 2008 by Andrew Collette\n",
	" - Now maintained by community\n",
	" - https://github.com/h5py/h5py\n",
	" - https://h5py.readthedocs.io/en/stable/\n",
	" - 129th most downlodaded package on pypi (mostil CI machines)\n",
	" - used by keras / tensorflow"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"# Basic Philosophy\n",
	"\n",
	" - Provides a \"pythonic\" wrapping of `libhdf5`\n",
	" - less opnionated about use cases than `pytables`\n",
	" - less tuned that `pytables`\n",
	" \n",
	"## Core Analogies\n",
	"\n",
	"- `dict` <-> {`h5py.File`, `h5py.Group`}\n",
	" - `g['key']` access to children (groups or datasets)\n",
	"- `np.array` <-> `h5py.Dataset`\n",
	" - `Dateset` object support array protocol, slicing\n",
	" - only pulls data from disk on demand"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"# Write some data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [],
	"source": [
	"import h5py\n",
	"import numpy as np\n",
	"\n",
	"with h5py.File('example.h5', 'w') as fout:\n",
	" # do the right thing in simple cases\n",
	" fout['data'] = [0, 1, 2, 3, 4]\n",
	" fout['nested/twoD'] = np.array([[1, 2], [3, 4]])\n",
	" # method provides access to all of the dataset creation knobs\n",
	" fout.create_dataset('data_B', \n",
	" data=np.arange(10).reshape(2, 5),\n",
	" chunks=(1, 5))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "subslide"
	}
	},
	"source": [
	"# Read some data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<HDF5 file \"example.h5\" (mode r)>"
	]
	},
	"execution_count": 2,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"fin = h5py.File('example.h5', 'r')\n",
	"# the File object\n",
	"fin"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<HDF5 group \"/\" (3 members)>"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# root group\n",
	"fin['/']"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['data', 'data_B', 'nested']"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"list(fin['/'])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<HDF5 dataset \"data\": shape (5,), type \"<i8\">"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# a Dateset, has not read any data yet\n",
	"fin['data']"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "subslide"
	}
	},
	"source": [
	"### numpy-stlye slicing on datasets"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([0, 1, 2, 3, 4])"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# pull data from disk to an array\n",
	"fin['data'][:]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([1, 2])"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# pull part of the dataset\n",
	"fin['data'][1:3]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {
	"scrolled": true,
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[1, 3],\n",
	" [6, 8]])"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# handles numpy-style strided ND slicing\n",
	"fin['data_B'][:, 1::2]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {
	"scrolled": true,
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([0, 3, 4])"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# fancy slicing\n",
	"fin['data'][[0, 3, 4]]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "subslide"
	}
	},
	"source": [
	"### Accessing Nested Groups/Datasets"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<HDF5 dataset \"twoD\": shape (2, 2), type \"<i8\">"
	]
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# acess nested groups / datasets via repeated []\n",
	"fin['nested']['twoD']"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<HDF5 dataset \"twoD\": shape (2, 2), type \"<i8\">"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Or use file-path like access\n",
	"fin['nested/twoD']"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"source": [
	"### Close the file"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<Closed HDF5 file>"
	]
	},
	"execution_count": 12,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# if not using a context manager, remember to clean up!\n",
	"fin.close()\n",
	"fin"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"# New is h5py 2.8\n",
	"\n",
	" - register new file drivers\n",
	" - track object creation order\n",
	" - lots of bug fixes!"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"# New in `h5py` 2.9\n",
	"\n",
	" - high-level API for creating virtual datasets\n",
	" - passing in python \"file-like\" objects to `h5py.File`\n",
	" - control chunk cache when creating `h5py.File`\n",
	" - `create_dataset_like` method\n",
	" - track creation order of attributes\n",
	" - bug fixes!"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "subslide"
	}
	},
	"source": [
	"## High level API for Virtual Datasets \n",
	"\n",
	"\n",
	"- Work stared by Aaron Parsons at DLS\n",
	"- continued by Thomas Caswell at NSLS-II\n",
	"- finished by Thomas Kluyver at EuXFEL\n",
	"\n",
	"\n",
	"low-level API has been availble from h5py 2.6"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "subslide"
	}
	},
	"source": [
	"### Create some data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [],
	"source": [
	"# create some sample data\n",
	"data = np.arange(0, 100).reshape(1, 100) + np.arange(1, 5).reshape(4, 1)\n",
	"\n",
	"# Create source files (0.h5 to 3.h5)\n",
	"for n in range(4):\n",
	" with h5py.File(f\"{n}.h5\", \"w\") as f:\n",
	" d = f.create_dataset(\"data\", (100,), \"i4\", data[n])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"source": [
	"### Create the Virtual Dataset"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [],
	"source": [
	"# Assemble virtual dataset\n",
	"layout = h5py.VirtualLayout(shape=(4, 100), dtype=\"i4\")\n",
	"for n in range(4):\n",
	" layout[n] = h5py.VirtualSource(f\"{n}.h5\", \"data\", shape=(100,))\n",
	"\n",
	"# Add virtual dataset to output file\n",
	"with h5py.File(\"VDS.h5\", \"w\", libver=\"latest\") as f:\n",
	" # the virtual dataset\n",
	" f.create_virtual_dataset(\"data_A\", layout, fillvalue=-5)\n",
	" # normal dataset with identical values\n",
	" f.create_dataset(\"data_B\", data=data, dtype='i4')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "subslide"
	}
	},
	"source": [
	"### Read it back"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Virtual dataset: <HDF5 dataset \"data_A\": shape (4, 100), type \"<i4\">\n",
	"[[ 1 11 21 31 41 51 61 71 81 91]\n",
	" [ 2 12 22 32 42 52 62 72 82 92]\n",
	" [ 3 13 23 33 43 53 63 73 83 93]\n",
	" [ 4 14 24 34 44 54 64 74 84 94]]\n",
	"Normal dataset : <HDF5 dataset \"data_B\": shape (4, 100), type \"<i4\">\n",
	"[[ 1 11 21 31 41 51 61 71 81 91]\n",
	" [ 2 12 22 32 42 52 62 72 82 92]\n",
	" [ 3 13 23 33 43 53 63 73 83 93]\n",
	" [ 4 14 24 34 44 54 64 74 84 94]]\n"
	]
	}
	],
	"source": [
	"# read data back\n",
	"# virtual dataset is transparent for reader!\n",
	"with h5py.File(\"VDS.h5\", \"r\") as f:\n",
	" print(f\"Virtual dataset: {f['data_A']}\")\n",
	" print(f[\"data_A\"][:, ::10])\n",
	" print(f\"Normal dataset : {f['data_B']}\")\n",
	" print(f[\"data_B\"][:, ::10])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "subslide"
	}
	},
	"source": [
	"## Pass Python file-like objects to `h5py.File`\n",
	"\n",
	" - contributed by Andrey Paramonov (Андрей Парамонов)\n",
	" - can pass in object returned by `open` or a `BytesIO` object"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "subslide"
	}
	},
	"source": [
	"### Creat a `BtyesIO` object and write data to it"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [],
	"source": [
	"from io import BytesIO\n",
	"\n",
	"obj = BytesIO()\n",
	"with h5py.File(obj, 'w') as fout:\n",
	" fout['data'] = np.linspace(0, 30, 10)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"source": [
	"### Read the data back"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"the frist 5 bytse: b'\\x89HDF\\r'\n"
	]
	}
	],
	"source": [
	"obj.seek(0)\n",
	"print(f\"the frist 5 bytse: {obj.read(5)}\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"<HDF5 dataset \"data\": shape (10,), type \"<f8\">\n"
	]
	}
	],
	"source": [
	"obj.seek(0)\n",
	"with h5py.File(obj, 'r') as fin:\n",
	" print(fin['data'])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "subslide"
	}
	},
	"source": [
	"### Write buffer to disk"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [],
	"source": [
	"obj.seek(0)\n",
	"with open('test_out.h5', 'wb') as fout:\n",
	" fout.write(obj.getbuffer())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"source": [
	"### Read back with hdf5 opening the file"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"<HDF5 dataset \"data\": shape (10,), type \"<f8\">\n"
	]
	}
	],
	"source": [
	"with h5py.File('test_out.h5', 'r') as fin:\n",
	" print(fin['data'])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"source": [
	"### Use `open` to read the file"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"<HDF5 dataset \"data\": shape (10,), type \"<f8\">\n"
	]
	}
	],
	"source": [
	"with open('test_out.h5', 'rb') as raw_file:\n",
	" with h5py.File(raw_file, 'r') as fin:\n",
	" print(fin['data'])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "subslide"
	}
	},
	"source": [
	"## Better KeysView repr"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"<KeysViewHDF5 ['data', 'data_B', 'nested']>\n"
	]
	}
	],
	"source": [
	"with h5py.File('example.h5', 'r') as fin:\n",
	" print(fin.keys())\n",
	" "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"# New in h5py 2.10\n",
	"\n",
	"- Better support for reading bit fields\n",
	"- deprecate implicit file mode\n",
	"- better tab-completion out-of-the-box in IPython\n",
	"- add `Dataset.make_scale` helper\n",
	"- improve handling of spcial data types\n",
	"- expose `H5PL` functions\n",
	"- expose `H5Dread_chunk` and `h5d.read_direct_chunk`"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "subslide"
	}
	},
	"source": [
	"## Require file mode (so we can change the default next release)\n",
	"\n",
	"- the current default mode is \"open append, or create if needed\"\n",
	"- this is dangerous as users may accindentally mutate files they did not want to!\n",
	"- does not match behivor of `open`\n",
	"- for back-compatibliity did not want to change default in one step"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 23,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"/home/tcaswell/mc3/envs/dd37/lib/python3.7/site-packages/ipykernel_launcher.py:1: H5pyDeprecationWarning: The default file mode will change to 'r' (read-only) in h5py 3.0. To suppress this warning, pass the mode you need to h5py.File(), or set the global default h5.get_config().default_file_mode, or set the environment variable H5PY_DEFAULT_READONLY=1. Available modes are: 'r', 'r+', 'w', 'w-'/'x', 'a'. See the docs for details.\n",
	" \"\"\"Entry point for launching an IPython kernel.\n"
	]
	}
	],
	"source": [
	"with h5py.File('blahblah.h5') as fout:\n",
	" pass"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 24,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"h5py.get_config().default_file_mode = 'r'\n",
	"with h5py.File('blahblah.h5') as fout:\n",
	" pass\n",
	"# put it back to default just to be tidy!\n",
	"h5py.get_config().default_file_mode = None"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "subslide"
	}
	},
	"source": [
	"## `make_scale` helper"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 25,
	"metadata": {
	"slideshow": {
	"slide_type": "-"
	}
	},
	"outputs": [],
	"source": [
	"with h5py.File(\"with_scale.h5\", 'w') as fout:\n",
	" fout['data'] = range(10)\n",
	" fout['pos'] = np.arange(10) + 5\n",
	" fout['pos'].make_scale(\"pos\")\n",
	" fout['data'].dims[0].attach_scale(fout['pos'])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 26,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"HDF5 \"with_scale.h5\" {\r\n",
	"DATASET \"data\" {\r\n",
	" DATATYPE H5T_STD_I64LE\r\n",
	" DATASPACE SIMPLE { ( 10 ) / ( 10 ) }\r\n",
	" DATA {\r\n",
	" (0): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9\r\n",
	" }\r\n",
	" ATTRIBUTE \"DIMENSION_LIST\" {\r\n",
	" DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}\r\n",
	" DATASPACE SIMPLE { ( 1 ) / ( 1 ) }\r\n",
	" DATA {\r\n",
	" (0): (DATASET 1400 /pos )\r\n",
	" }\r\n",
	" }\r\n",
	"}\r\n",
	"}\r\n"
	]
	}
	],
	"source": [
	"!h5dump --dataset=data with_scale.h5"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"# Plan for h5py 3.0\n",
	"\n",
	" - drop python 2.7, 3.4, 3.5 support\n",
	" - open with 'r' mode by default\n",
	" - improve string handling\n",
	" - improve date / timestamp handling\n",
	" - MPI improvements (?)\n",
	" - vlen improvements (?)\n",
	" \n",
	" ## h5py Code Camp Thursday/Friday\n",
	" \n",
	" work needed at all levels (docs, performance tuning, bug fixes, libhdf5 reflection, API design)\n",
	" \n",
	" https://github.com/h5py/h5py/projects/1\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"# Questions?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 27,
	"metadata": {
	"slideshow": {
	"slide_type": "skip"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"h5py: 2.10.0\n"
	]
	}
	],
	"source": [
	"import h5py\n",
	"print(f\"h5py: {h5py.__version__}\")"
	]
	}
	],
	"metadata": {
	"celltoolbar": "Slideshow",
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}