Last active
April 22, 2024 14:18
-
-
Save t20100/80960ec46abd3a863e85876c013834bb to your computer and use it in GitHub Desktop.
How-to use JPEG2000 compression with HDF5 from Python using blosc2&grok - Binder: https://mybinder.org/v2/gist/t20100/80960ec46abd3a863e85876c013834bb/HEAD?labpath=hdf5-jpeg2000-codec-with-blosc2-grok.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "5cd1042a-8279-47c6-90ff-0d479f2af798", | |
"metadata": {}, | |
"source": [ | |
"# How-to use JPEG2000 compression with HDF5 from Python using blosc2&grok\n", | |
"\n", | |
"**Goal**: Writing and reading [JPEG2000](https://jpeg.org/jpeg2000/) compressed data in HDF5 file from Python with [blosc2](https://www.blosc.org/c-blosc2/c-blosc2.html) and [grok](https://github.com/GrokImageCompression/grok).\n", | |
"\n", | |
"[HDF5](https://www.hdfgroup.org/) (Hierarchical Data Format) is a file format designed to store and organize large amounts of data.\n", | |
"[hdf5plugin](http://www.silx.org/doc/hdf5plugin/latest/) provides some [HDF5 compression filters](https://portal.hdfgroup.org/documentation/hdf5-docs/registered_filter_plugins.html) - including the blosc2 filter - and makes them usable from [h5py](https://docs.h5py.org/en/stable/), a Pythonic interface to the HDF5 binary data format.\n", | |
"\n", | |
"[blosc2](https://www.blosc.org/c-blosc2/c-blosc2.html) is a \"meta\"-compressor optimized for binary data supporting different compressors and filters with support for external plugins.\n", | |
"[blosc2-grok](https://pypi.org/project/blosc2-grok/) is one of the blosc2 plugins which enables the use of [JPEG2000](https://jpeg.org/jpeg2000/) codec thanks to the [grok library](https://github.com/GrokImageCompression/grok).\n", | |
"\n", | |
"Notebook license: [CC-0](https://creativecommons.org/public-domain/cc0/)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "9dcefbd8-29d8-4091-bc7c-979e4d2779ad", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Install required packages\n", | |
"!pip install blosc2 blosc2_grok h5py hdf5plugin b2h5py jupyterlab_h5web" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "f9f12f43-8188-4aa0-b89f-ee500406532c", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import os\n", | |
"\n", | |
"import blosc2\n", | |
"import blosc2_grok\n", | |
"import h5py\n", | |
"import hdf5plugin\n", | |
"import numpy as np" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "c2a858cd-8d96-40e2-868c-037ca46cdd0d", | |
"metadata": {}, | |
"source": [ | |
"## Write a stack of images as a HDF5 dataset compressed with JPEG2000\n", | |
"\n", | |
"To write a dataset compressed with JPEG2000 using blosc2&grok, one has to compress the data with blosc2 and write it using HDF5's direct chunk write.\n", | |
"\n", | |
"Indeed, as of today, it is not possible to create a dataset compressed with blosc2&grok using h5py's [Group.create_dataset](https://docs.h5py.org/en/stable/high/group.html#h5py.Group.create_dataset).\n", | |
"\n", | |
"We define a function ``b2_grok_compress_stack`` which compresses a numpy array with blosc2&grok and a function ``create_blosc2_grok_stack_dataset`` which uses the first function and write the compressed data to a HDF5 dataset." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"id": "8efde572-be71-46e8-ba15-dd1d835a3bd4", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def b2_grok_compress_stack(data: np.ndarray, rate: float) -> blosc2.NDArray:\n", | |
" \"\"\"Compress a 3D array with blosc2&grok as a stack of JPEG2000 images.\n", | |
"\n", | |
" :param data: 3D array of data\n", | |
" :param rate: The requested compression ratio\n", | |
" \"\"\"\n", | |
" blosc2_grok.set_params_defaults(\n", | |
" cod_format=blosc2_grok.GrkFileFmt.GRK_FMT_JP2,\n", | |
" quality_mode=\"rates\",\n", | |
" quality_layers=np.array([rate], dtype=np.float64),\n", | |
" )\n", | |
" return blosc2.asarray(\n", | |
" data,\n", | |
" chunks=data.shape,\n", | |
" blocks=(1,) + data.shape[1:], # Compress slice by slice\n", | |
" cparams={\n", | |
" 'codec': blosc2.Codec.GROK,\n", | |
" 'filters': [],\n", | |
" 'splitmode': blosc2.SplitMode.NEVER_SPLIT,\n", | |
" },\n", | |
" )\n", | |
"\n", | |
"\n", | |
"def create_blosc2_grok_stack_dataset(\n", | |
" group: h5py.Group,\n", | |
" h5path: str,\n", | |
" data: np.ndarray,\n", | |
" rate: float,\n", | |
") -> h5py.Dataset:\n", | |
" \"\"\"Store data compressed with blosc2&grok in a new dataset: group[h5path]\n", | |
"\n", | |
" :param group: The root group where to create the dataset\n", | |
" :param h5path: The path of the new dataset in the group\n", | |
" :param data: The stack data to compress\n", | |
" :param rate: The requested compression ratio\n", | |
" \"\"\"\n", | |
" dataset = group.create_dataset( # Create the HDF5 dataset\n", | |
" h5path,\n", | |
" shape=data.shape,\n", | |
" dtype=data.dtype,\n", | |
" chunks=data.shape,\n", | |
" allow_unknown_filter=True,\n", | |
" compression=hdf5plugin.Blosc2(),\n", | |
" )\n", | |
" blosc2_array = b2_grok_compress_stack(data, rate) # Compress the data with blosc2 & grok\n", | |
" # Write the compressed data to HDF5 using direct unk write\n", | |
" dataset.id.write_direct_chunk((0, 0, 0), blosc2_array.schunk.to_cframe())\n", | |
" return dataset" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "e9069fcf-8215-45a9-b576-fee64318f2d6", | |
"metadata": {}, | |
"source": [ | |
"### Example with dummy data\n", | |
"\n", | |
"Compress a stack of 10 images of 1024x1024 with a compression rate of 10." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"id": "9c7e38dd-4c04-4858-8fb8-2635abd863bc", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"shape = 10, 1024, 1024\n", | |
"data = np.arange(np.prod(shape), dtype=np.uint16).reshape(*shape)\n", | |
"\n", | |
"with h5py.File(\"blosc2-grok.h5\", \"w\") as h5f:\n", | |
" create_blosc2_grok_stack_dataset(h5f, \"data\", data, rate=10)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"id": "28aeb4fd-b9fa-49a2-84d2-760133ad0919", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"blosc2-grok.h5 file size: 43707 bytes\n" | |
] | |
} | |
], | |
"source": [ | |
"print(f\"blosc2-grok.h5 file size: {os.path.getsize('blosc2-grok.h5')} bytes\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "2e90ffd1-c40a-4569-88e7-ff3a7df2b777", | |
"metadata": {}, | |
"source": [ | |
"## Read HDF5 dataset compressed with JPEG2000\n", | |
"\n", | |
"Provided that the hdf5plugin and blosc2-grok Python packages are installed, it is possible to read back the written data with h5py." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"id": "093adb17-590f-46df-8bce-6d40aaa4c2dc", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"with h5py.File(\"blosc2-grok.h5\", \"r\") as h5f:\n", | |
" read_data = h5f[\"data\"][()]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "e403b693-3c42-4913-a9d7-c6d7697a00f5", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from jupyterlab_h5web import H5Web\n", | |
"\n", | |
"H5Web(\"blosc2-grok.h5\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "475abe53-6831-4117-897d-118f13c3c7df", | |
"metadata": {}, | |
"source": [ | |
"### Slice access time\n", | |
"\n", | |
"Accessing data this way requires to decompress completely HDF5 chunks even if only accessing a slice.\n", | |
"\n", | |
"For instance, in this case, accessing one frame requires to decompress the entire HF5 chunk:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"id": "bffd6590-f80e-46de-87ad-f125836cfa0d", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"211 ms ± 7.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"\n", | |
"# Read one frame\n", | |
"with h5py.File(\"blosc2-grok.h5\", \"r\") as h5f:\n", | |
" x = h5f[\"data\"][0]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "d1d2d005-503c-42fa-9c83-a8de51013925", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"230 ms ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"\n", | |
"# Read all frames\n", | |
"with h5py.File(\"blosc2-grok.h5\", \"r\") as h5f:\n", | |
" x = h5f[\"data\"][()]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "198bbeff-b00e-4cb6-a861-d410f2aa55eb", | |
"metadata": {}, | |
"source": [ | |
"### Optimised slice reading with b2h5py\n", | |
"\n", | |
"[b2h5py](https://pypi.org/project/b2h5py) provides h5py with optimized reading of n-dimensional slices of Blosc2-compressed datasets.\n", | |
"This optimized slicing leverages direct chunk access and 2-level partitioning into chunks and then smaller blocks (so that less data is actually decompressed).\n", | |
"\n", | |
"Example: Read the first slice with ``b2h5py``:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"id": "03d3d3dd-27b8-43f9-b7f3-1351fc7b69a4", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import b2h5py" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"id": "5c6513ff-0477-405c-adf7-a7acd8a06eaf", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"28.3 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"\n", | |
"# With b2h5py\n", | |
"with h5py.File(\"blosc2-grok.h5\", \"r\") as h5f:\n", | |
" b2h5py_data = b2h5py.B2Dataset(h5f[\"data\"])[0]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "b2f63014-91d5-433a-97b7-d2c532dfa149", | |
"metadata": {}, | |
"source": [ | |
"## Example with tomography radios\n", | |
"\n", | |
"First, download raw data: http://www.silx.org/pub/leaps-innov/tomography/lung_raw_2000-2100.h5\n", | |
"\n", | |
"Read raw data and compress it with a 10x compression ratio:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"id": "cd430d70-41d3-41af-a10d-e3c87d782e2b", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"with h5py.File(\"lung_raw_2000-2100.h5\", \"r\") as h5f:\n", | |
" images = h5f[\"data\"][()]\n", | |
"\n", | |
"with h5py.File(\"lung_raw-blosc2-grok.h5\", \"w\") as h5f:\n", | |
" create_blosc2_grok_stack_dataset(h5f, \"data\", images, rate=10)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"id": "1af9ad9c-d180-49cf-af92-19978a903f05", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Files size:\n", | |
"- lung_raw-blosc2-grok.h5: 41936825 bytes\n", | |
"- lung_raw_2000-2100.h5: 419438624 bytes\n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"Files size:\")\n", | |
"print(f\"- lung_raw-blosc2-grok.h5: {os.path.getsize('lung_raw-blosc2-grok.h5')} bytes\")\n", | |
"print(f\"- lung_raw_2000-2100.h5: {os.path.getsize('lung_raw_2000-2100.h5')} bytes\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "80097d3e-15cc-462c-b7e3-19c11912bdf5", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.12.0" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
b2h5py | |
blosc2>=2.5.1 | |
blosc2_grok>=0.2.2 | |
h5py | |
hdf5plugin>=4.4.0 | |
jupyterlab_h5web |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment