Skip to content

Instantly share code, notes, and snippets.

@t20100
Last active April 22, 2024 14:18
Show Gist options
  • Save t20100/80960ec46abd3a863e85876c013834bb to your computer and use it in GitHub Desktop.
Save t20100/80960ec46abd3a863e85876c013834bb to your computer and use it in GitHub Desktop.
How-to use JPEG2000 compression with HDF5 from Python using blosc2&grok - Binder: https://mybinder.org/v2/gist/t20100/80960ec46abd3a863e85876c013834bb/HEAD?labpath=hdf5-jpeg2000-codec-with-blosc2-grok.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "5cd1042a-8279-47c6-90ff-0d479f2af798",
"metadata": {},
"source": [
"# How-to use JPEG2000 compression with HDF5 from Python using blosc2&grok\n",
"\n",
"**Goal**: Writing and reading [JPEG2000](https://jpeg.org/jpeg2000/) compressed data in HDF5 file from Python with [blosc2](https://www.blosc.org/c-blosc2/c-blosc2.html) and [grok](https://github.com/GrokImageCompression/grok).\n",
"\n",
"[HDF5](https://www.hdfgroup.org/) (Hierarchical Data Format) is a file format designed to store and organize large amounts of data.\n",
"[hdf5plugin](http://www.silx.org/doc/hdf5plugin/latest/) provides some [HDF5 compression filters](https://portal.hdfgroup.org/documentation/hdf5-docs/registered_filter_plugins.html) - including the blosc2 filter - and makes them usable from [h5py](https://docs.h5py.org/en/stable/), a Pythonic interface to the HDF5 binary data format.\n",
"\n",
"[blosc2](https://www.blosc.org/c-blosc2/c-blosc2.html) is a \"meta\"-compressor optimized for binary data supporting different compressors and filters with support for external plugins.\n",
"[blosc2-grok](https://pypi.org/project/blosc2-grok/) is one of the blosc2 plugins which enables the use of [JPEG2000](https://jpeg.org/jpeg2000/) codec thanks to the [grok library](https://github.com/GrokImageCompression/grok).\n",
"\n",
"Notebook license: [CC-0](https://creativecommons.org/public-domain/cc0/)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9dcefbd8-29d8-4091-bc7c-979e4d2779ad",
"metadata": {},
"outputs": [],
"source": [
"# Install required packages\n",
"!pip install blosc2 blosc2_grok h5py hdf5plugin b2h5py jupyterlab_h5web"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "f9f12f43-8188-4aa0-b89f-ee500406532c",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"import blosc2\n",
"import blosc2_grok\n",
"import h5py\n",
"import hdf5plugin\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"id": "c2a858cd-8d96-40e2-868c-037ca46cdd0d",
"metadata": {},
"source": [
"## Write a stack of images as a HDF5 dataset compressed with JPEG2000\n",
"\n",
"To write a dataset compressed with JPEG2000 using blosc2&grok, one has to compress the data with blosc2 and write it using HDF5's direct chunk write.\n",
"\n",
"Indeed, as of today, it is not possible to create a dataset compressed with blosc2&grok using h5py's [Group.create_dataset](https://docs.h5py.org/en/stable/high/group.html#h5py.Group.create_dataset).\n",
"\n",
"We define a function ``b2_grok_compress_stack`` which compresses a numpy array with blosc2&grok and a function ``create_blosc2_grok_stack_dataset`` which uses the first function and write the compressed data to a HDF5 dataset."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8efde572-be71-46e8-ba15-dd1d835a3bd4",
"metadata": {},
"outputs": [],
"source": [
"def b2_grok_compress_stack(data: np.ndarray, rate: float) -> blosc2.NDArray:\n",
" \"\"\"Compress a 3D array with blosc2&grok as a stack of JPEG2000 images.\n",
"\n",
" :param data: 3D array of data\n",
" :param rate: The requested compression ratio\n",
" \"\"\"\n",
" blosc2_grok.set_params_defaults(\n",
" cod_format=blosc2_grok.GrkFileFmt.GRK_FMT_JP2,\n",
" quality_mode=\"rates\",\n",
" quality_layers=np.array([rate], dtype=np.float64),\n",
" )\n",
" return blosc2.asarray(\n",
" data,\n",
" chunks=data.shape,\n",
" blocks=(1,) + data.shape[1:], # Compress slice by slice\n",
" cparams={\n",
" 'codec': blosc2.Codec.GROK,\n",
" 'filters': [],\n",
" 'splitmode': blosc2.SplitMode.NEVER_SPLIT,\n",
" },\n",
" )\n",
"\n",
"\n",
"def create_blosc2_grok_stack_dataset(\n",
" group: h5py.Group,\n",
" h5path: str,\n",
" data: np.ndarray,\n",
" rate: float,\n",
") -> h5py.Dataset:\n",
" \"\"\"Store data compressed with blosc2&grok in a new dataset: group[h5path]\n",
"\n",
" :param group: The root group where to create the dataset\n",
" :param h5path: The path of the new dataset in the group\n",
" :param data: The stack data to compress\n",
" :param rate: The requested compression ratio\n",
" \"\"\"\n",
" dataset = group.create_dataset( # Create the HDF5 dataset\n",
" h5path,\n",
" shape=data.shape,\n",
" dtype=data.dtype,\n",
" chunks=data.shape,\n",
" allow_unknown_filter=True,\n",
" compression=hdf5plugin.Blosc2(),\n",
" )\n",
" blosc2_array = b2_grok_compress_stack(data, rate) # Compress the data with blosc2 & grok\n",
" # Write the compressed data to HDF5 using direct unk write\n",
" dataset.id.write_direct_chunk((0, 0, 0), blosc2_array.schunk.to_cframe())\n",
" return dataset"
]
},
{
"cell_type": "markdown",
"id": "e9069fcf-8215-45a9-b576-fee64318f2d6",
"metadata": {},
"source": [
"### Example with dummy data\n",
"\n",
"Compress a stack of 10 images of 1024x1024 with a compression rate of 10."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9c7e38dd-4c04-4858-8fb8-2635abd863bc",
"metadata": {},
"outputs": [],
"source": [
"shape = 10, 1024, 1024\n",
"data = np.arange(np.prod(shape), dtype=np.uint16).reshape(*shape)\n",
"\n",
"with h5py.File(\"blosc2-grok.h5\", \"w\") as h5f:\n",
" create_blosc2_grok_stack_dataset(h5f, \"data\", data, rate=10)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "28aeb4fd-b9fa-49a2-84d2-760133ad0919",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"blosc2-grok.h5 file size: 43707 bytes\n"
]
}
],
"source": [
"print(f\"blosc2-grok.h5 file size: {os.path.getsize('blosc2-grok.h5')} bytes\")"
]
},
{
"cell_type": "markdown",
"id": "2e90ffd1-c40a-4569-88e7-ff3a7df2b777",
"metadata": {},
"source": [
"## Read HDF5 dataset compressed with JPEG2000\n",
"\n",
"Provided that the hdf5plugin and blosc2-grok Python packages are installed, it is possible to read back the written data with h5py."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "093adb17-590f-46df-8bce-6d40aaa4c2dc",
"metadata": {},
"outputs": [],
"source": [
"with h5py.File(\"blosc2-grok.h5\", \"r\") as h5f:\n",
" read_data = h5f[\"data\"][()]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e403b693-3c42-4913-a9d7-c6d7697a00f5",
"metadata": {},
"outputs": [],
"source": [
"from jupyterlab_h5web import H5Web\n",
"\n",
"H5Web(\"blosc2-grok.h5\")"
]
},
{
"cell_type": "markdown",
"id": "475abe53-6831-4117-897d-118f13c3c7df",
"metadata": {},
"source": [
"### Slice access time\n",
"\n",
"Accessing data this way requires to decompress completely HDF5 chunks even if only accessing a slice.\n",
"\n",
"For instance, in this case, accessing one frame requires to decompress the entire HF5 chunk:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "bffd6590-f80e-46de-87ad-f125836cfa0d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"211 ms ± 7.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"\n",
"# Read one frame\n",
"with h5py.File(\"blosc2-grok.h5\", \"r\") as h5f:\n",
" x = h5f[\"data\"][0]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "d1d2d005-503c-42fa-9c83-a8de51013925",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"230 ms ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"\n",
"# Read all frames\n",
"with h5py.File(\"blosc2-grok.h5\", \"r\") as h5f:\n",
" x = h5f[\"data\"][()]"
]
},
{
"cell_type": "markdown",
"id": "198bbeff-b00e-4cb6-a861-d410f2aa55eb",
"metadata": {},
"source": [
"### Optimised slice reading with b2h5py\n",
"\n",
"[b2h5py](https://pypi.org/project/b2h5py) provides h5py with optimized reading of n-dimensional slices of Blosc2-compressed datasets.\n",
"This optimized slicing leverages direct chunk access and 2-level partitioning into chunks and then smaller blocks (so that less data is actually decompressed).\n",
"\n",
"Example: Read the first slice with ``b2h5py``:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "03d3d3dd-27b8-43f9-b7f3-1351fc7b69a4",
"metadata": {},
"outputs": [],
"source": [
"import b2h5py"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "5c6513ff-0477-405c-adf7-a7acd8a06eaf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"28.3 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%%timeit\n",
"\n",
"# With b2h5py\n",
"with h5py.File(\"blosc2-grok.h5\", \"r\") as h5f:\n",
" b2h5py_data = b2h5py.B2Dataset(h5f[\"data\"])[0]"
]
},
{
"cell_type": "markdown",
"id": "b2f63014-91d5-433a-97b7-d2c532dfa149",
"metadata": {},
"source": [
"## Example with tomography radios\n",
"\n",
"First, download raw data: http://www.silx.org/pub/leaps-innov/tomography/lung_raw_2000-2100.h5\n",
"\n",
"Read raw data and compress it with a 10x compression ratio:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "cd430d70-41d3-41af-a10d-e3c87d782e2b",
"metadata": {},
"outputs": [],
"source": [
"with h5py.File(\"lung_raw_2000-2100.h5\", \"r\") as h5f:\n",
" images = h5f[\"data\"][()]\n",
"\n",
"with h5py.File(\"lung_raw-blosc2-grok.h5\", \"w\") as h5f:\n",
" create_blosc2_grok_stack_dataset(h5f, \"data\", images, rate=10)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "1af9ad9c-d180-49cf-af92-19978a903f05",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Files size:\n",
"- lung_raw-blosc2-grok.h5: 41936825 bytes\n",
"- lung_raw_2000-2100.h5: 419438624 bytes\n"
]
}
],
"source": [
"print(\"Files size:\")\n",
"print(f\"- lung_raw-blosc2-grok.h5: {os.path.getsize('lung_raw-blosc2-grok.h5')} bytes\")\n",
"print(f\"- lung_raw_2000-2100.h5: {os.path.getsize('lung_raw_2000-2100.h5')} bytes\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "80097d3e-15cc-462c-b7e3-19c11912bdf5",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
b2h5py
blosc2>=2.5.1
blosc2_grok>=0.2.2
h5py
hdf5plugin>=4.4.0
jupyterlab_h5web
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment