Skip to content

Instantly share code, notes, and snippets.

@moble
Created October 17, 2018 18:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save moble/a5151a555fd5ede99677147e6237e589 to your computer and use it in GitHub Desktop.
Save moble/a5151a555fd5ede99677147e6237e589 to your computer and use it in GitHub Desktop.
Comparison of npz and h5 compression
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import h5py\n",
"from scipy.special import comb as nCk\n",
"from tempfile import TemporaryDirectory"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Simulate an array of spinors in a six-dimensional Clifford algebra stored naively\n",
"d = 6\n",
"n_spinors = 1000\n",
"\n",
"shape1, shape2 = n_spinors, 2**d\n",
"tmp = np.zeros((shape1, shape2), dtype=float)\n",
"\n",
"spinor_indices = [k for i in range(0, d+1, 2)\n",
" for j in [int(sum([nCk(d, l) for l in range(i)]))]\n",
" for k in range(j, j+int(nCk(d, i)))]\n",
"\n",
"for i in spinor_indices:\n",
" tmp[:, i] = np.random.rand(shape1)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"total 6936\n",
"-rw-r--r-- 1 boyle staff 514048 Oct 17 14:12 tmp.h5\n",
"-rw-r--r-- 1 boyle staff 512236 Oct 17 14:12 tmp.npz\n",
"-rw-r--r-- 1 boyle staff 248660 Oct 17 14:12 tmp_compressed.npz\n",
"-rw-r--r-- 1 boyle staff 255118 Oct 17 14:12 tmp_compressed1.h5\n",
"-rw-r--r-- 1 boyle staff 266751 Oct 17 14:12 tmp_compressed1_shuffled.h5\n",
"-rw-r--r-- 1 boyle staff 230814 Oct 17 14:12 tmp_compressed1_shuffled_transposed.h5\n",
"-rw-r--r-- 1 boyle staff 253798 Oct 17 14:12 tmp_compressed4.h5\n",
"-rw-r--r-- 1 boyle staff 265700 Oct 17 14:12 tmp_compressed4_shuffled.h5\n",
"-rw-r--r-- 1 boyle staff 230047 Oct 17 14:12 tmp_compressed4_shuffled_transposed.h5\n",
"-rw-r--r-- 1 boyle staff 253634 Oct 17 14:12 tmp_compressed9.h5\n",
"-rw-r--r-- 1 boyle staff 265342 Oct 17 14:12 tmp_compressed9_shuffled.h5\n",
"-rw-r--r-- 1 boyle staff 229817 Oct 17 14:12 tmp_compressed9_shuffled_transposed.h5\n"
]
}
],
"source": [
"# Write the array to disk with various compression settings\n",
"\n",
"with TemporaryDirectory() as directory:\n",
" np.savez(directory+'/tmp', a=tmp)\n",
" np.savez_compressed(directory+'/tmp_compressed', a=tmp)\n",
" with h5py.File(directory+'/tmp.h5', \"w\") as f:\n",
" dset_coefs = f.create_dataset(\"a\", data=tmp)\n",
" with h5py.File(directory+'/tmp_compressed1.h5', \"w\") as f:\n",
" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=1, shuffle=False)\n",
" with h5py.File(directory+'/tmp_compressed4.h5', \"w\") as f:\n",
" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=4, shuffle=False)\n",
" with h5py.File(directory+'/tmp_compressed9.h5', \"w\") as f:\n",
" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=9, shuffle=False)\n",
" with h5py.File(directory+'/tmp_compressed1_shuffled.h5', \"w\") as f:\n",
" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=1, shuffle=True)\n",
" with h5py.File(directory+'/tmp_compressed4_shuffled.h5', \"w\") as f:\n",
" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=4, shuffle=True)\n",
" with h5py.File(directory+'/tmp_compressed9_shuffled.h5', \"w\") as f:\n",
" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=9, shuffle=True)\n",
" with h5py.File(directory+'/tmp_compressed1_shuffled_transposed.h5', \"w\") as f:\n",
" dset_coefs = f.create_dataset(\"a\", data=tmp.T, compression=\"gzip\", compression_opts=1, shuffle=True)\n",
" with h5py.File(directory+'/tmp_compressed4_shuffled_transposed.h5', \"w\") as f:\n",
" dset_coefs = f.create_dataset(\"a\", data=tmp.T, compression=\"gzip\", compression_opts=4, shuffle=True)\n",
" with h5py.File(directory+'/tmp_compressed9_shuffled_transposed.h5', \"w\") as f:\n",
" dset_coefs = f.create_dataset(\"a\", data=tmp.T, compression=\"gzip\", compression_opts=9, shuffle=True)\n",
"\n",
" ! ls -l {directory}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The spinor array has half of its elements set to zero, and the other half set to random numbers. So we can't reasonably expect much better compression — no matter how clever we are about only storing the nonzero elements — than about 50%. Unsurprisingly, all of the compressed options above achieve just about 50% or better. This suggests that there's no use being clever about sparsity, because compression can do very well on its own.\n",
"\n",
"I'm a little surprised to see that `npz` compression is substantially better than the naive `h5` compression — especially since it's also based on ZIP; I guess whatever it's doing is pretty clever. \n",
"\n",
"As expected, the compression level really doesn't make much difference — improving compression by ~0.5% by going from level 1 to level 4 (the default level), but only an additional ~0.08% by going to level 9.\n",
"\n",
"Shuffling is a very clever trick that improves compression of reasonably continuous data with essentially no computational cost. I've used it very successfully in other scenarios. Here, it can actually make compression *worse* if applied in the natural way, but significantly improves the results if the array is transposed first. The reason is that the data *on disk* are more continuous if stored with all those columns of zeros next to each other. I guess it would be a good idea to add an option to use shuffling, but maybe just leave it off by default."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment