Created
October 17, 2018 18:20
-
-
Save moble/a5151a555fd5ede99677147e6237e589 to your computer and use it in GitHub Desktop.
Comparison of npz and h5 compression
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"import h5py\n", | |
"from scipy.special import comb as nCk\n", | |
"from tempfile import TemporaryDirectory" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Simulate an array of spinors in a six-dimensional Clifford algebra stored naively\n", | |
"d = 6\n", | |
"n_spinors = 1000\n", | |
"\n", | |
"shape1, shape2 = n_spinors, 2**d\n", | |
"tmp = np.zeros((shape1, shape2), dtype=float)\n", | |
"\n", | |
"spinor_indices = [k for i in range(0, d+1, 2)\n", | |
" for j in [int(sum([nCk(d, l) for l in range(i)]))]\n", | |
" for k in range(j, j+int(nCk(d, i)))]\n", | |
"\n", | |
"for i in spinor_indices:\n", | |
" tmp[:, i] = np.random.rand(shape1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"total 6936\n", | |
"-rw-r--r-- 1 boyle staff 514048 Oct 17 14:12 tmp.h5\n", | |
"-rw-r--r-- 1 boyle staff 512236 Oct 17 14:12 tmp.npz\n", | |
"-rw-r--r-- 1 boyle staff 248660 Oct 17 14:12 tmp_compressed.npz\n", | |
"-rw-r--r-- 1 boyle staff 255118 Oct 17 14:12 tmp_compressed1.h5\n", | |
"-rw-r--r-- 1 boyle staff 266751 Oct 17 14:12 tmp_compressed1_shuffled.h5\n", | |
"-rw-r--r-- 1 boyle staff 230814 Oct 17 14:12 tmp_compressed1_shuffled_transposed.h5\n", | |
"-rw-r--r-- 1 boyle staff 253798 Oct 17 14:12 tmp_compressed4.h5\n", | |
"-rw-r--r-- 1 boyle staff 265700 Oct 17 14:12 tmp_compressed4_shuffled.h5\n", | |
"-rw-r--r-- 1 boyle staff 230047 Oct 17 14:12 tmp_compressed4_shuffled_transposed.h5\n", | |
"-rw-r--r-- 1 boyle staff 253634 Oct 17 14:12 tmp_compressed9.h5\n", | |
"-rw-r--r-- 1 boyle staff 265342 Oct 17 14:12 tmp_compressed9_shuffled.h5\n", | |
"-rw-r--r-- 1 boyle staff 229817 Oct 17 14:12 tmp_compressed9_shuffled_transposed.h5\n" | |
] | |
} | |
], | |
"source": [ | |
"# Write the array to disk with various compression settings\n", | |
"\n", | |
"with TemporaryDirectory() as directory:\n", | |
" np.savez(directory+'/tmp', a=tmp)\n", | |
" np.savez_compressed(directory+'/tmp_compressed', a=tmp)\n", | |
" with h5py.File(directory+'/tmp.h5', \"w\") as f:\n", | |
" dset_coefs = f.create_dataset(\"a\", data=tmp)\n", | |
" with h5py.File(directory+'/tmp_compressed1.h5', \"w\") as f:\n", | |
" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=1, shuffle=False)\n", | |
" with h5py.File(directory+'/tmp_compressed4.h5', \"w\") as f:\n", | |
" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=4, shuffle=False)\n", | |
" with h5py.File(directory+'/tmp_compressed9.h5', \"w\") as f:\n", | |
" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=9, shuffle=False)\n", | |
" with h5py.File(directory+'/tmp_compressed1_shuffled.h5', \"w\") as f:\n", | |
" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=1, shuffle=True)\n", | |
" with h5py.File(directory+'/tmp_compressed4_shuffled.h5', \"w\") as f:\n", | |
" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=4, shuffle=True)\n", | |
" with h5py.File(directory+'/tmp_compressed9_shuffled.h5', \"w\") as f:\n", | |
" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=9, shuffle=True)\n", | |
" with h5py.File(directory+'/tmp_compressed1_shuffled_transposed.h5', \"w\") as f:\n", | |
" dset_coefs = f.create_dataset(\"a\", data=tmp.T, compression=\"gzip\", compression_opts=1, shuffle=True)\n", | |
" with h5py.File(directory+'/tmp_compressed4_shuffled_transposed.h5', \"w\") as f:\n", | |
" dset_coefs = f.create_dataset(\"a\", data=tmp.T, compression=\"gzip\", compression_opts=4, shuffle=True)\n", | |
" with h5py.File(directory+'/tmp_compressed9_shuffled_transposed.h5', \"w\") as f:\n", | |
" dset_coefs = f.create_dataset(\"a\", data=tmp.T, compression=\"gzip\", compression_opts=9, shuffle=True)\n", | |
"\n", | |
" ! ls -l {directory}" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The spinor array has half of its elements set to zero, and the other half set to random numbers. So we can't reasonably expect much better compression — no matter how clever we are about only storing the nonzero elements — than about 50%. Unsurprisingly, all of the compressed options above achieve just about 50% or better. This suggests that there's no use being clever about sparsity, because compression can do very well on its own.\n", | |
"\n", | |
"I'm a little surprised to see that `npz` compression is substantially better than the naive `h5` compression — especially since it's also based on ZIP; I guess whatever it's doing is pretty clever. \n", | |
"\n", | |
"As expected, the compression level really doesn't make much difference — improving compression by ~0.5% by going from level 1 to level 4 (the default level), but only an additional ~0.08% by going to level 9.\n", | |
"\n", | |
"Shuffling is a very clever trick that improves compression of reasonably continuous data with essentially no computational cost. I've used it very successfully in other scenarios. Here, it can actually make compression *worse* if applied in the natural way, but significantly improves the results if the array is transposed first. The reason is that the data *on disk* are more continuous if stored with all those columns of zeros next to each other. I guess it would be a good idea to add an option to use shuffling, but maybe just leave it off by default." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python [default]", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.6" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment