moble/CliffordCompression.ipynb

## CliffordCompression.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import h5py\n",
    "from scipy.special import comb as nCk\n",
    "from tempfile import TemporaryDirectory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Simulate an array of spinors in a six-dimensional Clifford algebra stored naively\n",
    "d = 6\n",
    "n_spinors = 1000\n",
    "\n",
    "shape1, shape2 = n_spinors, 2**d\n",
    "tmp = np.zeros((shape1, shape2), dtype=float)\n",
    "\n",
    "spinor_indices = [k for i in range(0, d+1, 2)\n",
    "                  for j in [int(sum([nCk(d, l) for l in range(i)]))]\n",
    "                  for k in range(j, j+int(nCk(d, i)))]\n",
    "\n",
    "for i in spinor_indices:\n",
    "    tmp[:, i] = np.random.rand(shape1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 6936\n",
      "-rw-r--r--  1 boyle  staff  514048 Oct 17 14:12 tmp.h5\n",
      "-rw-r--r--  1 boyle  staff  512236 Oct 17 14:12 tmp.npz\n",
      "-rw-r--r--  1 boyle  staff  248660 Oct 17 14:12 tmp_compressed.npz\n",
      "-rw-r--r--  1 boyle  staff  255118 Oct 17 14:12 tmp_compressed1.h5\n",
      "-rw-r--r--  1 boyle  staff  266751 Oct 17 14:12 tmp_compressed1_shuffled.h5\n",
      "-rw-r--r--  1 boyle  staff  230814 Oct 17 14:12 tmp_compressed1_shuffled_transposed.h5\n",
      "-rw-r--r--  1 boyle  staff  253798 Oct 17 14:12 tmp_compressed4.h5\n",
      "-rw-r--r--  1 boyle  staff  265700 Oct 17 14:12 tmp_compressed4_shuffled.h5\n",
      "-rw-r--r--  1 boyle  staff  230047 Oct 17 14:12 tmp_compressed4_shuffled_transposed.h5\n",
      "-rw-r--r--  1 boyle  staff  253634 Oct 17 14:12 tmp_compressed9.h5\n",
      "-rw-r--r--  1 boyle  staff  265342 Oct 17 14:12 tmp_compressed9_shuffled.h5\n",
      "-rw-r--r--  1 boyle  staff  229817 Oct 17 14:12 tmp_compressed9_shuffled_transposed.h5\n"
     ]
    }
   ],
   "source": [
    "# Write the array to disk with various compression settings\n",
    "\n",
    "with TemporaryDirectory() as directory:\n",
    "    np.savez(directory+'/tmp', a=tmp)\n",
    "    np.savez_compressed(directory+'/tmp_compressed', a=tmp)\n",
    "    with h5py.File(directory+'/tmp.h5', \"w\") as f:\n",
    "        dset_coefs = f.create_dataset(\"a\", data=tmp)\n",
    "    with h5py.File(directory+'/tmp_compressed1.h5', \"w\") as f:\n",
    "        dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=1, shuffle=False)\n",
    "    with h5py.File(directory+'/tmp_compressed4.h5', \"w\") as f:\n",
    "        dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=4, shuffle=False)\n",
    "    with h5py.File(directory+'/tmp_compressed9.h5', \"w\") as f:\n",
    "        dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=9, shuffle=False)\n",
    "    with h5py.File(directory+'/tmp_compressed1_shuffled.h5', \"w\") as f:\n",
    "        dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=1, shuffle=True)\n",
    "    with h5py.File(directory+'/tmp_compressed4_shuffled.h5', \"w\") as f:\n",
    "        dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=4, shuffle=True)\n",
    "    with h5py.File(directory+'/tmp_compressed9_shuffled.h5', \"w\") as f:\n",
    "        dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=9, shuffle=True)\n",
    "    with h5py.File(directory+'/tmp_compressed1_shuffled_transposed.h5', \"w\") as f:\n",
    "        dset_coefs = f.create_dataset(\"a\", data=tmp.T, compression=\"gzip\", compression_opts=1, shuffle=True)\n",
    "    with h5py.File(directory+'/tmp_compressed4_shuffled_transposed.h5', \"w\") as f:\n",
    "        dset_coefs = f.create_dataset(\"a\", data=tmp.T, compression=\"gzip\", compression_opts=4, shuffle=True)\n",
    "    with h5py.File(directory+'/tmp_compressed9_shuffled_transposed.h5', \"w\") as f:\n",
    "        dset_coefs = f.create_dataset(\"a\", data=tmp.T, compression=\"gzip\", compression_opts=9, shuffle=True)\n",
    "\n",
    "    ! ls -l {directory}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The spinor array has half of its elements set to zero, and the other half set to random numbers.  So we can't reasonably expect much better compression — no matter how clever we are about only storing the nonzero elements — than about 50%.  Unsurprisingly, all of the compressed options above achieve just about 50% or better.  This suggests that there's no use being clever about sparsity, because compression can do very well on its own.\n",
    "\n",
    "I'm a little surprised to see that `npz` compression is substantially better than the naive `h5` compression — especially since it's also based on ZIP; I guess whatever it's doing is pretty clever.  \n",
    "\n",
    "As expected, the compression level really doesn't make much difference — improving compression by ~0.5% by going from level 1 to level 4 (the default level), but only an additional ~0.08% by going to level 9.\n",
    "\n",
    "Shuffling is a very clever trick that improves compression of reasonably continuous data with essentially no computational cost.  I've used it very successfully in other scenarios.  Here, it can actually make compression *worse* if applied in the natural way, but significantly improves the results if the array is transposed first.  The reason is that the data *on disk* are more continuous if stored with all those columns of zeros next to each other.  I guess it would be a good idea to add an option to use shuffling, but maybe just leave it off by default."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import numpy as np\n",
	"import h5py\n",
	"from scipy.special import comb as nCk\n",
	"from tempfile import TemporaryDirectory"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Simulate an array of spinors in a six-dimensional Clifford algebra stored naively\n",
	"d = 6\n",
	"n_spinors = 1000\n",
	"\n",
	"shape1, shape2 = n_spinors, 2**d\n",
	"tmp = np.zeros((shape1, shape2), dtype=float)\n",
	"\n",
	"spinor_indices = [k for i in range(0, d+1, 2)\n",
	" for j in [int(sum([nCk(d, l) for l in range(i)]))]\n",
	" for k in range(j, j+int(nCk(d, i)))]\n",
	"\n",
	"for i in spinor_indices:\n",
	" tmp[:, i] = np.random.rand(shape1)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"total 6936\n",
	"-rw-r--r-- 1 boyle staff 514048 Oct 17 14:12 tmp.h5\n",
	"-rw-r--r-- 1 boyle staff 512236 Oct 17 14:12 tmp.npz\n",
	"-rw-r--r-- 1 boyle staff 248660 Oct 17 14:12 tmp_compressed.npz\n",
	"-rw-r--r-- 1 boyle staff 255118 Oct 17 14:12 tmp_compressed1.h5\n",
	"-rw-r--r-- 1 boyle staff 266751 Oct 17 14:12 tmp_compressed1_shuffled.h5\n",
	"-rw-r--r-- 1 boyle staff 230814 Oct 17 14:12 tmp_compressed1_shuffled_transposed.h5\n",
	"-rw-r--r-- 1 boyle staff 253798 Oct 17 14:12 tmp_compressed4.h5\n",
	"-rw-r--r-- 1 boyle staff 265700 Oct 17 14:12 tmp_compressed4_shuffled.h5\n",
	"-rw-r--r-- 1 boyle staff 230047 Oct 17 14:12 tmp_compressed4_shuffled_transposed.h5\n",
	"-rw-r--r-- 1 boyle staff 253634 Oct 17 14:12 tmp_compressed9.h5\n",
	"-rw-r--r-- 1 boyle staff 265342 Oct 17 14:12 tmp_compressed9_shuffled.h5\n",
	"-rw-r--r-- 1 boyle staff 229817 Oct 17 14:12 tmp_compressed9_shuffled_transposed.h5\n"
	]
	}
	],
	"source": [
	"# Write the array to disk with various compression settings\n",
	"\n",
	"with TemporaryDirectory() as directory:\n",
	" np.savez(directory+'/tmp', a=tmp)\n",
	" np.savez_compressed(directory+'/tmp_compressed', a=tmp)\n",
	" with h5py.File(directory+'/tmp.h5', \"w\") as f:\n",
	" dset_coefs = f.create_dataset(\"a\", data=tmp)\n",
	" with h5py.File(directory+'/tmp_compressed1.h5', \"w\") as f:\n",
	" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=1, shuffle=False)\n",
	" with h5py.File(directory+'/tmp_compressed4.h5', \"w\") as f:\n",
	" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=4, shuffle=False)\n",
	" with h5py.File(directory+'/tmp_compressed9.h5', \"w\") as f:\n",
	" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=9, shuffle=False)\n",
	" with h5py.File(directory+'/tmp_compressed1_shuffled.h5', \"w\") as f:\n",
	" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=1, shuffle=True)\n",
	" with h5py.File(directory+'/tmp_compressed4_shuffled.h5', \"w\") as f:\n",
	" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=4, shuffle=True)\n",
	" with h5py.File(directory+'/tmp_compressed9_shuffled.h5', \"w\") as f:\n",
	" dset_coefs = f.create_dataset(\"a\", data=tmp, compression=\"gzip\", compression_opts=9, shuffle=True)\n",
	" with h5py.File(directory+'/tmp_compressed1_shuffled_transposed.h5', \"w\") as f:\n",
	" dset_coefs = f.create_dataset(\"a\", data=tmp.T, compression=\"gzip\", compression_opts=1, shuffle=True)\n",
	" with h5py.File(directory+'/tmp_compressed4_shuffled_transposed.h5', \"w\") as f:\n",
	" dset_coefs = f.create_dataset(\"a\", data=tmp.T, compression=\"gzip\", compression_opts=4, shuffle=True)\n",
	" with h5py.File(directory+'/tmp_compressed9_shuffled_transposed.h5', \"w\") as f:\n",
	" dset_coefs = f.create_dataset(\"a\", data=tmp.T, compression=\"gzip\", compression_opts=9, shuffle=True)\n",
	"\n",
	" ! ls -l {directory}"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The spinor array has half of its elements set to zero, and the other half set to random numbers. So we can't reasonably expect much better compression — no matter how clever we are about only storing the nonzero elements — than about 50%. Unsurprisingly, all of the compressed options above achieve just about 50% or better. This suggests that there's no use being clever about sparsity, because compression can do very well on its own.\n",
	"\n",
	"I'm a little surprised to see that `npz` compression is substantially better than the naive `h5` compression — especially since it's also based on ZIP; I guess whatever it's doing is pretty clever. \n",
	"\n",
	"As expected, the compression level really doesn't make much difference — improving compression by ~0.5% by going from level 1 to level 4 (the default level), but only an additional ~0.08% by going to level 9.\n",
	"\n",
	"Shuffling is a very clever trick that improves compression of reasonably continuous data with essentially no computational cost. I've used it very successfully in other scenarios. Here, it can actually make compression worse if applied in the natural way, but significantly improves the results if the array is transposed first. The reason is that the data on disk are more continuous if stored with all those columns of zeros next to each other. I guess it would be a good idea to add an option to use shuffling, but maybe just leave it off by default."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python [default]",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.6"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}