martindurant/zarr_xarr.ipynb

## zarr_xarr.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import dask.array as da \n",
    "import xarray as xr\n",
    "import zarr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets get some real data: bathimetry from netCDF examples at https://www.unidata.ucar.edu/software/netcdf/\n",
    "\n",
    "\n",
    "    $ wget https://www.unidata.ucar.edu/software/netcdf/examples/smith_sandwell_topo_v8_2.nc\n",
    "\n",
    "Using the `chunks` parameter references the data as a dask array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "arr = xr.open_dataset('smith_sandwell_topo_v8_2.nc', chunks={'latitude': 6336//11, 'longitude': 10800//15}).ROSE\n",
    "darr = arr.data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<xarray.DataArray 'ROSE' (latitude: 6336, longitude: 10800)>\n",
      "dask.array<smith_s..., shape=(6336, 10800), dtype=float64, chunksize=(576, 720)>\n",
      "Coordinates:\n",
      "  * latitude   (latitude) float32 -72.0009 -71.9905 -71.9802 -71.9699 ...\n",
      "  * longitude  (longitude) float32 0.0166667 0.05 0.0833333 0.116667 0.15 ...\n",
      "Attributes:\n",
      "    valid_range: [-32766  32767]\n",
      "    long_name: Topography and Bathymetry (  8123m -> -10799m)\n",
      "    units: meters\n",
      "    unpacked_missing_value: -32767.0 \n",
      "\n",
      " dask.array<smith_s..., shape=(6336, 10800), dtype=float64, chunksize=(576, 720)>\n"
     ]
    }
   ],
   "source": [
    "print(arr, '\\n\\n', darr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(-2184.878420854962)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# example operation\n",
    "arr.mean().values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First a couple of functions to do zarr<->dask-array. This is already mentioned in the [docs](http://dask.pydata.org/en/latest/array-creation.html), but these would feel very natural as methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def to_zarr(darr, url, compressor='default', ret=False):\n",
    "    for c in darr.chunks:\n",
    "        if len(set(c)) != 1:\n",
    "            # I believe arbitrary chunking is not possible\n",
    "            raise ValueError(\"Must use regular chunking for zarr; %s\"\n",
    "                             % darr.chunks)\n",
    "    chunks = [c[0] for c in darr.chunks]\n",
    "    out = zarr.open_array(url, mode='w', shape=darr.shape,\n",
    "                          chunks=chunks, dtype=darr.dtype,\n",
    "                          compressor=compressor)\n",
    "    da.store(darr, out)\n",
    "    if ret:\n",
    "        return out\n",
    "\n",
    "def from_zarr(url, retstore=False):\n",
    "    # or perhaps \"read_zarr\"\n",
    "    d = zarr.open_array(url)\n",
    "    out = da.from_array(d, chunks=d.chunks)\n",
    "    if retstore:\n",
    "        return out, d.store\n",
    "    return out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# array to zarr\n",
    "to_zarr(darr, 'out.zarr', compressor=zarr.Blosc())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-2184.8784208549619"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# zarr to dask array, simple computation\n",
    "from_zarr('out.zarr').mean().compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Array((6336, 10800), float64, chunks=(576, 720), order=C)\n",
       "  nbytes: 522.1M; nbytes_stored: 103.8M; ratio: 5.0; initialized: 165/165\n",
       "  compressor: Blosc(cname='lz4', clevel=5, shuffle=1)\n",
       "  store: DirectoryStore"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# or can use the zarr directory directly \n",
    "zarr.open_array('out.zarr')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, a shim for xarray"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pickle\n",
    "\n",
    "def xarray_to_zarr(arr, url, **kwargs):\n",
    "    coorddict = [(name, arr.coords[name].values) for name in arr.dims]\n",
    "    z = to_zarr(arr.data, url, compressor=kwargs.get('compressor', 'default'),\n",
    "                ret=True)\n",
    "    z.store['.xarray'] = pickle.dumps({'coords': coorddict, 'attrs': arr.attrs,\n",
    "                                       'name': arr.name, 'dims': arr.dims})\n",
    "\n",
    "def xarray_from_zarr(url):\n",
    "    z, store = from_zarr(url, True)\n",
    "    meta = pickle.loads(store['.xarray'])\n",
    "    out = xr.DataArray(z, coords=meta['coords'], name=meta['name'],\n",
    "                       attrs=meta['attrs'], dims=meta['dims'])\n",
    "    return out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# array to zarr\n",
    "xarray_to_zarr(arr, 'out.xarr')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<xarray.DataArray 'ROSE' (latitude: 6336, longitude: 10800)>\n",
       "dask.array<array-b..., shape=(6336, 10800), dtype=float64, chunksize=(576, 720)>\n",
       "Coordinates:\n",
       "  * latitude   (latitude) float32 -72.0009 -71.9905 -71.9802 -71.9699 ...\n",
       "  * longitude  (longitude) float32 0.0166667 0.05 0.0833333 0.116667 0.15 ...\n",
       "Attributes:\n",
       "    valid_range: [-32766  32767]\n",
       "    long_name: Topography and Bathymetry (  8123m -> -10799m)\n",
       "    units: meters\n",
       "    unpacked_missing_value: -32767.0"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# zarr to zarray\n",
    "out = xarray_from_zarr('out.xarr')\n",
    "out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dask.array<array-b..., shape=(6336, 10800), dtype=float64, chunksize=(576, 720)>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The data is a dask array \n",
    "out.data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(-2184.878420854962)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "out.mean().values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Further thoughts:\n",
    "- this is storing a single data array, not a data set. zarr does have the concept groups, but writing a directory containing many zarrs would be easy enough to implement\n",
    "- as hinted in the functions, the location could be an arbitrary URL with backends such as s3fs and hdfs3 using the same model as in dask's read and write functions\n",
    "- the coordinate arrays and metadata are simply pickled, because they need to be loaded back all in one go, so there seems no reason to worry too hard about packing this information carefully."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import dask.array as da \n",
	"import xarray as xr\n",
	"import zarr"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Lets get some real data: bathimetry from netCDF examples at https://www.unidata.ucar.edu/software/netcdf/\n",
	"\n",
	"\n",
	" $ wget https://www.unidata.ucar.edu/software/netcdf/examples/smith_sandwell_topo_v8_2.nc\n",
	"\n",
	"Using the `chunks` parameter references the data as a dask array."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"arr = xr.open_dataset('smith_sandwell_topo_v8_2.nc', chunks={'latitude': 6336//11, 'longitude': 10800//15}).ROSE\n",
	"darr = arr.data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"<xarray.DataArray 'ROSE' (latitude: 6336, longitude: 10800)>\n",
	"dask.array<smith_s..., shape=(6336, 10800), dtype=float64, chunksize=(576, 720)>\n",
	"Coordinates:\n",
	" * latitude (latitude) float32 -72.0009 -71.9905 -71.9802 -71.9699 ...\n",
	" * longitude (longitude) float32 0.0166667 0.05 0.0833333 0.116667 0.15 ...\n",
	"Attributes:\n",
	" valid_range: [-32766 32767]\n",
	" long_name: Topography and Bathymetry ( 8123m -> -10799m)\n",
	" units: meters\n",
	" unpacked_missing_value: -32767.0 \n",
	"\n",
	" dask.array<smith_s..., shape=(6336, 10800), dtype=float64, chunksize=(576, 720)>\n"
	]
	}
	],
	"source": [
	"print(arr, '\\n\\n', darr)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array(-2184.878420854962)"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# example operation\n",
	"arr.mean().values"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"First a couple of functions to do zarr<->dask-array. This is already mentioned in the [docs](http://dask.pydata.org/en/latest/array-creation.html), but these would feel very natural as methods."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"def to_zarr(darr, url, compressor='default', ret=False):\n",
	" for c in darr.chunks:\n",
	" if len(set(c)) != 1:\n",
	" # I believe arbitrary chunking is not possible\n",
	" raise ValueError(\"Must use regular chunking for zarr; %s\"\n",
	" % darr.chunks)\n",
	" chunks = [c[0] for c in darr.chunks]\n",
	" out = zarr.open_array(url, mode='w', shape=darr.shape,\n",
	" chunks=chunks, dtype=darr.dtype,\n",
	" compressor=compressor)\n",
	" da.store(darr, out)\n",
	" if ret:\n",
	" return out\n",
	"\n",
	"def from_zarr(url, retstore=False):\n",
	" # or perhaps \"read_zarr\"\n",
	" d = zarr.open_array(url)\n",
	" out = da.from_array(d, chunks=d.chunks)\n",
	" if retstore:\n",
	" return out, d.store\n",
	" return out"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# array to zarr\n",
	"to_zarr(darr, 'out.zarr', compressor=zarr.Blosc())"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"-2184.8784208549619"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# zarr to dask array, simple computation\n",
	"from_zarr('out.zarr').mean().compute()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"Array((6336, 10800), float64, chunks=(576, 720), order=C)\n",
	" nbytes: 522.1M; nbytes_stored: 103.8M; ratio: 5.0; initialized: 165/165\n",
	" compressor: Blosc(cname='lz4', clevel=5, shuffle=1)\n",
	" store: DirectoryStore"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# or can use the zarr directory directly \n",
	"zarr.open_array('out.zarr')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Next, a shim for xarray"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import pickle\n",
	"\n",
	"def xarray_to_zarr(arr, url, **kwargs):\n",
	" coorddict = [(name, arr.coords[name].values) for name in arr.dims]\n",
	" z = to_zarr(arr.data, url, compressor=kwargs.get('compressor', 'default'),\n",
	" ret=True)\n",
	" z.store['.xarray'] = pickle.dumps({'coords': coorddict, 'attrs': arr.attrs,\n",
	" 'name': arr.name, 'dims': arr.dims})\n",
	"\n",
	"def xarray_from_zarr(url):\n",
	" z, store = from_zarr(url, True)\n",
	" meta = pickle.loads(store['.xarray'])\n",
	" out = xr.DataArray(z, coords=meta['coords'], name=meta['name'],\n",
	" attrs=meta['attrs'], dims=meta['dims'])\n",
	" return out"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# array to zarr\n",
	"xarray_to_zarr(arr, 'out.xarr')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<xarray.DataArray 'ROSE' (latitude: 6336, longitude: 10800)>\n",
	"dask.array<array-b..., shape=(6336, 10800), dtype=float64, chunksize=(576, 720)>\n",
	"Coordinates:\n",
	" * latitude (latitude) float32 -72.0009 -71.9905 -71.9802 -71.9699 ...\n",
	" * longitude (longitude) float32 0.0166667 0.05 0.0833333 0.116667 0.15 ...\n",
	"Attributes:\n",
	" valid_range: [-32766 32767]\n",
	" long_name: Topography and Bathymetry ( 8123m -> -10799m)\n",
	" units: meters\n",
	" unpacked_missing_value: -32767.0"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# zarr to zarray\n",
	"out = xarray_from_zarr('out.xarr')\n",
	"out"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"dask.array<array-b..., shape=(6336, 10800), dtype=float64, chunksize=(576, 720)>"
	]
	},
	"execution_count": 12,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# The data is a dask array \n",
	"out.data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array(-2184.878420854962)"
	]
	},
	"execution_count": 13,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"out.mean().values"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Further thoughts:\n",
	"- this is storing a single data array, not a data set. zarr does have the concept groups, but writing a directory containing many zarrs would be easy enough to implement\n",
	"- as hinted in the functions, the location could be an arbitrary URL with backends such as s3fs and hdfs3 using the same model as in dask's read and write functions\n",
	"- the coordinate arrays and metadata are simply pickled, because they need to be loaded back all in one go, so there seems no reason to worry too hard about packing this information carefully."
	]
	}
	],
	"metadata": {
	"anaconda-cloud": {},
	"kernelspec": {
	"display_name": "Python [default]",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}