Skip to content

Instantly share code, notes, and snippets.

@martindurant
Last active January 18, 2017 15:57
Show Gist options
  • Save martindurant/dc27a072da47fab8d63117488f1fd7f1 to your computer and use it in GitHub Desktop.
Save martindurant/dc27a072da47fab8d63117488f1fd7f1 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import dask.array as da \n",
"import xarray as xr\n",
"import zarr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets get some real data: bathimetry from netCDF examples at https://www.unidata.ucar.edu/software/netcdf/\n",
"\n",
"\n",
" $ wget https://www.unidata.ucar.edu/software/netcdf/examples/smith_sandwell_topo_v8_2.nc\n",
"\n",
"Using the `chunks` parameter references the data as a dask array."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"arr = xr.open_dataset('smith_sandwell_topo_v8_2.nc', chunks={'latitude': 6336//11, 'longitude': 10800//15}).ROSE\n",
"darr = arr.data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<xarray.DataArray 'ROSE' (latitude: 6336, longitude: 10800)>\n",
"dask.array<smith_s..., shape=(6336, 10800), dtype=float64, chunksize=(576, 720)>\n",
"Coordinates:\n",
" * latitude (latitude) float32 -72.0009 -71.9905 -71.9802 -71.9699 ...\n",
" * longitude (longitude) float32 0.0166667 0.05 0.0833333 0.116667 0.15 ...\n",
"Attributes:\n",
" valid_range: [-32766 32767]\n",
" long_name: Topography and Bathymetry ( 8123m -> -10799m)\n",
" units: meters\n",
" unpacked_missing_value: -32767.0 \n",
"\n",
" dask.array<smith_s..., shape=(6336, 10800), dtype=float64, chunksize=(576, 720)>\n"
]
}
],
"source": [
"print(arr, '\\n\\n', darr)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array(-2184.878420854962)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# example operation\n",
"arr.mean().values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First a couple of functions to do zarr<->dask-array. This is already mentioned in the [docs](http://dask.pydata.org/en/latest/array-creation.html), but these would feel very natural as methods."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def to_zarr(darr, url, compressor='default', ret=False):\n",
" for c in darr.chunks:\n",
" if len(set(c)) != 1:\n",
" # I believe arbitrary chunking is not possible\n",
" raise ValueError(\"Must use regular chunking for zarr; %s\"\n",
" % darr.chunks)\n",
" chunks = [c[0] for c in darr.chunks]\n",
" out = zarr.open_array(url, mode='w', shape=darr.shape,\n",
" chunks=chunks, dtype=darr.dtype,\n",
" compressor=compressor)\n",
" da.store(darr, out)\n",
" if ret:\n",
" return out\n",
"\n",
"def from_zarr(url, retstore=False):\n",
" # or perhaps \"read_zarr\"\n",
" d = zarr.open_array(url)\n",
" out = da.from_array(d, chunks=d.chunks)\n",
" if retstore:\n",
" return out, d.store\n",
" return out"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# array to zarr\n",
"to_zarr(darr, 'out.zarr', compressor=zarr.Blosc())"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"-2184.8784208549619"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# zarr to dask array, simple computation\n",
"from_zarr('out.zarr').mean().compute()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Array((6336, 10800), float64, chunks=(576, 720), order=C)\n",
" nbytes: 522.1M; nbytes_stored: 103.8M; ratio: 5.0; initialized: 165/165\n",
" compressor: Blosc(cname='lz4', clevel=5, shuffle=1)\n",
" store: DirectoryStore"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# or can use the zarr directory directly \n",
"zarr.open_array('out.zarr')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, a shim for xarray"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pickle\n",
"\n",
"def xarray_to_zarr(arr, url, **kwargs):\n",
" coorddict = [(name, arr.coords[name].values) for name in arr.dims]\n",
" z = to_zarr(arr.data, url, compressor=kwargs.get('compressor', 'default'),\n",
" ret=True)\n",
" z.store['.xarray'] = pickle.dumps({'coords': coorddict, 'attrs': arr.attrs,\n",
" 'name': arr.name, 'dims': arr.dims})\n",
"\n",
"def xarray_from_zarr(url):\n",
" z, store = from_zarr(url, True)\n",
" meta = pickle.loads(store['.xarray'])\n",
" out = xr.DataArray(z, coords=meta['coords'], name=meta['name'],\n",
" attrs=meta['attrs'], dims=meta['dims'])\n",
" return out"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# array to zarr\n",
"xarray_to_zarr(arr, 'out.xarr')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"<xarray.DataArray 'ROSE' (latitude: 6336, longitude: 10800)>\n",
"dask.array<array-b..., shape=(6336, 10800), dtype=float64, chunksize=(576, 720)>\n",
"Coordinates:\n",
" * latitude (latitude) float32 -72.0009 -71.9905 -71.9802 -71.9699 ...\n",
" * longitude (longitude) float32 0.0166667 0.05 0.0833333 0.116667 0.15 ...\n",
"Attributes:\n",
" valid_range: [-32766 32767]\n",
" long_name: Topography and Bathymetry ( 8123m -> -10799m)\n",
" units: meters\n",
" unpacked_missing_value: -32767.0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# zarr to zarray\n",
"out = xarray_from_zarr('out.xarr')\n",
"out"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"dask.array<array-b..., shape=(6336, 10800), dtype=float64, chunksize=(576, 720)>"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The data is a dask array \n",
"out.data"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array(-2184.878420854962)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"out.mean().values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Further thoughts:\n",
"- this is storing a single data array, not a data set. zarr does have the concept groups, but writing a directory containing many zarrs would be easy enough to implement\n",
"- as hinted in the functions, the location could be an arbitrary URL with backends such as s3fs and hdfs3 using the same model as in dask's read and write functions\n",
"- the coordinate arrays and metadata are simply pickled, because they need to be loaded back all in one go, so there seems no reason to worry too hard about packing this information carefully."
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
@martindurant
Copy link
Author

@shoyer , @alimanfoo
Since I had been working on fastparquet as standard storage for tabular data, I am also thinking about a standard format for array data for dask. netCDF and HDF are good legacy archival formats, but don't play nicely with parallel access across a cluster or from an archive store like s3. zarr is certainly non-standard, but would make a very nice internal store for intermediates. This gist is a simple motivator that we could use zarr not only for dask but for xarray too without too much expenditure of effort.

@martindurant
Copy link
Author

@mrocklin, updated with some real data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment