Skip to content

Instantly share code, notes, and snippets.

@rsignell-usgs
Created January 18, 2020 15:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rsignell-usgs/d56b1bd0ace18cba13a4ab05ebf8d586 to your computer and use it in GitHub Desktop.
Save rsignell-usgs/d56b1bd0ace18cba13a4ab05ebf8d586 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploring MUR data storage options "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import xarray as xr\n",
"import os"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"a sample original NetCDF4/HDF5 data file from [NASA PODAAC](https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2003/001):"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"ncfile = '20030101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"356.798394"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"file_size_in_mb = os.stat(ncfile).st_size/1e6\n",
"file_size_in_mb"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'zlib': True,\n",
" 'shuffle': True,\n",
" 'complevel': 7,\n",
" 'fletcher32': False,\n",
" 'contiguous': False,\n",
" 'chunksizes': (1, 1023, 2047),\n",
" 'source': '/home/jovyan/notebooks/20030101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',\n",
" 'original_shape': (1, 17999, 36000),\n",
" 'dtype': dtype('int16'),\n",
" '_FillValue': -32768,\n",
" 'scale_factor': 0.001,\n",
" 'add_offset': 298.15,\n",
" 'coordinates': 'lon lat'}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds = xr.open_dataset(ncfile)\n",
"ds['analysed_sst'].encoding"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Open dataset with variable types and chunking used in NetCDF4/HDF5 file\n",
"ds = xr.open_dataset(ncfile, mask_and_scale=False, decode_times=False, decode_cf=False, \n",
" chunks={'time':1, 'lat':1023, 'lon':2047})"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3888.0"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"uncompressed_size_in_mb = ds.nbytes/1e6\n",
"uncompressed_size_in_mb"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### try writing to zarr using default compression"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<xarray.backends.zarr.ZarrStore at 0x7f6129c69ca8>"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds.to_zarr('zarr_test1', consolidated=True, mode='w')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"ds1 = xr.open_zarr('zarr_test1', consolidated=True)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'chunks': (1, 1023, 2047),\n",
" 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),\n",
" 'filters': None,\n",
" '_FillValue': -32768,\n",
" 'scale_factor': 0.001,\n",
" 'add_offset': 298.15,\n",
" 'dtype': dtype('int16'),\n",
" 'coordinates': 'lon lat'}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds1['analysed_sst'].encoding"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### try writing to zarr using same compression as in NetCDF4/HDF5 file"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"import zarr\n",
"compressor = zarr.Blosc(cname='zlib', clevel=7, shuffle=1)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# set compression for all data variables (the coordinate variables will be tiny in aggregated dataset)\n",
"encoding={}\n",
"for v in ds.data_vars:\n",
" encoding[v]={'compressor':compressor}"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<xarray.backends.zarr.ZarrStore at 0x7f6158277990>"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds.to_zarr('zarr_test2', consolidated=True, encoding=encoding, mode='w')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'chunks': (1, 1023, 2047),\n",
" 'compressor': Blosc(cname='zlib', clevel=7, shuffle=SHUFFLE, blocksize=0),\n",
" 'filters': None,\n",
" '_FillValue': -32768,\n",
" 'scale_factor': 0.001,\n",
" 'add_offset': 298.15,\n",
" 'dtype': dtype('int16'),\n",
" 'coordinates': 'lon lat'}"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds2 = xr.open_zarr('zarr_test2', consolidated=True)\n",
"ds2['analysed_sst'].encoding"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.25 s ± 34.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit ds1['analysed_sst'].mean(dim=['lon','lat','time']).compute()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"468M\tzarr_test1\n"
]
}
],
"source": [
"!du -sbh zarr_test1"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3.15 s ± 67.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit ds2['analysed_sst'].mean(dim=['lon','lat','time']).compute()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"348M\tzarr_test2\n"
]
}
],
"source": [
"!du -sbh zarr_test2"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.3109243697478992"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"468/357"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.4"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"3.15/2.25"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So the default compression scheme for Zarr results in dataset size that is 31% larger than the original NetCDF file. If we use the same compression options as the original file, the dataset size is actually slightly smaller than the original NetCDF. But it takes the user 40% longer to access the data. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment