rsignell-usgs/mur_storage.ipynb

## mur_storage.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploring MUR data storage options "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import xarray as xr\n",
    "import os"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "a sample original NetCDF4/HDF5 data file from [NASA PODAAC](https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2003/001):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "ncfile = '20030101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "356.798394"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "file_size_in_mb = os.stat(ncfile).st_size/1e6\n",
    "file_size_in_mb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'zlib': True,\n",
       " 'shuffle': True,\n",
       " 'complevel': 7,\n",
       " 'fletcher32': False,\n",
       " 'contiguous': False,\n",
       " 'chunksizes': (1, 1023, 2047),\n",
       " 'source': '/home/jovyan/notebooks/20030101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',\n",
       " 'original_shape': (1, 17999, 36000),\n",
       " 'dtype': dtype('int16'),\n",
       " '_FillValue': -32768,\n",
       " 'scale_factor': 0.001,\n",
       " 'add_offset': 298.15,\n",
       " 'coordinates': 'lon lat'}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds = xr.open_dataset(ncfile)\n",
    "ds['analysed_sst'].encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Open dataset with variable types and chunking used in NetCDF4/HDF5 file\n",
    "ds = xr.open_dataset(ncfile, mask_and_scale=False, decode_times=False, decode_cf=False, \n",
    "                     chunks={'time':1, 'lat':1023, 'lon':2047})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3888.0"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "uncompressed_size_in_mb = ds.nbytes/1e6\n",
    "uncompressed_size_in_mb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### try writing to zarr using default compression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<xarray.backends.zarr.ZarrStore at 0x7f6129c69ca8>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.to_zarr('zarr_test1', consolidated=True, mode='w')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds1 = xr.open_zarr('zarr_test1', consolidated=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'chunks': (1, 1023, 2047),\n",
       " 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),\n",
       " 'filters': None,\n",
       " '_FillValue': -32768,\n",
       " 'scale_factor': 0.001,\n",
       " 'add_offset': 298.15,\n",
       " 'dtype': dtype('int16'),\n",
       " 'coordinates': 'lon lat'}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds1['analysed_sst'].encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### try writing to zarr using same compression as in NetCDF4/HDF5 file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "import zarr\n",
    "compressor = zarr.Blosc(cname='zlib', clevel=7, shuffle=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set compression for all data variables (the coordinate variables will be tiny in aggregated dataset)\n",
    "encoding={}\n",
    "for v in ds.data_vars:\n",
    "    encoding[v]={'compressor':compressor}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<xarray.backends.zarr.ZarrStore at 0x7f6158277990>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.to_zarr('zarr_test2', consolidated=True, encoding=encoding, mode='w')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'chunks': (1, 1023, 2047),\n",
       " 'compressor': Blosc(cname='zlib', clevel=7, shuffle=SHUFFLE, blocksize=0),\n",
       " 'filters': None,\n",
       " '_FillValue': -32768,\n",
       " 'scale_factor': 0.001,\n",
       " 'add_offset': 298.15,\n",
       " 'dtype': dtype('int16'),\n",
       " 'coordinates': 'lon lat'}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds2 = xr.open_zarr('zarr_test2', consolidated=True)\n",
    "ds2['analysed_sst'].encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.25 s ± 34.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit ds1['analysed_sst'].mean(dim=['lon','lat','time']).compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "468M\tzarr_test1\n"
     ]
    }
   ],
   "source": [
    "!du -sbh zarr_test1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.15 s ± 67.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit ds2['analysed_sst'].mean(dim=['lon','lat','time']).compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "348M\tzarr_test2\n"
     ]
    }
   ],
   "source": [
    "!du -sbh zarr_test2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.3109243697478992"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "468/357"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.4"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "3.15/2.25"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the default compression scheme for Zarr results in dataset size that is 31% larger than the original NetCDF file.  If we use the same compression options as the original file, the dataset size is actually slightly smaller than the original NetCDF. But it takes the user 40% longer to access the data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Exploring MUR data storage options "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import xarray as xr\n",
	"import os"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"a sample original NetCDF4/HDF5 data file from [NASA PODAAC](https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2003/001):"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"ncfile = '20030101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"356.798394"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"file_size_in_mb = os.stat(ncfile).st_size/1e6\n",
	"file_size_in_mb"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"{'zlib': True,\n",
	" 'shuffle': True,\n",
	" 'complevel': 7,\n",
	" 'fletcher32': False,\n",
	" 'contiguous': False,\n",
	" 'chunksizes': (1, 1023, 2047),\n",
	" 'source': '/home/jovyan/notebooks/20030101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',\n",
	" 'original_shape': (1, 17999, 36000),\n",
	" 'dtype': dtype('int16'),\n",
	" '_FillValue': -32768,\n",
	" 'scale_factor': 0.001,\n",
	" 'add_offset': 298.15,\n",
	" 'coordinates': 'lon lat'}"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"ds = xr.open_dataset(ncfile)\n",
	"ds['analysed_sst'].encoding"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Open dataset with variable types and chunking used in NetCDF4/HDF5 file\n",
	"ds = xr.open_dataset(ncfile, mask_and_scale=False, decode_times=False, decode_cf=False, \n",
	" chunks={'time':1, 'lat':1023, 'lon':2047})"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"3888.0"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"uncompressed_size_in_mb = ds.nbytes/1e6\n",
	"uncompressed_size_in_mb"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### try writing to zarr using default compression"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<xarray.backends.zarr.ZarrStore at 0x7f6129c69ca8>"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"ds.to_zarr('zarr_test1', consolidated=True, mode='w')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [],
	"source": [
	"ds1 = xr.open_zarr('zarr_test1', consolidated=True)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"{'chunks': (1, 1023, 2047),\n",
	" 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),\n",
	" 'filters': None,\n",
	" '_FillValue': -32768,\n",
	" 'scale_factor': 0.001,\n",
	" 'add_offset': 298.15,\n",
	" 'dtype': dtype('int16'),\n",
	" 'coordinates': 'lon lat'}"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"ds1['analysed_sst'].encoding"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### try writing to zarr using same compression as in NetCDF4/HDF5 file"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [],
	"source": [
	"import zarr\n",
	"compressor = zarr.Blosc(cname='zlib', clevel=7, shuffle=1)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [],
	"source": [
	"# set compression for all data variables (the coordinate variables will be tiny in aggregated dataset)\n",
	"encoding={}\n",
	"for v in ds.data_vars:\n",
	" encoding[v]={'compressor':compressor}"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<xarray.backends.zarr.ZarrStore at 0x7f6158277990>"
	]
	},
	"execution_count": 12,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"ds.to_zarr('zarr_test2', consolidated=True, encoding=encoding, mode='w')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"{'chunks': (1, 1023, 2047),\n",
	" 'compressor': Blosc(cname='zlib', clevel=7, shuffle=SHUFFLE, blocksize=0),\n",
	" 'filters': None,\n",
	" '_FillValue': -32768,\n",
	" 'scale_factor': 0.001,\n",
	" 'add_offset': 298.15,\n",
	" 'dtype': dtype('int16'),\n",
	" 'coordinates': 'lon lat'}"
	]
	},
	"execution_count": 13,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"ds2 = xr.open_zarr('zarr_test2', consolidated=True)\n",
	"ds2['analysed_sst'].encoding"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"2.25 s ± 34.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%timeit ds1['analysed_sst'].mean(dim=['lon','lat','time']).compute()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"468M\tzarr_test1\n"
	]
	}
	],
	"source": [
	"!du -sbh zarr_test1"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"3.15 s ± 67.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	]
	}
	],
	"source": [
	"%timeit ds2['analysed_sst'].mean(dim=['lon','lat','time']).compute()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"348M\tzarr_test2\n"
	]
	}
	],
	"source": [
	"!du -sbh zarr_test2"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"1.3109243697478992"
	]
	},
	"execution_count": 20,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"468/357"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"1.4"
	]
	},
	"execution_count": 21,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"3.15/2.25"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So the default compression scheme for Zarr results in dataset size that is 31% larger than the original NetCDF file. If we use the same compression options as the original file, the dataset size is actually slightly smaller than the original NetCDF. But it takes the user 40% longer to access the data. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.3"
	},
	"widgets": {
	"application/vnd.jupyter.widget-state+json": {
	"state": {},
	"version_major": 2,
	"version_minor": 0
	}
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}