Skip to content

Instantly share code, notes, and snippets.

@arbennett
Created August 26, 2022 20:28
Show Gist options
  • Save arbennett/b4d3e4b67d4a3c9dbc77dbe326c15ef0 to your computer and use it in GitHub Desktop.
Save arbennett/b4d3e4b67d4a3c9dbc77dbe326c15ef0 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "3ff4a436-7732-4571-bd9d-e6d3d778d661",
"metadata": {},
"source": [
"# Using data from Pangeo-Forge\n",
"\n",
"I've been following the development of [Pangeo-Forge](https://pangeo-forge.org/) for a while and use a lot of their data/code stack in my research. This year at SciPy Ryan Abernathey gave a [talk](https://www.youtube.com/watch?v=sY20UpYCAEE) showing an example of actually using it and I was pretty blown away at how easy it has become to use. So, I reproduced his example and extended it a bit. Briefly, this notebook just pulls in [NOAA Daily Global Precipitation data](https://www.ncei.noaa.gov/products/climate-data-records/precipitation-gpcp-daily) and computes a climatology as well as pulls out a timeseries for the gridcell closest to Tucson.\n",
"\n",
"First we do some installs and imports."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2beb4c6c-3b13-4d7d-8f84-b3b99887a914",
"metadata": {},
"outputs": [],
"source": [
"!pip install xarray dask zarr fsspec aiohttp requests matplotlib\n",
"\n",
"import xarray as xr\n",
"import matplotlib.pyplot as plt\n",
"from dask.diagnostics import ProgressBar"
]
},
{
"cell_type": "markdown",
"id": "0ab8adaa-5d7b-4d70-9f06-b9ac0671d5b8",
"metadata": {},
"source": [
"# Accessing the data\n",
"To access the data all we need to do is put in the url to the dataset on Pangeo Forge. That can just get passed into xarray and we are good to go. Note this dataset is about 2.5G, and the file opens basically instantly. This is because xarray is lazy and onlly opens the metadata. Data will only be pulled in over the network when we actually do something with it."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "19e5efe6-ca1c-41bd-99ed-87cb5b449dd7",
"metadata": {},
"outputs": [],
"source": [
"store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'\n",
"ds = xr.open_dataset(store, engine='zarr', chunks={})\n",
"ds"
]
},
{
"cell_type": "markdown",
"id": "63679ebb-ee58-49ab-8567-1856aa28b202",
"metadata": {
"tags": []
},
"source": [
"# Calculating the seasonal climatology\n",
"\n",
"Next we just pull the data and calculate the mean precipitation for each season. We have to mask out some values because there are some data quality issues in this dataset, which was mentioned by Ryan in his talk. I just followed his masking scheme. We can see this computes in ~25 seconds. Not bad for a global dataset on a binder instance with only 2G of memory!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "85aae2eb-5789-4146-8db3-be59f6996b84",
"metadata": {},
"outputs": [],
"source": [
"\n",
"with ProgressBar():\n",
" mask = (ds['precip'] >= 0) & (ds['precip'] < 1000)\n",
" climatology = (ds['precip'].where(mask)\n",
" .groupby(ds['time'].dt.season)\n",
" .mean(dim='time')\n",
" .compute())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8097afe1-474f-4ca1-a718-769c57810240",
"metadata": {},
"outputs": [],
"source": [
"climatology.plot(col='season', col_wrap=2, cmap='turbo', figsize=(12, 8))"
]
},
{
"cell_type": "markdown",
"id": "28458928-1bc5-49e7-a990-b8a590365886",
"metadata": {},
"source": [
"# Pulling a single timeseries\n",
"\n",
"Next I just wanted to see how easy it was to pull a single pixel's timeseries. So naturally I chose Tucson. Again this takes ~20s to pull the data and then I made some plots."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8123837a-af84-4527-8e0f-4e2e0c9495b8",
"metadata": {},
"outputs": [],
"source": [
"lat, lon = 32.2540, 110.9742\n",
"tucson_precip = ds['precip'].sel(latitude=lat, longitude=lon, method='nearest')\n",
"\n",
"with ProgressBar():\n",
" mask = (tucson_precip >= 0) & (tucson_precip < 1000)\n",
" tucson_precip = tucson_precip.where(mask, other=0.0)\n",
" tucson_precip = tucson_precip.compute()\n",
" tucson_smoothed_precip = tucson_precip.rolling({'time': 30}).mean().compute()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4047d593-c997-4610-aeb5-c7c80c45797c",
"metadata": {},
"outputs": [],
"source": [
"tucson_precip.plot(label='daily', figsize=(12, 4))\n",
"tucson_smoothed_precip.plot(label='30 moving average', linewidth=2)\n",
"plt.legend()"
]
},
{
"cell_type": "markdown",
"id": "d6e7824f-b3a6-4bdd-88a5-e83977bb8b87",
"metadata": {},
"source": [
"# Other datasets\n",
"\n",
"I looked around and it seems there's about [10 datasets available](https://pangeo-forge.org/catalog) so far at the time of writing, so it's still a pretty new archive. But supposedly it's relatively straightforward to contribute datasets if you have a way to publicly access the data you want to contribute."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "be287289-2e07-44e6-99bd-8ad08f545ddd",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment