Pangeo OSN Data Guide
This is a rough guide to using OSN with Pangeo tools.
Open Storage Network
Open Storage Network (OSN) is a distributed data service to support active data sharing and transfer between academic institutions, leveraging existing NSF-funded resources. It is funded by NSF and the Schmidt Futures Foundation.
Our allocation is on the OSN pod at NCSA in Illinois. This pod has a very high-bandwidth internet connection and should provide good transfer rates to all Pangeo cloud hubs.
Cloud Object Storage
OSN is a cloud object storage service. It is a technology that allows data to be read / written via HTTP calls. For a basic primer on object storage, this document may be a useful reference.
OSN is configured to be compatible with the most common object storage API: Amazon S3. Therefore, the Amazon S3 Documentation is also a useful reference. To configure your computer to talk to OSN, you need a few key details
The credentials are required for writing but not reading.
All data will be stored under the bucket
In contrast to more user-friendly cloud storage services like Google Drive, Dropbox, etc., there is no pretty website
In order to work effectively with object storage, it is desirable to use a "cloud-optimized" data format. This means transforming our NetCDF data into Zarr. (This AGU Talk provides a simple introduction to Zarr.) The Pangeo Guide to Preparing Cloud-Optimized Data is a useful overview of the process of creating Zarr stores from NetCDF data and putting it in the cloud, although some aspects of that guide have to be changed to work with OSN.
Guide for Usage of OSN
Here we attempt to provide a compact practical guide for users interacting with OSN. Please feel free to suggest changes / additions where anything is not clear.
Step 1: Create the Zarr Data
This step is run wherever the original data live (e.g. on a GFDL supercomputer.) For this part, you can follow the Pangeo Guide steps 1 and 2 exactly. We recommend using the python package Xarray for this step. (The Xarray Zarr Documenation may also be helpful here.) The most important parameter you will have to consider is the chunk size / shape on the data variables. We want to aim for chunks of roughly 100 MB in size. It is most common to apply chunking along the time dimension. Chunking in space can also be used for very high-resolution datasets. When this step is complete, you will have one or more Zarr stores on disk. Note that Zarr also makes an excellent on-disk analysis-ready format. Many users prefer to transform all their NetCDF data to Zarr, even if not using the cloud.
Step 2: Upload to OSN
$ pip install awscli-plugin-endpoint
Then enable the plugin and create a configuration profile for OSN.
You do this by editing the file
$HOME/.aws/config and adding a section as follows:
endpoint = awscli_plugin_endpoint
aws_access_key_id=<email Ryan for secrets>
aws_secret_access_key=<email Ryan for secrets>
endpoint_url = https://ncsa.osn.xsede.org
endpoint_url = https://ncsa.osn.xsede.org
You can now upload your data to OSN with a command line argument like the following
$ aws s3 --profile osn cp --recursive /local/path s3://Pangeo/<dataset name>
<dataset name> is a unique identifier for your dataset.
It can contain
/ characters in order to organize the data in to sub-directories.
We have not yet worked out how to organize the data within the budget, so for now just use your best judgement.
Note: there is only one account for the entire bucket. That means that, if you have the read-write credentials, you can potentially delete / overwrite data created by other people. Please be very careful!
Step 3: Verify your upload
First we check that the files are there using the CLI:
$ aws s3 --profile osn ls --recursive /local/path s3://Pangeo/
Next we check that our uploaded data is readable from python. For this we need to have the following python packages installed.
Here is some code to open a dataset from OSN.
import xarray as xr
endpoint_url = 'https://ncsa.osn.xsede.org'
fs = s3fs.S3FileSystem(
mapper = fs.get_mapper('Pangeo/<dataset name>')
# open the data from Zarr (more low-level)
zarr_group = zarr.open_consolidated(mapper)
# open the data from Xarray (recommend)
ds = xr.open_zarr(mapper, consolidated=True)
ds and ensure it looks right.
Note that not all the data is actually downloaded when you open a dataset.
It is downloaded "lazily", i.e. only when needed for computation or plotting, or when explicitly requested via