Skip to content

Instantly share code, notes, and snippets.

@hevgyrt
Last active August 24, 2022 12:26
Show Gist options
  • Save hevgyrt/eca2a89e76223df3a0d183a4c6ba3bbe to your computer and use it in GitHub Desktop.
Save hevgyrt/eca2a89e76223df3a0d183a4c6ba3bbe to your computer and use it in GitHub Desktop.
Tutorial for how to create CF and ACDD compliant NetCDF files in the context of the Nansen Legacy project
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial for creating NetCDF files relevant for the Nansen Legacy project\n",
"_By [Trygve Halsne](https://www.met.no/en/search-result?q=trygve+halsne)_\n",
"\n",
"## Table of contents:\n",
"1. Introduction\n",
"2. NetCDF and metadata\n",
"3. Wave buoy example (creating a file)\n",
"4. Dataset granularity\n",
"5. Wrap-up\n",
"\n",
"# 1. Introduction\n",
"This tutoral will focus on creating NetCDF files compliant with the [Climate and Forecast](http://cfconventions.org/) (CF) convention and the [Attribute Convention for Data Discovery](http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery_1-3) (ACDD). In particular, the focus will be on relevant datasets for the Nansen Legacy project. We will also briefly touch upon the discussion on how to structure a dataset in terms of granulatiry.\n",
"\n",
"## 1.1 Requirements to run the following jupyter-notebook\n",
"In this tutorial, we will basically do two things: <br>\n",
"1. Stepwise, create a CF and ACDD compliant NetCDF file using data from an already existing wave buoy dataset <br>\n",
"2. Modify an existing NetCDF file to make it comliant with the standards. \n",
" \n",
"The data for the first step is fetched by means of OPeNDAP (ie. streaming of data) and hence __no download prior to doing the excersize is needed__. OPeNDAP is one of the benefits when distributing CF compliant datasets.\n",
"\n",
"Before running this tutorial, you might also need some python packages. To create a conda environment with all the necessary packages, use the following command:\n",
"\n",
"*conda create -n nc_cf_acdd python=3.7 netCDF4=1.4.0 numpy -c anaconda xarray*\n",
"\n",
"In your terminal, activate the environment and run jupyter-notebook like:\n",
"\n",
"__conda activate nc_cf_acdd__ <br>\n",
"__jupyter-notebook__\n",
"\n",
"\n",
"# 2 NetCDF and metadata\n",
"[NetCDF](https://www.unidata.ucar.edu/software/netcdf/) is a very convenient and powerful file format in terms of data storage and data dissemination. However, describing the actual content of the file is crucial in order to be used correctly by others. Moreover, the dataset could also be made self-describing and machine readable and thus compliant with widely used international standards. The latter is important in order to make your data visible and accessible for a wide(r) range of users. You will also contribute to make your data [FAIR](https://www.nature.com/articles/sdata201618) (i.e. Findable, Accessible, Interoperable, Reuseable), which are the guiding principles for scientific data management and stewardship. \n",
"\n",
"In order to be precise when talking about metadata, we split types of metadata into two categories: discovery metadata and use metadata:\n",
"\n",
"- __Discovery metadata__ describes e.g. the who, what, where and when about the products as well as the interfaces and access points to the data. Examples of discovery metadata standards are the GCMD DIF and ISO19115. If a NetCDF file follows ACDD, the file is compliant to the above mentioned standards which thus can be extracted from the file.\n",
"- __Use metadata__ provides a definitive description of what each variable in the dataset represents. Use metadata serves the purpose of describing the actual content of the data themselves allowing users to understand and correctly use the datasets. Examples of use metadata are units, missing values and spatio-temporal properties of the data. If a NetCDF file follows the CF convention, enough information is in place to make the file self-describing.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Wave buoy dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.1 Create NetCDF file with minimal metadata\n",
"We use data from an already existing wave buoy dataset available from thredds.met.no, for which we create our own subset of the dataset."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# importing packages\n",
"import netCDF4\n",
"from netCDF4 import Dataset\n",
"import numpy as np\n",
"import datetime\n",
"import xarray as xa\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# OPeNDAP URL to wave buoy data operated by the Norwegian Costal Administration\n",
"url = \"https://thredds.met.no/thredds/dodsC/obs/kystverketbuoy/2019/01/201901_Kystverket-Smartbuoy-Fauskane_AanderaaMotusSensor.nc\"\n",
"\n",
"# Specifying time subset\n",
"t0,tn = 0,300\n",
"\n",
"# Reading the data reading a subset of the variables\n",
"ncin = Dataset(url, 'r')\n",
"time = ncin['time'][t0:tn]\n",
"longitude = ncin['longitude'][t0:tn]\n",
"latitude = ncin['latitude'][t0:tn]\n",
"hm0 = ncin['Significant_Wave_Height_Hm0'][t0:tn]\n",
"ncin.close()\n",
"\n",
"# Creating output\n",
"test_fname = 'subset_bouy_data.nc'\n",
"\n",
"with (netCDF4.Dataset(test_fname, 'w', format='NETCDF4')) as ncout:\n",
" dim_time = ncout.createDimension('time',None) # None denotes unlimited time. Makes it possible to add data later.\n",
"\n",
" nctime = ncout.createVariable('time','f4',('time',))\n",
" nctime[:] = time[:]\n",
" \n",
" nclat = ncout.createVariable('latitude','f4',('time',))\n",
" nclon = ncout.createVariable('longitude','f4',('time',))\n",
" nclat[:]=latitude\n",
" nclon[:]=longitude\n",
"\n",
" # add variable\n",
" varout = ncout.createVariable('Hm0',np.float32, ('time',))\n",
" varout[:] = hm0\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the current state, it is evident that the file is not well described. We lack data describing the content like reference time and units. (You can check this by means of using software like ncdump). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 CF convention\n",
"Let's start with the [CF convention](http://cfconventions.org/). CF is designed to *promote the processing and sharing of files created with the NetCDF API*. It is very useful to read through the [documentation](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/cf-conventions.html). This can, however, become a bit cumbersome so we will try to cover the most important bits for our particular dataset which is:\n",
"\n",
"- use metadata for the variables,\n",
"- description of feature type,\n",
"- global attributes.\n",
"\n",
"We will go through these stepwise in the following sections."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.3 Use metadata for variables\n",
"Use metadata can involve a number of things like flags, units, valid range of data, and scale factors depening on you product. We will, however, restric this to a minimum according to product we are dealing with which will be units, standard_name and long_name. \n",
"\n",
"To add standard_name, we should use the [*CF standard name table*](http://cfconventions.org/Data/cf-standard-names/69/build/cf-standard-name-table.html). For some specific variables, you may not find an entry in this table. Then you could contact the CF community for advice and fill the other attributes as best as you can.\n",
"\n",
"In order to add these attributes in the various variables, we do the following:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Re-opening the dataset\n",
"ncout = Dataset(test_fname, mode='r+') # r+ for append mode\n",
"\n",
"nctime = ncout.variables['time']\n",
"nctime.long_name = 'Time of measurements'\n",
"nctime.standard_name = 'time'\n",
"nctime.units = 'seconds since 1970-01-01 00:00:00 UTC'\n",
"nctime.calendar = 'standard' #standard = gregorian\n",
"\n",
"\n",
"nclat = ncout.variables['latitude']\n",
"nclat.standard_name = 'latitude'\n",
"nclat.units = 'degrees_north'\n",
"nclat.long_name = 'latitude'\n",
"nclat.valid_min =\"-90\"\n",
"nclat.valid_max =\"90\"\n",
"\n",
"nclon = ncout.variables['longitude']\n",
"nclon.long_name = 'longitude'\n",
"nclon.units = 'degrees_east'\n",
"nclon.standard_name = 'longitude'\n",
"nclon.valid_min =\"-180\"\n",
"nclon.valid_max =\"180\"\n",
"\n",
"hm0 = ncout.variables['Hm0']\n",
"hm0.units = \"m\"\n",
"hm0.standard_name = \"sea_surface_wave_significant_height\"\n",
"hm0.long_name = \"Significant Wave Height Hm0 estimate from spectrum\"\n",
"\n",
"# in order to explain the variable a bit more, we add the following\n",
"hm0.valid_range = np.array([0.0,30.0],dtype=np.float32)\n",
"\n",
"#ncout.variables\n",
"ncout.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Investigating the output file, it is obvious that the what and where of the data is much more clear. In order to be more precise, we should add more variables describing e.g. measurement interval etc. Have a look at the original file to see examples of this."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.4 Feature type\n",
"In general, raster data must be georeferenced in some kind of coordinate system. You can read more about that [here](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/cf-conventions.html#coordinate-system) in the CF convention document. For point measurements, we already know from the lat/lon values where we are located. However, in order to be more precise on what kind of dataset we have, the CF convention has established [feature types](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#_features_and_feature_types) for discrete sampling geometries. We can thus specify the discrete sampling geometry of our data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ncout = Dataset(test_fname, mode='r+') # r+ for append mode\n",
"\n",
"globalAttribs = {}\n",
"globalAttribs['featureType'] = \"timeSeries\"\n",
"\n",
"\n",
"ncout.setncatts(globalAttribs)\n",
"ncout.sync()\n",
"\n",
"ncout.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are also other feature types supporting other measurements like profiles (for e.g. weather baloons and CTDs) and time series profiles (for e.g.ADCP)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.5 Global attributes\n",
"The CF convention requires some global attributes describing the product. You can read more about this [here](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/cf-conventions.html#_attributes) in the CF convention document. In the following code, we will add attributes in a new way compared with above:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nowstr = datetime.datetime.utcnow().isoformat()\n",
"\n",
"ncout = Dataset(test_fname, mode='r+') # r+ for append mode\n",
"\n",
"globalAttribs = {}\n",
"globalAttribs['title'] = \"Wave buoy measurments from ...\"\n",
"globalAttribs['Conventions'] = \"CF-1.8\"\n",
"globalAttribs['institution'] = \"Norwegian Coastal Administration\"\n",
"globalAttribs['source'] = \"surface observation\"\n",
"globalAttribs['history'] = \"Subset of wave buoy data created {}.\".format(nowstr)\n",
"globalAttribs['references'] = \"http://www.datawell.nl/Portals/0/Documents/Brochures/datawell_brochure_dwr-mk3_b-09-09.pdf\"\n",
"globalAttribs['comment'] = \"Test creating NetCDF/CF data\"\n",
"\n",
"ncout.setncatts(globalAttribs)\n",
"ncout.sync()\n",
"\n",
"ncout.close()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.6 ACDD and discovery metadata\n",
"Now, the file should be CF compliant and hence be both machine readable and self-describing. However, in order to make you data discoverable (i.e. to describe the who, what, where and when for the data), the data should follow the [ACDD](http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery_1-3). ACDD defines a number of global attributes grouped as __highly recommended__, __recommended__ and __suggested__. It also suggests some highly recommended variable attributes. We will encourage you to at least follow the higly recommended global attributes, but also the attributes listed [here](https://adc.met.no/node/4). Below, we show how to include some of the attributes: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ncout = Dataset(test_fname, mode='r+') # r+ for append mode\n",
"dt = float(ncout['time'][0].data)\n",
"\n",
"globalAttribs['id'] = 'preferably a UUID' #URLs, URNs, DOIs,\n",
"globalAttribs['date_created'] = datetime.datetime.utcnow().isoformat()\n",
"globalAttribs['geospatial_lat_min'] = ncout['latitude'][:].min()\n",
"globalAttribs['geospatial_lat_max'] = ncout['latitude'][:].max()\n",
"globalAttribs['geospatial_lon_min'] = ncout['longitude'][:].min()\n",
"globalAttribs['geospatial_lon_max'] = ncout['longitude'][:].max()\n",
"globalAttribs['time_coverage_start'] = (datetime.datetime(1970,1,1, 0,0,0) + datetime.timedelta(0, dt)).isoformat()\n",
"globalAttribs['Conventions'] = \"CF-1.8, ACDD-1.3\"\n",
"globalAttribs['keywords'] = ['Earth Science > Oceans > Ocean Waves']\n",
"globalAttribs['keywords_vocabulary'] = \"GCMD Science Keywords\"\n",
"\n",
"globalAttribs['license'] = \"Freely Distributed\"\n",
"globalAttribs['standard_name_vocabulary'] = 'CF Standard Name Table v70'\n",
"\n",
"ncout.setncatts(globalAttribs)\n",
"ncout.sync()\n",
"\n",
"hm0 = ncout.variables['Hm0']\n",
"hm0.coverage_content_type = \"physicalMeasurement\"\n",
"ncout.close()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<matplotlib.lines.Line2D at 0x7f5b070bd7b8>]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# NOW, LET'S PLOT THE DATA:\n",
"\n",
"dataset = xa.open_dataset('subset_bouy_data.nc')\n",
"dataset.Hm0.plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.7 Edit existing files using NCO\n",
"Instead of creating a NetCDF file from scratch, somtimes we want to edit an existing file in order to make it compliant with the above mentioned standards. A very convenient tool for this is [NCO](http://nco.sourceforge.net/) (easy to install on at least Ubuntu OS). To give an example on __profile__ data, we use an already existing dataset from the [Ice-Tethered Profiler](http://www.whoi.edu/itp/). We have datasets from this instrument on [thredds.met.no](https://thredds.met.no/thredds/dodsC/data/met.no/itp06/itp06_itp6grd1261.nc.html). In this particular case, we would like to edit the NetCDF file adding e.g. global and variable attributes. \n",
"\n",
"We use the __nco__ attribute editor tool [ncatted](http://nco.sourceforge.net/nco.html#ncatted-netCDF-Attribute-Editor) (in a terminal/bash)."
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"ncatted -a featureType,global,a,c,profile -a positive,pres,a,c,down -a axis,o2,o,c,n_levels -a coordinates,,a,c,pres itp06_itp6grd1317.nc output.nc"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we added: \n",
"1. featureType *profile* to the global attributes\n",
"2. postitive direction of *pressure*\n",
"3. *pressure* as coordinate variable for the other variables\n",
"\n",
"Let's have a look at the plot:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<matplotlib.lines.Line2D at 0x7f5b032af198>]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"itp_ds = xa.open_dataset('output.nc')\n",
"itp_ds.temp.plot(y='pres')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Please note that the NCO has loads of capabilities like e.g. [reversing dimensions](http://nco.sourceforge.net/nco.html#ncpdq-netCDF-Permute-Dimensions-Quickly) and [changing variable types](http://nco.sourceforge.net/nco.html#ncap2-netCDF-Arithmetic-Processor)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.8. Check CF and/or ACDD compliance\n",
"You can now check your dataset in online compliance checkers like [this one](https://pumatest.nerc.ac.uk/cgi-bin/cf-checker.pl). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4. Dataset granularity\n",
"How should I structure my datasets? Or, what is actually sa dataset? There are several ways to answer these questions depending on e.g. scientific branch, but also a lot of opinions. From a data management perspective, it is important to structure the data in a way that the users can easily find and use their data of interest. Basically, we want the granularity of the data to fit with *a decent amount of use cases*. In that contetxt, it means that __it should not__ be necessary for the user to download all the data from a particluar instrument (all the years of recordings, locations, etc) before filtering out the data of interest. This should be organized by the data provider. Here we show a couple of examples:\n",
"\n",
"### Parent - child relationship examples:\n",
"\n",
"__Ex1:__<br>\n",
"Parent (Level-1): Radiosonde measurements from instrument XXXXXX<br>\n",
"Child (Level-2): Globally scattered TimeSeriesProfile datasets from intstrument XXXXXX.\n",
"\n",
"__Ex2:__<br>\n",
"Parent (Level-1): Copernicus Sentinel-1A SAR EW GRDM NTC products<br>\n",
"Child (Level-2): Single Sentinel-1A SAR EW GRDM NTC products acquired all over the world\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5. Wrap-up\n",
"Well-documented data contributes to a higher degree of unambiguous interpretation and thus enhance usability for a wider range of users. Moreover, it makes the data more findable, accessible and reusable. Thank you for contributing to FAIR data!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment