Skip to content

Instantly share code, notes, and snippets.

@gmaze
Last active April 19, 2024 19:20
Show Gist options
  • Save gmaze/d0df4fe5eabe28c179322361441ce2f8 to your computer and use it in GitHub Desktop.
Save gmaze/d0df4fe5eabe28c179322361441ce2f8 to your computer and use it in GitHub Desktop.
Better understanding how Argo data could be easily subset-ed from the "cloud"

"Is there already an interface that allows for subsetting of the data or do you imagine that additional tools will be developed on top of the structure as is done with the current https GDAC service? "
In short:

  1. additional tools are required mostly for Matlab users, Python users are already very well equipped.
  2. If there's only subsetting to address, the api/database approach is an alternative solution to changing format. But that's not the case, we also would like to improve raw data access/structure through an alternative file format to netcdf

With regard to better understanding how Argo data could be easily subset-ed from the "cloud", this is for me the opportunity to try to summarise some key points: see below and, sorry for some possible shortcomings and verbosity of the email !

Key points

In long:

1- with the existing system (netcdf files via https access point): you are limited to access/download data file per file

For a single float (or a few) this is not really a big deal, BUT the limitation is that you cannot access a subset of such files (because netcdf does not support a "range"-like access via http, you cannot download a bytes range, a peace of the binary file).
In other word: you have to download an entire file (hence all variables) even if you only want to access DOXY, for instance. We're not even talking about QC flags and Data Mode here. Another example is if you want to access a single cycle from a multi-profile file, or if you want a small space/time domain within a multi-profile file from the "geo" folder.

2- a cloud-optimised file format provides a solution to actual limitations in subsetting

Parquet and zarr (and others) support "range"-like access from http. This means that you can access/download only what you want from inside such files (and like netcdf, they also combine readable meta-data with binary-compressed data components).
"Opening" such files can give you the meta-data of its full content (eg: nb of variables, dimensions sizes), but you can download/fetch only a smaller part. This dramatically reduces data web traffic.

3- such formats are furthermore designed to handle very large dataset efficiently

We could have the entire Argo dataset in a single file (provided that we design it properly). That's where the cloud access point comes in.
Indeed, in practice we could "simply" convert netcdf to zarr or parquet and still continue to distribute data through a http web server. But this would only partially help for improving performances when working with the data.

In fact, Parquet and zarr are natively designed to internally "chunk" the data in small peaces (it is in their structure to cut data in smaller peaces). And a cloud access point is also natively designed to allow for direct access to these chunk of data. (It's no surprise because these techs have been developed for that purpose.)

Such access points include: amazon S3, GCS/google cloud storage, etc.

The combination/coordination of such file formats and access point makes analysing large dataset very efficient, because it is easy to parallelise analysis through the chunked dimension (eg: we could chunk dataset by floats, by profiles, by day, etc).
In this case, we expect a lot of CPU resources to access and work with chunk of data in parallel. This can be done from a laptop using few CPUs and also from High Performance Computing facilities, clusters of computers.

4- In the end, we kind of have 2 dimensions to the problem

  1. accessing finely tuned grain of data (dim = easy sub setting)
  2. leveraging parallelisation to improve analysis performances (dim = parallelisation)

I think we tend to get confused with these two when talking about "cloud-based" solutions.

It is also easy to grasp the idea that how you design chunking will be critical to performances. We tend to see now as many chunking as use-cases. This leads to several "versions" of the same dataset, but organized/chunked differently according to the target use-case. For Argo, I guess the main split is between WMO-based vs Space/Time-based chunking.

Examples

Now in practice, how does this go ? I can give you a example in Python.

For instance, let's consider the EN4 dataset, an optimal interpolation of all in-situ data. For a class I give to master students, I aggregated all the netcdf files of the native dataset into a single zarr file that I uploaded to a Google Cloud Storage point. The file is about 50Gb.

# We import the required libraries:
import gcsfs
import xarray as xr

# Then we define our cloud file system access point:
fs = gcsfs.GCSFileSystem(project='alert-ground-261008', token='anon', access='read_only')

# Now we can open the zarr file:
gcsmap = fs.get_mapper("opendata_bdo2020/EN.4.2.1.f.analysis.g10.zarr")
ds = xr.open_zarr(gcsmap)

xarray will read all the meta-data from the remote file and return a "view" of the file content:

<xarray.Dataset>
Dimensions:                          (depth: 42, time: 832, bnds: 2, lat: 173,
                                      lon: 360)
Coordinates:
  * depth                            (depth) float32 5.022 15.08 ... 5.35e+03
  * lat                              (lat) float32 -83.0 -82.0 ... 88.0 89.0
  * lon                              (lon) float32 1.0 2.0 3.0 ... 359.0 360.0
  * time                             (time) datetime64[ns] 1950-01-16T12:00:0...
Dimensions without coordinates: bnds
Data variables:
    depth_bnds                       (time, depth, bnds) float32 ...
    salinity                         (time, depth, lat, lon) float32 ...
    salinity_observation_weights     (time, depth, lat, lon) float32 ...
    salinity_uncertainty             (time, depth, lat, lon) float32 ...
    temperature                      (time, depth, lat, lon) float32 ...
    temperature_observation_weights  (time, depth, lat, lon) float32 ...
    temperature_uncertainty          (time, depth, lat, lon) float32 ...
    time_bnds                        (time, bnds) datetime64[ns] ...
Attributes: (12/21)
    Conventions:            CF-1.0
    DSD_entry_id:           UKMO-L4UHFnd-GLOB-v01
    GDS_version_id:         v1.7
    contact:                Simon Good - simon.good@metoffice.gov.uk
    creation_date:          2017-04-21 21:12:08.123 -00:00
    easternmost_longitude:  362.5
    ...                     ...
    start_date:             2001-01-01 UTC
    start_time:             00:00:00 UTC
    stop_date:              2001-01-01 UTC
    stop_time:              00:00:00 UTC
    title:                  Temperature and salinity analysis
    westernmost_longitude:  0.5

we can "see" everything, but at this point we didn't actually downloaded any data, just the meta-data that allow to "describe" the file content

From here, it is easy to subset the data, for instance for a given time and depth:

ds.sel(time='2002-01-16T12:00').isel(depth=0)

we get the "view":

<xarray.Dataset>
Dimensions:                          (bnds: 2, lat: 173, lon: 360)
Coordinates:
    depth                            float32 5.022
  * lat                              (lat) float32 -83.0 -82.0 ... 88.0 89.0
  * lon                              (lon) float32 1.0 2.0 3.0 ... 359.0 360.0
    time                             datetime64[ns] 2002-01-16T12:00:00
Dimensions without coordinates: bnds
Data variables:
    depth_bnds                       (bnds) float32 ...
    salinity                         (lat, lon) float32 ...
    salinity_observation_weights     (lat, lon) float32 ...
    salinity_uncertainty             (lat, lon) float32 ...
    temperature                      (lat, lon) float32 ...
    temperature_observation_weights  (lat, lon) float32 ...
    temperature_uncertainty          (lat, lon) float32 ...
    time_bnds                        (bnds) datetime64[ns] ...
Attributes: (12/21)
    Conventions:            CF-1.0
    DSD_entry_id:           UKMO-L4UHFnd-GLOB-v01
    GDS_version_id:         v1.7
    contact:                Simon Good - simon.good@metoffice.gov.uk
    creation_date:          2017-04-21 21:12:08.123 -00:00
    easternmost_longitude:  362.5
    ...                     ...
    start_date:             2001-01-01 UTC
    start_time:             00:00:00 UTC
    stop_date:              2001-01-01 UTC
    stop_time:              00:00:00 UTC
    title:                  Temperature and salinity analysis
    westernmost_longitude:  0.5

where all variables have been squeezed to the required datetime and first depth level (check the time and depth dimension content). See that attributes have also been retained. We did not yet download any data at this point, just the meta-data.

If we want to plot or analyse temperature for instance, then we can trigger the data download/transfert with a "compute" or "plot":

ds.sel(time='2002-01-16T12:00').isel(depth=0)['salinity'].plot()

It's easy to imagine about the same mechanism with the Argo dataset !

In argopy, we provide such a "structure" for the data manipulation. But it's not based on a native cloud dataset, we assemble it internally on the user side, from miscellaneous resources.

A note about Matlab

Matlab is used by a large audience in research and the Argo community.
But there's not zarr reader for Matlab. There's a parquet though. Moreover, I'm not aware of any high level Matlab library that would provide about the same service as "xarray" in python. Therefore, but that is already the case today for the actual system, the Matlab user-base will not have an easy high level access to the new services will shall provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment