Skip to content

Instantly share code, notes, and snippets.

@gmaze
Last active May 17, 2018 09:27
Show Gist options
  • Save gmaze/7507e478b27d02de77743745c0bb7395 to your computer and use it in GitHub Desktop.
Save gmaze/7507e478b27d02de77743745c0bb7395 to your computer and use it in GitHub Desktop.
OBIDAM: dataset fast access issue

To run data mining algorithms on ocean's large datasets, we need to optimise access to datasets with possibly up to 6-dimensions.

A generalised 6-dimensional dataset is [X,Y,Z,T,V,E] where:

  • X,Y,Z,T are the space/time dimensions,
  • V is the variable dimension (eg: temperature, salinity, zonal velocity) and,
  • E the ensemble dimensions (list of realisations or members).

Running data mining algorithms on this dataset mostly implies to re-arrange the 6 dimensions into 2-dimensional arrays with, following the statistics vocabulary "sampling" vs "features" dimensions. The sampling dimension is along rows, the features along columns. A large dataset can have billions of rows and hundreds of columns.

Eg:

  • a collection of timeseries: sampling is [X,Y,Z,V,E] and features is [T]
  • a collection of profiles: sampling is [X,Y,V,E] and features is [Z]
  • a collection of maps: sampling is [Z,V,E] and features is [X,Y] (here the X and Y dimensions are stacked)

Note that the V variable dimension may be moved to the feature dimensions.

The problem is that most of the time, the dataset is stored on disk in netcdf files as follows:

  • one file for each T, V and E instances
  • in each file: [X,Y] maps are stacked along [Z], ie data are stored as [X,Y,Z]

With this storage solution, accessing to a map for a single variable/timestep is fast.

But all other situations (eg: timeseries, profiles) are not optimised because they would involved to construct the feature dimensions by reading data scattered in several files and space locations.

The challenge is thus to transform 6-d datasets into 2-d sampling/feature dimensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment