To run data mining algorithms on ocean's large datasets, we need to optimise access to datasets with possibly up to 6-dimensions.
A generalised 6-dimensional dataset is [X,Y,Z,T,V,E] where:
- X,Y,Z,T are the space/time dimensions,
- V is the variable dimension (eg: temperature, salinity, zonal velocity) and,
- E the ensemble dimensions (list of realisations or members).
Running data mining algorithms on this dataset mostly implies to re-arrange the 6 dimensions into 2-dimensional arrays with, following the statistics vocabulary "sampling" vs "features" dimensions. The sampling dimension is along rows, the features along columns. A large dataset can have billions of rows and hundreds of columns.
Eg:
- a collection of timeseries: sampling is [X,Y,Z,V,E] and features is [T]
- a collection of profiles: sampling is [X,Y,V,E] and features is [Z]
- a collection of maps: sampling is [Z,V,E] and features is [X,Y] (here the X and Y dimensions are stacked)
Note that the V variable dimension may be moved to the feature dimensions.
The problem is that most of the time, the dataset is stored on disk in netcdf files as follows:
- one file for each T, V and E instances
- in each file: [X,Y] maps are stacked along [Z], ie data are stored as [X,Y,Z]
With this storage solution, accessing to a map for a single variable/timestep is fast.
But all other situations (eg: timeseries, profiles) are not optimised because they would involved to construct the feature dimensions by reading data scattered in several files and space locations.
The challenge is thus to transform 6-d datasets into 2-d sampling/feature dimensions.