To run data mining algorithms on ocean's large datasets, we need to optimise access to datasets with possibly up to 6-dimensions.
A generalised 6-dimensional dataset is [X,Y,Z,T,V,E] where:
- X,Y,Z,T are the space/time dimensions,
- V is the variable dimension (eg: temperature, salinity, zonal velocity) and,
- E the ensemble dimensions (list of realisations or members).
Running data mining algorithms on this dataset mostly implies to re-arrange the 6 dimensions into 2-dimensional arrays with, following the statistics vocabulary "sampling" vs "features" dimensions. The sampling dimension is along rows, the features along columns. A large dataset can have billions of rows and hundreds of columns.
Eg: