gmaze/obidam_storage.md

## obidam_storage.md

      
    Raw
  

              obidam_storage.md
            
          
    To run data mining algorithms on ocean's large datasets, we need to optimise access to datasets with possibly up to 6-dimensions.
A generalised 6-dimensional dataset is [X,Y,Z,T,V,E] where:

X,Y,Z,T are the space/time dimensions,
V is the variable dimension (eg: temperature, salinity, zonal velocity) and,
E the ensemble dimensions (list of realisations or members).

Running data mining algorithms on this dataset mostly implies to re-arrange the 6 dimensions into 2-dimensional arrays with, following the statistics vocabulary "sampling" vs "features" dimensions. The sampling dimension is along rows, the features along columns. A large dataset can have billions of rows and hundreds of columns.
Eg:

a collection of timeseries: sampling is [X,Y,Z,V,E] and features is [T]
a collection of profiles: sampling is [X,Y,V,E] and features is [Z]
a collection of maps: sampling is [Z,V,E] and features is [X,Y] (here the X and Y dimensions are stacked)

Note that the V variable dimension may be moved to the feature dimensions.
The problem is that most of the time, the dataset is stored on disk in netcdf files as follows:

one file for each T, V and E instances
in each file: [X,Y] maps are stacked along [Z], ie data are stored as [X,Y,Z]

With this storage solution, accessing to a map for a single variable/timestep is fast.
But all other situations (eg: timeseries, profiles) are not optimised because they would involved to construct the feature dimensions by reading data scattered in several files and space locations.
The challenge is thus to transform 6-d datasets into 2-d sampling/feature dimensions.