ssomnath/NSID.rst

## NSID.rst

      
    Raw
  

              NSID.rst
            
          
    N-Dimensional Spectroscopy and Imaging Data (NSID) Format


Preface

Several years ago, when I was a postdoctoral researcher at CNMS, a few fellow researchers and I explored multiple methods for representing measurement data.
Long-story short, we decided to go ahead with what is now the Universal Spectroscopy and Imaging Data (USID) model where the data would be written into Heirarchical Data Format (HDF5) files.
USID's ability to express all kinds of data, especially niche cases like compressed sensing, spiral scans (neither polar nor cartesian grid), etc., renders itself perhaps unnecessarily complicated for data which do have an N-dimensional form.
When data does have a clear N-dimensional form, one could leverage a lot of HDF5's inherent capabilities to easily represent and use N-dimensional datasets.
For this reason, I had explored the possibility of using USID only when an N-dimensional form for the data did not exist.
I had developed rudimentary scripts and powerpoint presentations to express this idea several years ago.
This is my attempt to bring back the idea of a simpler representation for N-dimensional data in light of experiences from developing and supporting USID.
For the sake of simplicity, I will call this representation model - N-Dimensional Spectroscopy and Imaging Data (NSID)

Why not just use h5py?

h5py does indeed provide all the functionality necessary to support NSID. However, a layer of convenience and standardization is still useful / necessary for few reasons:

To ensure that data (in memory) are always stored in the same standardized fashion. This would be a function like pyUSID.hdf_utils.write_main_dataset() or a class like pyUSID.ArrayTranslator.
To make it easier to access relevant ancillary information from HDF5 datasets such as the dimensions, units, scales, etc. without needing to write a lot of h5py code. I anticpate that this may look like a class along the lines of pyusid.USIDataset. However, this class may extend a dask.array object instead of a h5py.Dataset object for simplicity. xarray apparently extends pandas which is inappropriate for this application. However, packages like pint should ceratinly be used.
To simplify certain ancillary tasks like identify all NSID datasets in a given file, seamlessly reusing datasets representing dimensions / copying datasets, verifying whether a dataset is indeed NSID or not.
To facilitate embarrasingly parallel computations on datasets along the lines of pyUSID.Process. I would love to use dask to handle parallelization. However, HDF5 datasets are still not pickle-able. Therefore, Dask cannot operate on them. It is likely, that this framaework would rely on lower-level libraries like mpi4py

I expect that the package to support NSID would be far simpler than pyUSID since h5py provides the majority of the functionality inherently.

Strawman NSID specifications

Originally the NSID model was envisioned to be similar to USID in that it too would have a Main Dataset that is supported by Ancillary Datasets to provide reference information about each dimension. The ancillary datasets for each dimension would be attached to the Main Dataset using HDF5 Dimension Scales
However, I have since learnt that HDF5's Dimension Scales can capture the information that would have been stored in these ancillary datasets and can be attached to the Main Dataset.

Main Dataset

The main data will be stored in an HDF5 dataset:

shape: Arbitrary - matching the dimensionality of the data
dtype: basic types like integer, float, and complex only. I am told that compound-valued datasets are not supported well in languages other than python. Therefore, such data should be broken up into simpler dtype datasets.
chunks: Leave as default / do not specify anything.
compression: Preferably do not use anything. If compression is indeed necessary, consider using gzip.
Dimension scales: Every single dimension needs to have at least one scale attached to it with the name(s) of the dimension(s) as the `label`(s) for the scale. Normally, we would only have one dataset attached to each dimension. However, for example, if one of the reference axes was a color (a tuple of three integers), we would need to attach three datasets to the scale for the appropriate dimension.
Required Attributes:
quantity: string: Physical quantity that is contained in this dataset
units: string: Units for this physical quantity
data_type: `string : What kind of data this is. Example - image, image stack, video, hyperspectral image, etc.
modality: `string : Experimental / simulation modality - scientific meaning of data. Example - photograph, TEM micrograph, SPM Force-Distance spectroscopy.
source: `string : Source for dataset like the kind of instrument. One could go very deep here into either the algorithmic details if this is a result from analysis or the exact configurations for the instrument that generated this daatset. I am inclined to remove this attribute and have this expressed in the metadata alone.
nsid_version: string: Version of the abstract NSID model.


Note tha we should take guidance from experts in schemas and ontologies on how best to represent the data_type and modality information.

Ancillary Datasets

Each of the N dimensions correponding to the N-dimensional Main Dataset would be an HDF5 dataset:

shape - 1D only
dtype - Simple data types like int, float, complex
Required attributes -
quantity: string: Physical quantity that is contained in this dataset
units: string: units for the physical quantity


Metadata

Strawman solution - Store the heirarchical metadata into heirarchical HDF5 groups within the same file as the Main Dataset and link the parent group the provides the metadata to the Main Dataset. Again, this requires feedback from experts in schemas and ontologies.

Multiple measurements in same file

A single HDF5 file can contain multiple HDF5 datasets. It is not necessary that all datasets be NSID-specific. Similarly, the heirarchical nature of HDF5 will allow the storage of multiple NSID measurements within the same HDF5 file. Strict restrictions will not be placed on how the datasets should be arranged. Users are free to use and are recommended to use the same guidelienes of Measurement Groups and Channels as defined in USID.

Data processing results in same file

We defined a possible solution for capturing provenance between the source dataset and the results datasets. Briefly, results would be stored in a group whose name would be formatted as SourceDataset-ProcessingAlgorithmName_NumericIndex. However, this solution does not work elegantly for certain situations:

if multiple source datasets were used to produce a set of results datasets.
if results are written into a different file.
In general, the algorithm name was loosly defined.

Do get in touch if you know of a better solution

Existing solutions

These are the requirements for the materials characterization domain. I am not sure if something like NSID or a python API like the to-be pyNSID exist. We would need to survey the web for existing solutions to avoid duplicating efforts and for supporting an existing central effort.