Skip to content

Instantly share code, notes, and snippets.

@ssomnath
Last active May 19, 2020 20:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ssomnath/255010b6ac948b4f6c71af9036a7fb92 to your computer and use it in GitHub Desktop.
Save ssomnath/255010b6ac948b4f6c71af9036a7fb92 to your computer and use it in GitHub Desktop.
Data format for storing N-dimensional data in HDF5

N-Dimensional Spectroscopy and Imaging Data (NSID) Format

Preface

Several years ago, when I was a postdoctoral researcher at CNMS, a few fellow researchers and I explored multiple methods for representing measurement data. Long-story short, we decided to go ahead with what is now the Universal Spectroscopy and Imaging Data (USID) model where the data would be written into Heirarchical Data Format (HDF5) files. USID's ability to express all kinds of data, especially niche cases like compressed sensing, spiral scans (neither polar nor cartesian grid), etc., renders itself perhaps unnecessarily complicated for data which do have an N-dimensional form. When data does have a clear N-dimensional form, one could leverage a lot of HDF5's inherent capabilities to easily represent and use N-dimensional datasets. For this reason, I had explored the possibility of using USID only when an N-dimensional form for the data did not exist. I had developed rudimentary scripts and powerpoint presentations to express this idea several years ago. This is my attempt to bring back the idea of a simpler representation for N-dimensional data in light of experiences from developing and supporting USID. For the sake of simplicity, I will call this representation model - N-Dimensional Spectroscopy and Imaging Data (NSID)

Why not just use h5py?

h5py does indeed provide all the functionality necessary to support NSID. However, a layer of convenience and standardization is still useful / necessary for few reasons:

  1. To ensure that data (in memory) are always stored in the same standardized fashion. This would be a function like pyUSID.hdf_utils.write_main_dataset() or a class like pyUSID.ArrayTranslator.
  2. To make it easier to access relevant ancillary information from HDF5 datasets such as the dimensions, units, scales, etc. without needing to write a lot of h5py code. I anticpate that this may look like a class along the lines of pyusid.USIDataset. However, this class may extend a dask.array object instead of a h5py.Dataset object for simplicity. xarray apparently extends pandas which is inappropriate for this application. However, packages like pint should ceratinly be used.
  3. To simplify certain ancillary tasks like identify all NSID datasets in a given file, seamlessly reusing datasets representing dimensions / copying datasets, verifying whether a dataset is indeed NSID or not.
  4. To facilitate embarrasingly parallel computations on datasets along the lines of pyUSID.Process. I would love to use dask to handle parallelization. However, HDF5 datasets are still not pickle-able. Therefore, Dask cannot operate on them. It is likely, that this framaework would rely on lower-level libraries like mpi4py

I expect that the package to support NSID would be far simpler than pyUSID since h5py provides the majority of the functionality inherently.

Strawman NSID specifications

Originally the NSID model was envisioned to be similar to USID in that it too would have a Main Dataset that is supported by Ancillary Datasets to provide reference information about each dimension. The ancillary datasets for each dimension would be attached to the Main Dataset using HDF5 Dimension Scales However, I have since learnt that HDF5's Dimension Scales can capture the information that would have been stored in these ancillary datasets and can be attached to the Main Dataset.

Main Dataset

The main data will be stored in an HDF5 dataset:

  • shape: Arbitrary - matching the dimensionality of the data
  • dtype: basic types like integer, float, and complex only. I am told that compound-valued datasets are not supported well in languages other than python. Therefore, such data should be broken up into simpler dtype datasets.
  • chunks: Leave as default / do not specify anything.
  • compression: Preferably do not use anything. If compression is indeed necessary, consider using gzip.
  • Dimension scales: Every single dimension needs to have at least one scale attached to it with the name(s) of the dimension(s) as the label(s) for the scale. Normally, we would only have one dataset attached to each dimension. However, for example, if one of the reference axes was a color (a tuple of three integers), we would need to attach three datasets to the scale for the appropriate dimension.
  • Required Attributes:
    • quantity: `string`: Physical quantity that is contained in this dataset
    • units: `string`: Units for this physical quantity
    • data_type: `string : What kind of data this is. Example - image, image stack, video, hyperspectral image, etc.
    • modality: `string : Experimental / simulation modality - scientific meaning of data. Example - photograph, TEM micrograph, SPM Force-Distance spectroscopy.
    • source: `string : Source for dataset like the kind of instrument. One could go very deep here into either the algorithmic details if this is a result from analysis or the exact configurations for the instrument that generated this daatset. I am inclined to remove this attribute and have this expressed in the metadata alone.
    • nsid_version: `string`: Version of the abstract NSID model.

Note tha we should take guidance from experts in schemas and ontologies on how best to represent the data_type and modality information.

Ancillary Datasets

Each of the N dimensions correponding to the N-dimensional Main Dataset would be an HDF5 dataset:

  • shape - 1D only
  • dtype - Simple data types like int, float, complex
  • Required attributes -
    • quantity: `string`: Physical quantity that is contained in this dataset
    • units: `string`: units for the physical quantity

Metadata

Strawman solution - Store the heirarchical metadata into heirarchical HDF5 groups within the same file as the Main Dataset and link the parent group the provides the metadata to the Main Dataset. Again, this requires feedback from experts in schemas and ontologies.

Multiple measurements in same file

A single HDF5 file can contain multiple HDF5 datasets. It is not necessary that all datasets be NSID-specific. Similarly, the heirarchical nature of HDF5 will allow the storage of multiple NSID measurements within the same HDF5 file. Strict restrictions will not be placed on how the datasets should be arranged. Users are free to use and are recommended to use the same guidelienes of Measurement Groups and Channels as defined in USID.

Data processing results in same file

We defined a possible solution for capturing provenance between the source dataset and the results datasets. Briefly, results would be stored in a group whose name would be formatted as SourceDataset-ProcessingAlgorithmName_NumericIndex. However, this solution does not work elegantly for certain situations:

  • if multiple source datasets were used to produce a set of results datasets.
  • if results are written into a different file.
  • In general, the algorithm name was loosly defined.

Do get in touch if you know of a better solution

Existing solutions

These are the requirements for the materials characterization domain. I am not sure if something like NSID or a python API like the to-be pyNSID exist. We would need to survey the web for existing solutions to avoid duplicating efforts and for supporting an existing central effort.

@keknight
Copy link

In looking at data_type and modality, I'm wondering if you could make use of PROV-DM. It’s a data model (w3c) that describes the core elements and relationships that can be used to represent any process as a directed graph by linking three structures (entities, activities, and agents) using a specific set of relationships. Additional constraints are needed to make PROV specific for a given domain, but there are extension points that allow inclusion of domain-specific semantics.

The PROV-DM Core Structures are
1) entity - a physical, digital, conceptual, or other kind of thing with some fixed aspects (can be concrete or abstract)
2) activity - something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using, or generating entities
3) agent -something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity.
4) relationships, which are: a) wasDerivedBy, b) used, c) wasGeneratedBy, d) wasInformedBy, e) wasAssociatedWith, f) actedOnBehalfOf, g) wasAttributedTo

http://www.w3.org/TR/prov-dm/ 

@ssomnath
Copy link
Author

Interesting. I will look into this. Thank you, @keknight

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment