Skip to content

Instantly share code, notes, and snippets.

@lomereiter
Last active June 15, 2016 08:51
Show Gist options
  • Save lomereiter/f1d2467c51f32d84af7f91c4521b3b9f to your computer and use it in GitHub Desktop.
Save lomereiter/f1d2467c51f32d84af7f91c4521b3b9f to your computer and use it in GitHub Desktop.

Summary of the problem from mz5 paper (concerning .mzML but just as true for .imzML):

Although based on excellent ontologies, relying on the extended markup language (XML) for the straightforward implementation of mzData, mzXML, and mzML makes for a major efficiency bottleneck. XML was designed to be a human readable, textual data format with considerable inherent verbosity and redundancy. XML was not designed for efficient bulk data storage, and the general modus operandi requires reading complete files to construct the XML parse tree. The mzXML and mzML formats partly circumvent these limitations by using base-64 encoding and (optional) compression of the raw MS scan data in combination with an application-specific indexing system. Despite the improvements gained from these efforts, vendor formats in general outperform mzXML and mzML in terms of space requirements, as well as in read and write efficiency.

HDF5-based: mz5

SQLite-based: mzDB

Designed for LC-MS data but extension for imaging MS data should be easy.

  • usage of JSON for metadata was considered but rejected
  • instead, metadata can be stored as XML, although there are also tables for metadata
  • not possible to store both centroided and profile data in the same file
  • data is organized into chunks
  • range queries are implemented with R*tree structure which is built into SQLite
  • SQLite does all the indexing, although the setup of chunking and multiple indices is not trivial
  • compression is planned for the next version (MS-Numpress)

OpenMSI data format (HDF5-based)

  • Designed only for imaging MS data, not for LC-MS
  • Supports only profile-mode data (binning is performed on centroided data)
  • By default stores two copies of data for fast access to both spectra and images.

Some closed formats also store two copies of data in profile mode: msiQuant, Scils Lab .sl format. In msiQuant, centroided data is not binned but converted to profile via resolution estimation and gaussian smoothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment