Skip to content

Instantly share code, notes, and snippets.

@lomereiter
Last active October 6, 2022 09:23
Show Gist options
  • Save lomereiter/5230f66b927f7545c8aa546f33746cdc to your computer and use it in GitHub Desktop.
Save lomereiter/5230f66b927f7545c8aa546f33746cdc to your computer and use it in GitHub Desktop.

Serialization: best practices

(In this document I pay attention mostly to data storage in scientific applications, not to web protocols.)

Traditional approaches

  • XML:
    • slow to parse
    • schemas (.xsd) are human-readable but hard to edit without special software
    • tooling for generating code for reading/writing is limited (mostly to Java)
    • not suited for binary data
    • more on XML: http://c2.com/cgi/wiki?XmlSucks
  • JSON:
    • much simpler than XML, but also more limited
    • shares most XML disadvantages
    • web-friendly
  • HDF5:
    • designed for storing groups of huge N-dimensional arrays
    • insanely complex format specification
      • "The reference implementation of the HDF5 File Format and I/O Library (http://hdf.ncsa.uiuc.edu/HDF5/) consists of approximately 2073 files or about 917,000 lines of the source code."
      • there's no full implementation beyond the reference one
    • includes chunking and compression of arrays
      • needs parameter tweaking for getting good performance
    • NASA uses it to store Earth observation data
    • no built-in indexing support
  • SQLite:

Modern tools for serialization

  • generate reading/writing code for multiple programming languages
  • allow flexible iteration of schemas (adding/removing/deprecating fields)

Mainly used nowadays:

Very recent tools (zero-copy decoding, i.e. a file can be just mmap-ed and accessed with minimal overhead):

  • Google Flatbuffers:
    • shaky prospects (Google doesn't use it much)
    • supports directed acyclic graphs of objects
    • intended mostly for storing/loading game scenes
    • most languages except C++ are second-class citizens
    • scientific use: in Feather format for storing metadata
  • Cap'n Proto:
    • author worked at Google, wrote Protocol Buffers v2, learned all its downsides
    • supports only trees of objects, not DAGs, it's suggested to use integer ids as a workaround
    • pycapnp is easier to use and has 10x more downloads on PyPI than flatbuffers
    • intended for interprocess communication (=> support for multiple languages)
    • dogfooded by the awesome SandStorm project

Summary

  • creating new formats is easy with Protocol Buffers and the like
    • evolving and human-readable schema instead of hand-written specification
    • generated file format is immediately usable with many programming languages
  • storing multidimensional arrays needs a bit of extra work:
    • chunking (trivial, especially with matrix libraries like numpy)
    • compression (stable libraries: zlib/bzip2 for long-term storage, LZ4 for speed)
    • in principle, the two points above can be solved in one shot with Blosc
      • the format of 1.x version is stable but lacks a formal specification
  • SQLite/HDF5 are flexible and often good enough, but:
    • for large datasets careful choice of parameters and settings is required
    • complicated data structures have to be fit into the supported data model (relational and hierarchical, respectively), which requires extra efforts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment