lomereiter/serialization.md

## serialization.md

      
    Raw
  

              serialization.md
            
          
    Serialization: best practices

(In this document I pay attention mostly to data storage in scientific applications, not to web protocols.)
Traditional approaches


XML:

slow to parse
schemas (.xsd) are human-readable but hard to edit without special software
tooling for generating code for reading/writing is limited (mostly to Java)
not suited for binary data
more on XML: http://c2.com/cgi/wiki?XmlSucks


JSON:

much simpler than XML, but also more limited
shares most XML disadvantages
web-friendly


HDF5:

designed for storing groups of huge N-dimensional arrays
insanely complex format specification

"The reference implementation of the HDF5 File Format and I/O Library (http://hdf.ncsa.uiuc.edu/HDF5/) consists of approximately 2073 files or about 917,000 lines of the source code."
there's no full implementation beyond the reference one


includes chunking and compression of arrays

needs parameter tweaking for getting good performance


NASA uses it to store Earth observation data
no built-in indexing support

PyTables added indices on top of HDF5 (http://www.blosc.org/docs/OPSI-indexes.pdf), but it limits file access to Python only


SQLite:

flexible single-file SQL database, perfect for storing tables
very widely used
built-in indexing (including multi-dimensional)
authors suggest to use it as an application format: https://www.sqlite.org/appfileformat.html
no compression, although it's available as a commercial add-on: http://www.hwaci.com/sw/sqlite/zipvfs.html


Modern tools for serialization


generate reading/writing code for multiple programming languages
allow flexible iteration of schemas (adding/removing/deprecating fields)

Mainly used nowadays:

Google Protocol Buffers:

advertised as XML successor
heavily used in Google internally (open-sourced in 2008), thus very sustainable
OpenStreetMaps project started to use Protocol Buffers (PBF format):

http://planet.openstreetmap.org/ offers it for downloading alongside XML
out-of-the-box encoding/decoding is not great
several OSM editors use a specialized implementation as a workaround: https://github.com/mapbox/protozero


An example of scientific use is vg, toolset for analyzing variation graphs

also not entirely happy with encoding/decoding times: vgteam/vg#97


Apache Thrift / Apache Avro:

similar to Protobuf in approach, I didn't have time to dig deeper

comparison: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html


documentation is horrible compared to alternatives


Very recent tools (zero-copy decoding, i.e. a file can be just mmap-ed and accessed with minimal overhead):

Google Flatbuffers:

shaky prospects (Google doesn't use it much)
supports directed acyclic graphs of objects
intended mostly for storing/loading game scenes
most languages except C++ are second-class citizens
scientific use: in Feather format for storing metadata


Cap'n Proto:

author worked at Google, wrote Protocol Buffers v2, learned all its downsides
supports only trees of objects, not DAGs, it's suggested to use integer ids as a workaround
pycapnp is easier to use and has 10x more downloads on PyPI than flatbuffers
intended for interprocess communication (=> support for multiple languages)
dogfooded by the awesome SandStorm project


Summary


creating new formats is easy with Protocol Buffers and the like

evolving and human-readable schema instead of hand-written specification
generated file format is immediately usable with many programming languages


storing multidimensional arrays needs a bit of extra work:

chunking (trivial, especially with matrix libraries like numpy)
compression (stable libraries: zlib/bzip2 for long-term storage, LZ4 for speed)
in principle, the two points above can be solved in one shot with Blosc

the format of 1.x version is stable but lacks a formal specification


SQLite/HDF5 are flexible and often good enough, but:

for large datasets careful choice of parameters and settings is required
complicated data structures have to be fit into the supported data model (relational and hierarchical, respectively), which requires extra efforts