Skip to content

Instantly share code, notes, and snippets.

@rossant
Created October 9, 2015 08:49
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save rossant/1270854a15f3b918268c to your computer and use it in GitHub Desktop.
Save rossant/1270854a15f3b918268c to your computer and use it in GitHub Desktop.
Problems with HDF5

Problems with HDF5

  • Corruption: not sure why, sometimes files get corrupted during a session and users lose all their work, either automatic or manual, which may correspond to days of computer time or, worse, human time. Corruption is more likely to happen because libhdf5 is a very complex piece of software, and a crash or sudden kill is likely to corrupt the file completely. This would be much rarer with flat binary or text files, at least you'd be able to recover part of the data.

  • Not possible to delete arrays, but that might be fixed in the future (not today though...).

  • Various bugs with strings on Windows and h5py: users may need to downgrade h5py in order to use their files, otherwise a nasty segfault occurs. Not a good sign...

  • There is a single implementation of HDF5 in the world, so we depend critically on it. It is almost impossible to contribute on such a complex piece of code since it is really low-level (in C). There are bugs and performance issues with it and there is nothing we can do.

  • HDF5 recreates a file system hierarchy within a file. It is a FS within a FS, just less powerful, more buggy, less efficient, etc. Just... why? Why not just use the rock-solid file system..? File systems today are extremely stable, fast, optimized, robust, etc. There is nothing wrong with having multiple files instead of a huge one, on the contrary.

  • Performance: random access in a big array is much slower in HDF5 than with a simple memmap on a flat binary array.

  • Opacity of files: you have an HDF5 file. What is in there? How is the data organized? You need dedicated software to find out, which depends on the single libhdf5 implementation which is hard to compile. This is an order of magnitude harder, slower, and more opaque than a text file or a flat binary file. If some archaeologists from 2070 find a .kwik file somewhere, they may not be able to open it, whereas chances are that they will be able to read a text or flat binary file.

  • If it difficult to change metadata values, the contents of an array, add an array, remove an array (currently it is impossible). You need to go through libhdf5 and you need dedicated software for this. Same story if you want to extract one array out of your HDF5 file.

  • Limitation in the data types: support of Unicode is poor, so you have to stick with ASCII (it is 2015 by the way). With h5py, it is easy to incorrectly save some metadata in an unsupported data type (for example a tuple of strings maybe), which will be silently converted to an opaque pickle binary blob that is impossible to read without (possibly the same version of) h5py.

  • HDF5 has not been designed to work on distributed architectures like Spark. It was designed way before that time. Parallel HDF5 might work with MPI, but MPI is fundamentally different from Spark (see http://www.dursi.ca/hpc-is-dying-and-mpi-is-killing-it/). It seems that the only way currently to make HDF5 work with Spark is to use multiple HDF5 files, which de factor removes all benefits of HDF5. By contrast, using a flat binary file in parallel on Spark is trivial.

  • Parallel reading/writing of HDF5 files is limited, tricky, slow, and somewhat buggy.

  • The only "advantage" I can see of HDF5 is that it gives you a way to put multiple arrays in a single file. But I am not sure this is really an advantage: there is nothing wrong with having multiple, smaller files, really... Would you want you put all of your photos in a single gigantic multi-TBs file? It is like putting all your eggs in one basket: it is good if you live dangerously. One corrupted byte in that file and you lose all your photo library!

@xkortex
Copy link

xkortex commented Sep 4, 2020

👏 Well-put.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment