Skip to content

Instantly share code, notes, and snippets.

Last active Nov 13, 2022
What would you like to do?
A Straightforward Way To Extend CSV With Metadata

A Straightforward Way To Extend CSV With Metadata

Pekka Väänänen, Aug 19 2021.

This proposal is a response to It's Time to Retire the CSV by Alex Rasmussen and the discussion on Don't take it too seriously.

CSV files (comma-separated values) are great but sometimes difficult to parse because everybody seems to have a slightly different idea what CSV means. The obvious solution is to transmit some metadata that tells what to expect but where do you put it? Well, how about a ZIP archive?

An archive with two files. The first file, say format.txt, has the metadata inside and the second one is the original CSV file unchanged. This is still readable by non-technical users because ZIP files are natively supported by both Windows and macOS. People can double click on them like a directory and then double click again on the CSV to open it up in Excel.

I know it sounds simplistic but if there's a lesson to be learned from the history of computing, it's that stupid ideas often win. By making this extended CSV format at least somewhat backwards compatible, it's possible (in theory) to switch to it without enraging your customers.

The Spec

Let's try to sketch something just for the sake of discussion. Let there be two formats.

The File Format. A ZIP archive, either uncompressed or compressed with the DEFLATE algorithm. The archive contains at least two files:

  • format.txt, the metadata file
  • *.csv, a CSV file

There can be multiple CSV files but they must all respect format.txt.

The Metadata Format. Very loose. The first line of format.txt must contain an ASCII encoded metadata type name, terminated by a linefeed. The rest of the file is then interpreted according to that name.

For example if we'd like to use the CSV Dialect then format.txt could say this:

CSV Dialect v1.2
  "dialect": {
    "csvddfVersion": 1.2,
    "delimiter": ";",
    "doubleQuote": true,
    "lineTerminator": "\r\n",
    "quoteChar": "\"",
    "skipInitialSpace": true,
    "header": true,
    "commentChar": "#"

This way different metadata formats could evolve without breaking the overall scheme.

Would this scheme really help me parse CSVs?

Maybe but possibly not enough. A mismatch between metadata and the CSV can still happen and there's nothing we can do about it as long as CSV is editable by anyone with a text editor. Also, the maximum file size limit of the ZIP format is unfortunate.


  • Q: Why not use a tarball?
    • They are incomprehensible to Windows users.
  • Q: How do you store CSV files larger than the ZIP's maximum file size of 2^32-1 bytes?
    • Save the archive as ZIP64. It's supported by Windows Explorer since Vista but macOS seems to have too old of a version of unzip. Not a great solution.
  • Q: How do you do random access?
    • Save the ZIP file uncompressed and put some kind of index in the metadata.
  • Q: Have you seen that XKCD comic about standards?
Copy link

sterlinm commented Aug 20, 2021

a more plausible idea would be the creation of a tool that could infer all or a subset of these metadata parameters from a given CSV file.

Python has CSV Sniffer that does something like this. Presumably Pandas is also doing this and could expose the inferred dialect without actually reading the rest of the file.

Copy link

seece commented Aug 20, 2021

Have you seen the bagit file format Its used in computational biology platforms but is generic enough to apply to any computational workload. It works with the same idea of using zip and seperate files for metadata.

I haven't, it seems pretty much the same idea but with more complexity.

The multitudes that generate CSV files will not soon, nor perhaps ever, adopt such a standard -- no matter how effective and noble it may be.

One can dream, right?

This is common in the 3rd party data vendor industry as well where there would be a schema text file along with the data files in the folder overall that is zipped up transmitted.

Yeah I expected something like that :)

Copy link

denis-bz commented Oct 17, 2022

One file is better than 2: the metadata stays with the data. How about

# header e.g. metadata
# header ...
# ...
topline  e.g. column names in a .csv
... the rest

A .csv like this can be read with NO changes, ignoring the header, in python pandas:

pd.read_csv( filename, comment="#", sep= ... )

There are simple hacks read_metacsv( csvin ) --> df, headerlines
and write_metacsv( csvout: str, df: pd.DataFrame, header: list[str] )
which I could put up on gist.github for anyone who'd want to test them.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment