Skip to content

Instantly share code, notes, and snippets.

@seece
Last active November 13, 2022 01:07
Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save seece/97d02d6518db8b9e4c5b81fc686553e2 to your computer and use it in GitHub Desktop.
Save seece/97d02d6518db8b9e4c5b81fc686553e2 to your computer and use it in GitHub Desktop.
A Straightforward Way To Extend CSV With Metadata

A Straightforward Way To Extend CSV With Metadata

Pekka Väänänen, Aug 19 2021.

This proposal is a response to It's Time to Retire the CSV by Alex Rasmussen and the discussion on lobste.rs. Don't take it too seriously.

CSV files (comma-separated values) are great but sometimes difficult to parse because everybody seems to have a slightly different idea what CSV means. The obvious solution is to transmit some metadata that tells what to expect but where do you put it? Well, how about a ZIP archive?

An archive with two files. The first file, say format.txt, has the metadata inside and the second one is the original CSV file unchanged. This is still readable by non-technical users because ZIP files are natively supported by both Windows and macOS. People can double click on them like a directory and then double click again on the CSV to open it up in Excel.

I know it sounds simplistic but if there's a lesson to be learned from the history of computing, it's that stupid ideas often win. By making this extended CSV format at least somewhat backwards compatible, it's possible (in theory) to switch to it without enraging your customers.

The Spec

Let's try to sketch something just for the sake of discussion. Let there be two formats.

The File Format. A ZIP archive, either uncompressed or compressed with the DEFLATE algorithm. The archive contains at least two files:

  • format.txt, the metadata file
  • *.csv, a CSV file

There can be multiple CSV files but they must all respect format.txt.

The Metadata Format. Very loose. The first line of format.txt must contain an ASCII encoded metadata type name, terminated by a linefeed. The rest of the file is then interpreted according to that name.

For example if we'd like to use the CSV Dialect then format.txt could say this:

CSV Dialect v1.2
{
  "dialect": {
    "csvddfVersion": 1.2,
    "delimiter": ";",
    "doubleQuote": true,
    "lineTerminator": "\r\n",
    "quoteChar": "\"",
    "skipInitialSpace": true,
    "header": true,
    "commentChar": "#"
  }
}

This way different metadata formats could evolve without breaking the overall scheme.

Would this scheme really help me parse CSVs?

Maybe but possibly not enough. A mismatch between metadata and the CSV can still happen and there's nothing we can do about it as long as CSV is editable by anyone with a text editor. Also, the maximum file size limit of the ZIP format is unfortunate.

FAQ

  • Q: Why not use a tarball?
    • They are incomprehensible to Windows users.
  • Q: How do you store CSV files larger than the ZIP's maximum file size of 2^32-1 bytes?
    • Save the archive as ZIP64. It's supported by Windows Explorer since Vista but macOS seems to have too old of a version of unzip. Not a great solution.
  • Q: How do you do random access?
    • Save the ZIP file uncompressed and put some kind of index in the metadata.
  • Q: Have you seen that XKCD comic about standards?
@waffletower
Copy link

waffletower commented Aug 19, 2021

The multitudes that generate CSV files will not soon, nor perhaps ever, adopt such a standard -- no matter how effective and noble it may be. I think a more plausible idea would be the creation of a tool that could infer all or a subset of these metadata parameters from a given CSV file.

Such a tool would not be restricted to returning pedestrian JSON with camelCase identifiers either 👃

@djedr
Copy link

djedr commented Aug 19, 2021

Here is a minimalist alternative as a parser in JavaScript: https://gist.github.com/djedr/8349415de904eeade9bb1cb5e860e3ec

You can define custom separator characters for a simple format for tabular data which doesn't have the escaping issues of CSV.

@drtbiswas
Copy link

This is common in the 3rd party data vendor industry as well where there would be a schema text file along with the data files in the folder overall that is zipped up transmitted.

@sterlinm
Copy link

a more plausible idea would be the creation of a tool that could infer all or a subset of these metadata parameters from a given CSV file.

Python has CSV Sniffer that does something like this. Presumably Pandas is also doing this and could expose the inferred dialect without actually reading the rest of the file.

@seece
Copy link
Author

seece commented Aug 20, 2021

Have you seen the bagit file format Its used in computational biology platforms but is generic enough to apply to any computational workload. It works with the same idea of using zip and seperate files for metadata.

I haven't, it seems pretty much the same idea but with more complexity.

The multitudes that generate CSV files will not soon, nor perhaps ever, adopt such a standard -- no matter how effective and noble it may be.

One can dream, right?

This is common in the 3rd party data vendor industry as well where there would be a schema text file along with the data files in the folder overall that is zipped up transmitted.

Yeah I expected something like that :)

@denis-bz
Copy link

One file is better than 2: the metadata stays with the data. How about

# header e.g. metadata
# header ...
# ...
topline  e.g. column names in a .csv
... the rest

A .csv like this can be read with NO changes, ignoring the header, in python pandas:

pd.read_csv( filename, comment="#", sep= ... )

There are simple hacks read_metacsv( csvin ) --> df, headerlines
and write_metacsv( csvout: str, df: pd.DataFrame, header: list[str] )
which I could put up on gist.github for anyone who'd want to test them.

xkcd-standards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment