Skip to content

Instantly share code, notes, and snippets.

@seece
Last active Aug 20, 2021
Embed
What would you like to do?
A Straightforward Way To Extend CSV With Metadata

A Straightforward Way To Extend CSV With Metadata

Pekka Väänänen, Aug 19 2021.

This proposal is a response to It's Time to Retire the CSV by Alex Rasmussen and the discussion on lobste.rs. Don't take it too seriously.

CSV files (comma-separated values) are great but sometimes difficult to parse because everybody seems to have a slightly different idea what CSV means. The obvious solution is to transmit some metadata that tells what to expect but where do you put it? Well, how about a ZIP archive?

An archive with two files. The first file, say format.txt, has the metadata inside and the second one is the original CSV file unchanged. This is still readable by non-technical users because ZIP files are natively supported by both Windows and macOS. People can double click on them like a directory and then double click again on the CSV to open it up in Excel.

I know it sounds simplistic but if there's a lesson to be learned from the history of computing, it's that stupid ideas often win. By making this extended CSV format at least somewhat backwards compatible, it's possible (in theory) to switch to it without enraging your customers.

The Spec

Let's try to sketch something just for the sake of discussion. Let there be two formats.

The File Format. A ZIP archive, either uncompressed or compressed with the DEFLATE algorithm. The archive contains at least two files:

  • format.txt, the metadata file
  • *.csv, a CSV file

There can be multiple CSV files but they must all respect format.txt.

The Metadata Format. Very loose. The first line of format.txt must contain an ASCII encoded metadata type name, terminated by a linefeed. The rest of the file is then interpreted according to that name.

For example if we'd like to use the CSV Dialect then format.txt could say this:

CSV Dialect v1.2
{
  "dialect": {
    "csvddfVersion": 1.2,
    "delimiter": ";",
    "doubleQuote": true,
    "lineTerminator": "\r\n",
    "quoteChar": "\"",
    "skipInitialSpace": true,
    "header": true,
    "commentChar": "#"
  }
}

This way different metadata formats could evolve without breaking the overall scheme.

Would this scheme really help me parse CSVs?

Maybe but possibly not enough. A mismatch between metadata and the CSV can still happen and there's nothing we can do about it as long as CSV is editable by anyone with a text editor. Also, the maximum file size limit of the ZIP format is unfortunate.

FAQ

  • Q: Why not use a tarball?
    • They are incomprehensible to Windows users.
  • Q: How do you store CSV files larger than the ZIP's maximum file size of 2^32-1 bytes?
    • Save the archive as ZIP64. It's supported by Windows Explorer since Vista but macOS seems to have too old of a version of unzip. Not a great solution.
  • Q: How do you do random access?
    • Save the ZIP file uncompressed and put some kind of index in the metadata.
  • Q: Have you seen that XKCD comic about standards?
@mlev71

This comment has been minimized.

Copy link

@mlev71 mlev71 commented Aug 19, 2021

Love this!
Have you seen the bagit file format Its used in computational biology platforms but is generic enough to apply to any computational workload. It works with the same idea of using zip and seperate files for metadata.

Here is the rfc

@cjohnson318

This comment has been minimized.

Copy link

@cjohnson318 cjohnson318 commented Aug 19, 2021

I really like that link to Frictionless Data. It'd be nice to have a pandas function to read a CSV dialect in a zip archive like that and just know what to do with it instead of opening the CSV in vim or something, figuring out what delimiters there are, adding that argument to pandas.read_csv(...), trying to open it, failing, opening the CSV with vim again, etc.

@markstos

This comment has been minimized.

Copy link

@markstos markstos commented Aug 19, 2021

A precedent, the GTFS spec is a zip file of CSVS, some of which could be considered metadata:

https://gtfs.org/

This is an industry standard for transit feeds, championed by Google.

In practice, it's OK to work with, as there are plenty of tools that work with zip files and CSV.

@yakovsh

This comment has been minimized.

Copy link

@yakovsh yakovsh commented Aug 19, 2021

(I am the author of the CSV RFC / RFC 4180)

The W3C tried something similar as well:
https://www.w3.org/2013/csvw/wiki/Main_Page

@waffletower

This comment has been minimized.

Copy link

@waffletower waffletower commented Aug 19, 2021

The multitudes that generate CSV files will not soon, nor perhaps ever, adopt such a standard -- no matter how effective and noble it may be. I think a more plausible idea would be the creation of a tool that could infer all or a subset of these metadata parameters from a given CSV file.

Such a tool would not be restricted to returning pedestrian JSON with camelCase identifiers either 👃

@djedr

This comment has been minimized.

Copy link

@djedr djedr commented Aug 19, 2021

Here is a minimalist alternative as a parser in JavaScript: https://gist.github.com/djedr/8349415de904eeade9bb1cb5e860e3ec

You can define custom separator characters for a simple format for tabular data which doesn't have the escaping issues of CSV.

@drtbiswas

This comment has been minimized.

Copy link

@drtbiswas drtbiswas commented Aug 20, 2021

This is common in the 3rd party data vendor industry as well where there would be a schema text file along with the data files in the folder overall that is zipped up transmitted.

@sterlinm

This comment has been minimized.

Copy link

@sterlinm sterlinm commented Aug 20, 2021

a more plausible idea would be the creation of a tool that could infer all or a subset of these metadata parameters from a given CSV file.

Python has CSV Sniffer that does something like this. Presumably Pandas is also doing this and could expose the inferred dialect without actually reading the rest of the file.

@seece

This comment has been minimized.

Copy link
Owner Author

@seece seece commented Aug 20, 2021

Have you seen the bagit file format Its used in computational biology platforms but is generic enough to apply to any computational workload. It works with the same idea of using zip and seperate files for metadata.

I haven't, it seems pretty much the same idea but with more complexity.

The multitudes that generate CSV files will not soon, nor perhaps ever, adopt such a standard -- no matter how effective and noble it may be.

One can dream, right?

This is common in the 3rd party data vendor industry as well where there would be a schema text file along with the data files in the folder overall that is zipped up transmitted.

Yeah I expected something like that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment