Create a gist now

Instantly share code, notes, and snippets.

Comment to blog Research data - what does it really look like?

Agree with Gareth's suggestion to give FILE a try. One possible catch is that FILE tries to identify the format solely based on the byte signature. In my experience, lots of research data formats are essentially text-based formats, for which signature-based detection doesn't work well. The ID results of such formats will often be very general (e.g. "text/plain"), which isn't all that helpful.

Because of this I would also suggest to give Apache Tika a try as well. Unlike FILE, Tika uses a combination of format signatures, file extensions and container-aware detection, and in my experience Tika often gives you a better or more specific identification than FILE.

Another good resource is Andy Jackson's Format registry aggregator:

http://www.digipres.org/formats/

Just enter some of your problematic file extensions, and it will tell you which tools are able to detect it (though I think the underlying data haven't been updated in quite a while).

Finally, some time ago I did a little investigation of the most prevalent file formats in the KB's e-Depot (based on file extensions). As this also includes many research formats you might find it useful. Blog link here:

Top 50 file formats in the KB e-Depot

As part of that work we also produced a list of all file extensions in our e-Depot, which you can find here:

File extensions in KB e-Depot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment