bitsgalore/commentBlogResearchData.md

## commentBlogResearchData.md

      
    Raw
  

              commentBlogResearchData.md
            
          
    Comment to blog Research data - what does it really look like?
Agree with Gareth's suggestion to give FILE a try. One possible catch is that FILE tries to identify the format solely based on the byte signature. In my experience, lots of research data formats are essentially text-based formats, for which signature-based detection doesn't work well. The ID results of such formats will often be very general (e.g. "text/plain"), which isn't all that helpful.
Because of this I would also suggest to give Apache Tika a try as well. Unlike FILE, Tika uses a combination of format signatures, file extensions and container-aware detection, and in my experience Tika often gives you a better or more specific identification than FILE.
Another good resource is Andy Jackson's Format registry aggregator:
http://www.digipres.org/formats/
Just enter some of your problematic file extensions, and it will tell you which tools are able to detect it (though I think the underlying data haven't been updated in quite a while).
Finally, some time ago I did a little investigation of the most prevalent file formats in the KB's e-Depot (based on file extensions). As this also includes many research formats you might find it useful. Blog link here:
Top 50 file formats in the KB e-Depot
As part of that work we also produced a list of all file extensions in our e-Depot, which you can find here:
File extensions in KB e-Depot