Skip to content

Instantly share code, notes, and snippets.

@bitsgalore
Last active August 29, 2015 14:19
Show Gist options
  • Save bitsgalore/21028de28b7f05066585 to your computer and use it in GitHub Desktop.
Save bitsgalore/21028de28b7f05066585 to your computer and use it in GitHub Desktop.
50 most prevalent formats in KB e-Depot by file extension, based on March 2014 count. Use scrollbar at bottom to display remarks column to the right.
Extension Number of files in e-Depot ID(s) Tika Remarks
gif 34499095 - image/gif GIF image
xml 12913388 - application/xml XML (mostly metadata)
jpg 8197415 N/A* JPEG image
sml 7744829 - image/gif GIF image with unusual extension
pdf 7577414 - application/pdf PDF
raw 2045662 - text/plain Text file
tif 715509 - image/tiff TIFF image
oa3 296101 - text/plain Looks like SGML (oases, Kluwer). See also: Publisher Data Formats. Metadata.
doc 134732 - application/msword MS Word document
htm 103009 - text/html HTML
html 52016 - application/x-bzip2
- text/html
HTML. One .html file in dataset is actually a BZIP2 file.
wav 46796 - audio/x-wav Waveform Audio File Format audio
mp3 41931 - audio/mpeg MP3 audio
docx 40342 - application/vnd.openxmlformats-officedocument.wordprocessingml.document Office Open XML document
txt 40239 - text/plain Plain text
bmp 39927 - image/x-ms-bmp Windows Bitmap
swf 32816 - application/x-shockwave-flash Shockwave Flash
xls 31181 - application/vnd.ms-excel
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
MS Excel and Office Open XML Spreadsheet
lmx 25429 - application/xml XML (metadata)
zip 20008 - application/zip ZIP
class 19091 - application/java-vm Java Class File
epq 15238 - text/plain Looks like RTF fragments wrapped in something else
js 13441 - application/javascript JavaScript
mp4 13353 - video/mp4 MP4 video
xlsx 13088 - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet Office Open XML Spreadsheet
rtf 12289 - application/rtf Rich Text Format
suppl 10926 - text/plain
- text/html
Looks like malformed html, sometimes with XML declaration. Files refer to supplemental files through hyperlinks.
mov 9904 - video/quicktime
- video/mpeg
MPEG video
png 9037 - image/png PNG image
abg 8910 - application/octet-stream Strange format; only 1 sample in dataset. Possible relation with Abakt backup tool, which used the same extension: http://www.gorearicayparinacota.cl/softwarelibre/Abakt/Abakt.html
avi 8662 - video/x-msvideo
- audio/x-wav
- video/mp4
Audio Video Interleave (AVI)
ppt 8324 - application/vnd.ms-powerpoint MS PowerPoint
dat 7870 - application/octet-stream
- text/plain
Diverse mix: checked files include measurement data in plain text, binary data and some system/configuration file.
mpg 7599 - video/mpeg
- video/x-msvideo
MPEG video
aif 6655 - audio/x-aiff AIFF audio
cab 6611 - application/vnd.ms-cab-compressed
- text/plain
Microsoft Cabinet file: http://en.wikipedia.org/wiki/Cabinet_%28file_format%29; 1 file in test dataset contains plain text
page 5672 - application/xml XML
tiff 5623 - image/tiff TIFF image
exe 3724 - application/x-dosexec
- application/x-msdownload; format=pe32
- text/plain
Windows executables. One file in test dataset actually contains plain text.
pptx 3400 - application/vnd.openxmlformats-officedocument.presentationml.presentation Office Open XML Powerpoint presentation
dll 3332 - application/x-msdownload; format=pe32
- application/x-msdownload
Windows Dynamic-link library
jpeg 3107 - image/jpeg JPEG image
ini 2896 - text/x-ini Mostly Windows configuration files
dib 2381 - image/x-ms-bmp Windows Bitmap image
x32 2171 - application/x-msdownload
- application/x-msdownload; format=pe32
Windows executable or DLL. Ther .x32 extension is also used for Macromedia plugins
db 2021 - application/octet-stream
- application/x-tika-msoffice
Unix File identifies most of these files as -application/CDFV2-corrupt-. Might be (possibly corrupted) thumbs.db files (which are also based on the OLE2 container format, corresponding to -application/x-tika-msoffice-). See also: http://apps.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1481
fig 1822 - application/x-xfig
- image/jpeg
One JPEG image aside, these files are all identified as application/x-xfig (Fig format), but this is incorrect! Actually they are MATLAB Figure files.
inf 1759 - text/plain Samples are all Autorun or Setup files, see: http://fileformats.archiveteam.org/wiki/Ext:inf
drv 1656 - application/x-msdownload
- application/octet-stream
- application/x-msdownload; format=pe32
Most likely Windows system driver files: http://fileformats.archiveteam.org/wiki/Dynamic-link_library_%28Windows%29
phd 1593 - text/html Based on tags in the sample file this looks more like it's SGML instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment