Extension | Number of files in e-Depot | ID(s) Tika | Remarks |
---|---|---|---|
gif | 34499095 | - image/gif | GIF image |
xml | 12913388 | - application/xml | XML (mostly metadata) |
jpg | 8197415 | N/A* | JPEG image |
sml | 7744829 | - image/gif | GIF image with unusual extension |
7577414 | - application/pdf | ||
raw | 2045662 | - text/plain | Text file |
tif | 715509 | - image/tiff | TIFF image |
oa3 | 296101 | - text/plain | Looks like SGML (oases, Kluwer). See also: Publisher Data Formats. Metadata. |
doc | 134732 | - application/msword | MS Word document |
htm | 103009 | - text/html | HTML |
html | 52016 | - application/x-bzip2 - text/html |
HTML. One .html file in dataset is actually a BZIP2 file. |
wav | 46796 | - audio/x-wav | Waveform Audio File Format audio |
mp3 | 41931 | - audio/mpeg | MP3 audio |
docx | 40342 | - application/vnd.openxmlformats-officedocument.wordprocessingml.document | Office Open XML document |
txt | 40239 | - text/plain | Plain text |
bmp | 39927 | - image/x-ms-bmp | Windows Bitmap |
swf | 32816 | - application/x-shockwave-flash | Shockwave Flash |
xls | 31181 | - application/vnd.ms-excel - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
MS Excel and Office Open XML Spreadsheet |
lmx | 25429 | - application/xml | XML (metadata) |
zip | 20008 | - application/zip | ZIP |
class | 19091 | - application/java-vm | Java Class File |
epq | 15238 | - text/plain | Looks like RTF fragments wrapped in something else |
js | 13441 | - application/javascript | JavaScript |
mp4 | 13353 | - video/mp4 | MP4 video |
xlsx | 13088 | - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | Office Open XML Spreadsheet |
rtf | 12289 | - application/rtf | Rich Text Format |
suppl | 10926 | - text/plain - text/html |
Looks like malformed html, sometimes with XML declaration. Files refer to supplemental files through hyperlinks. |
mov | 9904 | - video/quicktime - video/mpeg |
MPEG video |
png | 9037 | - image/png | PNG image |
abg | 8910 | - application/octet-stream | Strange format; only 1 sample in dataset. Possible relation with Abakt backup tool, which used the same extension: http://www.gorearicayparinacota.cl/softwarelibre/Abakt/Abakt.html |
avi | 8662 | - video/x-msvideo - audio/x-wav - video/mp4 |
Audio Video Interleave (AVI) |
ppt | 8324 | - application/vnd.ms-powerpoint | MS PowerPoint |
dat | 7870 | - application/octet-stream - text/plain |
Diverse mix: checked files include measurement data in plain text, binary data and some system/configuration file. |
mpg | 7599 | - video/mpeg - video/x-msvideo |
MPEG video |
aif | 6655 | - audio/x-aiff | AIFF audio |
cab | 6611 | - application/vnd.ms-cab-compressed - text/plain |
Microsoft Cabinet file: http://en.wikipedia.org/wiki/Cabinet_%28file_format%29; 1 file in test dataset contains plain text |
page | 5672 | - application/xml | XML |
tiff | 5623 | - image/tiff | TIFF image |
exe | 3724 | - application/x-dosexec - application/x-msdownload; format=pe32 - text/plain |
Windows executables. One file in test dataset actually contains plain text. |
pptx | 3400 | - application/vnd.openxmlformats-officedocument.presentationml.presentation | Office Open XML Powerpoint presentation |
dll | 3332 | - application/x-msdownload; format=pe32 - application/x-msdownload |
Windows Dynamic-link library |
jpeg | 3107 | - image/jpeg | JPEG image |
ini | 2896 | - text/x-ini | Mostly Windows configuration files |
dib | 2381 | - image/x-ms-bmp | Windows Bitmap image |
x32 | 2171 | - application/x-msdownload - application/x-msdownload; format=pe32 |
Windows executable or DLL. Ther .x32 extension is also used for Macromedia plugins |
db | 2021 | - application/octet-stream - application/x-tika-msoffice |
Unix File identifies most of these files as -application/CDFV2-corrupt-. Might be (possibly corrupted) thumbs.db files (which are also based on the OLE2 container format, corresponding to -application/x-tika-msoffice-). See also: http://apps.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1481 |
fig | 1822 | - application/x-xfig - image/jpeg |
One JPEG image aside, these files are all identified as application/x-xfig (Fig format), but this is incorrect! Actually they are MATLAB Figure files. |
inf | 1759 | - text/plain | Samples are all Autorun or Setup files, see: http://fileformats.archiveteam.org/wiki/Ext:inf |
drv | 1656 | - application/x-msdownload - application/octet-stream - application/x-msdownload; format=pe32 |
Most likely Windows system driver files: http://fileformats.archiveteam.org/wiki/Dynamic-link_library_%28Windows%29 |
phd | 1593 | - text/html | Based on tags in the sample file this looks more like it's SGML instead. |
Last active
August 29, 2015 14:19
-
-
Save bitsgalore/21028de28b7f05066585 to your computer and use it in GitHub Desktop.
50 most prevalent formats in KB e-Depot by file extension, based on March 2014 count. Use scrollbar at bottom to display remarks column to the right.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment