Skip to content

Instantly share code, notes, and snippets.

@bitsgalore
Last active January 29, 2020 13:42
Show Gist options
  • Save bitsgalore/507bd6a5445e3098a5ee to your computer and use it in GitHub Desktop.
Save bitsgalore/507bd6a5445e3098a5ee to your computer and use it in GitHub Desktop.
JPEG 2000 differences shared-mime-info vs Unix File

Shared-mime-info

Downloaded latest sources from:

http://freedesktop.org/~hadess/shared-mime-info-1.6.tar.xz

Then look at JPEG 2000 magic pattrns in file freedesktop.org.xml (I edited out the comments):

  <mime-type type="image/jp2">
    <alias type="image/jpeg2000"/>
    <alias type="image/jpx"/>
    <alias type="image/jpeg2000-image"/>
    <alias type="image/x-jpeg2000-image"/>
    <magic priority="50">
      <match value="\xFF\x4F\xFF\x51\x00" type="string" offset="0"/>
      <match value="0x0c6a5020" type="big32" offset="3"/>
      <match value="jp2" type="string" offset="20"/>
    </magic>
    <glob pattern="*.jp2"/>
    <glob pattern="*.jpx"/>
    <glob pattern="*.jpf"/>

Pattern doesn't distinguish between any of the sub-formats. Now compare with Unix file below:

Unix File

Downloaded latest source from:

ftp://ftp.astron.com/pub/file/file-5.25.tar.gz

Then look at the JPEG 2000 magic patterns in /magic/Magdir/jpeg:

# From: David Santinoli <david@santinoli.com>
0	string		\x00\x00\x00\x0C\x6A\x50\x20\x20\x0D\x0A\x87\x0A	JPEG 2000
# From: Johan van der Knijff <johan.vanderknijff@kb.nl>
# Added sub-entries for JP2, JPX, JPM and MJ2 formats; added mimetypes
# https://github.com/bitsgalore/jp2kMagic
#
# Now read value of 'Brand' field, which yields a few possibilities:
>20	string		\x6a\x70\x32\x20	Part 1 (JP2)
!:mime	image/jp2
>20	string		\x6a\x70\x78\x20	Part 2 (JPX)
!:mime	image/jpx
>20	string		\x6a\x70\x6d\x20	Part 6 (JPM)
!:mime	image/jpm
>20	string		\x6d\x6a\x70\x32	Part 3 (MJ2)
!:mime	video/mj2

# Type: JPEG 2000 codesream
# From: Mathieu Malaterre <mathieu.malaterre@gmail.com>
0	belong		0xff4fff51						JPEG 2000 codestream
45	beshort		0xff52

This covers all sub-formats.

Apache Tika

From: https://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml

This also has signatures for every sub-format:

<mime-type type="image/jp2">
<sub-class-of type="image/x-jp2-container" />
<acronym>JP2</acronym>
<_comment>JPEG 2000 Part 1 (JP2)</_comment>
<magic priority="50">
  <match value="0x0000000C6A5020200D0A870A" type="string" offset="0">
    <match value="0x6a703220" type="string" offset="20"/>
  </match>
</magic>
<glob pattern="*.jp2"/>
</mime-type>

<mime-type type="image/jpm">
<alias type="video/jpm"/>
<sub-class-of type="image/x-jp2-container" />
<acronym>JP2</acronym>
<_comment>JPEG 2000 Part 6 (JPM)</_comment>
<magic priority="50">
  <match value="0x0000000C6A5020200D0A870A" type="string" offset="0">
    <match value="0x6a706d20" type="string" offset="20"/>
  </match>
</magic>
<glob pattern="*.jpm"/>
<glob pattern="*.jpgm"/>
</mime-type>

<mime-type type="image/jpx">
<sub-class-of type="image/x-jp2-container" />
<acronym>JP2</acronym>
<_comment>JPEG 2000 Part 2 (JPX)</_comment>
<magic priority="50">
  <match value="0x0000000C6A5020200D0A870A" type="string" offset="0">
    <match value="0x6a707820" type="string" offset="20"/>
  </match>
</magic>
<glob pattern="*.jpf"/>
</mime-type>

<mime-type type="image/x-jp2-codestream">
<_comment>JPEG 2000 Codestream</_comment>
<magic priority="25">
  <match value="0xff4fff51" type="string" offset="0"/>
</magic>
<glob pattern="*.j2c"/>
</mime-type>

<mime-type type="video/mj2">
<sub-class-of type="image/x-jp2-container" />
<acronym>MJ2</acronym>
<_comment>JPEG 2000 Part 3 (Motion JPEG, MJ2)</_comment>
<magic priority="50">
  <match value="0x0000000C6A5020200D0A870A" type="string" offset="0">
    <match value="0x6d6a7032" type="string" offset="20"/>
  </match>
</magic>
<glob pattern="*.mj2"/>
<glob pattern="*.mjp2"/>
</mime-type>
@richardlehane
Copy link

you could do this in as a MIME info pattern ... but you'd need to create four sigs. They use MIME type as the unique identifier for a sig.

Probably cleanest way would be to use their "mask" type for the signatures, with match elements look something like this:

@bitsgalore
Copy link
Author

OK, only saw your comment after I added the corresponding Tika definitions. But I think this is exactly what you're suggesting, or not?

@richardlehane
Copy link

Yep same but their way is nicer. With the sf implementation you can choose Tika or freedesktop or both at same time (with pronom too if you like). I haven't delved that deeply into the differences but they certainly have different coverage.

@anjackson
Copy link

I was quite surprised how low the common coverage is. I compared 5 format ID sources and out of thousands of formats, only 77 appeared to be in all five (based on file extension), and every system contained unique file extensions unknown to the others: http://www.digipres.org/formats/overlaps/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment