Skip to content

Instantly share code, notes, and snippets.

Avatar
😵

Andy Jackson anjackson

😵
View GitHub Profile
@anjackson
anjackson / Nanite on DROID 6.1.3 branch.
Last active Mar 19, 2017
Nanite basic comparison with Skeleton Suite
View Nanite on DROID 6.1.3 branch.
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/fmt/fmt-894-signature-id-1241.gjf
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/fmt/fmt-899-signature-id-1249.exe
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/fmt/fmt-900-signature-id-1251.exe
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/x-fmt/x-fmt-411-signature-id-198.exe
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/x-fmt/x-fmt-451-signature-id-243.skb
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/x-fmt/x-fmt-453-signature-id-242.ttf
View fmts-pdfa-droid-results.tsv
application/pdf application/pdf application/pdf; version=1a 2009 1
null application/pdf application/pdf; version=1a 2008 2
null application/pdf application/pdf; version=1a 2009 1
application/pdf application/pdf application/pdf; version=1a 2009 7
null application/pdf application/pdf; version=1a 2009 4
application/pdf application/pdf application/pdf; version=1a 2008 1
application/pdf application/pdf application/pdf; version=1a 2009 2
null application/pdf application/pdf; version=1a 2008 7
application/pdf application/pdf application/pdf; version=1a 2009 2
null application/pdf application/pdf; version=1a 2008 1
@anjackson
anjackson / droid-binsig-example.txt
Created Feb 24, 2016
Running Droid from the CLI
View droid-binsig-example.txt
$ droid q -Ns ~/.droid6/signature_files/DROID_SignatureFile_V84.xml -Nr ~/Downloads/DDM_SR_20160443_197-312_935.pdf
19:22:23,212 INFO [main] ReflectionServiceFactoryBean:399 - Creating Service {http://pronom.nationalarchives.gov.uk}PronomServiceService from class uk.gov.nationalarchives.pronom.PronomService
DROID 6.1.5 No Profile mode: Runtime Information
Selected folder or file: /Users/andy/Downloads/DDM_SR_20160443_197-312_935.pdf
Binary signature file: /Users/andy/.droid6/signature_files/DROID_SignatureFile_V84.xml
Container signature file: None
Open archives: False
/Users/andy/Downloads/DDM_SR_20160443_197-312_935.pdf,fmt/18
@anjackson
anjackson / sha1b32.sh
Created Nov 18, 2015
Calculate the Base32-encoded SHA-1 digest of a file at the command line.
View sha1b32.sh
openssl dgst -sha1 -binary $1 | python -c "import base64,sys; print base64.b32encode(sys.stdin.read())"
@anjackson
anjackson / breakdown.md
Last active Nov 9, 2015
Ideal WARC ID result?
View breakdown.md

When analysing example.warc.gz containing a HTML response that was GZip encoded.

  • application/warc
    • application/gzip
      (outer gzip chunk)
      • application/warc; version="1.0", type=response
        (The whole WARC Record)
        • application/http; msgtype=response
          (WARC Record content, i.e. HTTP headers and entity body)
          • application/gzip
            (i.e. the entity body is compressed)
            • text/html; version=5
@anjackson
anjackson / crawler-beans.cxml
Created Sep 25, 2015
Example H3 crawler beans from one our our domain crawler instances.
View crawler-beans.cxml
<?xml version="1.0" encoding="UTF-8"?>
<!--
HERITRIX 3 CRAWL JOB CONFIGURATION FILE
This is a relatively minimal configuration suitable for many crawls.
Commented-out beans and properties are provided as an example; values
shown in comments reflect the actual defaults which are in effect
if not otherwise specified specification. (To change from the default
behavior, uncomment AND alter the shown values.)
@anjackson
anjackson / results.csv
Created Sep 2, 2015
Files with WSD extension in the 1996-2013 collection
View results.csv
We can make this file beautiful and searchable if this error is corrected: It looks like row 10 should actually have 7 columns, instead of 5. in line 9.
wayback_date,url,resourcename,content_length,content_ffb,content_type,crawl_year
20040828045513,http://www.tei-c.org.uk:80/WSDs/iso646ss.wsd,iso646ss.wsd,12028,3c21444f,text/plain,2004
20020327071047,http://www.hcu.ox.ac.uk:80/TEI/WSDs/teien.wsd,teien.wsd,1411,3c21444f,text/plain,2002
20040830232050,http://www.tei-c.org.uk:80/WSDs/iso8859a.wsd,iso8859a.wsd,29554,3c777269,text/plain,2004
20040830232046,http://www.tei-c.org.uk:80/WSDs/iso88599.wsd,iso88599.wsd,28787,3c777269,text/plain,2004
20030722032619,http://www.tei-c.org.uk:80/WSDs/teigk2.wsd,teigk2.wsd,49037,3c777269,text/plain,2003
20030722032319,http://www.tei-c.org.uk:80/WSDs/iso8859a.wsd,iso8859a.wsd,29554,3c777269,text/plain,2003
20040830232014,http://www.tei-c.org.uk:80/WSDs/iso88592.wsd,iso88592.wsd,29456,52657475,message/rfc822,2004
20040830232042,http://www.tei-c.org.uk:80/WSDs/iso88597.wsd,iso88597.wsd,27467,3c777269,text/plain,2004
20000110235101,http://src.doc.ic.ac.uk:80/Mirrors/ftp.ifi.uio.no/pub/SGML/TEI/ISO646IR.WSD,ISO646IR.WSD,13512,5265
@anjackson
anjackson / humans.warc
Created Aug 1, 2015
Example WARC from wget
View humans.warc
WARC/1.0
WARC-Type: warcinfo
Content-Type: application/warc-fields
WARC-Date: 2015-07-31T16:32:22Z
WARC-Record-ID: <urn:uuid:CD4DD5EA-710A-43A4-9E75-2238B9664926>
WARC-Filename: humans.warc.gz
WARC-Block-Digest: sha1:AARITJBDT4LFDLBOUU63IJAD2MK7WFL3
Content-Length: 241
software: Wget/1.16.3 (darwin14.1.0)
@anjackson
anjackson / gist:06971ff43e50645e3f2f
Last active Aug 29, 2015
OpenWayback Stacktrace Analysis
View gist:06971ff43e50645e3f2f

Most of the threads look like this - waiting for something to do:

"http-8080-198" daemon prio=10 tid=0x00007f06cc18e800 nid=0x10ca in Object.wait() [0x00007f073b6f5000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x0000000703eeafa8> (a org.apache.tomcat.util.net.JIoEndpoint$Worker)
        at java.lang.Object.wait(Object.java:503)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.await(JIoEndpoint.java:458)
        - locked <0x0000000703eeafa8> (a org.apache.tomcat.util.net.JIoEndpoint$Worker)
@anjackson
anjackson / gist:cb4a10711da9a1f9617a
Created May 5, 2015
JHOVE 'validating' a bytestream
View gist:cb4a10711da9a1f9617a
$ jhove /Users/andy/Documents/workspace/format-corpus/3rd-party/systems-showcase-files/MVI_0943.mp4
May 5, 2015 1:30:33 PM edu.harvard.hul.ois.jhove.JhoveBase init
SEVERE: Testing SEVERE level
Jhove (Rel. 1.11, 2013-09-29)
Date: 2015-05-05 13:30:34 BST
RepresentationInformation: /Users/andy/Documents/workspace/format-corpus/3rd-party/systems-showcase-files/MVI_0943.mp4
ReportingModule: BYTESTREAM, Rel. 1.3 (2007-04-10)
LastModified: 2015-01-21 03:12:23 GMT
Size: 583677
Format: bytestream