Skip to content

Instantly share code, notes, and snippets.

View anjackson's full-sized avatar
🧐

Andy Jackson anjackson

🧐
View GitHub Profile
@anjackson
anjackson / ext-woff2.csv
Created December 14, 2017 10:13
Resources with WOFF2 file extensions, sorted by the first four bytes (hex encoded)
crawl_date content_ffb url
2016-08-11T12:56:31Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-Italic.woff2
2016-08-11T12:56:47Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-Bold.woff2
2016-08-11T12:56:08Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-Regular.woff2
2016-08-11T12:56:27Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-BoldItalic.woff2
2016-05-01T11:35:39Z 0a090a20 http://www.mind.org.uk/fonts/street_corner-webfont.woff2
2016-09-08T15:48:03Z 0a090a20 http://www.mind.org.uk/fonts/2FBE41_0_0.woff2
2016-04-01T11:22:11Z 0a090a20 http://www.mind.org.uk/fonts/2FBE41_0_0.woff2
2016-05-01T11:31:44Z 0a090a20 http://www.mind.org.uk/fonts/2FBE41_0_0.woff2
2016-04-01T11:27:41Z 0a090a20 http://www.mind.org.uk/fonts/street_corner-webfont.woff2
@anjackson
anjackson / puid-fmt-111.csv
Created December 14, 2017 09:53
Current results for fmt/111 from the frequent crawl index
We can make this file beautiful and searchable if this error is corrected: It looks like row 8 should actually have 5 columns, instead of 3. in line 7.
crawl_date,content_type_ext,content_type_tika,content_ffb,url
2016-01-21T04:02:00Z,fla,application/x-tika-msoffice,d0cf11e0,http://www.w3.org/WAI/WCAG20/Techniques/working-examples/FLASH31/datagrid_with_caption_as3.fla
2016-10-05T05:44:59Z,db,application/x-tika-msoffice,d0cf11e0,http://democracy.cambridge.gov.uk/Data/Licensing%20Committee/20020613/Agenda/Thumbs.db
2016-01-14T17:04:22Z,mso,application/x-tika-msoffice,d0cf11e0,http://www.webarchive.org.uk/wayback/archive/20081011214118/http://www.kereve.com/kernewek/index06_files/oledata.mso
2016-06-23T10:41:17Z,mso,application/x-tika-msoffice,d0cf11e0,http://www.scan.org.uk/aboutus/Reports/Post_Project_Report_Files/200106_files/oledata.mso
2016-10-05T05:33:48Z,db,application/x-tika-msoffice,d0cf11e0,http://democracy.cambridge.gov.uk/Data/Planning/20040204/Agenda/Thumbs.db
2016-06-01T16:53:36Z,db,application/x-tika-msoffice,d0cf11e0,http://democracy.monmouthshire.gov.uk/Data/Adults%20Select%20Committee/20150616/Agenda/Thumbs.db
2016-01-10T17:43:39Z,mso,applicat
@anjackson
anjackson / puid-fmt-583.csv
Last active December 14, 2017 15:07
Current results for fmt/583 from the frequent crawl index
crawl_date content_type_ext content_type_tika content_ffb url
2016-04-18T18:15:10Z shtml application/xhtml+xml; charset=windows-1252 0d0a0d0a http://www.southportreporter.com/484/484-7.shtml
2016-08-11T22:48:34Z htm text/html; charset=windows-1252 3c68746d https://www.iwight.com/Meetings/committees/Policy%20Commission%20for%20Economy/4-4-07/minutes.htm
2016-01-06T05:23:06Z aspx application/xhtml+xml; charset=UTF-8 3c21444f http://www.t.www.rspb.org.uk/discoverandenjoynature/discoverandlearn/birdguide/name/w/wheatear/videos.aspx
2016-01-06T04:20:01Z aspx application/xhtml+xml; charset=UTF-8 3c21444f http://www.t.www.rspb.org.uk/discoverandenjoynature/seenature/events/details.aspx?id=tcm:9-404023
2016-09-08T19:26:46Z aspx application/xhtml+xml; charset=UTF-8 3c21444f http://www.rspb.org.uk/discoverandenjoynature/seenature/reserves/tags.aspx?tag=flowers
2016-01-06T05:10:16Z application/xhtml+xml; charset=UTF-8 3c21444f http://www.information.www.rspb.org.uk/vacancies/details/411612-membership-development-office
@anjackson
anjackson / Nanite on DROID 6.1.3 branch.
Last active March 19, 2017 22:40
Nanite basic comparison with Skeleton Suite
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/fmt/fmt-894-signature-id-1241.gjf
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/fmt/fmt-899-signature-id-1249.exe
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/fmt/fmt-900-signature-id-1251.exe
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/x-fmt/x-fmt-411-signature-id-198.exe
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/x-fmt/x-fmt-451-signature-id-243.skb
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/x-fmt/x-fmt-453-signature-id-242.ttf
application/pdf application/pdf application/pdf; version=1a 2009 1
null application/pdf application/pdf; version=1a 2008 2
null application/pdf application/pdf; version=1a 2009 1
application/pdf application/pdf application/pdf; version=1a 2009 7
null application/pdf application/pdf; version=1a 2009 4
application/pdf application/pdf application/pdf; version=1a 2008 1
application/pdf application/pdf application/pdf; version=1a 2009 2
null application/pdf application/pdf; version=1a 2008 7
application/pdf application/pdf application/pdf; version=1a 2009 2
null application/pdf application/pdf; version=1a 2008 1
@anjackson
anjackson / droid-binsig-example.txt
Created February 24, 2016 19:24
Running Droid from the CLI
$ droid q -Ns ~/.droid6/signature_files/DROID_SignatureFile_V84.xml -Nr ~/Downloads/DDM_SR_20160443_197-312_935.pdf
19:22:23,212 INFO [main] ReflectionServiceFactoryBean:399 - Creating Service {http://pronom.nationalarchives.gov.uk}PronomServiceService from class uk.gov.nationalarchives.pronom.PronomService
DROID 6.1.5 No Profile mode: Runtime Information
Selected folder or file: /Users/andy/Downloads/DDM_SR_20160443_197-312_935.pdf
Binary signature file: /Users/andy/.droid6/signature_files/DROID_SignatureFile_V84.xml
Container signature file: None
Open archives: False
/Users/andy/Downloads/DDM_SR_20160443_197-312_935.pdf,fmt/18
@anjackson
anjackson / sha1b32.sh
Created November 18, 2015 22:03
Calculate the Base32-encoded SHA-1 digest of a file at the command line.
openssl dgst -sha1 -binary $1 | python -c "import base64,sys; print base64.b32encode(sys.stdin.read())"
@anjackson
anjackson / breakdown.md
Last active November 9, 2015 00:27
Ideal WARC ID result?

When analysing example.warc.gz containing a HTML response that was GZip encoded.

  • application/warc
    • application/gzip
      (outer gzip chunk)
      • application/warc; version="1.0", type=response
        (The whole WARC Record)
        • application/http; msgtype=response
          (WARC Record content, i.e. HTTP headers and entity body)
          • application/gzip
            (i.e. the entity body is compressed)
            • text/html; version=5
@anjackson
anjackson / crawler-beans.cxml
Created September 25, 2015 14:55
Example H3 crawler beans from one our our domain crawler instances.
<?xml version="1.0" encoding="UTF-8"?>
<!--
HERITRIX 3 CRAWL JOB CONFIGURATION FILE
This is a relatively minimal configuration suitable for many crawls.
Commented-out beans and properties are provided as an example; values
shown in comments reflect the actual defaults which are in effect
if not otherwise specified specification. (To change from the default
behavior, uncomment AND alter the shown values.)
@anjackson
anjackson / results.csv
Created September 2, 2015 11:23
Files with WSD extension in the 1996-2013 collection
We can make this file beautiful and searchable if this error is corrected: It looks like row 10 should actually have 7 columns, instead of 5. in line 9.
wayback_date,url,resourcename,content_length,content_ffb,content_type,crawl_year
20040828045513,http://www.tei-c.org.uk:80/WSDs/iso646ss.wsd,iso646ss.wsd,12028,3c21444f,text/plain,2004
20020327071047,http://www.hcu.ox.ac.uk:80/TEI/WSDs/teien.wsd,teien.wsd,1411,3c21444f,text/plain,2002
20040830232050,http://www.tei-c.org.uk:80/WSDs/iso8859a.wsd,iso8859a.wsd,29554,3c777269,text/plain,2004
20040830232046,http://www.tei-c.org.uk:80/WSDs/iso88599.wsd,iso88599.wsd,28787,3c777269,text/plain,2004
20030722032619,http://www.tei-c.org.uk:80/WSDs/teigk2.wsd,teigk2.wsd,49037,3c777269,text/plain,2003
20030722032319,http://www.tei-c.org.uk:80/WSDs/iso8859a.wsd,iso8859a.wsd,29554,3c777269,text/plain,2003
20040830232014,http://www.tei-c.org.uk:80/WSDs/iso88592.wsd,iso88592.wsd,29456,52657475,message/rfc822,2004
20040830232042,http://www.tei-c.org.uk:80/WSDs/iso88597.wsd,iso88597.wsd,27467,3c777269,text/plain,2004
20000110235101,http://src.doc.ic.ac.uk:80/Mirrors/ftp.ifi.uio.no/pub/SGML/TEI/ISO646IR.WSD,ISO646IR.WSD,13512,5265