Skip to content

Instantly share code, notes, and snippets.

View anjackson's full-sized avatar
🧐

Andy Jackson anjackson

🧐
View GitHub Profile
@anjackson
anjackson / current_droid_results.csv
Created March 15, 2019 16:12
Sample work-in-progress format results
year content_type_droid count
0 2016 text/html; version=5 302988660
1 2016 application/xhtml+xml; version=1.0 75434142
2 2016 image/jpeg; version=1.01 36520960
3 2016 text/html 33316347
4 2016 application/xml; version=1.0 24319951
5 2016 application/octet-stream 13247847
6 2016 image/jpeg; version=1.02 8928862
7 2016 text/html; version=4.01 8237526
8 2016 image/gif; version=89a 6932182
@anjackson
anjackson / SimpleWARCAnalyser.java
Created October 2, 2018 23:24
Checking CC WARC
/**
*
*/
package uk.bl.wa.analyser;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.net.MalformedURLException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
@anjackson
anjackson / hits.csv
Created March 9, 2018 10:11
0x0baddeed in the JISC/IA Historical Archive
We can make this file beautiful and searchable if this error is corrected: It looks like row 3 should actually have 25 columns, instead of 1. in line 2.
id,source_file,source_file_offset,url,url_type,resourcename,content_type_ext,host,domain,public_suffix,server,content_length,hash,crawl_date,crawl_year,wayback_date,parse_error,content_type,content_ffb,content_type_droid,content_type_tika,content_type_full,content_type_norm,_version_,content_type_served
Q2lQuCTsZJo5clOVhSEebQ==/19980124072756,DOTUK-HISTORICAL-1996-2010-GROUP-AB-XAAYBI-20110428000000-00000.arc.gz,11984270,http://www-centrim.bus.bton.ac.uk:80/projects/esrcitm/proc.inn/6/6.ppt,normal,6.ppt,ppt,www-centrim.bus.bton.ac.uk,bton.ac.uk,ac.uk,AppleShareIP/5.0.235.0.2,110428,sha1:ETNTURZ5RCU37HYG74QAS77EKQXGGYS2,1998-01-24T07:27:56Z,1998,19980124072756,"org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x03000000EDDEAD0B\, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document",application/vnd.ms-powerpoint,0baddeed,application/octet-stream,application/vnd.ms-powerpoint,application/vnd.ms-powerpoint,powerpoint,1507428472951144448,
lmmHMoCuYDE0
@anjackson
anjackson / puid-fmt-111.csv
Created December 14, 2017 09:53
Current results for fmt/111 from the frequent crawl index
We can make this file beautiful and searchable if this error is corrected: It looks like row 8 should actually have 5 columns, instead of 3. in line 7.
crawl_date,content_type_ext,content_type_tika,content_ffb,url
2016-01-21T04:02:00Z,fla,application/x-tika-msoffice,d0cf11e0,http://www.w3.org/WAI/WCAG20/Techniques/working-examples/FLASH31/datagrid_with_caption_as3.fla
2016-10-05T05:44:59Z,db,application/x-tika-msoffice,d0cf11e0,http://democracy.cambridge.gov.uk/Data/Licensing%20Committee/20020613/Agenda/Thumbs.db
2016-01-14T17:04:22Z,mso,application/x-tika-msoffice,d0cf11e0,http://www.webarchive.org.uk/wayback/archive/20081011214118/http://www.kereve.com/kernewek/index06_files/oledata.mso
2016-06-23T10:41:17Z,mso,application/x-tika-msoffice,d0cf11e0,http://www.scan.org.uk/aboutus/Reports/Post_Project_Report_Files/200106_files/oledata.mso
2016-10-05T05:33:48Z,db,application/x-tika-msoffice,d0cf11e0,http://democracy.cambridge.gov.uk/Data/Planning/20040204/Agenda/Thumbs.db
2016-06-01T16:53:36Z,db,application/x-tika-msoffice,d0cf11e0,http://democracy.monmouthshire.gov.uk/Data/Adults%20Select%20Committee/20150616/Agenda/Thumbs.db
2016-01-10T17:43:39Z,mso,applicat
@anjackson
anjackson / puid-fmt-583.csv
Last active December 14, 2017 15:07
Current results for fmt/583 from the frequent crawl index
crawl_date content_type_ext content_type_tika content_ffb url
2016-04-18T18:15:10Z shtml application/xhtml+xml; charset=windows-1252 0d0a0d0a http://www.southportreporter.com/484/484-7.shtml
2016-08-11T22:48:34Z htm text/html; charset=windows-1252 3c68746d https://www.iwight.com/Meetings/committees/Policy%20Commission%20for%20Economy/4-4-07/minutes.htm
2016-01-06T05:23:06Z aspx application/xhtml+xml; charset=UTF-8 3c21444f http://www.t.www.rspb.org.uk/discoverandenjoynature/discoverandlearn/birdguide/name/w/wheatear/videos.aspx
2016-01-06T04:20:01Z aspx application/xhtml+xml; charset=UTF-8 3c21444f http://www.t.www.rspb.org.uk/discoverandenjoynature/seenature/events/details.aspx?id=tcm:9-404023
2016-09-08T19:26:46Z aspx application/xhtml+xml; charset=UTF-8 3c21444f http://www.rspb.org.uk/discoverandenjoynature/seenature/reserves/tags.aspx?tag=flowers
2016-01-06T05:10:16Z application/xhtml+xml; charset=UTF-8 3c21444f http://www.information.www.rspb.org.uk/vacancies/details/411612-membership-development-office
@anjackson
anjackson / ext-woff2.csv
Created December 14, 2017 10:13
Resources with WOFF2 file extensions, sorted by the first four bytes (hex encoded)
crawl_date content_ffb url
2016-08-11T12:56:31Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-Italic.woff2
2016-08-11T12:56:47Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-Bold.woff2
2016-08-11T12:56:08Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-Regular.woff2
2016-08-11T12:56:27Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-BoldItalic.woff2
2016-05-01T11:35:39Z 0a090a20 http://www.mind.org.uk/fonts/street_corner-webfont.woff2
2016-09-08T15:48:03Z 0a090a20 http://www.mind.org.uk/fonts/2FBE41_0_0.woff2
2016-04-01T11:22:11Z 0a090a20 http://www.mind.org.uk/fonts/2FBE41_0_0.woff2
2016-05-01T11:31:44Z 0a090a20 http://www.mind.org.uk/fonts/2FBE41_0_0.woff2
2016-04-01T11:27:41Z 0a090a20 http://www.mind.org.uk/fonts/street_corner-webfont.woff2
@anjackson
anjackson / Nanite on DROID 6.1.3 branch.
Last active March 19, 2017 22:40
Nanite basic comparison with Skeleton Suite
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/fmt/fmt-894-signature-id-1241.gjf
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/fmt/fmt-899-signature-id-1249.exe
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/fmt/fmt-900-signature-id-1251.exe
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/x-fmt/x-fmt-411-signature-id-198.exe
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/x-fmt/x-fmt-451-signature-id-243.skb
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/x-fmt/x-fmt-453-signature-id-242.ttf
application/pdf application/pdf application/pdf; version=1a 2009 1
null application/pdf application/pdf; version=1a 2008 2
null application/pdf application/pdf; version=1a 2009 1
application/pdf application/pdf application/pdf; version=1a 2009 7
null application/pdf application/pdf; version=1a 2009 4
application/pdf application/pdf application/pdf; version=1a 2008 1
application/pdf application/pdf application/pdf; version=1a 2009 2
null application/pdf application/pdf; version=1a 2008 7
application/pdf application/pdf application/pdf; version=1a 2009 2
null application/pdf application/pdf; version=1a 2008 1
@anjackson
anjackson / droid-binsig-example.txt
Created February 24, 2016 19:24
Running Droid from the CLI
$ droid q -Ns ~/.droid6/signature_files/DROID_SignatureFile_V84.xml -Nr ~/Downloads/DDM_SR_20160443_197-312_935.pdf
19:22:23,212 INFO [main] ReflectionServiceFactoryBean:399 - Creating Service {http://pronom.nationalarchives.gov.uk}PronomServiceService from class uk.gov.nationalarchives.pronom.PronomService
DROID 6.1.5 No Profile mode: Runtime Information
Selected folder or file: /Users/andy/Downloads/DDM_SR_20160443_197-312_935.pdf
Binary signature file: /Users/andy/.droid6/signature_files/DROID_SignatureFile_V84.xml
Container signature file: None
Open archives: False
/Users/andy/Downloads/DDM_SR_20160443_197-312_935.pdf,fmt/18
@anjackson
anjackson / gist:2888380
Created June 7, 2012 11:47
Making a Bottle app that routes to a proxy
import bottle
from wsgiproxy.app import WSGIProxyApp
# Remove "hop-by-hop" headers (as defined by RFC2613, Section 13)
# since they are not allowed by the WSGI standard.
FILTER_HEADERS = [
'Connection',
'Keep-Alive',
'Proxy-Authenticate',
'Proxy-Authorization',