Skip to content

Instantly share code, notes, and snippets.

Andy Jackson anjackson

Block or report user

Report or block anjackson

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@anjackson
anjackson / govuk-council-hosts.csv
Last active Sep 11, 2019
A list of unique GOV.UK domains with the work 'council' in the text, as discovered by UKWA crawlers since 2018-01-01
View govuk-council-hosts.csv
host title
tendringdc.gov.uk Contact Council Tax | Tendring District Council
thanet.gov.uk Thanet District Council
northampton.gov.uk Northampton Borough Council Homepage
kettering.gov.uk Kettering Borough Council Homepage
sholland.gov.uk South Holland District Council - South Holland District Council
ryedale.gov.uk Ryedale District Council - working with you to make a difference
sevenoaks.gov.uk Sevenoaks District Council homepage
barmouthtowncouncil.gov.uk Barmouth Town Council - Barmouth Town Council
testvalley.gov.uk Home | Test Valley Borough Council
View current_elements_used.csv
year elements_used count
0 2016 a 2787170
1 2016 html 2758395
2 2016 head 2754630
3 2016 title 2753436
4 2016 meta 2717990
5 2016 script 2715418
6 2016 link 2701135
7 2016 div 2695886
8 2016 img 2670980
View current_tika_results.csv
year content_type_tika count
0 2016 text/html; charset=UTF-8 278514582
1 2016 application/xhtml+xml; charset=UTF-8 117731771
2 2016 image/jpeg 48557044
3 2016 application/rss+xml 16497156
4 2016 text/html; charset=ISO-8859-1 13856782
5 2016 application/xhtml+xml; charset=ISO-8859-1 12076356
6 2016 image/gif 7153120
7 2016 text/html; charset=windows-1252 6334265
8 2016 application/xhtml+xml; charset=windows-1252 5859147
@anjackson
anjackson / current_droid_results.csv
Created Mar 15, 2019
Sample work-in-progress format results
View current_droid_results.csv
year content_type_droid count
0 2016 text/html; version=5 302988660
1 2016 application/xhtml+xml; version=1.0 75434142
2 2016 image/jpeg; version=1.01 36520960
3 2016 text/html 33316347
4 2016 application/xml; version=1.0 24319951
5 2016 application/octet-stream 13247847
6 2016 image/jpeg; version=1.02 8928862
7 2016 text/html; version=4.01 8237526
8 2016 image/gif; version=89a 6932182
View SimpleWARCAnalyser.java
/**
*
*/
package uk.bl.wa.analyser;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.net.MalformedURLException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
@anjackson
anjackson / hits.csv
Created Mar 9, 2018
0x0baddeed in the JISC/IA Historical Archive
View hits.csv
We can make this file beautiful and searchable if this error is corrected: It looks like row 3 should actually have 25 columns, instead of 1. in line 2.
id,source_file,source_file_offset,url,url_type,resourcename,content_type_ext,host,domain,public_suffix,server,content_length,hash,crawl_date,crawl_year,wayback_date,parse_error,content_type,content_ffb,content_type_droid,content_type_tika,content_type_full,content_type_norm,_version_,content_type_served
Q2lQuCTsZJo5clOVhSEebQ==/19980124072756,DOTUK-HISTORICAL-1996-2010-GROUP-AB-XAAYBI-20110428000000-00000.arc.gz,11984270,http://www-centrim.bus.bton.ac.uk:80/projects/esrcitm/proc.inn/6/6.ppt,normal,6.ppt,ppt,www-centrim.bus.bton.ac.uk,bton.ac.uk,ac.uk,AppleShareIP/5.0.235.0.2,110428,sha1:ETNTURZ5RCU37HYG74QAS77EKQXGGYS2,1998-01-24T07:27:56Z,1998,19980124072756,"org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x03000000EDDEAD0B\, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document",application/vnd.ms-powerpoint,0baddeed,application/octet-stream,application/vnd.ms-powerpoint,application/vnd.ms-powerpoint,powerpoint,1507428472951144448,
lmmHMoCuYDE0
@anjackson
anjackson / ext-woff2.csv
Created Dec 14, 2017
Resources with WOFF2 file extensions, sorted by the first four bytes (hex encoded)
View ext-woff2.csv
crawl_date content_ffb url
2016-08-11T12:56:31Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-Italic.woff2
2016-08-11T12:56:47Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-Bold.woff2
2016-08-11T12:56:08Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-Regular.woff2
2016-08-11T12:56:27Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-BoldItalic.woff2
2016-05-01T11:35:39Z 0a090a20 http://www.mind.org.uk/fonts/street_corner-webfont.woff2
2016-09-08T15:48:03Z 0a090a20 http://www.mind.org.uk/fonts/2FBE41_0_0.woff2
2016-04-01T11:22:11Z 0a090a20 http://www.mind.org.uk/fonts/2FBE41_0_0.woff2
2016-05-01T11:31:44Z 0a090a20 http://www.mind.org.uk/fonts/2FBE41_0_0.woff2
2016-04-01T11:27:41Z 0a090a20 http://www.mind.org.uk/fonts/street_corner-webfont.woff2
@anjackson
anjackson / puid-fmt-111.csv
Created Dec 14, 2017
Current results for fmt/111 from the frequent crawl index
View puid-fmt-111.csv
We can make this file beautiful and searchable if this error is corrected: It looks like row 8 should actually have 5 columns, instead of 3. in line 7.
crawl_date,content_type_ext,content_type_tika,content_ffb,url
2016-01-21T04:02:00Z,fla,application/x-tika-msoffice,d0cf11e0,http://www.w3.org/WAI/WCAG20/Techniques/working-examples/FLASH31/datagrid_with_caption_as3.fla
2016-10-05T05:44:59Z,db,application/x-tika-msoffice,d0cf11e0,http://democracy.cambridge.gov.uk/Data/Licensing%20Committee/20020613/Agenda/Thumbs.db
2016-01-14T17:04:22Z,mso,application/x-tika-msoffice,d0cf11e0,http://www.webarchive.org.uk/wayback/archive/20081011214118/http://www.kereve.com/kernewek/index06_files/oledata.mso
2016-06-23T10:41:17Z,mso,application/x-tika-msoffice,d0cf11e0,http://www.scan.org.uk/aboutus/Reports/Post_Project_Report_Files/200106_files/oledata.mso
2016-10-05T05:33:48Z,db,application/x-tika-msoffice,d0cf11e0,http://democracy.cambridge.gov.uk/Data/Planning/20040204/Agenda/Thumbs.db
2016-06-01T16:53:36Z,db,application/x-tika-msoffice,d0cf11e0,http://democracy.monmouthshire.gov.uk/Data/Adults%20Select%20Committee/20150616/Agenda/Thumbs.db
2016-01-10T17:43:39Z,mso,applicat
@anjackson
anjackson / puid-fmt-583.csv
Last active Dec 14, 2017
Current results for fmt/583 from the frequent crawl index
View puid-fmt-583.csv
crawl_date content_type_ext content_type_tika content_ffb url
2016-04-18T18:15:10Z shtml application/xhtml+xml; charset=windows-1252 0d0a0d0a http://www.southportreporter.com/484/484-7.shtml
2016-08-11T22:48:34Z htm text/html; charset=windows-1252 3c68746d https://www.iwight.com/Meetings/committees/Policy%20Commission%20for%20Economy/4-4-07/minutes.htm
2016-01-06T05:23:06Z aspx application/xhtml+xml; charset=UTF-8 3c21444f http://www.t.www.rspb.org.uk/discoverandenjoynature/discoverandlearn/birdguide/name/w/wheatear/videos.aspx
2016-01-06T04:20:01Z aspx application/xhtml+xml; charset=UTF-8 3c21444f http://www.t.www.rspb.org.uk/discoverandenjoynature/seenature/events/details.aspx?id=tcm:9-404023
2016-09-08T19:26:46Z aspx application/xhtml+xml; charset=UTF-8 3c21444f http://www.rspb.org.uk/discoverandenjoynature/seenature/reserves/tags.aspx?tag=flowers
2016-01-06T05:10:16Z application/xhtml+xml; charset=UTF-8 3c21444f http://www.information.www.rspb.org.uk/vacancies/details/411612-membership-development-office
@anjackson
anjackson / Nanite on DROID 6.1.3 branch.
Last active Mar 19, 2017
Nanite basic comparison with Skeleton Suite
View Nanite on DROID 6.1.3 branch.
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/fmt/fmt-894-signature-id-1241.gjf
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/fmt/fmt-899-signature-id-1249.exe
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/fmt/fmt-900-signature-id-1251.exe
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/x-fmt/x-fmt-411-signature-id-198.exe
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/x-fmt/x-fmt-451-signature-id-243.skb
FAIL: File could not be identified! src/test/resources/skeleton-suite-test/skeleton-suite-pronom-export-2017-03-08-sig-file-v89/x-fmt/x-fmt-453-signature-id-242.ttf
You can’t perform that action at this time.