Skip to content

Instantly share code, notes, and snippets.

Avatar
😵

Andy Jackson anjackson

😵
View GitHub Profile
@anjackson
anjackson / rabbithole.md
Last active Jan 4, 2021
2020-01-03 US General Election Voting Data Problem
View rabbithole.md

When digging too deep into Twitter, following some conspiracy tweets about Trump's election loss, I came to this odd site: https://hereistheevidence.com/

This site claims to provide tools and links to data to show alleged voting irregularities, and gives examples like this: https://twitter.com/indio007/status/1331828590552428544

Returned BEFORE ballot was mailed
23305 ballots pic.twitter.com/t0O5mUMWKh

— noone special (@indio007) November 26, 2020

Out of curiosity, I thought I'd see if I could reproduce the alleged irregularities.

The here-is-the-evidence site provides tools to download, but I'm not going to go anywhere near those. Installing software from a site like this would be very risky. And anyway, basic tools like grep would be enough

@anjackson
anjackson / govuk-council-hosts.csv
Last active Sep 11, 2019
A list of unique GOV.UK domains with the work 'council' in the text, as discovered by UKWA crawlers since 2018-01-01
View govuk-council-hosts.csv
host title
tendringdc.gov.uk Contact Council Tax | Tendring District Council
thanet.gov.uk Thanet District Council
northampton.gov.uk Northampton Borough Council Homepage
kettering.gov.uk Kettering Borough Council Homepage
sholland.gov.uk South Holland District Council - South Holland District Council
ryedale.gov.uk Ryedale District Council - working with you to make a difference
sevenoaks.gov.uk Sevenoaks District Council homepage
barmouthtowncouncil.gov.uk Barmouth Town Council - Barmouth Town Council
testvalley.gov.uk Home | Test Valley Borough Council
View current_elements_used.csv
year elements_used count
0 2016 a 2787170
1 2016 html 2758395
2 2016 head 2754630
3 2016 title 2753436
4 2016 meta 2717990
5 2016 script 2715418
6 2016 link 2701135
7 2016 div 2695886
8 2016 img 2670980
View current_tika_results.csv
year content_type_tika count
0 2016 text/html; charset=UTF-8 278514582
1 2016 application/xhtml+xml; charset=UTF-8 117731771
2 2016 image/jpeg 48557044
3 2016 application/rss+xml 16497156
4 2016 text/html; charset=ISO-8859-1 13856782
5 2016 application/xhtml+xml; charset=ISO-8859-1 12076356
6 2016 image/gif 7153120
7 2016 text/html; charset=windows-1252 6334265
8 2016 application/xhtml+xml; charset=windows-1252 5859147
@anjackson
anjackson / current_droid_results.csv
Created Mar 15, 2019
Sample work-in-progress format results
View current_droid_results.csv
year content_type_droid count
0 2016 text/html; version=5 302988660
1 2016 application/xhtml+xml; version=1.0 75434142
2 2016 image/jpeg; version=1.01 36520960
3 2016 text/html 33316347
4 2016 application/xml; version=1.0 24319951
5 2016 application/octet-stream 13247847
6 2016 image/jpeg; version=1.02 8928862
7 2016 text/html; version=4.01 8237526
8 2016 image/gif; version=89a 6932182
View SimpleWARCAnalyser.java
/**
*
*/
package uk.bl.wa.analyser;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.net.MalformedURLException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
@anjackson
anjackson / hits.csv
Created Mar 9, 2018
0x0baddeed in the JISC/IA Historical Archive
View hits.csv
We can make this file beautiful and searchable if this error is corrected: It looks like row 3 should actually have 25 columns, instead of 1. in line 2.
id,source_file,source_file_offset,url,url_type,resourcename,content_type_ext,host,domain,public_suffix,server,content_length,hash,crawl_date,crawl_year,wayback_date,parse_error,content_type,content_ffb,content_type_droid,content_type_tika,content_type_full,content_type_norm,_version_,content_type_served
Q2lQuCTsZJo5clOVhSEebQ==/19980124072756,DOTUK-HISTORICAL-1996-2010-GROUP-AB-XAAYBI-20110428000000-00000.arc.gz,11984270,http://www-centrim.bus.bton.ac.uk:80/projects/esrcitm/proc.inn/6/6.ppt,normal,6.ppt,ppt,www-centrim.bus.bton.ac.uk,bton.ac.uk,ac.uk,AppleShareIP/5.0.235.0.2,110428,sha1:ETNTURZ5RCU37HYG74QAS77EKQXGGYS2,1998-01-24T07:27:56Z,1998,19980124072756,"org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x03000000EDDEAD0B\, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document",application/vnd.ms-powerpoint,0baddeed,application/octet-stream,application/vnd.ms-powerpoint,application/vnd.ms-powerpoint,powerpoint,1507428472951144448,
lmmHMoCuYDE0
@anjackson
anjackson / ext-woff2.csv
Created Dec 14, 2017
Resources with WOFF2 file extensions, sorted by the first four bytes (hex encoded)
View ext-woff2.csv
crawl_date content_ffb url
2016-08-11T12:56:31Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-Italic.woff2
2016-08-11T12:56:47Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-Bold.woff2
2016-08-11T12:56:08Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-Regular.woff2
2016-08-11T12:56:27Z 00010000 https://www.iod.com/Portals/_default/Skins/IoD/fonts/ID00/ID00Serif-BoldItalic.woff2
2016-05-01T11:35:39Z 0a090a20 http://www.mind.org.uk/fonts/street_corner-webfont.woff2
2016-09-08T15:48:03Z 0a090a20 http://www.mind.org.uk/fonts/2FBE41_0_0.woff2
2016-04-01T11:22:11Z 0a090a20 http://www.mind.org.uk/fonts/2FBE41_0_0.woff2
2016-05-01T11:31:44Z 0a090a20 http://www.mind.org.uk/fonts/2FBE41_0_0.woff2
2016-04-01T11:27:41Z 0a090a20 http://www.mind.org.uk/fonts/street_corner-webfont.woff2
@anjackson
anjackson / puid-fmt-111.csv
Created Dec 14, 2017
Current results for fmt/111 from the frequent crawl index
View puid-fmt-111.csv
We can make this file beautiful and searchable if this error is corrected: It looks like row 8 should actually have 5 columns, instead of 3. in line 7.
crawl_date,content_type_ext,content_type_tika,content_ffb,url
2016-01-21T04:02:00Z,fla,application/x-tika-msoffice,d0cf11e0,http://www.w3.org/WAI/WCAG20/Techniques/working-examples/FLASH31/datagrid_with_caption_as3.fla
2016-10-05T05:44:59Z,db,application/x-tika-msoffice,d0cf11e0,http://democracy.cambridge.gov.uk/Data/Licensing%20Committee/20020613/Agenda/Thumbs.db
2016-01-14T17:04:22Z,mso,application/x-tika-msoffice,d0cf11e0,http://www.webarchive.org.uk/wayback/archive/20081011214118/http://www.kereve.com/kernewek/index06_files/oledata.mso
2016-06-23T10:41:17Z,mso,application/x-tika-msoffice,d0cf11e0,http://www.scan.org.uk/aboutus/Reports/Post_Project_Report_Files/200106_files/oledata.mso
2016-10-05T05:33:48Z,db,application/x-tika-msoffice,d0cf11e0,http://democracy.cambridge.gov.uk/Data/Planning/20040204/Agenda/Thumbs.db
2016-06-01T16:53:36Z,db,application/x-tika-msoffice,d0cf11e0,http://democracy.monmouthshire.gov.uk/Data/Adults%20Select%20Committee/20150616/Agenda/Thumbs.db
2016-01-10T17:43:39Z,mso,applicat
@anjackson
anjackson / puid-fmt-583.csv
Last active Dec 14, 2017
Current results for fmt/583 from the frequent crawl index
View puid-fmt-583.csv
crawl_date content_type_ext content_type_tika content_ffb url
2016-04-18T18:15:10Z shtml application/xhtml+xml; charset=windows-1252 0d0a0d0a http://www.southportreporter.com/484/484-7.shtml
2016-08-11T22:48:34Z htm text/html; charset=windows-1252 3c68746d https://www.iwight.com/Meetings/committees/Policy%20Commission%20for%20Economy/4-4-07/minutes.htm
2016-01-06T05:23:06Z aspx application/xhtml+xml; charset=UTF-8 3c21444f http://www.t.www.rspb.org.uk/discoverandenjoynature/discoverandlearn/birdguide/name/w/wheatear/videos.aspx
2016-01-06T04:20:01Z aspx application/xhtml+xml; charset=UTF-8 3c21444f http://www.t.www.rspb.org.uk/discoverandenjoynature/seenature/events/details.aspx?id=tcm:9-404023
2016-09-08T19:26:46Z aspx application/xhtml+xml; charset=UTF-8 3c21444f http://www.rspb.org.uk/discoverandenjoynature/seenature/reserves/tags.aspx?tag=flowers
2016-01-06T05:10:16Z application/xhtml+xml; charset=UTF-8 3c21444f http://www.information.www.rspb.org.uk/vacancies/details/411612-membership-development-office