Skip to content

Instantly share code, notes, and snippets.

View anjackson's full-sized avatar
👾

Andy Jackson anjackson

👾
View GitHub Profile
@anjackson
anjackson / watch_v_Hnrdfb6HiK0.html
Created August 3, 2022 12:02
Example file with UTF-8 that is not being detected by Tika
<!DOCTYPE html><html style="font-size: 10px;font-family: Roboto, Arial, sans-serif;" lang="en-GB" system-icons typography typography-spacing><head><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta http-equiv="origin-trial" content="At2B4ABoBE3kiyFJp5tVx/Zi81HAk2cn2zjA0NcVurqsrwLHavE/fe86HDn71lPLg+o1Rf7jkyRD7QdT4TS8+g0AAABteyJvcmlnaW4iOiJodHRwczovL3lvdXR1YmUuY29tOjQ0MyIsImZlYXR1cmUiOiJQcml2YWN5U2FuZGJveEFkc0FQSXMiLCJleHBpcnkiOjE2NjEyOTkxOTksImlzU3ViZG9tYWluIjp0cnVlfQ=="/><script nonce="zVi1d0BoBdL0_JsEzzYI1g">var ytcfg={d:function(){return window.yt&&yt.config_||ytcfg.data_||(ytcfg.data_={})},get:function(k,o){return k in ytcfg.d()?ytcfg.d()[k]:o},set:function(){var a=arguments;if(a.length>1)ytcfg.d()[a[0]]=a[1];else for(var k in a[0])ytcfg.d()[k]=a[0][k]}};
window.ytcfg.set('EMERGENCY_BASE_URL', '\/error_204?t\x3djserror\x26level\x3dERROR\x26client.name\x3d1\x26client.version\x3d2.20220801.00.00');</script><script nonce="zVi1d0BoBdL0_JsEzzYI1g">(function(){window.yterr=window.yterr||true;window.unha
2022-07-13T09:58:56.201Z 200 1039 https://www.vectisradio.com/wp-sitemap.xml I https://www.vectisradio.com/robots.txt application/xml #127 20220713095855289+777 sha1:VFT6NNA5DW5POM6PGF7ERSJYBV6LDNAJ tid:97213:https://www.vectisradio.com/schedule/ isSitemap,launchTimestamp:20220713095854,duplicate:digest,ip:185.151.30.133 {"contentSize":1483,"warcFilename":"BL-NPLD-20220713084656856-19144-80~npld-heritrix3-worker-1~8443.warc.gz","warcFileOffset":380231951,"scopeDecision":"ACCEPT by rule #1 WatchedFileSurtPrefixedDecideRule","warcFileRecordLength":1623}
2022-07-13T09:59:06.496Z -5002 - https://healthwatchwarrington.co.uk/get-involved/ - https://healthwatchwarrington.co.uk/get-involved/ unknown #061 20220713095152869+433395 - tid:1673:http://www.healthwatchwarrington.co.uk/ launchTimestamp:20220713090000,WebRenderStatus:200,resetQuotas,WebRenderCount:1 {"warcPrefix":"BL-NPLD-WEBRENDER-frequent-npld-20220606093552","scopeDecision":"ACCEPT by rule #1 WatchedFileSurtPrefixedDecideRule"}
2022-07-13T
@anjackson
anjackson / aot-collection-example.json
Created June 8, 2022 15:15
An example of 'back end' collection data.
{
"ttype": "collections",
"id": 4028,
"url": "act-4028",
"created_at": "2021-10-13 09:25:25.153",
"name": "Public Health Discourse",
"description": "Writing and other materials reflecting on health, from the open web.\n\nPlease get in touch with the leads about this collection: Cui Cui (cui.cui@bodleian.ox.ac.uk); Alice Doyle (adoyle2@exseed.ed.ac.uk) and Leontien Talboom (lkt39@cam.ac.uk).",
"publish": false,
"parents_all": "",
"revision": "Archive of Tomorrow collection. This project runs from February 2022 until the beginning of 2023. Eilidh MacGlone 29/03/2022.",
@anjackson
anjackson / rabbithole.md
Last active January 4, 2021 21:12
2020-01-03 US General Election Voting Data Problem

When digging too deep into Twitter, following some conspiracy tweets about Trump's election loss, I came to this odd site: https://hereistheevidence.com/

This site claims to provide tools and links to data to show alleged voting irregularities, and gives examples like this: https://twitter.com/indio007/status/1331828590552428544

Returned BEFORE ballot was mailed
23305 ballots pic.twitter.com/t0O5mUMWKh

— noone special (@indio007) November 26, 2020

Out of curiosity, I thought I'd see if I could reproduce the alleged irregularities.

The here-is-the-evidence site provides tools to download, but I'm not going to go anywhere near those. Installing software from a site like this would be very risky. And anyway, basic tools like grep would be enough

@anjackson
anjackson / govuk-council-hosts.csv
Last active September 11, 2019 09:41
A list of unique GOV.UK domains with the work 'council' in the text, as discovered by UKWA crawlers since 2018-01-01
host title
tendringdc.gov.uk Contact Council Tax | Tendring District Council
thanet.gov.uk Thanet District Council
northampton.gov.uk Northampton Borough Council Homepage
kettering.gov.uk Kettering Borough Council Homepage
sholland.gov.uk South Holland District Council - South Holland District Council
ryedale.gov.uk Ryedale District Council - working with you to make a difference
sevenoaks.gov.uk Sevenoaks District Council homepage
barmouthtowncouncil.gov.uk Barmouth Town Council - Barmouth Town Council
testvalley.gov.uk Home | Test Valley Borough Council
year elements_used count
0 2016 a 2787170
1 2016 html 2758395
2 2016 head 2754630
3 2016 title 2753436
4 2016 meta 2717990
5 2016 script 2715418
6 2016 link 2701135
7 2016 div 2695886
8 2016 img 2670980
year content_type_tika count
0 2016 text/html; charset=UTF-8 278514582
1 2016 application/xhtml+xml; charset=UTF-8 117731771
2 2016 image/jpeg 48557044
3 2016 application/rss+xml 16497156
4 2016 text/html; charset=ISO-8859-1 13856782
5 2016 application/xhtml+xml; charset=ISO-8859-1 12076356
6 2016 image/gif 7153120
7 2016 text/html; charset=windows-1252 6334265
8 2016 application/xhtml+xml; charset=windows-1252 5859147
@anjackson
anjackson / current_droid_results.csv
Created March 15, 2019 16:12
Sample work-in-progress format results
year content_type_droid count
0 2016 text/html; version=5 302988660
1 2016 application/xhtml+xml; version=1.0 75434142
2 2016 image/jpeg; version=1.01 36520960
3 2016 text/html 33316347
4 2016 application/xml; version=1.0 24319951
5 2016 application/octet-stream 13247847
6 2016 image/jpeg; version=1.02 8928862
7 2016 text/html; version=4.01 8237526
8 2016 image/gif; version=89a 6932182
@anjackson
anjackson / SimpleWARCAnalyser.java
Created October 2, 2018 23:24
Checking CC WARC
/**
*
*/
package uk.bl.wa.analyser;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.net.MalformedURLException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
@anjackson
anjackson / hits.csv
Created March 9, 2018 10:11
0x0baddeed in the JISC/IA Historical Archive
We can make this file beautiful and searchable if this error is corrected: It looks like row 3 should actually have 25 columns, instead of 1 in line 2.
id,source_file,source_file_offset,url,url_type,resourcename,content_type_ext,host,domain,public_suffix,server,content_length,hash,crawl_date,crawl_year,wayback_date,parse_error,content_type,content_ffb,content_type_droid,content_type_tika,content_type_full,content_type_norm,_version_,content_type_served
Q2lQuCTsZJo5clOVhSEebQ==/19980124072756,DOTUK-HISTORICAL-1996-2010-GROUP-AB-XAAYBI-20110428000000-00000.arc.gz,11984270,http://www-centrim.bus.bton.ac.uk:80/projects/esrcitm/proc.inn/6/6.ppt,normal,6.ppt,ppt,www-centrim.bus.bton.ac.uk,bton.ac.uk,ac.uk,AppleShareIP/5.0.235.0.2,110428,sha1:ETNTURZ5RCU37HYG74QAS77EKQXGGYS2,1998-01-24T07:27:56Z,1998,19980124072756,"org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x03000000EDDEAD0B\, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document",application/vnd.ms-powerpoint,0baddeed,application/octet-stream,application/vnd.ms-powerpoint,application/vnd.ms-powerpoint,powerpoint,1507428472951144448,
lmmHMoCuYDE0