Skip to content

Instantly share code, notes, and snippets.

Avatar

Ed Summers edsu

View GitHub Profile
View check.py
#!/usr/bin/env python3
#
# This demonstrates an inconsistency in results from the Internet Archive CDX
# API when querying by scopeType=domain vs scopeType=prefix. For context see:
#
# https://inkdroid.org/2022/09/24/pdfs/
#
# Note: you'll need to
#
View check_cdx.py
#!/usr/bin/env python3
import json
druids = ['bj330fg0526', 'bp312sd3142', 'bs648dv9357', 'bz893jg7695', 'bz922hc1158', 'cc095kz3027', 'ch908dt6803', 'cp809cz8166', 'cv292vs5727', 'dn752dz0508', 'dy271hk6968', 'fd892fn4310', 'fj109wp2130', 'fn912wb3725', 'fp815hx3553', 'fs415vb1264', 'fv812yp9241', 'fw782ks7983', 'gf100kp6588', 'gj901jn9353', 'hf001pb6273', 'hh929wg3298', 'hn217tx5368', 'hq140wy0905', 'hv642nf7717', 'hv698ks1475', 'hw434pj6642', 'hw645gv7743', 'jb739pj9696', 'jg940ts4575', 'jh597wr5998', 'jz331hr5976', 'kw186hs7975', 'kx196rt8122', 'ky214ft2956', 'ky357nb9554', 'mg249dy7051', 'mk879xr0461', 'mv110pd4781', 'mv300dt6569', 'mx349xb4098', 'mz415jv3453', 'nd087pt9085', 'nk906ht6735', 'nn453zz9250', 'nr015ch1092', 'nv773xq7981', 'pf139tj8228', 'pn628yn6194', 'pq169jd6716', 'px611qw1504', 'qd726vf4177', 'qk039cf4369', 'qw725qm9638', 'qx771bj6775', 'rv306cp2774', 'sd725cc2793', 'sk583gg2589', 'sn506gj4859', 'sq394vr6558', 'sq694nb4696', 'st474bt2800', 'tk364rs5190', 'tw357sy1852', 'tx189sh1771', 't
View wacz-images.py
#!/usr/bin/env python3
#
# usage: wacz-images.py <wacz_file>
#
# This program will extract images from the WARC files contained in a WACZ
# file and write them to the current working directory using the image's URL
# as a file location.
#
# You will need to `pip install warcio` for it to work.
View titles.py
# print out the url and title of web pages in a WARC file
import bs4
import sys
from warcio.archiveiterator import ArchiveIterator
warc_file = sys.argv[1]
records = ArchiveIterator(open(warc_file, 'rb'))
@edsu
edsu / find.rb
Last active Jun 8, 2022
An example of using an enumerable with parallel, but which gets flattened into a list by parallel prior to processing.
View find.rb
require 'pathname'
require 'parallel'
# a directory to traverse
dir = ARGV[0]
# files is an Enumerator
files = Pathname.new(dir).find
results = Parallel.map(files, processes: 3) do |f|
View x.rb
def stuff()
yield 1
yield 2
yield 3
yield 4
yield 5
end
stuff.take(2).each do |i|
puts i
View gif
#!/bin/sh
# Turn a video file into an animated GIF
USAGE="usage: gif video_file [gif_file]"
video_file=$1
if [ "$video_file" = "" ]; then
echo $USAGE
View check_wayback.py
#!/usr/bin/env python3
# This is an example of seeing what unique HTML webpages there are in the
# Wayback Machine for the http://myshtetl.org/ website after 2022-03-01.
from wayback import WaybackClient
wb = WaybackClient()
pages = set()
@edsu
edsu / README.md
Last active Apr 14, 2022
Debugging PyWB and Wayback
View README.md

I'm trying to figure out why this JavaScript file rendered through PyWB seems to throw a Uncaught SyntaxError: missing formal parameter in Firefox and a Uncaught SyntaxError: Unexpected token 'function' (at pywb.js:15:5628639) in Chrome whereas it works fine when rendered through Archive-It Wayback.

curl http://localhost:8080/sul/20220225003837js_/https://prod.smassets.net/assets/anweb/anweb-shared-page-summary-bundle-min.58b903b5.js > pywb.js

curl https://wayback.archive-it.org/18713/20220225003837js_/https://prod.smassets.net/assets/anweb/anweb-shared-page-summary-bundle-min.58b903b5.js > wayback.js

You can open wayback.html and pywb.html in your browser and look at the developer console to see the error in the case of pywb.html.