Skip to content

Instantly share code, notes, and snippets.

View edsu's full-sized avatar

Ed Summers edsu

View GitHub Profile
@edsu
edsu / check.py
Created August 3, 2023 14:51
Check a specific WARC file that is being discussed in IIPC Slack
#!/usr/bin/env python
from warcio.archiveiterator import ArchiveIterator
with open('archive/rec-20230722210008512613-81a34b41ee13.warc.gz', 'rb') as stream:
for i, record in enumerate(ArchiveIterator(stream)):
print(i, record.rec_headers.get_header('WARC-Target-URI'))
if record.rec_type == 'response':
content = record.content_stream().read()
@edsu
edsu / writer.py
Created July 26, 2023 16:01
A little example of writing files as resource records to a WARC file.
from warcio.warcwriter import WARCWriter
with open('test.warc.gz', 'wb') as output:
writer = WARCWriter(output, gzip=True)
# write some metadata for the warc as a info record
rec = writer.create_warcinfo_record('test.warc.gz', {
'software': 'warcio',
'description': 'An example of packaging up two images in a WARC'
})
@edsu
edsu / warc2mbox.py
Last active July 15, 2023 19:07
Convert Yahoo Groups WARC archive files to MBOX files: see https://archive.org/search?query=subject%3A%22yahoo+groups%22
#!/usr/bin/env python3
# run like this:
#
# $ python3 warc2mbox.py yahoo-groups-2016-03-20T12:45:19Z-nyzp9w.warc.gz
#
# and it will generate an mbox file for each Yahoo Group:
#
# $ ls -l mboxes
# -rw-r--r-- 1 edsummers staff 12522488 Jul 15 14:14 amicigranata.mbox
@edsu
edsu / swap_check.py
Last active July 13, 2023 16:14
Reads a text file of URLs and writes out a CSV report of whether the URL is in swap.stanford.edu
#!/usr/bin/env python3
import csv
import sys
import json
import time
import requests
def get_snapshots(url):
url = f"https://swap.stanford.edu/was/cdx?url={url}&output=json"
#!/usr/bin/env python3
import csv
import sys
import json
import time
import requests
def get_snapshots(url):
url = f"https://swap.stanford.edu/was/cdx?url={url}&output=json"
#!/usr/bin/env python
for n in range(0,1002):
with open("files/file-{:04n}.txt".format(n), "w") as fh:
fh.write(str(n))
@edsu
edsu / add_to_pinboard.py
Last active June 4, 2023 09:45
A test of the Pinboard API
#!/usr/bin/env python3
import os
import dotenv
import requests
#
# You'll need to put your Pinboard API token in a .env file in the same directory as this program.
#
# PINBOARD_KEY=abc:123
import csv
import time
from ipwhois import IPWhois
output = csv.DictWriter(open("blocks.csv", "w"), ["ip", "affected", "name", "country", "description"])
output.writeheader()
for line in open("blocks.txt"):
line = line.strip()
@edsu
edsu / template-test.html
Created May 22, 2023 19:21
Small example of a template
<html>
<body>
<div id="root" />
<template id="my-template">
<div>
<input type="text" />
<button>Remove</button>
</div>
</template>
@base <https://id.loc.gov/resources/instances/7977519.nt> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<http://id.loc.gov/resources/instances/7977519>
<http://id.loc.gov/ontologies/bibframe/adminMetadata> [
<http://id.loc.gov/ontologies/bflc/encodingLevel> <http://id.loc.gov/vocabulary/menclvl/1> ;
<http://id.loc.gov/ontologies/bibframe/assigner> <http://id.loc.gov/vocabulary/organizations/dlc> ;
<http://id.loc.gov/ontologies/bibframe/changeDate> "2010-06-14T09:46:36"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
<http://id.loc.gov/ontologies/bibframe/creationDate> "1973-05-11"^^<http://www.w3.org/2001/XMLSchema#date> ;
<http://id.loc.gov/ontologies/bibframe/descriptionAuthentication> <http://id.loc.gov/vocabulary/marcauthen/premarc> ;