Skip to content

Instantly share code, notes, and snippets.

Avatar

Ed Summers edsu

View GitHub Profile
@edsu
edsu / extract_images.py
Last active Apr 10, 2021
Extract images from a WARC file. usage: extract_images.py <warc_file>
View extract_images.py
#!/usr/bin/env python3
import sys
import pathlib
from urllib.parse import urlparse
from warcio.archiveiterator import ArchiveIterator
def save(url, stream):
uri = urlparse(url)
View journal
#/bin/zsh
# journal is a little command to edit my markdown journal with vim. By default
# it will open the journal for today. Optionally supply a date (e.g. 2021-01-01)
# to edit an older entry.
journal_dir="/home/ed/Dropbox/Journal"
if [ "$1" ];
then
View searching-for-trust.md

From: James Lowry James.Lowry@qc.cuny.edu
To: James Lowry James.Lowry@qc.cuny.edu
Subject: Reminder - New Works - Victoria Lemieux reading this Wednesday
Date: Mon, 22 Mar 2021 15:47:12 +0000 (03/22/2021 11:47:12 AM)

The next reading in the New Works series arranged by the City University of New York’s Archival Technologies Lab will be:

Wednesday, March 24, 2021 11am ET, via Zoom

View MD_COVID19_TotalVaccinationsCountyFirstandSecondDose.csv
OBJECTID VACCINATION_DATE County DailyFirstDose CumulativeFirstDose DailySecondDose CumulativeSecondDose
1 1930/11/13 15:00:00+00 Kent 1 1 0 0
2 1966/02/12 15:00:00+00 Anne Arundel 1 1 0 0
3 1966/02/15 15:00:00+00 Anne Arundel 1 2 0 0
4 1972/10/13 15:00:00+00 Allegany 1 1 0 0
5 1972/12/16 15:00:00+00 Baltimore 1 1 0 0
6 2012/02/03 15:00:00+00 Baltimore 1 2 0 0
7 2020/01/01 15:00:00+00 Unknown 2 2 0 0
8 2020/01/01 15:00:00+00 Worcester 1 1 0 0
9 2020/01/03 15:00:00+00 Montgomery 1 1 0 0
View MD_COVID19_TotalVaccinationsCountyFirstandSecondDose.csv
OBJECTID VACCINATION_DATE County DailyFirstDose CumulativeFirstDose DailySecondDose CumulativeSecondDose
1 1930/11/13 15:00:00+00 Kent 1 1 0 0
2 1966/02/12 15:00:00+00 Anne Arundel 1 1 0 0
3 1966/02/15 15:00:00+00 Anne Arundel 1 2 0 0
4 2019/12/26 15:00:00+00 Frederick 1 1 0 0
5 2020/01/01 15:00:00+00 Howard 1 1 0 0
6 2020/01/01 15:00:00+00 Unknown 2 2 0 0
7 2020/01/01 15:00:00+00 Worcester 2 2 0 0
8 2020/01/03 15:00:00+00 Montgomery 1 1 0 0
9 2020/01/05 15:00:00+00 Allegany 1 1 0 0
View titles.py
#!/usr/bin/env python3
import requests
for e in requests.get('https://unlocking.netlify.app/data/episodes.json').json():
title = e.get('title')
if title is None:
print('https://unlocking.netlify.app/episode/' + e.get('aapbId', ''))
View tags.csv
plantandosementes 67
quemmandoumatarmarielle 60
8m2021 59
mariellefranco 56
g1 46
mariellepresente 40
12marzo 35
brasile 35
8mars 35
whoorderedtheassassinationof 33
View pinboard-ask.txt
Hi there,
My name is Maciej, I run the bookmarking site Pinboard, and I’m writing to ask
for your help.
You joined the site back when there was a one-time signup fee. Back then,
charging for bookmarking online was unheard of, and the fee was more of an
anti-spam measure than a revenue model.
In 2015, I changed Pinboard over to a subscription site, where even “basic”
@edsu
edsu / CC-MAIN-2021-04-hosts-sizes-top100.csv
Last active Feb 7, 2021
The top 100 hosts by WARC record sizes (bytes) in commoncrawl CC-MAIN-2021-04.
View CC-MAIN-2021-04-hosts-sizes-top100.csv
url_host_name length
d2y1pz2y630308.cloudfront.net 35553494127
photos.google.com 22829413806
www.download.p4c.philips.com 19523224128
quod.lib.umich.edu 18400799789
s3.amazonaws.com 17043193709
support.google.com 16945185389
www.wmagazine.com 15723224197
api.whatsapp.com 15241728762
www.thecut.com 15017634948
@edsu
edsu / CC-MAIN-2021-04-host-counts-top100.csv
Last active Feb 7, 2021
The top 100 host counts in commoncrawl CC-MAIN-2021-04
View CC-MAIN-2021-04-host-counts-top100.csv
url_host_name total
getpocket.com 1640422
auth.webnode.com 1056353
telegram.me 543797
plus.google.com 472899
www.ncbi.nlm.nih.gov 433041
api.whatsapp.com 394818
web.skype.com 338835
www.amazon.com 296540
dx.doi.org 290895