Skip to content

Instantly share code, notes, and snippets.

View edsu's full-sized avatar

Ed Summers edsu

View GitHub Profile
import csv
import sys
from itertools import batched
import pyarrow
from pyarrow.parquet import ParquetWriter
csv.field_size_limit(sys.maxsize)
def csv_to_parquet(csv_file, parquet_file, batch_size=10_000):
@edsu
edsu / .gitignore
Last active July 18, 2024 14:48
A sloppy prototype for moving browsertrix WACZs to AWS S3.
.env
import requests
author_id = 'https://openalex.org/A5067004024'
url = 'https://api.openalex.org/works'
params = {
'filter': f'author.id:{author_id}',
'cursor': '*'
}
#!/usr/bin/env python3
"""
Run this program with an institution name and see the institutions and the count
of publications in OpenAlex.
$ ./openalex_counts "stanford"
Stanford University (I97018004): 430550
Stanford Medicine (I4210137306): 32576
@edsu
edsu / mix.sh
Last active May 27, 2024 13:44
Concat two mp4 files with different resolutions.
# concatenate two videos with different resolution
ffmpeg -i part1.mp4 -i part2.mp4 -filter_complex "[0]scale=1280:720:force_original_aspect_ratio=decrease,pad=1280:720:(ow-iw)/2:(oh-ih)/2,setsar=1[v0];[1]scale=1280:720:force_original_aspect_ratio=decrease,pad=1280:720:(ow-iw)/2:(oh-ih)/2,setsar=1[v1];[v0][0:a:0][v1][1:a:0]concat=n=2:v=1:a=1[v][a]" -map "[v]" -map "[a]" out.mp4
@edsu
edsu / bagit.sh
Last active April 18, 2024 19:57
Remembering the original spirit of BagIt. https://twitter.com/justin_littman/status/778561421428793344
#!/bin/bash
#
# The simplest way to create a valid BagIt bag?
#
# Usage: bagit.sh <dir_to_bag> <bag_dir>
#
# Note: you'll need to have md5deep installed:
# brew install md5deep
# apt-get install md5deep
@edsu
edsu / en.wav
Last active March 29, 2024 12:22
This seems to cause whisper to segfault on my MacBook Pro 2.4 GHz 8-Core Intel Core i9, Sonoma 14.4.1, Python 3.12.0
@edsu
edsu / response.json
Last active March 20, 2024 16:54
Looking at the HTTP request that happens when you click on a citation link in a PDF when using Google Scholar's PDF extension for Chrome. You will need to be logged into Google to see the response, which comes back with the wrong Content-Type: https://scholar.google.com/scholar?oi=gsr-r&q=Ben-David%20A%20and%20Amram%20A%20(2018)%20The%20internet…
{
"l": "1",
"p": "https://lh3.googleusercontent.com/-XdUIqdMkCWA/AAAAAAAAAAI/AAAAAAAAAAA/4252rscbv5M/s64-c-mo/photo.jpg",
"r": [
{
"t": "The Internet Archive and the socio-technical construction of historical facts",
"u": "https://scholar.google.com/scholar_url?url=https://www.tandfonline.com/doi/abs/10.1080/24701475.2018.1455412&hl=en&sa=T&oi=gsr-r&ct=res&cd=0&d=3272375975175528132&ei=YBH7ZeXNA4Cb6rQPmrOdoA8&scisig=AFWwaeb_dRhXurIfWX0NXA2y4G9I",
"x": "",
"m": "A Ben-David, A Amram - Internet Histories, 2018",
"s": "This article analyses the socio-technical epistemic processes behind the construction of historical facts by the Internet Archive Wayback Machine (IAWM). Grounded in theoretical debates in Science and Technology Studies about digital and algorithmic platforms as “black boxes”, this article uses provenance information and other data traces provided by the IAWM to uncover specific epistemic processes embedded at its back-end, through a case study on the archiv
filename count
data.zip 22397
data_EPSG_4326.zip 22397
preview.jpg 22397
index_map.json 147
Beechey_WGS.tif.xml 1
Beechey_WGS-iso19139.xml 1
Beechey_WGS-fgdc.xml 1
bathy20.txt 1
@edsu
edsu / wacz-images.py
Last active February 19, 2024 03:08
#!/usr/bin/env python3
#
# usage: wacz-images.py <wacz_file>
#
# This program will extract images from the WARC files contained in a WACZ
# file and write them to the current working directory using the image's URL
# as a file location.
#
# You will need to `pip install warcio` for it to work.