Skip to content

Instantly share code, notes, and snippets.

Downloading and Syncing Archive.org Collections

Following are instructions on how to use the Internet Archive command-line tool, "ia", to download a collection from Archive.org and keep it synced. The only requirements are that you have Python 2 installed on a Unix-like operating system (i.e. Mac OS X, Linux).

Downloading and Configuring the Ia Command-Line Tool

  1. Download the latest binary of the ia command-line tool:
@jjjake
jjjake / convert_xls_to_utf8_csv.py
Last active March 19, 2022 15:54
This script converts a Microsoft Excel spreadsheet to a UTF-8 CSV file.
#!/usr/bin/env python
"""Convert a Microsoft Excel spreadsheet to a UTF-8 csv.
Usage:
# Make sure requrirements are installed.
$ sudo pip install xlrd backports.csv
# Run script.
$ python convert_xls_to_utf8_csv.py <spreadsheet>

GENERAL TODO:

  • The examples are all over the place. They need to be more consistent.
  • Check that x-archive-queue-derive header. I just skimmed it and it doesn't seem right.
  • Investigate getting an "ias3support@archive.org" address for support requests
  • Some of the standard metadata fields are repeatable, some are not. State this in the descriptions.
  • Excellent Hank idea: Quick Start (TL;DR) section to avoid all the gory details
  • Dang, but this damn thing is hard to read. Will that get better when it gets converted to the PHP wrapper? I have my doubts. May need a some quick George love to give tips for better readability.
  • All the other 'foo' (read: green) bits below
#!/usr/bin/env python
import sys
import os
import datetime
import time
import pysrt
from internetarchive import get_item
@jjjake
jjjake / audit_gb_shipment.py
Last active March 16, 2017 23:15
Audit GB Shipment
import json
def get_gb_counts(tsv):
counts = dict()
for line in open(tsv):
barcode = line.split('\t')[0].lower()
# Skip header row.
if barcode == 'barcode':
continue
@jjjake
jjjake / iamine.go
Last active May 11, 2016 01:04
iamine in golang
package main
import (
"sync"
"bytes"
"time"
"bufio"
"fmt"
"github.com/sethgrid/pester"
"os"
#!/usr/bin/env python
"""Parse audfprint .out files.
example input:
Fri Jan 8 00:07:47 2016 Reading hash table /1/2015/db-dem3/dem3-debate-aa.db
NOMATCH precomp/1/2015/mp3s/ALJAZAM_20151219_000000_News.afpt 3659.9 sec 299066 raw hashes
Matched 2.9 s starting at 35.1 s in precomp/1/2015/mp3s/ALJAZAM_20151220_040000_Weekend_News.afpt to time 0.8 s in /1/2015/dem3-mp4/2015-12-19-D-Debate-0050.mp4 with 76 of 1264 common hashes at rank 5
"""
#!/usr/bin/env python
from sys import argv
from lxml import etree
from pymarc import MARCReader, record_to_xml
def write_marc_xml(id, marc_xml):
f = '{id}_marc.xml'.format(id=id)
try:
from gevent import monkey, queue, spawn
monkey.patch_all(thread=False)
except ImportError:
raise ImportError(
"""No module named gevent
This feature requires the gevent neworking library. gevent
and all of it's dependencies can be installed with pip:
\tpip install cython git+git://github.com/surfly/gevent.git@1.0rc2#egg=gevent