Skip to content

Instantly share code, notes, and snippets.

@jjjake
Last active August 29, 2015 14:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jjjake/0a2bde52f59023637842 to your computer and use it in GitHub Desktop.
Save jjjake/0a2bde52f59023637842 to your computer and use it in GitHub Desktop.

The Internet Archive

Bits in Bits Out

The users and contributors of the Internet Archive are what makes Archive.org what it is today. Without contributions from our users, we would have nothing, and without users accessing our digital materials it would mean nothing.

This document will give a brief overview on how to get data into, and out of, Archive.org.

Table of Contents:

The Internet Archive Command-line interface

https://github.com/jjjake/internetarchive is a Python library and command-line interface to Archive.org. It is a tool for getting data into and out of the Internet Archive.

Binaries for the CLI are available here: https://archive.org/details/ia-pex

It can also be installed via pip install internetarchive.

Configuring and Installation

To get started, simply download a binary and configure the ia command.

..code bash:

$ curl -L https://archive.org/download/ia-pex/ia-0.8.2-py2.pex > ia
$ chmod +x ia
$ ./ia configure

You will be prompted to enter your Archive.org credentials. After doing so, a config file will be saved to your computer with everything you need to start uploading and modifying metadata via the ia command.

Uploading

There are about 15 million public items on Archive.org. Over 2 million of those items have been uploaded using the internetarchive library. Below is a brief overview of uploading using the CLI.

$ ./ia upload <identifier> <files>... --metadata=collection:test_collection --metadata='title:My Title'

See youtube2ia.sh for a more advanced example of how you might use ia upload in a bash script to mirror a Youtube channel to Archive.org.

You can also use a spreadsheet for uploading a batch of files. See metadata.csv for an example of the required format.

$ ./ia upload --spreadsheet=metadata.csv

Downloading

Downloading files via the ia command is easy:

$ ./ia download nasa

And flexible:

$ ./ia download Sita_Sings_the_Blues --format="Ogg Vorbis" --destdir="~/Downloads"

You can even glob for files to download:

$ ./ia download OTRR_X_Minus_One_Singles --glob="*mp3"

Metadata

Modifying and retrieving metadata for items can be done with the ia metadata command.

Retrieving the metadata for an Archive.org item in JSON is as easy as:

$ ./ia metadata nasa

To modify the metadata for an item, you could use a command such as the following:

$ ./ia metadata iacli-test-item60 --modify='title:My New Title' --modify='foo:bar'

Data Mining Archive.org

https://github.com/jjjake/iamine is a Python library and command-line tool for mining Archive.org metadata and search results.

Binaries for the CLI are available here: https://archive.org/details/ia-pex

It can also be installed via pip install iamine. iamine requires Python 3.

https://archive.org/download/iamine-pex/ia-mine-0.3.0-py3.pex

$ curl -L https://archive.org/download/iamine-pex/ia-mine-0.3.0-py3.pex > ia-mine
$ chmod +x ia-mine
$ ./ia-mine --configure

With ia-mine you can do things like...

Concurrently download an entire Archive.org collection using GNU Parallel:

$ ./ia-mine --search 'collection:freemusicarchive' --itemlist | parallel 'ia download {}'

ia-mine is especially powerful when used with a command-line JSON parsing tool, such as jq:

$ ./ia-mine --all --mine-ids | jq -r '.metadata.identifier as $id | .files | map("\($id)\t\(.name)\t\(.sha1)") | join("\n")'

Find all of the EXE files that are on Archive.org:

$ ./ia-mine --search 'format:exe' --mine-ids | jq -r '.metadata.identifier as $i | .files | map(select(.format == "exe") | "https://archive.org/download/\($i)/\(.name)") | join("\n")'

Monitor progress and transfer rate with Pipe Viewer:

$ ./ia-mine --search 'collection:usenet' --mine-ids 2> errors.json | pv -acbrl > usenet-metadata.json
identifier file collection title subject[0] subject[1] subject[2] date foo
my-test-item-2015-06-30 file1.txt test_collection Test Item foo bar baz
file2.txt
My-test-item2-2015-06-30 file3.txt test_collection Test Item 2 foo bar
My-test-item3-2015-06-30 file4.txt test_collection Test Item 2 foo baz bar 2015-06-30
My-test-item4-2015-06-30 file4.txt test_collection Test Item 2 foo 2015-06-30 bar
#!/bin/bash
function upload_video() {
JSON_FILE="$1"
VIDEO_FILE="$(echo $JSON_FILE | sed 's/info.json$/mp4/')"
ID="digital-freedom-test-$(echo $JSON_FILE | sed 's/.info.json//' | cut -c 1-80)"
title="$(jq -r '.title' < $JSON_FILE)"
contributor="$(jq -r '.uploader' < $JSON_FILE)"
description="$(jq -r '.description' < $JSON_FILE)"
yt_date="$(jq -r '.upload_date' < $JSON_FILE)"
date="$(date -jf '%Y%m%d' $yt_date '+%Y-%m-%d')"
ia upload --log --quiet $ID $JSON_FILE $VIDEO_FILE -m "title:$title" \
-m "date:$date" \
-m "contributor:$contributor" \
-m "description:$description" \
-m "collection:test_collection"
}
# Download youtube videos.
#youtube-dl --restrict-filenames "${1}"
# Upload to Archive.org
export -f upload_video
find . -type f -name '*info.json' -exec basename {} \; | parallel upload_video "{}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment