jjjake/bits-in-bits-out.rst

## bits-in-bits-out.rst

      
    Raw
  

              bits-in-bits-out.rst
            
          
    The Internet Archive

Bits in Bits Out

The users and contributors of the Internet Archive are what makes Archive.org what it is today. Without contributions from our users, we would have nothing, and without users accessing our digital materials it would mean nothing.
This document will give a brief overview on how to get data into, and out of, Archive.org.

Table of Contents:

The Internet Archive Command-line interface
Configuring and Installation
Uploading
Downloading
Metadata


Data Mining Archive.org


The Internet Archive Command-line interface

https://github.com/jjjake/internetarchive is a Python library and command-line interface to Archive.org. It is a tool for getting data into and out of the Internet Archive.
Binaries for the CLI are available here: https://archive.org/details/ia-pex
It can also be installed via pip install internetarchive.

Configuring and Installation

To get started, simply download a binary and configure the ia command.
..code bash:
$ curl -L https://archive.org/download/ia-pex/ia-0.8.2-py2.pex > ia
$ chmod +x ia
$ ./ia configure

You will be prompted to enter your Archive.org credentials. After doing so, a config file will be saved to your computer with everything you need to start uploading and modifying metadata via the ia command.

Uploading

There are about 15 million public items on Archive.org. Over 2 million of those items have been uploaded using the internetarchive library. Below is a brief overview of uploading using the CLI.
$ ./ia upload <identifier> <files>... --metadata=collection:test_collection --metadata='title:My Title'
See youtube2ia.sh for a more advanced example of how you might use ia upload in a bash script to mirror a Youtube channel to Archive.org.
You can also use a spreadsheet for uploading a batch of files. See metadata.csv for an example of the required format.
$ ./ia upload --spreadsheet=metadata.csv

Downloading

Downloading files via the ia command is easy:
$ ./ia download nasa
And flexible:
$ ./ia download Sita_Sings_the_Blues --format="Ogg Vorbis" --destdir="~/Downloads"
You can even glob for files to download:
$ ./ia download OTRR_X_Minus_One_Singles --glob="*mp3"

Metadata

Modifying and retrieving metadata for items can be done with the ia metadata command.
Retrieving the metadata for an Archive.org item in JSON is as easy as:
$ ./ia metadata nasa
To modify the metadata for an item, you could use a command such as the following:
$ ./ia metadata iacli-test-item60 --modify='title:My New Title' --modify='foo:bar'

Data Mining Archive.org

https://github.com/jjjake/iamine is a Python library and command-line tool for mining Archive.org metadata and search results.
Binaries for the CLI are available here: https://archive.org/details/ia-pex
It can also be installed via pip install iamine. iamine requires Python 3.
https://archive.org/download/iamine-pex/ia-mine-0.3.0-py3.pex
$ curl -L https://archive.org/download/iamine-pex/ia-mine-0.3.0-py3.pex > ia-mine
$ chmod +x ia-mine
$ ./ia-mine --configure
With ia-mine you can do things like...
Concurrently download an entire Archive.org collection using GNU Parallel:
$ ./ia-mine --search 'collection:freemusicarchive' --itemlist | parallel 'ia download {}'
ia-mine is especially powerful when used with a command-line JSON parsing tool, such as jq:
$ ./ia-mine --all --mine-ids | jq -r '.metadata.identifier as $id | .files | map("\($id)\t\(.name)\t\(.sha1)") | join("\n")'
Find all of the EXE files that are on Archive.org:
$ ./ia-mine --search 'format:exe' --mine-ids | jq -r '.metadata.identifier as $i | .files | map(select(.format == "exe") | "https://archive.org/download/\($i)/\(.name)") | join("\n")'
Monitor progress and transfer rate with Pipe Viewer:
$ ./ia-mine --search 'collection:usenet' --mine-ids 2> errors.json | pv -acbrl > usenet-metadata.json


## metadata.csv

          
            identifier
            file
            collection
            title
            subject[0]
            subject[1]
            subject[2]
            date
            foo

            
              my-test-item-2015-06-30
              file1.txt
              test_collection
              Test Item
              foo
              bar
              baz

            
              file2.txt

            
              My-test-item2-2015-06-30
              file3.txt
              test_collection
              Test Item 2
              foo
              bar

            
              My-test-item3-2015-06-30
              file4.txt
              test_collection
              Test Item 2
              foo
              baz
              bar
              2015-06-30

            
              My-test-item4-2015-06-30
              file4.txt
              test_collection
              Test Item 2
              foo
              
              
              2015-06-30
              bar

## youtube2ia.sh
#!/bin/bash


function upload_video() {
    JSON_FILE="$1"
    VIDEO_FILE="$(echo $JSON_FILE | sed 's/info.json$/mp4/')"
    ID="digital-freedom-test-$(echo $JSON_FILE | sed 's/.info.json//' | cut -c 1-80)"

    title="$(jq -r '.title' < $JSON_FILE)"
    contributor="$(jq -r '.uploader' < $JSON_FILE)"
    description="$(jq -r '.description' < $JSON_FILE)"

    yt_date="$(jq -r '.upload_date' < $JSON_FILE)"
    date="$(date -jf '%Y%m%d' $yt_date '+%Y-%m-%d')"

    ia upload --log --quiet $ID $JSON_FILE $VIDEO_FILE -m "title:$title" \
                                                       -m "date:$date" \
                                                       -m "contributor:$contributor" \
                                                       -m "description:$description" \
                                                       -m "collection:test_collection"
}

# Download youtube videos.
#youtube-dl --restrict-filenames "${1}"

# Upload to Archive.org
export -f upload_video
find . -type f -name '*info.json' -exec basename {} \; | parallel upload_video "{}"
identifier	file	collection	title	subject[0]	subject[1]	subject[2]	date	foo
my-test-item-2015-06-30	file1.txt	test_collection	Test Item	foo	bar	baz
	file2.txt
My-test-item2-2015-06-30	file3.txt	test_collection	Test Item 2	foo	bar
My-test-item3-2015-06-30	file4.txt	test_collection	Test Item 2	foo	baz	bar	2015-06-30
My-test-item4-2015-06-30	file4.txt	test_collection	Test Item 2	foo			2015-06-30	bar
	#!/bin/bash


	function upload_video() {
	JSON_FILE="$1"
	VIDEO_FILE="$(echo $JSON_FILE \| sed 's/info.json$/mp4/')"
	ID="digital-freedom-test-$(echo $JSON_FILE \| sed 's/.info.json//' \| cut -c 1-80)"

	title="$(jq -r '.title' < $JSON_FILE)"
	contributor="$(jq -r '.uploader' < $JSON_FILE)"
	description="$(jq -r '.description' < $JSON_FILE)"

	yt_date="$(jq -r '.upload_date' < $JSON_FILE)"
	date="$(date -jf '%Y%m%d' $yt_date '+%Y-%m-%d')"

	ia upload --log --quiet $ID $JSON_FILE $VIDEO_FILE -m "title:$title" \
	-m "date:$date" \
	-m "contributor:$contributor" \
	-m "description:$description" \
	-m "collection:test_collection"
	}

	# Download youtube videos.
	#youtube-dl --restrict-filenames "${1}"

	# Upload to Archive.org
	export -f upload_video
	find . -type f -name '*info.json' -exec basename {} \; \| parallel upload_video "{}"