Skip to content

Instantly share code, notes, and snippets.

@Asparagirl
Last active February 14, 2024 19:56
Show Gist options
  • Star 26 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Asparagirl/6206247 to your computer and use it in GitHub Desktop.
Save Asparagirl/6206247 to your computer and use it in GitHub Desktop.
Have a WARC that you would like to upload to the Internet Archive so that it can eventually be included in their Wayback Machine? Here's how to upload it from the command line.

Do you have a WARC file of a website all downloaded and ready to be added to the Internet Archive? Great! You can do that with the Internet Archive's web-based uploader, but it's not ideal and it can't handle really big uploads. Here's how you can upload your WARC files to the IA from the command line, and without worrying about a size restriction.

First, you need to get your Access Key and Secret Key from the Internet Archive for the S3-like API. Here's where you can get that for your IA account: http://archive.org/account/s3.php Don't share those with other people!

Here's their documentation file about how to use it, if you need some extra help: http://archive.org/help/abouts3.txt

Next, you should copy the following files to a text file and edit them as needed:

export IA_S3_ACCESS_KEY="YOUR-ACCESS-KEY-FROM-THE-IA-GOES-HERE"
export IA_S3_SECRET_KEY="YOUR-SECRET-KEY-FROM-THE-IA-GOES-HERE"
export ITEM_TITLE="Your item title goes here"
export ITEM_DESCRIPTION="Your item description goes here.  You can add a link in your description like this: <a href=http://www.example.com/ >http://www.example.com/</a>.  Note how there is a space before the forward bracket of the href in that link; don't remove that, it's needed. "
export ITEM_KEYWORDS="keyword;another keyword;yet another keyword;separated by semicolons awww yeaaah"
export FILE_LOCAL_PATH="/home/archiveteam/no/trailing/slash/on/this"
export FILE_LOCAL_NAME="example.com_YYYY-MM-DD_some-descriptive-text.warc.gz"
export FILE_LOCAL_FULL="$FILE_LOCAL_PATH/$FILE_LOCAL_NAME"
export FILE_IA_DIRECTORY="example.com_YYYY-MM-DD_some-more-descriptive-text-if-you-want-it"
export FILE_IA="http://s3.us.archive.org/$FILE_IA_DIRECTORY/$FILE_LOCAL_NAME"

That sets all the metadata about your item, so the IA deriver will know what to do with the file when it finishes uploading. Finally, you need to start the actual uploading process from your computer or server to the IA. This uses a command line program called curl, so you'll need to have that installed on your system (it probably already is). Copy this line as-is, no need to edit it, paste it into your command line, and hit enter at the end. It will start the upload process, which may take a little time, depending on how fast your connection is and how big your item is.

curl --location --header "x-amz-auto-make-bucket:1" --header "x-archive-meta01-collection:opensource" --header "x-archive-meta-mediatype:web" --header "x-archive-meta-title:$ITEM_TITLE" --header "x-archive-meta-description:$ITEM_DESCRIPTION" --header "x-archive-meta-subject:$ITEM_KEYWORDS" --header "authorization: LOW $IA_S3_ACCESS_KEY:$IA_S3_SECRET_KEY" --upload-file "$FILE_LOCAL_FULL" "$FILE_IA"

You can optionally add this line to the previous command too, to help the Internet Archive deal with the ingestion of really, really big file uploads. Set this number to the estimated file size, in BYTES, not MB or GB:

--header "x-archive-size-hint:18565000000"

If you're uploading a WARC that should be included in the ArchiveTeam collection, you'll also need to contact one of their admins through IRC or e-mail to let them know about your new item, so they can jiggle the handle move it into the right collection and set the right filetype on it (which should be 'web' instead of 'texts').

@throwaway2566
Copy link

@notevenaperson
Copy link

Have a WARC that you would like to upload to the Internet Archive so that it can eventually be included in their Wayback Machine?

Is this really possible? I thought users were forbidden from uploading WARCs because they could tamper with history.

@ShadowJonathan
Copy link

ShadowJonathan commented Sep 4, 2023

I got confirmation from someone on #archiveteam that they don't accept third-party WARCs anymore for inclusion in the wayback machine

@notevenaperson
Copy link

Lame, but understandable. Wish there was a way to verify if a web page snapshot was tampered with

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment