Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Want to help Archive Team do a "panic grab" of a website, so that you can later upload it to the Internet Archive for inclusion in its WayBack Machine? Here's the code!

Want to grab a copy of your favorite website, using wget in the command line, and saving it in WARC format? Then this is the gist for you. Read on!

First, copy the following lines into a textfile, and edit them as needed. Then paste them into your command line and hit enter:

export USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
export DOMAIN_NAME_TO_SAVE="www.example.com"
export SPECIFIC_HOSTNAMES_TO_INCLUDE="example1.com,example2.com,images.example2.com"
export FILES_AND_PATHS_TO_EXCLUDE="/path/to/ignore"
export WARC_NAME="example.com-20130810-panicgrab"

Then, choose which of these situations seems right for the website you're saving. Copy and paste that line into your command line and hit save. You only need to choose one of these options.

(use this for grabbing single domain names:)

wget -e robots=off --mirror --page-requisites --save-headers --keep-session-cookies --save-cookies "cookies.txt" --wait 2 --waitretry 3 --timeout 60 --tries 3 --span-hosts --domains="$SPECIFIC_HOSTNAMES_TO_INCLUDE" --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$DOMAIN_NAME_TO_SAVE"

(use this for for grabbing single domain names recursively, and have the spider follow links up to 10 levels deep:)

wget -e robots=off --mirror --page-requisites --save-headers --keep-session-cookies --save-cookies "cookies.txt" --recursive --level=10 --wait 2 --waitretry 3 --timeout 60 --tries 3 --span-hosts --domains="$SPECIFIC_HOSTNAMES_TO_INCLUDE" --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$DOMAIN_NAME_TO_SAVE"

(use this for for grabbing single domain names recursively, and have the spider follow links up to 20 levels deep:)

wget -e robots=off --mirror --page-requisites --save-headers --keep-session-cookies --save-cookies "cookies.txt" --recursive --level=20 --wait 2 --waitretry 3 --timeout 60 --tries 3 --span-hosts --domains="$SPECIFIC_HOSTNAMES_TO_INCLUDE" --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$DOMAIN_NAME_TO_SAVE"

(use this for for grabbing single domain names recursively, and have the spider follow links up to 10 levels deep, but EXCLUDE a certain file or path:)

wget -e robots=off --mirror --page-requisites --save-headers --keep-session-cookies --save-cookies "cookies.txt" --recursive --level=10 --wait 2 --waitretry 3 --timeout 60 --tries 3 --span-hosts --domains="$SPECIFIC_HOSTNAMES_TO_INCLUDE" -X "$FILES_AND_PATHS_TO_EXCLUDE" --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$DOMAIN_NAME_TO_SAVE"

(use this for for grabbing single domain names recursively, and have the spider follow links up to 10 levels deep, but do NOT crawl upwards and grab stuff from the parent directory:)

wget -e robots=off --mirror --page-requisites --save-headers --keep-session-cookies --save-cookies "cookies.txt" --recursive --level=10 --no-parent --wait 2 --waitretry 3 --timeout 60 --tries 3 --span-hosts --domains="$SPECIFIC_HOSTNAMES_TO_INCLUDE" --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$DOMAIN_NAME_TO_SAVE"

Note that all of this code is explicitly ignoring the website's robots.txt file, the ethics of which is left up to your own discretion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.