Skip to content

Instantly share code, notes, and snippets.

@payingattention
Forked from Asparagirl/gist:6202872
Last active August 29, 2015 14:06
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save payingattention/f9de0d4709b342965a94 to your computer and use it in GitHub Desktop.
Save payingattention/f9de0d4709b342965a94 to your computer and use it in GitHub Desktop.

Want to grab a copy of your favorite website, using wget in the command line, and saving it in WARC format? Then this is the gist for you. Read on!

First, copy the following lines into a textfile, and edit them as needed. Then paste them into your command line and hit enter:

export USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
export DOMAIN_NAME_TO_SAVE="www.example.com"
export SPECIFIC_HOSTNAMES_TO_INCLUDE="example1.com,example2.com,images.example2.com"
export FILES_AND_PATHS_TO_EXCLUDE="/path/to/ignore"
export WARC_NAME="example.com-20130810-panicgrab"

Then, choose which of these situations seems right for the website you're saving. Copy and paste that line into your command line and hit save. You only need to choose one of these options.

(use this for grabbing single domain names:)

wget -e robots=off --mirror --page-requisites --save-headers --keep-session-cookies --save-cookies "cookies.txt" --wait 2 --waitretry 3 --timeout 60 --tries 3 --span-hosts --domains="$SPECIFIC_HOSTNAMES_TO_INCLUDE" --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$DOMAIN_NAME_TO_SAVE"

(use this for for grabbing single domain names recursively, and have the spider follow links up to 10 levels deep:)

wget -e robots=off --mirror --page-requisites --save-headers --keep-session-cookies --save-cookies "cookies.txt" --recursive --level=10 --wait 2 --waitretry 3 --timeout 60 --tries 3 --span-hosts --domains="$SPECIFIC_HOSTNAMES_TO_INCLUDE" --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$DOMAIN_NAME_TO_SAVE"

(use this for for grabbing single domain names recursively, and have the spider follow links up to 20 levels deep:)

wget -e robots=off --mirror --page-requisites --save-headers --keep-session-cookies --save-cookies "cookies.txt" --recursive --level=20 --wait 2 --waitretry 3 --timeout 60 --tries 3 --span-hosts --domains="$SPECIFIC_HOSTNAMES_TO_INCLUDE" --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$DOMAIN_NAME_TO_SAVE"

(use this for for grabbing single domain names recursively, and have the spider follow links up to 10 levels deep, but EXCLUDE a certain file or path:)

wget -e robots=off --mirror --page-requisites --save-headers --keep-session-cookies --save-cookies "cookies.txt" --recursive --level=10 --wait 2 --waitretry 3 --timeout 60 --tries 3 --span-hosts --domains="$SPECIFIC_HOSTNAMES_TO_INCLUDE" -X "$FILES_AND_PATHS_TO_EXCLUDE" --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$DOMAIN_NAME_TO_SAVE"

(use this for for grabbing single domain names recursively, and have the spider follow links up to 10 levels deep, but do NOT crawl upwards and grab stuff from the parent directory:)

wget -e robots=off --mirror --page-requisites --save-headers --keep-session-cookies --save-cookies "cookies.txt" --recursive --level=10 --no-parent --wait 2 --waitretry 3 --timeout 60 --tries 3 --span-hosts --domains="$SPECIFIC_HOSTNAMES_TO_INCLUDE" --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$DOMAIN_NAME_TO_SAVE"

Note that all of this code is explicitly ignoring the website's robots.txt file, the ethics of which is left up to your own discretion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment