Skip to content

Instantly share code, notes, and snippets.

@phillipsm
Created January 24, 2014 16:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save phillipsm/8601065 to your computer and use it in GitHub Desktop.
Save phillipsm/8601065 to your computer and use it in GitHub Desktop.
wget command
# Construct wget command
command = 'wget '
command = command + '--quiet ' # turn off wget's output
command = command + '--tries=' + str(settings.NUMBER_RETRIES) + ' ' # number of retries (assuming no 404 or the like)
command = command + '--wait=' + str(settings.WAIT_BETWEEN_TRIES) + ' ' # number of seconds between requests (lighten the load on a page that has a lot of assets)
command = command + '--quota=' + settings.ARCHIVE_QUOTA + ' ' # only store this amount
command = command + '--random-wait ' # random wait between .5 seconds and --wait=
command = command + '--limit-rate=' + settings.ARCHIVE_LIMIT_RATE + ' ' # we'll be performing multiple archives at once. let's not download too much in one stream
command = command + '--adjust-extension ' # if a page is served up at .asp, adjust to .html. (this is the new --html-extension flag)
command = command + '--span-hosts ' # sometimes things like images are hosted at a CDN. let's span-hosts to get those
command = command + '--convert-links ' # rewrite links in downloaded source so they can be viewed in our local version
command = command + '-e robots=off ' # we're not crawling, just viewing the page exactly as you would in a web-browser.
command = command + '--page-requisites ' # get the things required to render the page later. things like images.
command = command + '--no-directories ' # when downloading, flatten the source. we don't need a bunch of dirs.
command = command + '--no-check-certificate ' # We don't care too much about busted certs
command = command + '--user-agent="' + user_agent + '" ' # pass through our user's user agent
command = command + '--directory-prefix=' + directory + ' ' # store our downloaded source in this directory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment