Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Archiving a website with wget

The command I use to archive a single website

wget -mpck --html-extension --user-agent="" -e robots=off --wait 1 -P . www.foo.com

Explanation of the parameters used

  • -m (Mirror) Turns on mirror-friendly settings like infinite recursion depth, timestamps, etc.
  • -c (Continue) Resumes a partially-downloaded transfer
  • -p (Page requisites) Downloads any page dependencies like images, style sheets, etc.
  • -k (Convert) After completing retrieval of all files… converts all absolute links to other downloaded files into relative links converts all relative links to any files that weren’t downloaded into absolute, external links in a nutshell: makes your website archive work locally
  • --html-extension this adds .html after the downloaded filename, to make sure it plays nicely on whatever system you’re going to view the archive on
  • –user-agent=”” Sometimes websites use robots.txt to block certain agents like web crawlers (e.g. GoogleBot) and Wget. This tells Wget to send a blank user-agent, preventing identification. You could alternatively use a web browser’s user-agent and make it look like a web browser, but it probably doesn’t matter.
  • -e robots=off Sometimes you’ll run into a site with a robots.txt that blocks everything. In these cases, this setting will tell Wget to ignore it. Like the user-agent, I usually leave this on for the sake of convenience.
  • –wait 1 Tells Wget to wait 1 second between each action. This will make it a bit less taxing on the servers.
  • -P . set the download directory to something. I left it at the default “.” (which means “here”) but this is where you could pass in a directory path to tell wget to save the archived site. Handy, if you’re doing this on a regular basis (say, as a cron job or something…) http://url-to-site: this is the full URL of the site to download. You’ll likely want to change this.

Sources

@Ham5ter

This comment has been minimized.

Copy link

@Ham5ter Ham5ter commented Oct 4, 2016

Thank you.

@Mayank-1234-cmd

This comment has been minimized.

Copy link

@Mayank-1234-cmd Mayank-1234-cmd commented Feb 4, 2021

what i use:

wget --recursive --convert-links -mpck --html-extension --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36." -e robots=off site.com

(changes:

  • removed the wait, faster
  • useragent makes it look like im using a actual browser (fixes 403 errors)
  • --convert-links converts the links
  • --recursive makes it recursive
    )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment