Skip to content

Instantly share code, notes, and snippets.

@stvhwrd
Last active March 13, 2024 17:05
Show Gist options
  • Save stvhwrd/985dedbe1d3329e68d70 to your computer and use it in GitHub Desktop.
Save stvhwrd/985dedbe1d3329e68d70 to your computer and use it in GitHub Desktop.
Download an entire website for offline use with wget. Internal inks will be corrected so that the entire downloaded site will work as it did online.

The best way to download a website for offline use, using wget

There are two ways - the first way is just one command run plainly in front of you; the second one runs in the background and in a different instance so you can get out of your ssh session and it will continue.

First make a folder to download the websites to and begin your downloading: (note if downloading www.SOME_WEBSITE.com, you will get a folder like this: /websitedl/www.SOME_WEBSITE.com/)


STEP 1:

mkdir ~/websitedl/
cd ~/websitedl/

Now choose for Step 2 whether you want to download it simply (1st way) or if you want to get fancy (2nd way).


STEP 2:

1st way:

wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U mozilla http://www.SOME_WEBSITE.com

2nd way:

TO RUN IN THE BACKGROUND:
nohup wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U mozilla http://www.SOME_WEBSITE.com &
THEN TO VIEW OUTPUT (there will be a nohup.out file in whichever directory you ran the command from):
tail -f nohup.out

WHAT DO ALL THE SWITCHES MEAN:

--limit-rate=200k limit download to 200 Kb /sec

--no-clobber don't overwrite any existing files (used in case the download is interrupted and resumed).

--convert-links convert links so that they work locally, off-line, instead of pointing to a website online

--random-wait random waits between download - websites dont like their websites downloaded

-r recursive - downloads full website

-p downloads everything even pictures (same as --page-requsites, downloads the images, css stuff and so on)

-E gets the right extension of the file, without most html and other files have no extension

-e robots=off act like we are not a robot - not like a crawler - websites dont like robots/crawlers unless they are google/or other famous search engine

-U mozilla pretends to be just like a browser Mozilla is looking at a page instead of a crawler like wget

PURPOSELY DIDN'T INCLUDE THE FOLLOWING:

-o=/websitedl/wget1.txt log everything to wget_log.txt - didn't do this because it gave me no output on the screen and I don't like that.

-b runs it in the background and I can't see progress... I like "nohup &" better

--domain=steviehoward.com didn't include because this is hosted by Google so it might need to step into Google's domains

--restrict-file-names=windows modify filenames so that they will work in Windows as well. Seems to work okay without this.



tested with zsh 5.0.5 (x86_64-apple-darwin14.0) on Apple MacBook Pro (Late 2011) running OS X 10.10.3

credit

@JohnDotOwl
Copy link

JohnDotOwl commented Nov 14, 2018

This doesn't download asset file, CSS / JS

The asset files are located on external link, it's like a CDN.

@AcostArichA
Copy link

AcostArichA commented Apr 19, 2020

Super Montana! Very simple, fast and clear. Thanks
P.S. Some words about "--no-parent" option would be useful I think.

@sndao
Copy link

sndao commented Oct 12, 2022

Very useful!

@BradKML
Copy link

BradKML commented Apr 10, 2023

Cross-posting some other findings from another paste! https://gist.github.com/crittermike/fe02c59fed1aeebd0a9697cf7e9f5c0c
Primarily random-wait for better slowdowns, -e robots=off for dodging robot.txt, and -l inf and --recursive instead of mirror to control layer count. --no-parent may be useful as well.

@Flower7C3
Copy link

It should be --page-requisites not --page-requsites

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment