Skip to content

Instantly share code, notes, and snippets.

@noscripter
Forked from stvhwrd/website-dl.md
Created October 20, 2021 08:44
Show Gist options
  • Save noscripter/42cbf54658b3cbf2016422efd99eefd3 to your computer and use it in GitHub Desktop.
Save noscripter/42cbf54658b3cbf2016422efd99eefd3 to your computer and use it in GitHub Desktop.
Download an entire website for offline use with wget. Internal inks will be corrected so that the entire downloaded site will work as it did online.

The best way to download a website for offline use, using wget

There are two ways - the first way is just one command run plainly in front of you; the second one runs in the background and in a different instance so you can get out of your ssh session and it will continue.

First make a folder to download the websites to and begin your downloading: (note if downloading www.SOME_WEBSITE.com, you will get a folder like this: /websitedl/www.SOME_WEBSITE.com/)


STEP 1:

mkdir ~/websitedl/
cd ~/websitedl/

Now choose for Step 2 whether you want to download it simply (1st way) or if you want to get fancy (2nd way).


STEP 2:

1st way:

wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U mozilla http://www.SOME_WEBSITE.com

2nd way:

TO RUN IN THE BACKGROUND:
nohup wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U mozilla http://www.SOME_WEBSITE.com &
THEN TO VIEW OUTPUT (there will be a nohup.out file in whichever directory you ran the command from):
tail -f nohup.out

WHAT DO ALL THE SWITCHES MEAN:

--limit-rate=200k limit download to 200 Kb /sec

--no-clobber don't overwrite any existing files (used in case the download is interrupted and resumed).

--convert-links convert links so that they work locally, off-line, instead of pointing to a website online

--random-wait random waits between download - websites dont like their websites downloaded

-r recursive - downloads full website

-p downloads everything even pictures (same as --page-requsites, downloads the images, css stuff and so on)

-E gets the right extension of the file, without most html and other files have no extension

-e robots=off act like we are not a robot - not like a crawler - websites dont like robots/crawlers unless they are google/or other famous search engine

-U mozilla pretends to be just like a browser Mozilla is looking at a page instead of a crawler like wget

PURPOSELY DIDN'T INCLUDE THE FOLLOWING:

-o=/websitedl/wget1.txt log everything to wget_log.txt - didn't do this because it gave me no output on the screen and I don't like that.

-b runs it in the background and I can't see progress... I like "nohup &" better

--domain=steviehoward.com didn't include because this is hosted by Google so it might need to step into Google's domains

--restrict-file-names=windows modify filenames so that they will work in Windows as well. Seems to work okay without this.



tested with zsh 5.0.5 (x86_64-apple-darwin14.0) on Apple MacBook Pro (Late 2011) running OS X 10.10.3

credit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment