Skip to content

Instantly share code, notes, and snippets.

@crittermike
Last active March 26, 2024 22:49
Show Gist options
  • Save crittermike/fe02c59fed1aeebd0a9697cf7e9f5c0c to your computer and use it in GitHub Desktop.
Save crittermike/fe02c59fed1aeebd0a9697cf7e9f5c0c to your computer and use it in GitHub Desktop.
Download an entire website with wget, along with assets.
# One liner
wget --recursive --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows --domains yoursite.com --no-parent yoursite.com
# Explained
wget \
--recursive \ # Download the whole site.
--page-requisites \ # Get all assets/elements (CSS/JS/images).
--adjust-extension \ # Save files with .html on the end.
--span-hosts \ # Include necessary assets from offsite as well.
--convert-links \ # Update links to still work in the static version.
--restrict-file-names=windows \ # Modify filenames to work in Windows as well.
--domains yoursite.com \ # Do not follow links outside this domain.
--no-parent \ # Don't follow links outside the directory you pass in.
yoursite.com/whatever/path # The URL to download
@realowded
Copy link

sudo apt-get update

@Celestine-Nelson
Copy link

hello good afternoon...please still don't know how to use it...to download the entire website

@Veracious
Copy link

Veracious commented Aug 19, 2019

hello good afternoon...please still don't know how to use it...to download the entire website

This is just using wget, just look up how to use wget. There are tons of examples online.

Either way you need to make sure you have wget installed already:
debian:
sudo apt-get install wget

Centos/RHEL:
yum install wget

Here are some usage examples to download an entire site:
convert links for local viewing:
wget --mirror --convert-links --page-requisites ----no-parent -P /path/to/download/to https://example-domain.com

without converting:
wget --mirror --page-requisites ----no-parent -P /path/to/download/to https://example-domain.com

One more example to download an entire site with wget:
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.org

Explanation of the various flags:

--mirror – Makes (among other things) the download recursive.
--convert-links – convert all the links (also to stuff like CSS stylesheets) to relative, so it will be suitable for offline viewing.
--adjust-extension – Adds suitable extensions to filenames (html or css) depending on their content-type.
--page-requisites – Download things like CSS style-sheets and images required to properly display the page offline.
--no-parent – When recursing do not ascend to the parent directory. It useful for restricting the download to only a portion of the site.

Alternatively, the command above may be shortened:
wget -mkEpnp http://example.org

If you still insist on running this script, it is a BASH script so first set it as executable:
chmod u+x wget.sh

and then this to run the script:
./wget.sh

if you still can't run the script edit it by adding this as the first line:
#!/bin/sh

Also you need to specify the site in the script that you want to download. At this point you are really better off just using wget outright.

@vasili111
Copy link

vasili111 commented Nov 17, 2019

@Veracious

  1. What about --span-hosts ? Should I use it ?
  2. Why to use --mirror instead of --recursive ?

@cdamken
Copy link

cdamken commented Feb 17, 2020

2: ‘--mirror’
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to ‘-r -N -l inf --no-remove-listing’.

@YubinXie
Copy link

Thanks for the tips. After I download the website, every time I open the file, it links back to its original website. Any idea how to solve this? Thanks!

@tloudon
Copy link

tloudon commented May 2, 2020

@mikecrittenden 👋

@vasili111
Copy link

@YubinXie

Maybe you need --convert-links option?

@polly4you
Copy link

H!, if I am wrong you can virtually shoot me, but the no-parent command is maybe hit by a typo because when I tried with ----no-parent it did not recognize the command but when I did some surgery I endid up with --no-parent and it worked so if I am right cool if I am wrong I am sorry:

YS: polly4you

@Ornolfr
Copy link

Ornolfr commented Oct 22, 2020

What if the website requires authorization of some sort? How do we specify some cookies to wget?

@jan-martinek
Copy link

--no-parent requires trailing slash, otherwise it works from the parent dir

as quoted from docs:

Note that, for HTTP (and HTTPS), the trailing slash is very important to ‘--no-parent’. HTTP has no concept of a “directory”—Wget relies on you to indicate what’s a directory and what isn’t. In ‘http://foo/bar/’, Wget will consider ‘bar’ to be a directory, while in ‘http://foo/bar’ (no trailing slash), ‘bar’ will be considered a filename (so ‘--no-parent’ would be meaningless, as its parent is ‘/’).

@imharvol
Copy link

imharvol commented Mar 5, 2021

What if the website requires authorization of some sort? How do we specify some cookies to wget?

Add

--header='Cookie: KEY=VALUE; KEY=VALUE'

and so on, with the creedentials

@abdallahoraby
Copy link

What if the website requires authorization of some sort? How do we specify some cookies to wget?

Add

--header='Cookie: KEY=VALUE; KEY=VALUE'

and so on, with the creedentials

worked like a charm THANKS

@swport
Copy link

swport commented May 11, 2021

how to also download lazily loaded static chunks ( like css, js files not loaded on initial page load, but are requested after the page load is finished )

@fengshansi
Copy link

What if the page contains an external link that i don't want to clone?

@sineausr931
Copy link

It never occurred to me that wget could do this, thank you for the slap in the face, it saved me from using httrack or something else unnecessarily.

@BradKML
Copy link

BradKML commented Aug 25, 2021

@jeffory-orrok
Copy link

If you're going to use --recursive, then you need to use --level, and you should probably be polite and use --wait

@fazlearefin
Copy link

Aggregating this command with other blog posts on the internet, I ended up using

wget --mirror --no-clobber --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows --domains {{DOMAINS}} --no-parent {{URL}}

@BradKML
Copy link

BradKML commented Jan 3, 2022

@fazlearefin thanks

@dillfrescott
Copy link

My file names end with @ver=xx. How do I fix this?

@iceguru
Copy link

iceguru commented Feb 17, 2022

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://web.archive.org/web/20210628062523/https://www.ps-survival.com/PS/Hydro-Power/index.htm

Will download the .pdf

But if I change the domain to https://web.archive.org/web/20220118034512/https://ps-survival.com/PS/index.htm

IT doesn't go down and download the PDF's

Could someone tell me why that is? I'm trying to download all the PDF's.

@641i130
Copy link

641i130 commented Mar 31, 2023

@iceguru I'd try using an archive downloader. Wget doesn't play nicely with how they have it setup:
https://github.com/hartator/wayback-machine-downloader

@BradKML
Copy link

BradKML commented Apr 2, 2023

I am also eyeing for this repo, since it can be directly hooked up to LLMs for use instead of using wget indirectly. https://pypi.org/project/pywebcopy/

@BradKML
Copy link

BradKML commented Apr 2, 2023

Recently discovered random-wait as an option from here, should be included to make things less sus https://gist.github.com/stvhwrd/985dedbe1d3329e68d70

@BradKML
Copy link

BradKML commented Apr 10, 2023

Just realized that --no-cobbler and --mirror conflicted, and such should use -l inf --recursive instead? https://stackoverflow.com/questions/13092229/cant-resume-wget-mirror-with-no-clobber-c-f-b-unhelpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment