Skip to content

Instantly share code, notes, and snippets.

@crittermike
Last active March 26, 2024 22:49
Show Gist options
  • Save crittermike/fe02c59fed1aeebd0a9697cf7e9f5c0c to your computer and use it in GitHub Desktop.
Save crittermike/fe02c59fed1aeebd0a9697cf7e9f5c0c to your computer and use it in GitHub Desktop.
Download an entire website with wget, along with assets.
# One liner
wget --recursive --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows --domains yoursite.com --no-parent yoursite.com
# Explained
wget \
--recursive \ # Download the whole site.
--page-requisites \ # Get all assets/elements (CSS/JS/images).
--adjust-extension \ # Save files with .html on the end.
--span-hosts \ # Include necessary assets from offsite as well.
--convert-links \ # Update links to still work in the static version.
--restrict-file-names=windows \ # Modify filenames to work in Windows as well.
--domains yoursite.com \ # Do not follow links outside this domain.
--no-parent \ # Don't follow links outside the directory you pass in.
yoursite.com/whatever/path # The URL to download
@swport
Copy link

swport commented May 11, 2021

how to also download lazily loaded static chunks ( like css, js files not loaded on initial page load, but are requested after the page load is finished )

@fengshansi
Copy link

What if the page contains an external link that i don't want to clone?

@sineausr931
Copy link

It never occurred to me that wget could do this, thank you for the slap in the face, it saved me from using httrack or something else unnecessarily.

@BradKML
Copy link

BradKML commented Aug 25, 2021

@jeffory-orrok
Copy link

If you're going to use --recursive, then you need to use --level, and you should probably be polite and use --wait

@fazlearefin
Copy link

Aggregating this command with other blog posts on the internet, I ended up using

wget --mirror --no-clobber --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows --domains {{DOMAINS}} --no-parent {{URL}}

@BradKML
Copy link

BradKML commented Jan 3, 2022

@fazlearefin thanks

@dillfrescott
Copy link

My file names end with @ver=xx. How do I fix this?

@iceguru
Copy link

iceguru commented Feb 17, 2022

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://web.archive.org/web/20210628062523/https://www.ps-survival.com/PS/Hydro-Power/index.htm

Will download the .pdf

But if I change the domain to https://web.archive.org/web/20220118034512/https://ps-survival.com/PS/index.htm

IT doesn't go down and download the PDF's

Could someone tell me why that is? I'm trying to download all the PDF's.

@641i130
Copy link

641i130 commented Mar 31, 2023

@iceguru I'd try using an archive downloader. Wget doesn't play nicely with how they have it setup:
https://github.com/hartator/wayback-machine-downloader

@BradKML
Copy link

BradKML commented Apr 2, 2023

I am also eyeing for this repo, since it can be directly hooked up to LLMs for use instead of using wget indirectly. https://pypi.org/project/pywebcopy/

@BradKML
Copy link

BradKML commented Apr 2, 2023

Recently discovered random-wait as an option from here, should be included to make things less sus https://gist.github.com/stvhwrd/985dedbe1d3329e68d70

@BradKML
Copy link

BradKML commented Apr 10, 2023

Just realized that --no-cobbler and --mirror conflicted, and such should use -l inf --recursive instead? https://stackoverflow.com/questions/13092229/cant-resume-wget-mirror-with-no-clobber-c-f-b-unhelpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment