Skip to content

Instantly share code, notes, and snippets.

@inian

inian/download.sh

Last active Oct 8, 2018
Embed
What would you like to do?
Download webpage using wget
wget -E -H -k -K -p -t 2 -T 30 --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36" -e robots=off http://example.com
@muralisr

This comment has been minimized.

Copy link

@muralisr muralisr commented Aug 3, 2015

‘-E’
‘--adjust-extension’
If a file of type ‘application/xhtml+xml’ or ‘text/html’ is downloaded and the URL does not end with the regexp ‘.[Hh][Tt][Mm][Ll]?’, this option will cause the suffix ‘.html’ to be appended to the local filename. This is useful, for instance, when you’re mirroring a remote site that uses ‘.asp’ pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good use for this is when you’re downloading CGI-generated materials. A URL like ‘http://site.com/article.cgi?25’ will be saved as article.cgi?25.html.

‘-H’
‘--span-hosts’
Enable spanning across hosts when doing recursive retrieving (see Spanning Hosts).

‘-k’
‘--convert-links’
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.

‘-K’
‘--backup-converted’
When converting a file, back up the original version with a ‘.orig’ suffix. Affects the behavior of ‘-N’ (see HTTP Time-Stamping Internals).

‘-p’
‘--page-requisites’
This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.

‘-e command’
‘--execute command’
Execute command as if it were a part of .wgetrc (see Startup File). A command thus invoked will be executed after the commands in .wgetrc, thus taking precedence over them. If you need to specify more than one wgetrc command, use multiple instances of ‘-e’.

robots = on/off
Specify whether the norobots convention is respected by Wget, “on” by default. This switch controls both the /robots.txt and the ‘nofollow’ aspect of the spec. See Robot Exclusion, for more details about this. Be sure you know what you are doing before turning this off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment