Skip to content

Instantly share code, notes, and snippets.

@2bj
Forked from iAugur/wget.txt
Created July 2, 2020 10:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save 2bj/500dfef194125ce621ebcb694698be73 to your computer and use it in GitHub Desktop.
Save 2bj/500dfef194125ce621ebcb694698be73 to your computer and use it in GitHub Desktop.
wget spider cache warmer
wget --spider -o wget.log -e robots=off -r -l 5 -p -S -T3 --header="X-Bypass-Cache: 1" -H --domains=live-mysite.mydomain.com --show-progress live-mysite.mydomain.com
# Options explained
# --spider: Crawl the site
# -o wget.log: Keep the log
# -e robots=off: Ignore robots.txt
# -r: specify recursive download
# -l 5: Depth to search. I.e 1 means 'crawl the homepages'.  2 means 'crawl the homepage and all pages it links to'...
# -p: get all images, etc. needed to display HTML page
# -S: print server response (to the log)
# --delete-after - delete the file once it os down loaded (we are only warming caches after all not mirroring the site)
# -T 3 Timeout after 3 seconds - default is 900
# --header="X-Bypass-Cache: 1": Set a header (this one bypasses Varnish cache)
# Also useful for specifying a Host header - see below
# or setting a User Agent:
# --header="User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36"
# multiple --headers can be provided
# --show-progress - list the urls we are warming
# -H Span domains (only used so we can limit with -D/--domains)
# --domains=live-mysite.mydomain.com - stay on the same domain (we aren't warming our friends) requires -H
# --max-redirect=0 - don't follow redirects (not in example above but useful)
# --limit-rate=20k - if you don't want to hammer your site
# live-mysite.mydomain.com: URL to start crawling
# On some systems (Macos) wget may fail to resolve hosts specified in the /etc/hosts
# this makes it hard to spider local dev sites
# to resolve this, specify a header and spider the local host address
wget --spider -o wget.log -e robots=off -r -l 5 -p -T3 --header "Host: testsite.dev.local" -H --domains=127.0.0.1 --show-progress 127.0.0.1 --delete-after
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment