Skip to content

Instantly share code, notes, and snippets.

@pixelbrackets
Last active November 13, 2022 03:46
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pixelbrackets/ab2bf9106f4ad0aaaa69 to your computer and use it in GitHub Desktop.
Save pixelbrackets/ab2bf9106f4ad0aaaa69 to your computer and use it in GitHub Desktop.
wget crawler (e.g. simple cache warmup)
### crawls all links, writes the result to /tmp/ and deletes the files right away
wget --spider --force-html -r -l10 --directory-prefix=/tmp/ --no-directories --delete-after http://www.example.com
@pixelbrackets
Copy link
Author

pixelbrackets commented Jul 12, 2021

# Wait one second after each request to avoid hitting rate limits
wget --spider --force-html -r -l10 --directory-prefix=/tmp/ --no-directories --delete-after --wait=1 https://www.example.com

# Only follow a-link tags (ignore img, script, link…), ignore links to assets, identify as cache crawler
wget --spider --force-html -r -l10 --directory-prefix=/tmp/ --no-directories --delete-after --wait=1 --follow-tags=a --reject gif,jpg,png,pdf,xls --user-agent="Cache Crawler" https://www.example.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment