Skip to content

Instantly share code, notes, and snippets.

@pgwillia
Last active August 29, 2015 14:05
Show Gist options
  • Save pgwillia/97fb2931254c02057418 to your computer and use it in GitHub Desktop.
Save pgwillia/97fb2931254c02057418 to your computer and use it in GitHub Desktop.
crawl a site's links and report on responses in file
# --spider only do HEAD request
# --no-verbose gives minimal output (~1/4 the lines)
# -o send output to file
# -e robots=off ignores robots.txt (but you should play nice -- so don't use this)
# -w 1 waits for a second between requests
# --random-wait uses -w to vary between 0.5 and 1.5 * wait seconds
# -r recursive (this is the crawler part)
# -nd don't create a hierarchy of directories when retrieving recursively (prevent inode issues)
# -p all page requisites
wget --spider --no-verbose -o ~/example-crawl.log -w 1 --random-wait -r -nd -p http://www.example.com
@pgwillia
Copy link
Author

wget ‐‐cookies=on ‐‐save-cookies cookies.txt ‐‐keep-session-cookies ‐‐post-data ‘user=labnol&password=123′ http://example.com/login

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment