Last active
August 29, 2015 14:05
-
-
Save pgwillia/97fb2931254c02057418 to your computer and use it in GitHub Desktop.
crawl a site's links and report on responses in file
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# --spider only do HEAD request | |
# --no-verbose gives minimal output (~1/4 the lines) | |
# -o send output to file | |
# -e robots=off ignores robots.txt (but you should play nice -- so don't use this) | |
# -w 1 waits for a second between requests | |
# --random-wait uses -w to vary between 0.5 and 1.5 * wait seconds | |
# -r recursive (this is the crawler part) | |
# -nd don't create a hierarchy of directories when retrieving recursively (prevent inode issues) | |
# -p all page requisites | |
wget --spider --no-verbose -o ~/example-crawl.log -w 1 --random-wait -r -nd -p http://www.example.com |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
wget ‐‐cookies=on ‐‐save-cookies cookies.txt ‐‐keep-session-cookies ‐‐post-data ‘user=labnol&password=123′ http://example.com/login