Skip to content

Instantly share code, notes, and snippets.

@danflies
Last active April 2, 2018 15:11
Show Gist options
  • Save danflies/958840ba3d417c7b02ac3abfac87c414 to your computer and use it in GitHub Desktop.
Save danflies/958840ba3d417c7b02ac3abfac87c414 to your computer and use it in GitHub Desktop.
Use wget to find broken links.

Found the resource on Created By Pete

Set Up

First, you’ll need to make sure you have Wget, on OS X you can just use Homebrew.

brew install wget

Command

wget --spider -o ~/wget.log -e robots=off -w 1 -r -p http://www.example.com

Breakdown

  • --spider, this tells Wget not to download anything since we only want a report so it will only do a HEAD request not a GET.
  • -o ~/wget.log, log messages to the declared file, in this case a file called wget.log that will be saved to your home directory, you can change this to a more convenient location and filename.
  • -e robots=off, this one tells wget to ignore the robots.txt file. Learn more about robots.txt.
  • -w 1, adds a 1 second wait between requests, this slows down Wget to more consistent rate to minimise stress on the hosting server so you don’t get back any false positives.
  • -r, this means recursive so Wget will keep trying to follow links deeper into your sites until it can find no more!
  • -p, get all page requisites such as images, etc. needed to display HTML page so we can find broken image links too.
  • http://www.example.com, finally the website url to start from.

Reading the Log

grep -B 2 '404' ~/wget.log

You'll get any references to pages that caused a 404 error. Now it won't show you the pages the link originated on, but it gives you a staring place. If you'd like to find other errors you can substitue 404 for 500 etc.

Here is the manual for wget

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment