Skip to content

Instantly share code, notes, and snippets.

@robmiller
Last active August 27, 2016 11:08
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save robmiller/ba52d42be701f6416351 to your computer and use it in GitHub Desktop.
Save robmiller/ba52d42be701f6416351 to your computer and use it in GitHub Desktop.
Use wget to spider a site and output which URLs (of pages or resources within those pages, such as stylesheets or images) returned a 404 status.
#!/bin/zsh
#
# 404s
#
# Usage:
# 1. Download, rename to 404s, put in path (~/bin is a good place)
# 2. Run the script:
#
# $ 404s http://example.com
#
# To get just the URLs and not progress updates, silence stderr:
# $ 404s http://example.com 2>/dev/null
SITE="$1"
HOSTNAME=$(echo "$SITE" | ruby -ruri -ne 'puts URI.parse($_.chomp).hostname rescue nil')
if [ -z "$HOSTNAME" ]; then
echo "Invalid URL specified" 1>&2
exit 1
fi
LOG_FILE="$HOSTNAME.log"
echo "Spidering site..." 1>&2
wget --spider -o "$LOG_FILE" -e robots=off -w 0.1 -r -p -nd --delete-after "$SITE"
echo "URLs that returned 404s:" 1>&2
cat "$LOG_FILE" | grep -B2 '404 Not Found' | egrep 'https?:' | ruby -pe 'gsub(/^--.+-- /, "")'
rm "$LOG_FILE"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment