Skip to content

Instantly share code, notes, and snippets.

@MikeNGarrett
Created May 7, 2014 03:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save MikeNGarrett/d3e28d91cf1d2929f5d4 to your computer and use it in GitHub Desktop.
Save MikeNGarrett/d3e28d91cf1d2929f5d4 to your computer and use it in GitHub Desktop.
Bring some sanity to a wget log. I'm pulling out all the files that are returning 404.
# Working with data like this:
# HTTP request sent, awaiting response... .--2014-05-06 16:41:58-- http://xxx.com/xxx.jpg
# Resolving xxx.com... 1.1.1.1
# Connecting to xxx.com|1.1.1.1|:80... ...............connected.
# HTTP request sent, awaiting response... .. .......... .... ....... .......--2014-05-06 16:41:58-- http://xxx.com/xxx.jpg
# Resolving xxx.com... .... .1.1.1.1
# Connecting to xxx.com|1.1.1.1|:80... .........404 Not Found
# 2014-05-06 16:41:58 ERROR 404: Not Found.
#
# .....200 OK
# Find and return the 2 lines before "404 Not Found" then
# Find the line where you see http:// then
# Divide the document into fields where you find "-- " and return the 2nd field then
# Sort unique instances then
# Put those results into 404s.txt
grep -B 2 "404 Not Found" log.txt | grep http:// | awk -F--\ \ '{print $2}' | sort -u > 404s.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment