Skip to content

Instantly share code, notes, and snippets.

@pvdb
Last active March 4, 2017 09:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pvdb/79c86898753713084e0eef22a4b6ebd3 to your computer and use it in GitHub Desktop.
Save pvdb/79c86898753713084e0eef22a4b6ebd3 to your computer and use it in GitHub Desktop.
Shell script to extract and validate external URLs found in Seedy content files
#!/usr/bin/env sh
#
# install GNU grep (`/usr/local/bin/ggrep`)...
#
# brew tap homebrew/dupes; brew install grep
#
# usage - ensure `${PROJECTS_HOME}` is set...
#
# find ${PROJECTS_HOME}/seedy/content -name '*.md' | ./validate_external_urls.sh > seedy_external_urls.tsv
#
printf '${http_status}\t${external_url}\t${content_file}\n' ;
while read -r content_file ; do
for external_url in $( ggrep -shoP 'https?://[^)"\] ]+' "${content_file}" ) ; do
if ruby -r uri -e "URI('${external_url}') rescue exit(1)" ; then
http_status=$( curl -s -o /dev/null -I -w '%{http_code}' "${external_url}" ) ;
else
http_status="666" ; # ad-hoc status code to indicate invalid URL syntax
fi
printf '%s\t%s\t%s\n' "${http_status}" "${external_url}" "${content_file}";
done
done
# That's all Folks!
@pvdb
Copy link
Author

pvdb commented Apr 12, 2016

Possible improvements to this initial, rudimentary version of the script:

  • make GET instead of HEAD requests (as certain sites don't respond properly to HEAD requests)
  • also capture and output the Location: HTTP header for redirects, ie. 30x responses
  • only generate output for "problematic" HTTP status codes (e.g. don't generate output for 20x responses)
  • investigate and work around site-specific issues (e.g. 999 HTTP status code used by LinkedIn)
  • follow 30x redirects, to ensure the redirect chain results in a 200 response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment