Last active
March 4, 2017 09:01
-
-
Save pvdb/79c86898753713084e0eef22a4b6ebd3 to your computer and use it in GitHub Desktop.
Shell script to extract and validate external URLs found in Seedy content files
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env sh | |
# | |
# install GNU grep (`/usr/local/bin/ggrep`)... | |
# | |
# brew tap homebrew/dupes; brew install grep | |
# | |
# usage - ensure `${PROJECTS_HOME}` is set... | |
# | |
# find ${PROJECTS_HOME}/seedy/content -name '*.md' | ./validate_external_urls.sh > seedy_external_urls.tsv | |
# | |
printf '${http_status}\t${external_url}\t${content_file}\n' ; | |
while read -r content_file ; do | |
for external_url in $( ggrep -shoP 'https?://[^)"\] ]+' "${content_file}" ) ; do | |
if ruby -r uri -e "URI('${external_url}') rescue exit(1)" ; then | |
http_status=$( curl -s -o /dev/null -I -w '%{http_code}' "${external_url}" ) ; | |
else | |
http_status="666" ; # ad-hoc status code to indicate invalid URL syntax | |
fi | |
printf '%s\t%s\t%s\n' "${http_status}" "${external_url}" "${content_file}"; | |
done | |
done | |
# That's all Folks! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Possible improvements to this initial, rudimentary version of the script:
GET
instead ofHEAD
requests (as certain sites don't respond properly toHEAD
requests)Location:
HTTP header for redirects, ie.30x
responses20x
responses)999
HTTP status code used by LinkedIn)30x
redirects, to ensure the redirect chain results in a200
response