Skip to content

Instantly share code, notes, and snippets.

@isaacsanders
Created August 7, 2012 21:03
Show Gist options
  • Save isaacsanders/3289341 to your computer and use it in GitHub Desktop.
Save isaacsanders/3289341 to your computer and use it in GitHub Desktop.
Crawls all absolute links on a page
# If this:
URLS=`curl $HOST/articles.html |\
grep -o 'href="\([^"]*\)' |\
grep -o '/[^/]*/\?.*html' |\
sed 's/\(.*\)/localhost:3000\1/'`; for url in $URLS
do
echo $url
done
# Produces this:
localhost:3000/articles.html
localhost:3000/topics/6-lorem.html
localhost:3000/topics/7-parody.html
localhost:3000/articles/9-the-original.html
localhost:3000/topics/5-latin.html
localhost:3000/topics/6-lorem.html
localhost:3000/articles/10-gangsta-ipsum.html
localhost:3000/topics/6-lorem.html
localhost:3000/topics/7-parody.html
localhost:3000/articles/11-veggie-ipsum.html
localhost:3000/articles/15-hipster-ipsum.html
localhost:3000/articles/17-tuna-ipsum.html
# Then shouldn't this:
URLS=`curl $HOST/articles.html |\
grep -o 'href="\([^"]*\)' |\
grep -o '/[^/]*/\?.*html' |\
sed 's/\(.*\)/localhost:3000\1/'`; for url in $URLS
do
curl $url
done
# `curl` each of those?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment