Skip to content

Instantly share code, notes, and snippets.

@jarvist
Created April 25, 2012 21:28
Show Gist options
  • Save jarvist/2493592 to your computer and use it in GitHub Desktop.
Save jarvist/2493592 to your computer and use it in GitHub Desktop.
#Strip emails from mediauk via some horrific re hacks + direct parsing of HTML.
#Strip emails from mediauk via some horrific re hacks + direct parsing of HTML.
# Their site breaks Beautiful Soup :(
#B wants all the email addresses in this linked to page
wget -O - "http://www.mediauk.com/tags/local" | sed 's/>/\n/g' | grep "href" | cut -f2 -d\' | grep "^\/" > urls.list
#I think I manually deleted the false ones in the top + bottom of urls.list
for URL in ` cat urls.list `
do
echo -n "${URL} "
email=` wget -O - "http://www.mediauk.com${URL}" | sed 's/>/\n/g' | grep "og:email" | cut -f3 -d\ | cut -f2 -d\" `
echo "${email}" #ensures end of line...
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment