Created
April 25, 2012 21:28
-
-
Save jarvist/2493592 to your computer and use it in GitHub Desktop.
#Strip emails from mediauk via some horrific re hacks + direct parsing of HTML.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Strip emails from mediauk via some horrific re hacks + direct parsing of HTML. | |
# Their site breaks Beautiful Soup :( | |
#B wants all the email addresses in this linked to page | |
wget -O - "http://www.mediauk.com/tags/local" | sed 's/>/\n/g' | grep "href" | cut -f2 -d\' | grep "^\/" > urls.list | |
#I think I manually deleted the false ones in the top + bottom of urls.list | |
for URL in ` cat urls.list ` | |
do | |
echo -n "${URL} " | |
email=` wget -O - "http://www.mediauk.com${URL}" | sed 's/>/\n/g' | grep "og:email" | cut -f3 -d\ | cut -f2 -d\" ` | |
echo "${email}" #ensures end of line... | |
done |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment