Skip to content

Instantly share code, notes, and snippets.

@mankuthimma
Created November 20, 2011 09:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mankuthimma/1380090 to your computer and use it in GitHub Desktop.
Save mankuthimma/1380090 to your computer and use it in GitHub Desktop.
Scrape a bunch of html files to generate a pipe separated values file from selected content
from BeautifulSoup import BeautifulSoup
# Files list
afp = open('afp.txt', 'rb')
aes = afp.readlines()
afp.close()
list_o_attrs = ['nm', 'ttl', 'co_nm', 'eml', 'co_url', 'linkedin_url', 'twitter_url', 'bio']
for person in aes:
person = person.replace('\n','')
person_det = []
fh = open('dump/'+person+'.html', 'rb').read()
bs = BeautifulSoup(fh)
for attr in list_o_attrs:
av = bs.find(id=attr)
tval = av.find('a')
if tval is not None and tval.string is not None:
val = repr(tval.string)
elif av.string is not None:
val = repr(av.string)
else:
val = ''
person_det.append(val)
print '|'.join(person_det)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment