Skip to content

Instantly share code, notes, and snippets.

@rajasankar
Created June 10, 2015 07:02
Show Gist options
  • Save rajasankar/1e867e5e841c7476464f to your computer and use it in GitHub Desktop.
Save rajasankar/1e867e5e841c7476464f to your computer and use it in GitHub Desktop.
This code to extract data from the html files based on class information
inputfilename='file.html'
data=urllib2.urlopen(inputfilename)
soup = BeautifulSoup(data)
data=soup.prettify()
soup = BeautifulSoup(data)
ti=soup.findAll(attrs={'class':'pno'})
for t in ti:
t.extract()
ti=soup.findAll(attrs={'class':'subhead'})
for t in ti:
t.extract()
lines=[]
for s in soup(text=True):
s=s.strip().replace('\t','')
print s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment