Skip to content

Instantly share code, notes, and snippets.

@ktibb
Created March 5, 2012 05:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save ktibb/1976900 to your computer and use it in GitHub Desktop.
Save ktibb/1976900 to your computer and use it in GitHub Desktop.
import urllib
import BeautifulSoup
import re
html = urllib.urlopen('http://www.oprah.com/relationships/What-Kind-of-Woman-Watches-Porn-Researchers-Find-Answers').read()
soup = BeautifulSoup.BeautifulSoup(html)
#texts = soup.findAll(text=True)
texts = soup.find("div", {"class": "arial14"})
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
elif re.match('<!--.*-->', str(element)):
return False
return True
visible_texts = filter(visible, texts)
print visible_texts
for line in visible_texts:
if line not in ['\n',' <br /> ', '']:
print "-----"
line.strip(";:#-?.,")
print line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment