The following code is for scraping content from websites and extracting just the words. This is useful for being able to feed web content into other processes.
This can be accomplished in a three step process.
- Get the raw html content using the
requests
library - Feed the
.text
results of step one into theBeautifulSoup
and extract the text with.get_text()
. This will strip all the html from the content and return and unstructured string. - The string that is returned will need some heavy sanitization.
- Strip the blank lines with a python filter
lines = filter(lambda x: x.strip(), text.splitlines())