Skip to content

Instantly share code, notes, and snippets.

@codekiln
Last active March 16, 2018 08:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save codekiln/92fc29ce38eed858991bfd91a67ae5bd to your computer and use it in GitHub Desktop.
Save codekiln/92fc29ce38eed858991bfd91a67ae5bd to your computer and use it in GitHub Desktop.
get_text_from_html using beautifulsoup; skip comments in style tags from MS Word
from bs4 import BeautifulSoup
def get_text_from_html(html_str):
"""
Given a string of html, return the text content,
removing HTML contents and style artifacts.
This function solves an issue that when pasting from Word,
<style> tags can contain html comments that bsoup 4
doesn't skip over when calling get_text().
It also truncates adjacent whitespaces to one character;
\r\n[space][tab][space][space] would become [space].
:param html_str: string of html
:return: text string. Two whitespaces will become one
"""
soup = BeautifulSoup(html_str)
for style in soup.find_all("style"):
style.extract()
text = soup.get_text()
if text:
return " ".join(text.split())
return ""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment