Skip to content

Instantly share code, notes, and snippets.

@morrisalp
Created April 7, 2020 11:45
Show Gist options
  • Save morrisalp/043a6d22536b7e8c7449748e43ff8ba4 to your computer and use it in GitHub Desktop.
Save morrisalp/043a6d22536b7e8c7449748e43ff8ba4 to your computer and use it in GitHub Desktop.
sane text extraction given html string, using BeautifulSoup
from bs4 import BeautifulSoup as bs
def html2text(html):
soup = bs(html, features='lxml')
for script in soup(["script", "style"]):
script.decompose()
for br in soup.find_all("br"):
br.replace_with("\n")
return soup.get_text(separator=' ').strip()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment