Skip to content

Instantly share code, notes, and snippets.

@ram0973
Created May 2, 2018 08:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ram0973/efe7a12c7b779a1cc83f85af4e0d8562 to your computer and use it in GitHub Desktop.
Save ram0973/efe7a12c7b779a1cc83f85af4e0d8562 to your computer and use it in GitHub Desktop.
bs4 parse tree example
from bs4 import BeautifulSoup, Comment
SKIP_TAGS = ['style', 'script', 'meta', 'code', 'pre']
html = """
"Hello, world!"<span class="black">
<div class="c1">division
<p>"Hello - this is me.
(c) passage in division"
<b>"bold in passage "</b>
</p>
My phone:
(+7) 999-999-99-99
</div>
<!-- Comment -->
<pre>It's a pre.</pre>
"""
def parse_HTML(html_text):
if not html_text:
return None
soup = BeautifulSoup(html_text, 'html.parser')
tags_texts = soup.findAll(string=lambda txt: not isinstance(txt, Comment))
for tag_text in tags_texts:
text = tag_text
if not text or text == '\n' or tag_text.findParent().name in SKIP_TAGS:
continue
print(text)
return soup.prettify()
if __name__ == "__main__":
print(parse_HTML(html))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment