Skip to content

Instantly share code, notes, and snippets.

@tkmru
Created July 19, 2013 19:40
Show Gist options
  • Save tkmru/6041836 to your computer and use it in GitHub Desktop.
Save tkmru/6041836 to your computer and use it in GitHub Desktop.
This program extract text from html file.
# coding: UTF-8
import re
def extract_text(html):
#First, I deal with tags.
cleaned = re.sub(r"<[^>]*?>", "", html)
cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?", "", cleaned)
cleaned = re.sub(r"(?s)<.*?>", "", cleaned)
#Finally, I deal with special chars.
cleaned = cleaned.replace("&nbsp;", " ")
cleaned = cleaned.replace("&quot;", "\"")
cleaned = cleaned.replace("&lt;", "<")
cleaned = cleaned.replace("&gt;", ">")
cleaned = cleaned.replace("&amp;", "&")
return cleaned
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment