Skip to content

Instantly share code, notes, and snippets.

@bdewilde
Last active December 15, 2015 22:29
Show Gist options
  • Save bdewilde/5333391 to your computer and use it in GitHub Desktop.
Save bdewilde/5333391 to your computer and use it in GitHub Desktop.
basic procedure for cleaning text in preparation for natural language processing
def clean_text(text):
from nltk import clean_html
import re
# strip html markup with handy NLTK function
text = clean_html(text)
# remove digits with regular expression
text = re.sub(r'\d', ' ', text)
# remove any patterns matching standard url format
url_pattern = r'((http|ftp|https):\/\/)?[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?'
text = re.sub(url_pattern, ' ', text)
# remove all non-ascii characters
text = ''.join(character for character in text if ord(character)<128)
# standardize white space
text = re.sub(r'\s+', ' ', text)
# drop capitalization
text = text.lower()
return text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment