Skip to content

Instantly share code, notes, and snippets.

@bera5186
Created December 20, 2019 08:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bera5186/345e29edccc77b297ecfaaec59fc02f2 to your computer and use it in GitHub Desktop.
Save bera5186/345e29edccc77b297ecfaaec59fc02f2 to your computer and use it in GitHub Desktop.
Text data pre-processing and cleaning for ML
from gensim import utils
import gensim.parsing.preprocessing as gsp
filters = [
gsp.strip_tags,
gsp.strip_punctuation,
gsp.strip_multiple_whitespaces,
gsp.strip_numeric,
gsp.remove_stopwords,
gsp.strip_short,
gsp.stem_text
]
def clean_text(s):
s = s.lower()
s = utils.to_unicode(s)
for f in filters:
s = f(s)
return s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment