Skip to content

Instantly share code, notes, and snippets.

@BernardOng
Created August 22, 2016 00:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save BernardOng/2a9ea6ba8c842b55e0c63d67dc759109 to your computer and use it in GitHub Desktop.
Save BernardOng/2a9ea6ba8c842b55e0c63d67dc759109 to your computer and use it in GitHub Desktop.
def cleanupTitle(self, s):
# remove stopwords
stopset = set(stopwords.words('english'))
punctuations = list(string.punctuation)
tokens = [i for i in nltk.word_tokenize(re.sub(r'\d+', '', s.lower())) if i not in punctuations]
cleanup = " ".join(filter(lambda word: word not in stopset, tokens))
cleanup = self.remove_non_ascii(cleanup)
cleanup = cleanup.replace('...','')
cleanup = cleanup.replace("'s",'')
cleanup = cleanup.replace("''",'')
cleanup = cleanup.replace("``",'')
cleanup = cleanup.replace("-",'')
cleanup = cleanup.replace("''",'')
cleanup = cleanup.replace("'",'')
return cleanup
def cleanTokens(self, s):
return nltk.word_tokenize(self.cleanupTitle(s))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment