Skip to content

Instantly share code, notes, and snippets.

@sharma-ji
Created July 16, 2018 06:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sharma-ji/2b843c261df8b10d5b57c7261effce26 to your computer and use it in GitHub Desktop.
Save sharma-ji/2b843c261df8b10d5b57c7261effce26 to your computer and use it in GitHub Desktop.
Pipeline for text cleaning
# Split text into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# Convert words to lower case
tokens = [w.lower() for w in tokens]
# Remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# Remove tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# Filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
# Stemming of words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment