Skip to content

Instantly share code, notes, and snippets.

@Navjotbians
Created May 22, 2021 19:41
Show Gist options
  • Save Navjotbians/6ebb7affbb6de94e089a84df0833d7fb to your computer and use it in GitHub Desktop.
Save Navjotbians/6ebb7affbb6de94e089a84df0833d7fb to your computer and use it in GitHub Desktop.
Process text
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
def process_txt(input, stemm = False,lemm = True):
### Clean input data
processed_text = clean(input)
### Tokenization
processed_text = word_tokenize(processed_text)
### remove stop words
processed_text = [word for word in processed_text if word not in stopwords.words('english')]
### Stemming
if stemm == True:
ps = nltk.stem.porter.PorterStemmer()
processed_text = [ps.stem(word) for word in processed_text]
### Lemmatization
if lemm == True:
lem = nltk.stem.wordnet.WordNetLemmatizer()
processed_text = [lem.lemmatize(word) for word in processed_text]
text = " ".join(processed_text)
return text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment