Skip to content

Instantly share code, notes, and snippets.

@shayaf84
Last active February 19, 2022 19:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shayaf84/aff6b698f8394aa93f41c60426f8bf40 to your computer and use it in GitHub Desktop.
Save shayaf84/aff6b698f8394aa93f41c60426f8bf40 to your computer and use it in GitHub Desktop.
#Import nltk preprocessing library to convert text into a readable format
import nltk
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
#Tokenize the string (create a list -> each index is a word)
data['title'] = data.apply(lambda row: nltk.word_tokenize(row['title']), axis=1)
#Define text lemmatization model (eg: walks will be changed to walk)
lemmatizer = WordNetLemmatizer()
#Loop through title dataframe and lemmatize each word
def lemma(data):
return [lemmatizer.lemmatize(w) for w in data]
#Apply to dataframe
data['title'] = data['title'].apply(lemma)
#Define all stopwords in the English language (it, was, for, etc.)
stop = stopwords.words('english')
#Remove them from our dataframe
data['title'] = data['title'].apply(lambda x: [i for i in x if i not in stop])
data.head()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment