Skip to content

Instantly share code, notes, and snippets.

@amitrani6
Created October 4, 2019 02:23
Show Gist options
  • Save amitrani6/8c1ca16f1dc7c5e22a4826b96a8b8c73 to your computer and use it in GitHub Desktop.
Save amitrani6/8c1ca16f1dc7c5e22a4826b96a8b8c73 to your computer and use it in GitHub Desktop.
Functions for processing raw text
#Import the necessary libraries
import nltk
from nltk.stem import WordNetLemmatizer
#Initialize the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()
#A function to lemmatize raw text, returns a list of lemmatized tokens
def lemmatize_text(tokenized_text):
return ' '.join([lemmatizer.lemmatize(w) for w in tokenized_text])
#A function that ties all of the steps together
def process_text(file_name):
raw_episode_text = open_file(file_name)
clean_episode_text = cleaned_episode(raw_episode_text)
tokenize_episode_text = tokenize(clean_episode_text)
lemmatize_episode_text = lemmatize_text(tokenize_episode_text)
return lemmatize_episode_text
#Applies the text to the data frame
df['lemmatize_text'] = df.file_path.apply(lambda x: process_text(x))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment