-
-
Save Johnne32/cd5a7038024a192d3e25190617dd7a5d to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Transforming the reviews data by removing stopwords, using regular expressions module to accept only letters, | |
# tokenizing those words and then making all the words lower case for consistency. | |
comments = [] | |
stop_words = set(stopwords.words('portuguese')) | |
for words in review_data['review_comment_message']: | |
only_letters = re.sub("[^a-zA-Z]", " ",words) | |
tokens = nltk.word_tokenize(only_letters) #tokenize the sentences | |
lower_case = [l.lower() for l in tokens] #convert all letters to lower case | |
filtered_result = list(filter(lambda l: l not in stop_words, lower_case)) #Remove stopwords from the comments | |
comments.append(' '.join(filtered_result)) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment