Skip to content

Instantly share code, notes, and snippets.

@Johnne32
Created November 30, 2020 07:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Johnne32/cd5a7038024a192d3e25190617dd7a5d to your computer and use it in GitHub Desktop.
Save Johnne32/cd5a7038024a192d3e25190617dd7a5d to your computer and use it in GitHub Desktop.
# Transforming the reviews data by removing stopwords, using regular expressions module to accept only letters,
# tokenizing those words and then making all the words lower case for consistency.
comments = []
stop_words = set(stopwords.words('portuguese'))
for words in review_data['review_comment_message']:
only_letters = re.sub("[^a-zA-Z]", " ",words)
tokens = nltk.word_tokenize(only_letters) #tokenize the sentences
lower_case = [l.lower() for l in tokens] #convert all letters to lower case
filtered_result = list(filter(lambda l: l not in stop_words, lower_case)) #Remove stopwords from the comments
comments.append(' '.join(filtered_result))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment