Skip to content

Instantly share code, notes, and snippets.

@makispl
Created April 11, 2020 22:13
Show Gist options
  • Save makispl/4f61577e5e81a586f06e42319f607074 to your computer and use it in GitHub Desktop.
Save makispl/4f61577e5e81a586f06e42319f607074 to your computer and use it in GitHub Desktop.
# Create the corpus
corpus = training_set['SMS'].sum()
# Create the vocabulary
temp_set = set(corpus)
vocabulary = list(temp_set)
# Create a dictionary
len_training_set = len(training_set['SMS'])
word_counts_per_sms = {unique_word: [0] * len_training_set for unique_word in vocabulary}
for index, sms in enumerate(training_set['SMS']):
for word in sms:
word_counts_per_sms[word][index] += 1
# Convert to dataframe
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()
# Concatenate with the original training set
training_set_final = pd.concat([training_set, word_counts], axis=1)
training_set_final.head()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment