Skip to content

Instantly share code, notes, and snippets.

@chricke
Last active April 18, 2019 10:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chricke/ccafc23ec040c4ce1925c278fcdf4c10 to your computer and use it in GitHub Desktop.
Save chricke/ccafc23ec040c4ce1925c278fcdf4c10 to your computer and use it in GitHub Desktop.
Preprocess the data
tokenized_punctuation = {
'.' : '||Period||',
',' : '||Comma||',
'"' : '||Quotation_Mark||',
';' : '||Semicolon||',
'!' : '||Exclamation_Mark||',
'?' : '||Question_Mark||',
'(' : '||Left_Parentheses||',
')' : '||Right_Parentheses||',
'--' : '||Dash||',
'\n' : '||Return||'
}
text = "\n".join(clean_text)
for key, token in tokenized_punctuation .items():
text = text.replace(key, ' {} '.format(token))
text = text.lower()
text = text.split()
word_counts = Counter(text)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}
int_text = [vocab_to_int[word] for word in text]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment