Skip to content

Instantly share code, notes, and snippets.

@amankharwal
Created December 20, 2020 10:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save amankharwal/05686fb462a00491428b28a32f3543cf to your computer and use it in GitHub Desktop.
Save amankharwal/05686fb462a00491428b28a32f3543cf to your computer and use it in GitHub Desktop.
tokenizer = Tokenizer()
def get_sequence_of_tokens(corpus):
## tokenization
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
## convert data to sequence of tokens
input_sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
return input_sequences, total_words
inp_sequences, total_words = get_sequence_of_tokens(corpus)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment