Skip to content

Instantly share code, notes, and snippets.

@frenzy2106
Created March 18, 2020 09:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save frenzy2106/76399e0856485b6bd6adab198ef35ec1 to your computer and use it in GitHub Desktop.
Save frenzy2106/76399e0856485b6bd6adab198ef35ec1 to your computer and use it in GitHub Desktop.
#check how many individual words present in the corpus
word_dict = {}
for doc in corpus:
words = nltk.word_tokenize(doc)
for word in words:
if word not in word_dict:
word_dict[word] = 1
else:
word_dict[word] += 1
len(word_dict)
#tokenising the texts
tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(corpus)
corpus_tokens = tokenizer.texts_to_sequences(corpus)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment