Skip to content

Instantly share code, notes, and snippets.

@rishisidhu
Created September 1, 2020 03:44
Show Gist options
  • Save rishisidhu/90fee8e695c6131402f5f7f5d6e4c69a to your computer and use it in GitHub Desktop.
Save rishisidhu/90fee8e695c6131402f5f7f5d6e4c69a to your computer and use it in GitHub Desktop.
Adding Padding to the tokenization process
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
#Let's add custom sentences
sentences = [
"Apples are red",
"Apples are round",
"Oranges are round",
'Grapes are sour, oranges are sweet'
]
#Tokenize the sentences
myTokenizer = Tokenizer(num_words=100)
myTokenizer.fit_on_texts(sentences)
sequences = myTokenizer.texts_to_sequences(sentences)
#Padding
padded = pad_sequences(sequences, maxlen=len(sentences[3].split(" ")))
#Display the output
print("\nWord Index = " , myTokenizer.word_index)
print("\nSequences = " , sequences)
print("\nPadded Sequences:")
print(padded)
print("\nOriginal Sentences: \n",[x for x in myTokenizer.sequences_to_texts_generator(padded)])
#Pre and Post Padding
padded = pad_sequences(sequences)
print("\nPre Padded Sequences:")
print(padded)
padded = pad_sequences(sequences, padding="post")
print("\nPost Padded Sequences:")
print(padded)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment