Skip to content

Instantly share code, notes, and snippets.

@Shivam-316
Created November 8, 2020 08:44
Show Gist options
  • Save Shivam-316/2e4977c67982fff1dc5ec53efc618640 to your computer and use it in GitHub Desktop.
Save Shivam-316/2e4977c67982fff1dc5ec53efc618640 to your computer and use it in GitHub Desktop.
This Function will do the required processing and tokenize text into vectors.
def preprocess_and_tokenize(language,vocab_size,oov_size,is_input=False,is_output=False):
if is_output:
lang=[]
for text in language:
lang.append('<sos> '+ text +' <eos>')
lang=np.array(lang)
else:
lang=language
tokenizer=keras.preprocessing.text.Tokenizer(vocab_size,oov_token=oov_size)
tokenizer.fit_on_texts(lang)
tensor=tokenizer.texts_to_sequences(lang)
if is_output:
tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,padding='post',value=0)
if is_input:
tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,padding='pre',value=0)
tensor=tensor[:,::-1]
return tensor,tokenizer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment