Skip to content

Instantly share code, notes, and snippets.

@phisad
Created March 15, 2019 09:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save phisad/dcaee742a329a70a6f3ee037463b8be6 to your computer and use it in GitHub Desktop.
Save phisad/dcaee742a329a70a6f3ee037463b8be6 to your computer and use it in GitHub Desktop.
How to encode and pad texts for machine learning using Keras
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import hashing_trick
def preprocessing(questions, questions_max_length, vocabulary_size):
"""
Stateless preprocessing the text questions to a one-hot encoding and pads to max length questions.
The one-hot encodings max value is the vocabulary size.
The padding is attached at the end of each question up to the maximal question length.
@param questions: the text questions as list
@param vocalbulary_size: the (globally) amount of known words
@param questions_max_length: the (globally) maximal length of a question
@return: the padded and encoded questions
"""
encoded_questions = [hashing_trick(question, round(vocabulary_size * 1.3), hash_function="md5") for question in questions]
padded_questions = pad_sequences(encoded_questions, maxlen=questions_max_length, padding="post")
return padded_questions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment