Skip to content

Instantly share code, notes, and snippets.

@eileen-code4fun
Created January 21, 2022 06:03
Show Gist options
  • Save eileen-code4fun/2c26f69e8344e4c44d2e75e9f0a6f001 to your computer and use it in GitHub Desktop.
Save eileen-code4fun/2c26f69e8344e4c44d2e75e9f0a6f001 to your computer and use it in GitHub Desktop.
Translation Preprocessing
def standardize(text):
# Split accecented characters.
text = tf_text.normalize_utf8(text, 'NFKD')
text = tf.strings.lower(text)
# Keep space, a to z, and select punctuation.
text = tf.strings.regex_replace(text, '[^ a-z.?!,¿]', '')
# Add spaces around punctuation.
text = tf.strings.regex_replace(text, '[.?!,¿]', r' \0 ')
# Strip whitespace.
text = tf.strings.strip(text)
text = tf.strings.join(['[START]', text, '[END]'], separator=' ')
return text
eng_text_processor = tf.keras.layers.TextVectorization(standardize=standardize, max_tokens=5000)
spa_text_processor = tf.keras.layers.TextVectorization(standardize=standardize, max_tokens=5000)
eng_text_processor.adapt(eng_dataset.batch(128))
spa_text_processor.adapt(spa_dataset.batch(128))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment