Skip to content

Instantly share code, notes, and snippets.

@noahtren
Last active March 23, 2023 13:13
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save noahtren/6f9f6ecf2f81d0975c4f54afaeb95318 to your computer and use it in GitHub Desktop.
Save noahtren/6f9f6ecf2f81d0975c4f54afaeb95318 to your computer and use it in GitHub Desktop.
HuggingFace Tokenizer -> TF.Text
import tensorflow as tf
import tensorflow_text as text
from transformers import AutoTokenizer
def get_tf_tokenizer(hf_model_name, do_test=False):
hf_tokenizer = AutoTokenizer.from_pretrained(hf_model_name)
model_proto = hf_tokenizer.sp_model.serialized_model_proto()
tf_tokenizer = text.SentencepieceTokenizer(model=model_proto, out_type=tf.int32)
if do_test:
test_string = "This is a testtt, hah! reaaly cool :)"
hf_result = hf_tokenizer.encode(test_string, add_special_tokens=False)
tf_result = tf_tokenizer.tokenize(tf.strings.lower(test_string))
assert tf.reduce_all(tf_result == hf_result)
return tf_tokenizer
if __name__ == "__main__":
tf_tokenizer = get_tf_tokenizer("albert-base-v2", do_test=True)
@zachmayer
Copy link

When I run this script, I get AttributeError: 'AlbertTokenizerFast' object has no attribute 'sp_model'

@ayalaall
Copy link

When I run this script, I get AttributeError: 'RobertaTokenizer' object has no attribute 'sp_model'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment