Skip to content

Instantly share code, notes, and snippets.

@opparco
Created August 18, 2023 11:20
Show Gist options
  • Save opparco/834be36dca2d1bb01b071bc3b504bfbd to your computer and use it in GitHub Desktop.
Save opparco/834be36dca2d1bb01b071bc3b504bfbd to your computer and use it in GitHub Desktop.
debug tokenizer of matsuo-lab/weblab-10b
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("matsuo-lab/weblab-10b")
dot = tokenizer.encode(".")
print(dot)
# [15]
nemureru = tokenizer.encode("眠れる")
print(nemureru)
# [20827, 243, 9345, 5832]
# print(tokenizer.decode(dot))
for i in dot:
print(tokenizer.decode([i]).encode('utf-8'))
# b'.'
# print(tokenizer.decode(nemureru))
for i in nemureru:
print(tokenizer.decode([i]).encode('utf-8'))
# b'\xef\xbf\xbd'
# b'\xef\xbf\xbd'
# b'\xe3\x82\x8c'
# b'\xe3\x82\x8b'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment