Skip to content

Instantly share code, notes, and snippets.

@BramVanroy
Created June 15, 2022 13:44
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save BramVanroy/7fb511d92274a8e926f0b9ed8526c576 to your computer and use it in GitHub Desktop.
Save BramVanroy/7fb511d92274a8e926f0b9ed8526c576 to your computer and use it in GitHub Desktop.
Get original words of tokens in HF Tokenizers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
text = "It 's a pre-tokenized , silly sentence !"
words = text.split()
encoded = tokenizer(words, is_split_into_words=True)
for token, wordid in zip(encoded.tokens(), encoded.word_ids()):
if wordid is not None:
print(token, words[wordid])
"""
# Output (subword unit - original word)
It It
' 's
s 's
a a
pre pre-tokenized
- pre-tokenized
token pre-tokenized
##ized pre-tokenized
, ,
silly silly
sentence sentence
! !
"""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment