Skip to content

Instantly share code, notes, and snippets.

@xenova
Last active May 10, 2024 00:59
Show Gist options
  • Save xenova/a452a6474428de0182b17605a98631ee to your computer and use it in GitHub Desktop.
Save xenova/a452a6474428de0182b17605a98631ee to your computer and use it in GitHub Desktop.
Convert tiktoken tokenizers to the Hugging Face tokenizers format
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@david-waterworth
Copy link

Shouldn't 'tokenizer_class' be 'GPT2Tokenizer' in all cases? This is the huggingface concrete class that's instantiated - i.e. by doing this you can use

 hf_tokenizer = AutoTokenizer.from_pretrained('Xenova/gpt-4')

Rather than GPT2TokenizerFast (which then generates a warning).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment