Skip to content

Instantly share code, notes, and snippets.

@jneuff
jneuff / fix-tokenizer.rs
Created October 17, 2023 11:35
Fix a huggingface tokenizer to which tokens have been added after training
/// Fix a huggingface tokenizer to which tokens have been added after training.
///
/// Adding tokens after training via `add_special_tokens` leads to them being added to the
/// `added_tokens` section but not to the `model.vocab` section. This yields warnings like:
/// ```
/// [2023-10-17T07:54:05Z WARN tokenizers::tokenizer::serialization] Warning: Token '<|empty_usable_token_space_1023|>' was expected to have ID '129023' but was given ID 'None'
/// ```
/// The code in this file ensures that all tokens from `added_tokens` are also placed into
/// `model.vocab`. This fixes the warning and does not change the tokenizer's behavior.