Skip to content

Instantly share code, notes, and snippets.

@maldevide
Last active March 11, 2024 01:53
Show Gist options
  • Save maldevide/0c02eac71b3755512a58eaa072914fc7 to your computer and use it in GitHub Desktop.
Save maldevide/0c02eac71b3755512a58eaa072914fc7 to your computer and use it in GitHub Desktop.

Tokenizer Notes

Praxis Maldevide - Draft A

Introduction

This document is a collection of thoughts and observations about the tokenizers used in llama-rooted large language models.

The Tokenizer

Most language models use the LlamaTokenizer.

BOS, EOS, and PAD Tokens

The BOS token marks the beginning of a sequence. What makes a sequence varies from model to model. Generally, the <s> token is at the very beginning of the prompt; but some models will use it to mark the beginning of a new <s>[INST] instruction.

The EOS token marks the end of a sequence. This varies much more than the use of the BOS token. A lot of models will override the default EOS token with a custom token, such as chatml's <|im_end|>.

The pad token is used to pad sequences to the same length so the matrix math works. It is by default token 0, <unk>. These values are ignored by the model.

Model inference is halted when the EOS or pad token is predicted.

Custom Tokens

Several models (like chatml-style) use special tokens to mark the beginning and end of a conversation turn. Chatml models like OpenHermes and Bagel use <|im_start|> and <|im_end|> to wrap each actor.

Tokenizer Options

There are a lot of different options that can be used to tweak the tokenizer:

  • add_bos_token - This is generally set to true, but some instuct models set it to false.
  • add_eos_token - This is generally set to false, but some instruct models set it to true.
  • add_prefix_space
  • model_max_length - This is generally set to a very large (magic?) number.
  • padding_side - You can specify if the padding token is added to the left or right of the sequence.
  • trust_remote_code - Set this to run custom code.
  • use_fast - There is a rust branch of tokenizers.
  • use_default_system_prompt - Llama has a default system prompt, that facebookreseach suggests to disable.
  • chat_template - This is used to specify how the conversation is formatted. It is a Jinja2 template. (See below)
  • add_generation_prompt - passed to the template to add a generation prompt.
  • fast_tokenizer - This also uses the rust branch of tokenizers?
  • additional_special_tokens - This is a list of additional special tokens that are added to the tokenizer. (they need to be n the added_tokens_decoder as well)
  • spaces_between_special_tokens - This is used to add spaces between special tokens.
  • clean_up_tokenization_spaces - This cleans up multiple spaces after decoding.
  • sp_model_kwargs - This directly sets the sentencepiece processor options.

Applying Chat Templates

To apply the chat template specified in the tokenizer, you can reference the following code:

!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "hf_model/name"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

add_generation_prompt should be false for training, and true for generation.

Message

The message is a list of objects, each with a role and content. The role is the actor in the conversation, and the content is the message. The role can be user, assistant, or system.

chat = [
   {"role": "user", "content": "Hello, how are you?"},
   {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
   {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

Example

tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
"""<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
"""

tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
"""<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant
"""

From Author on Reddit

When you're formatting data for training, we'd suggest add_generation_prompt=False, to make sure you don't add those extra tokens onto the end.

We generally recommend that you don't use add_special_tokens=True in apply_chat_template. Instead, we recommend just adding the tokens you want into the template itself! Because templates are so flexible, they can absolutely handle things like BOS or EOS tokens themselves. We think it'd be quite confusing for users to have to remember which models need add_special_tokens, but with templates they're guaranteed that if the template is right then they're getting all the special tokens they need.

Jinja2 Templates

The chat_template is a Jinja2 template. This allows for a lot of flexibility in how the conversation is formatted. The following is an example of a chat template:

{% for message in messages %}
{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}
{% endfor %}
{% if add_generation_prompt %}
{{ '<|im_start|>assistant\n' }}
{% endif %}

Formats

So, we have these tokens, how can we use them to achieve some objective? Given there are a lot of different usecases, each one needs different treatment. How a model is trained combined with how the tokenizer chooses to tokenize the input can create dramatic changes in what the next predicted token will be.

Instruct

Instruct is a basic format that follows the following format:

<s>[INST] Do something.[/INST]This is the model response.</s>

There are reports that allowing add_prefix_space to be set to false can cause the model to generate more coherent responses. This is because the model is trained to expect only the <s> token to be at the beginning of the prompt.

ChatML

ChatML is a more complex format that allows for multiple actors in a conversation. It uses the following format:

<|im_start>system
This is a system message.<|im_end|>
<|im_start|>user
Hello, how are you?<|im_end|>
<|im_start>assistant
I am doing well, thank you.<|im_end|>

The <|im_end|> being set as the EOS token allows for the model to pause before the next actor in the conversation. I suspect that <s> should be prepended to the beginning of the conversation.

Alpaca

Alpaca is rooted in llama, and follows the following format:

<s>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
%instruction%

### Input:
%input%

### Response:
%output%

EOS tokens are sometimes switched to pad for generation though, which allows for the model to continue generating text.

tokenizer.pad_token = tokenizer.eos_token

Alpaca models rely a lot on newlines, and add_prefix_space=False to separate the different parts of the conversation.

Zephyr

Zephyr is a littlle different, in that it doesn't use tokens to define the modes of conversation. Instead, it uses the following format:

<s><|system|>
You are a friendly chatbot who always responds in the style of a pirate</s> 
<|user|>
How many helicopters can a human eat in one sitting?</s> 
<|assistant|>
Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all.</s>

What would a custom chat template look like?

Role Playing / Chat

My curent chat template works like:

Player (to Storyteller): Hello, how are you?
Storyteller (to Player, happy): I am doing well, thank you.
Narrator (describing the scene): The sun is shining, and the birds are singing.

Lets say we wanted to implement a new chat format that allows us to specify a few things about each turn, like who is talking, who they are talking to, and their directive (which is generally an action or mood); and then we will also add a player, storyteller, and a narrator.

{% for message in messages %}
{% if "to" in message and "directive" in message %}
{{message['name'] + ' (to ' + message['to'] + ', ' + message["directive"] + '): ' + message['content'] + '\n'}}
{% elif "to" in message %}
{{message['name'] + ' (to ' + message['to'] + '): ' + message['content'] + '\n'}}
{% elif "directive" in message %}
{{message['name'] + ' (' + message['directive'] + '): ' + message['content'] + '\n'}}
{% else %}
{{message['name'] + ': ' + message['content'] + '\n'}}
{% endif %}
{% endfor %}

Messages

Messages don't have to be limited to just the role and content. They can also include the mood and who the message is directed to. This allows for a lot of flexibility in how the conversation is formatted.

[
    {"role": "user", "name": "Player", "content": "Hello, how are you?"},
    {"role": "assistant", "name": "Storyteller", "to": "player", "directive": "happy", "content": "I am doing well, thank you."},
    {"role": "assistant", "name": "Narrator", "directive": "describing the scene", "content": "The sun is shining, and the birds are singing."},
    {"role": "user", "name": "Player", "directive": "excited", "content": "I'd like to show off how chat templating works!"},
    {"role": "assistant", "name": "Alice", "to": "Bob", "directive": "curious", "content": "Hello, how are you?"},
    {"role": "assistant", "name": "Bob", "to": "Alice", "directive": "pleased", "content": "I am doing great, thank you."}
]

This is semi-compatible with the current chat templates, but will probably error due to non-alternating turns for instruct-type models.

Book Writing

I have a big need for creative writing using a LLM, but I'm not sure if any of the current formats really get it right. A custom format, with <|chapter|>, <|p|>, and some kind of <|notes|> user tag. It would still need things like <|user|>, <|system|> and <|assistant|> too probably; so an extension of Zephyr or Merlinite would be best.

{% for message in messages %}
{% if message['role'] == 'user' %}
{{'<|user|>' + message['content'] + '</s>\n'}}
{% elif message['role'] == 'system' %}
{{'<|system|>' + message['content'] + '</s>\n'}}
{% elif message['role'] == 'assistant' and 'type' not in message %}
{{'<|assistant|>' + message['content'] + '</s>\n'}}
{% elif message['role'] == 'assistant' and message['type'] == "chapter" %}
{{'<|chapter|>' + message['content'] + '</s>\n'}}
{% elif message['role'] == 'assistant' and message['type'] == "p" %}
{{'<|p|>' + message['content'] + '</s>\n'}}
{% elif message['role'] == 'assistant' and message['type'] == "notes" %}
{{'<|notes|>' + message['content'] + '</s>\n'}}
{% else %}
{{'<|assistant|>' + message['content'] + '</s>\n'}}
{% endif %}
{% endfor %}

ChatML

{% for message in messages %}
{% if message['role'] == 'user' %}
{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n'}}
{% elif message['role'] == 'system' %}
{{'<|im_start|>system\n' + message['content'] + '<|im_end|>\n'}}
{% elif message['role'] == 'assistant' and 'type' not in message %}
{{'<|im_start|>assistant\n'  + message['content'] + '<|im_end|>\n'}}
{% elif message['role'] == 'assistant' and message['type'] == "chapter" %}
{{'<|im_start|>chapter\n' + message['content'] + '<|im_end|>\n'}}
{% elif message['role'] == 'assistant' and message['type'] == "p" %}
{{'<|im_start|>p\n' + message['content'] + '<|im_end|>\n'}}
{% elif message['role'] == 'assistant' and message['type'] == "notes" %}
{{'<|im_start|>notes\n' + message['content'] + '<|im_end|>\n'}}
{% else %}
{{'<|im_start|>assistant\n' + message['content'] + '<|im_end|>\n'}}
{% endif %}
{% endfor %}

References

Appendix

LlamaTokenizer

Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding. The default padding token is unset as there is
no padding token in the original model.

Args:
    vocab_file (`str`):
        Path to the vocabulary file.
    unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<unk>"`):
        The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
        token instead.
    bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<s>"`):
        The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
    eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"</s>"`):
        The end of sequence token.
    pad_token (`str` or `tokenizers.AddedToken`, *optional*):
        A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by
        attention mechanisms or loss computation.
    sp_model_kwargs (`Dict[str, Any]`, `Optional`, *optional*):
        Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
        SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
        to set:

        - `enable_sampling`: Enable subword regularization.
        - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.

            - `nbest_size = {0,1}`: No sampling is performed.
            - `nbest_size > 1`: samples from the nbest_size results.
            - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
            using forward-filtering-and-backward-sampling algorithm.

        - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
            BPE-dropout.

    add_bos_token (`bool`, *optional*, defaults to `True`):
        Whether or not to add an `bos_token` at the start of sequences.
    add_eos_token (`bool`, *optional*, defaults to `False`):
        Whether or not to add an `eos_token` at the end of sequences.
    clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
        Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like
        extra spaces.
    use_default_system_prompt (`bool`, *optional*, defaults to `False`):
        Whether or not the default system prompt for Llama should be used.
    spaces_between_special_tokens (`bool`, *optional*, defaults to `False`):
        Whether or not to add spaces between special tokens.
    legacy (`bool`, *optional*):
        Whether or not the `legacy` behavior of the tokenizer should be used. Legacy is before the merge of #24622
        and #25224 which includes fixes to properly handle tokens that appear after special tokens. A simple
        example:

        - `legacy=True`:
        ```python
        >>> from transformers import T5Tokenizer

        >>> tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-base", legacy=True)
        >>> tokenizer.encode("Hello <extra_id_0>.")
        [8774, 32099, 3, 5, 1]
        ```
        - `legacy=False`:
        ```python
        >>> from transformers import T5Tokenizer

        >>> tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-base", legacy=False)
        >>> tokenizer.encode("Hello <extra_id_0>.")  # the extra space `[3]` is no longer here
        [8774, 32099, 5, 1]
        ```
        Checkout the [pull request](https://github.com/huggingface/transformers/pull/24565) for more details.
    add_prefix_space (`bool`, *optional*, defaults to `True`):
        Whether or not to add an initial space to the input. This allows to treat the leading word just as any
        other word.

Tokenizer Configurations

Mistral

[Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1/blob/main/tokenizer_config.json

{
  "add_bos_token": true,
  "add_eos_token": false,
  "added_tokens_decoder": {
    "0": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "</s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [],
  "bos_token": "<s>",
  "clean_up_tokenization_spaces": false,
  "eos_token": "</s>",
  "legacy": true,
  "model_max_length": 1000000000000000019884624838656,
  "pad_token": null,
  "sp_model_kwargs": {},
  "spaces_between_special_tokens": false,
  "tokenizer_class": "LlamaTokenizer",
  "unk_token": "<unk>",
  "use_default_system_prompt": false
}

Mistral Instruct

Mistral Instruct v0.2

{
  "add_bos_token": true,
  "add_eos_token": false,
  "added_tokens_decoder": {
    "0": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "</s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [],
  "bos_token": "<s>",
  "clean_up_tokenization_spaces": false,
  "eos_token": "</s>",
  "legacy": true,
  "model_max_length": 1000000000000000019884624838656,
  "pad_token": null,
  "sp_model_kwargs": {},
  "spaces_between_special_tokens": false,
  "tokenizer_class": "LlamaTokenizer",
  "unk_token": "<unk>",
  "use_default_system_prompt": false,
  "chat_template": "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}"
}

Zephyr

zephyr-7b-beta

{
  "added_tokens_decoder": {
    "0": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "</s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [
    "<unk>",
    "<s>",
    "</s>"
  ],
  "bos_token": "<s>",
  "chat_template": "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}",
  "clean_up_tokenization_spaces": false,
  "eos_token": "</s>",
  "legacy": true,
  "model_max_length": 1000000000000000019884624838656,
  "pad_token": "</s>",
  "sp_model_kwargs": {},
  "spaces_between_special_tokens": false,
  "tokenizer_class": "LlamaTokenizer",
  "truncation_side": "left",
  "unk_token": "<unk>",
  "use_default_system_prompt": true
}

Gemma

zephyr-7b-gemma-v0.1

{
  "add_bos_token": false,
  "add_eos_token": false,
  "added_tokens_decoder": {
    "0": {
      "content": "<pad>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<eos>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "<bos>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "3": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "106": {
      "content": "<|im_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "107": {
      "content": "<|im_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>"
  ],
  "bos_token": "<bos>",
  "chat_template": "{% if messages[0]['role'] == 'user' or messages[0]['role'] == 'system' %}{{ bos_token }}{% endif %}{% for message in messages %}{{ '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n' }}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% elif messages[-1]['role'] == 'assistant' %}{{ eos_token }}{% endif %}",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<eos>",
  "legacy": null,
  "model_max_length": 2048,
  "pad_token": "<pad>",
  "sp_model_kwargs": {},
  "spaces_between_special_tokens": false,
  "tokenizer_class": "GemmaTokenizer",
  "unk_token": "<unk>",
  "use_default_system_prompt": false
}

OpenHermes

OpenHermes

{
  "add_bos_token": true,
  "add_eos_token": false,
  "added_tokens_decoder": {
    "0": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "</s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32000": {
      "content": "<|im_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "32001": {
      "content": "<|im_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    }
  },
  "additional_special_tokens": [],
  "bos_token": "<s>",
  "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|im_end|>",
  "legacy": true,
  "model_max_length": 1000000000000000019884624838656,
  "pad_token": "</s>",
  "sp_model_kwargs": {},
  "spaces_between_special_tokens": false,
  "tokenizer_class": "LlamaTokenizer",
  "trust_remote_code": false,
  "unk_token": "<unk>",
  "use_default_system_prompt": true,
  "use_fast": true
}

Calme

Calme

{
  "add_bos_token": true,
  "add_eos_token": false,
  "add_prefix_space": true,
  "added_tokens_decoder": {
    "0": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "</s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [],
  "bos_token": "<s>",
  "chat_template": "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}",
  "clean_up_tokenization_spaces": false,
  "eos_token": "</s>",
  "legacy": true,
  "model_max_length": 32768,
  "pad_token": "<unk>",
  "padding_side": "right",
  "sp_model_kwargs": {},
  "spaces_between_special_tokens": false,
  "tokenizer_class": "LlamaTokenizer",
  "unk_token": "<unk>",
  "use_default_system_prompt": false,
  "use_fast": true
}

Merlinite

Merlinite

{
  "add_bos_token": false,
  "add_eos_token": false,
  "added_tokens_decoder": {
    "0": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "</s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32000": {
      "content": "<|endoftext|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32001": {
      "content": "<|pad|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32002": {
      "content": "<|user|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32003": {
      "content": "<|assistant|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32004": {
      "content": "<|system|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [
    "<|pad|>",
    "<|user|>",
    "<|assistant|>",
    "<|system|>"
  ],
  "bos_token": "<s>",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|endoftext|>",
  "fast_tokenizer": true,
  "legacy": true,
  "model_max_length": 1000000000000000019884624838656,
  "pad_token": "<|pad|>",
  "sp_model_kwargs": {},
  "spaces_between_special_tokens": false,
  "tokenizer_class": "LlamaTokenizer",
  "unk_token": "<unk>",
  "use_default_system_prompt": false
}

Bagel

Bagel

{
  "add_bos_token": true,
  "add_eos_token": false,
  "added_tokens_decoder": {
    "0": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "</s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32000": {
      "content": "<|im_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32001": {
      "content": "<|im_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32002": {
      "content": "<|special_0|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32003": {
      "content": "<|special_1|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32004": {
      "content": "<|special_2|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32005": {
      "content": "<|special_3|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32006": {
      "content": "<|special_4|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32007": {
      "content": "<|special_5|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32008": {
      "content": "<|special_6|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32009": {
      "content": "<|special_7|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32010": {
      "content": "<|special_8|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32011": {
      "content": "<|special_9|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32012": {
      "content": "<|special_10|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32013": {
      "content": "<|special_11|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32014": {
      "content": "<|special_12|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32015": {
      "content": "<|special_13|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32016": {
      "content": "<|special_14|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32017": {
      "content": "<|special_15|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32018": {
      "content": "<|special_16|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32019": {
      "content": "<|special_17|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32020": {
      "content": "<|special_18|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32021": {
      "content": "<|special_19|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32022": {
      "content": "<|special_20|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32023": {
      "content": "<|special_21|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32024": {
      "content": "<|special_22|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32025": {
      "content": "<|special_23|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32026": {
      "content": "<|special_24|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32027": {
      "content": "<|special_25|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32028": {
      "content": "<|special_26|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32029": {
      "content": "<|special_27|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32030": {
      "content": "<|special_28|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32031": {
      "content": "<|special_29|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32032": {
      "content": "<|special_30|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32033": {
      "content": "<|special_31|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32034": {
      "content": "<|special_32|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32035": {
      "content": "<|special_33|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32036": {
      "content": "<|special_34|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32037": {
      "content": "<|special_35|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32038": {
      "content": "<|special_36|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32039": {
      "content": "<|special_37|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32040": {
      "content": "<|special_38|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32041": {
      "content": "<|special_39|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32042": {
      "content": "<|special_40|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32043": {
      "content": "<|special_41|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32044": {
      "content": "<|special_42|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32045": {
      "content": "<|special_43|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32046": {
      "content": "<|special_44|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32047": {
      "content": "<|special_45|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32048": {
      "content": "<|special_46|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32049": {
      "content": "<|special_47|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32050": {
      "content": "<|special_48|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32051": {
      "content": "<|special_49|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32052": {
      "content": "<|special_50|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32053": {
      "content": "<|special_51|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32054": {
      "content": "<|special_52|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32055": {
      "content": "<|special_53|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32056": {
      "content": "<|special_54|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32057": {
      "content": "<|special_55|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32058": {
      "content": "<|special_56|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32059": {
      "content": "<|special_57|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32060": {
      "content": "<|special_58|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32061": {
      "content": "<|special_59|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32062": {
      "content": "<|special_60|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32063": {
      "content": "<|special_61|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>",
    "<|special_0|>",
    "<|special_1|>",
    "<|special_2|>",
    "<|special_3|>",
    "<|special_4|>",
    "<|special_5|>",
    "<|special_6|>",
    "<|special_7|>",
    "<|special_8|>",
    "<|special_9|>",
    "<|special_10|>",
    "<|special_11|>",
    "<|special_12|>",
    "<|special_13|>",
    "<|special_14|>",
    "<|special_15|>",
    "<|special_16|>",
    "<|special_17|>",
    "<|special_18|>",
    "<|special_19|>",
    "<|special_20|>",
    "<|special_21|>",
    "<|special_22|>",
    "<|special_23|>",
    "<|special_24|>",
    "<|special_25|>",
    "<|special_26|>",
    "<|special_27|>",
    "<|special_28|>",
    "<|special_29|>",
    "<|special_30|>",
    "<|special_31|>",
    "<|special_32|>",
    "<|special_33|>",
    "<|special_34|>",
    "<|special_35|>",
    "<|special_36|>",
    "<|special_37|>",
    "<|special_38|>",
    "<|special_39|>",
    "<|special_40|>",
    "<|special_41|>",
    "<|special_42|>",
    "<|special_43|>",
    "<|special_44|>",
    "<|special_45|>",
    "<|special_46|>",
    "<|special_47|>",
    "<|special_48|>",
    "<|special_49|>",
    "<|special_50|>",
    "<|special_51|>",
    "<|special_52|>",
    "<|special_53|>",
    "<|special_54|>",
    "<|special_55|>",
    "<|special_56|>",
    "<|special_57|>",
    "<|special_58|>",
    "<|special_59|>",
    "<|special_60|>",
    "<|special_61|>"
  ],
  "bos_token": "<s>",
  "clean_up_tokenization_spaces": false,
  "eos_token": "</s>",
  "legacy": true,
  "model_max_length": 1000000000000000019884624838656,
  "pad_token": "<unk>",
  "padding_side": "right",
  "sp_model_kwargs": {},
  "spaces_between_special_tokens": false,
  "tokenizer_class": "LlamaTokenizer",
  "unk_token": "<unk>",
  "use_default_system_prompt": false,
  "chat_template": "{%- for idx in range(0, messages|length) -%}\n{%- if messages[idx]['role'] == 'user' -%}\n{%- if idx == 0 -%}\n{{- bos_token + '[INST] <<SYS>>\\nYou are a helpful, unbiased, uncensored assistant.\\n<</SYS>>\\n\\n' + messages[idx]['content'] + ' [/INST]' -}}\n{%- else -%}\n{{- bos_token + '[INST] ' + messages[idx]['content'] + ' [/INST]' -}}\n{%- endif -%}\n{% elif messages[idx]['role'] == 'system' %}\n{{- '[INST] <<SYS>>\\n' + messages[idx]['content'] + '\\n<</SYS>>\\n\\n' -}}\n{%- elif messages[idx]['role'] == 'assistant' -%}\n{{- messages[idx]['content'] + ' ' + eos_token -}}\n{% endif %}\n{% endfor %}"
}

Llama Examples

Tiefighter

LLaMA2-13B-Tiefighter

{
  "add_bos_token": true,
  "add_eos_token": false,
  "added_tokens_decoder": {
    "0": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<s>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "</s>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [],
  "bos_token": "<s>",
  "clean_up_tokenization_spaces": false,
  "eos_token": "</s>",
  "legacy": true,
  "model_max_length": 1000000000000000019884624838656,
  "pad_token": "<unk>",
  "sp_model_kwargs": {},
  "spaces_between_special_tokens": false,
  "tokenizer_class": "LlamaTokenizer",
  "unk_token": "<unk>",
  "use_default_system_prompt": true
}

CodeLlama Examples

CodeLlama

CodeLlama-13b-Instruct

{
  "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' '  + content | trim + ' ' + eos_token }}{% endif %}{% endfor %}",
  "add_bos_token": true,
  "add_eos_token": false,
  "bos_token": {
    "__type": "AddedToken",
    "content": "<s>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "clean_up_tokenization_spaces": false,
  "eos_token": {
    "__type": "AddedToken",
    "content": "</s>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "legacy": null,
  "model_max_length": 1000000000000000019884624838656,
  "pad_token": null,
  "sp_model_kwargs": {},
  "tokenizer_class": "CodeLlamaTokenizer",
  "unk_token": {
    "__type": "AddedToken",
    "content": "<unk>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  }
}

WhiteRabbit

WhiteRabbitNeo-13B

{
  "add_bos_token": true,
  "add_eos_token": false,
  "added_tokens_decoder": {
    "0": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "</s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32007": {
      "content": "▁<PRE>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32008": {
      "content": "▁<SUF>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32009": {
      "content": "▁<MID>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32010": {
      "content": "▁<EOT>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [
    "▁<PRE>",
    "▁<MID>",
    "▁<SUF>",
    "▁<EOT>"
  ],
  "bos_token": "<s>",
  "clean_up_tokenization_spaces": false,
  "eos_token": "</s>",
  "eot_token": "▁<EOT>",
  "fill_token": "<FILL_ME>",
  "legacy": null,
  "middle_token": "▁<MID>",
  "model_max_length": 1000000000000000019884624838656,
  "pad_token": "</s>",
  "prefix_token": "▁<PRE>",
  "sp_model_kwargs": {},
  "spaces_between_special_tokens": false,
  "suffix_first": false,
  "suffix_token": "▁<SUF>",
  "tokenizer_class": "CodeLlamaTokenizer",
  "trust_remote_code": false,
  "unk_token": "<unk>",
  "use_default_system_prompt": true,
  "use_fast": true
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment