Skip to content

Instantly share code, notes, and snippets.

@thomwolf
Last active February 5, 2023 03:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thomwolf/74742c45ebf81287df35d29e634e483d to your computer and use it in GitHub Desktop.
Save thomwolf/74742c45ebf81287df35d29e634e483d to your computer and use it in GitHub Desktop.
Add special tokens to our model
# We will use 5 special tokens:
# - <bos> to indicate the start of the sequence
# - <eos> to indicate the end of the sequence
# - <speaker1> to indicate the beginning and the tokens of an utterance from the user
# - <speaker2> to indicate the beginning and the tokens of an utterance from the bot
# - <pad> as a padding token to build batches of sequences
SPECIAL_TOKENS = ["<bos>", "<eos>", "<speaker1>", "<speaker2>", "<pad>"]
# We can add these special tokens to the vocabulary and the embeddings of the model:
tokenizer.set_special_tokens(SPECIAL_TOKENS)
model.set_num_special_tokens(len(SPECIAL_TOKENS))
@samtn4
Copy link

samtn4 commented Feb 5, 2023

Hi thom, i'm trying to add the special tokens to a GPT-J model that is based in GPT2 toeknizer, bu ti couldn't reach the results, could you help me please:

AttributeError: 'GPT2Tokenizer' object has no attribute 'set_special_tokens'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment