Skip to content

Instantly share code, notes, and snippets.

@intervitens
Last active January 26, 2024 22:20
Show Gist options
  • Save intervitens/d171990ade60afd5dfe51415f6bf8c3b to your computer and use it in GitHub Desktop.
Save intervitens/d171990ade60afd5dfe51415f6bf8c3b to your computer and use it in GitHub Desktop.
Script for the InternLM2 tokenizer to add ChatML tokens and fix null token 354 for ggml conversion
# Launch with PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python tokenizer_fix.py
import sentencepiece.sentencepiece_model_pb2 as model
m = model.ModelProto()
m.ParseFromString(open('./tokenizer.model', 'rb').read())
m.pieces[92543].piece = '<|im_start|>'
m.pieces[92542].piece = '<|im_end|>'
m.pieces[92541].piece = '<|action_start|>'
m.pieces[92540].piece = '<|action_end|>'
m.pieces[92539].piece = '<|interpreter|>'
m.pieces[92538].piece = '<|plugin|>'
m.pieces[354].piece = "[ERROR_NULL_TOKEN_a76Y96a9eX7b]"
with open('tokenizer_fixed.model', 'wb') as f:
f.write(m.SerializeToString())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment