Skip to content

Instantly share code, notes, and snippets.

@intervitens
intervitens / tokenizer_fix.py
Last active January 26, 2024 22:20
Script for the InternLM2 tokenizer to add ChatML tokens and fix null token 354 for ggml conversion
# Launch with PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python tokenizer_fix.py
import sentencepiece.sentencepiece_model_pb2 as model
m = model.ModelProto()
m.ParseFromString(open('./tokenizer.model', 'rb').read())
m.pieces[92543].piece = '<|im_start|>'
m.pieces[92542].piece = '<|im_end|>'
m.pieces[92541].piece = '<|action_start|>'
m.pieces[92540].piece = '<|action_end|>'