Skip to content

Instantly share code, notes, and snippets.

@humanely
Created May 10, 2024 19:49
Show Gist options
  • Save humanely/7e66004744c629c317df492e2a86b555 to your computer and use it in GitHub Desktop.
Save humanely/7e66004744c629c317df492e2a86b555 to your computer and use it in GitHub Desktop.
Create CLIP Tokenizer
from tokenizers import (
models,
pre_tokenizers,
processors,
trainers,
Tokenizer,
)
from pathlib import Path
import os
from transformers import PreTrainedTokenizerFast
paths = [str(x) for x in Path("../sacorpus/Sankrit_Corpus/").glob("**/*.txt")]
tokenizer = Tokenizer(models.BPE(end_of_word_suffix="</w>"))
tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
[pre_tokenizers.ByteLevel(add_prefix_space=False)])
trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"], min_frequency=1,
show_progress=True)
tokenizer.train(files=paths, trainer=trainer)
tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)
wrapped_tokenizer = PreTrainedTokenizerFast(
tokenizer_object=tokenizer,
bos_token="<|endoftext|>",
eos_token="<|endoftext|>"
)
wrapped_tokenizer.save_pretrained("cliptok")
tokenizer.save("cliptok.json")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment