-
-
Save xenova/a452a6474428de0182b17605a98631ee to your computer and use it in GitHub Desktop.
mmmh I changed the regex to:
"(?i)'s|'t|'re|'ve|'m|'ll|'d|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Which I think should be setting the whole expression case-insensitive. However this breaks on a weird UTF character (https://unicode-explorer.com/c/0345). I sadly don't have an explanation for it 🤷
character = 'ͅ' # U+0345 fails
t = hfgpt4.encode(character)
print(t) # [137, 227] - what tiktoken also returns
o = hfgpt4_case_insensitive.encode(character)
print(o) # []
It is equivalent otherwise for everything else I tested.
Hey, thanks for sharing the solution and discussion! Do we have any conclusion on which Regex to use to fully replicate the tiktoken in Huggingface? Is this pre_tokenizer setting working? Does removing post_processor yield different results?
```python "pre_tokenizer": { "type": "Sequence", "pretokenizers": [ { "type": "Split", "pattern": { "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" }, "behavior": "Removed", "invert": True }, { "type": "ByteLevel", "add_prefix_space": False, "trim_offsets": True, "use_regex": False } ] }
Hey, thanks for sharing the solution and discussion! Do we have any conclusion on which Regex to use to fully replicate the tiktoken in Huggingface? Is this pre_tokenizer setting working? Does removing post_processor yield different results?
```python "pre_tokenizer": { "type": "Sequence", "pretokenizers": [ { "type": "Split", "pattern": { "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" }, "behavior": "Removed", "invert": True }, { "type": "ByteLevel", "add_prefix_space": False, "trim_offsets": True, "use_regex": False } ] }
Try it :)
Removing the post_processor shouldn't do anything since the decoder already handles byte level decoding. And as long as you adapted the Regex to the gpt4 version then it should work in python. I can't speak for JS.
Yeah I tried on 10M cases and only got 2 unmatched token sequences, which should be fine for common use cases. By the way, I noticed that the decoding sometimes does not yield same results between converted HF and TikToken. For example, I got different text string from this token sequence [12906, 224, 61196, 5619, 248, 15272, 113, 55884, 245, 5619, 111, 73414, 13, 15272, 250, 31584, 107, 24810, 15272, 113, 31584, 107, 65804, 31584, 97, 43411, 105, 5619, 99, 31584, 99, 92911, 15272, 228, 5619, 250, 84736, 86133, 80338, 31584, 107, 55884, 243, 32511, 248, 31584, 107, 24810, 92317, 61196, 32511, 97, 15272, 246, 12906, 225, 5619, 96, 24810, 11, 15272, 248, 44747, 5619, 94, 1174, 15272, 108, 32511, 245, 11, 15272, 99, 31584, 113, 55884, 115, 11, 15272, 107, 24810, 15272, 255, 32511, 113, 61196, 32511, 224, 5619, 244, 35470, 45279, 44747, 5619, 250, 48909, 32511, 117, 44747, 15272, 101, 32511, 117, 44747, 11, 15272, 107, 24810, 84736, 86133, 32511, 108, 31584, 114, 31584, 113, 5619, 255, 12906, 224, 88344, 44747, 5619, 113, 45279, 15272, 97, 31584, 107, 24810, 15272, 113, 31584, 107, 65804, 31584, 97, 44747, 5619, 248, 44747, 15272, 105, 32511, 107, 65804, 55675, 15272, 228, 5619, 96, 39951, 92317, 73753, 92911, 32511, 101, 35470, 85410, 35470, 84736, 73753, 79468, 31584, 97, 65804, 15272, 110, 43411, 117, 5619, 96, 31584, 107, 32511, 97, 85410, 24810, 84736, 73753, 5619, 95, 32511, 243, 32511, 108, 15272, 246, 31584, 107, 32511, 113, 24810, 11, 15272, 97, 31584, 107, 32511, 248, 73414, 15272, 228, 5619, 107, 73753, 5619, 115, 31584, 107, 15272, 110, 55675, 65804, 32511, 224, 79468, 88344, 55675, 45279, 92317, 32511, 224, 5619, 94, 5619, 96, 31584, 107, 32511, 248, 24810, 84736, 86133, 5619, 107, 80338, 31584, 101, 48909, 45279, 32511, 113, 24810, 11, 85410, 55884, 248, 15272, 113, 43411, 114, 55884, 115, 15272, 228, 95048, 35470, 13, 15272, 228, 5619, 96, 39951, 15272, 241, 79468, 32511, 106, 32511, 248, 24810, 15272, 114, 55884, 113, 5619, 253, 15272, 251, 32511, 110, 31584, 107, 32511, 101, 73414, 80338, 45279, 15272, 227, 92911, 31584, 103, 32511, 113, 5619, 100, 44747, 80338, 5619, 248, 85410, 35470, 84736, 73753, 79468, 31584, 97, 65804]
Nvm, I found it is caused by setting clean_up_tokenization_spaces=True.
@xenova Thanks for posting this! For the purposes of adapting/incorporating into other projects, what's the license for this code? (Maybe add a note license info to the comments at the top?)
Just an update on the issue with the case-insensitive group modifier (?i:
), which causes issues with certain regex implementations (e.g., JS): I think it's reasonable to just replace the problematic section with a longer (but equivalent) version.
Original: (?i:'s|'t|'re|'ve|'m|'ll|'d)|
JS-friendly version: (?:'([sS]|[tT]|[rR][eE]|[vV][eE]|[mM]|[lL][lL]|[dD]))
For the purposes of adapting/incorporating into other projects, what's the license for this code?
Do what you want with it :) In any case, my code is adapted from this comment, with a few modifications.
I actually forgot to update the gist with my new conversion script, which takes into account the new split pretokenization regex (thanks @gautierdag for pointing that out!).
It also sets the default clean_up_tokenization_spaces
to False
(thanks @binxuan for pointing that out).
So, now it's updated 🤗 👍 I've also validated the GPT-4 tokenizer on the entire XNLI dataset (all languages) with 100% compatibility (both encoding and decoding). 🔥 Code to validate:
import tqdm
from datasets import load_dataset
import tiktoken
from transformers import GPT2TokenizerFast
hf_tokenizer = GPT2TokenizerFast.from_pretrained('Xenova/gpt-4')
og_tokenizer = tiktoken.encoding_for_model('gpt-4')
dataset = load_dataset('xnli', 'all_languages')
for item in tqdm.tqdm(dataset['train']):
for string in item['premise'].values():
encoded1 = og_tokenizer.encode(string)
encoded2 = hf_tokenizer.encode(string)
assert encoded1 == encoded2, f'encoding "{string}" is incorrect. "{encoded1}" != "{encoded2}"'
decoded1 = og_tokenizer.decode(encoded1)
decoded2 = hf_tokenizer.decode(encoded2, skip_special_tokens=True)
assert decoded1 == decoded2, f'decoding "{string}" is incorrect. "{decoded1}" != "{decoded2}"'
Shouldn't 'tokenizer_class' be 'GPT2Tokenizer' in all cases? This is the huggingface concrete class that's instantiated - i.e. by doing this you can use
hf_tokenizer = AutoTokenizer.from_pretrained('Xenova/gpt-4')
Rather than GPT2TokenizerFast
(which then generates a warning).
No worries!
Right, I noticed that while playing around with it a bit more yesterday. I suppose the entire regex can be set to case-insensitive mode, no? Do you notice any difference in your tests if
?i:
is removed, but the entire regex is set to case-insensitive? (as opposed to that first group)?