Skip to content

Instantly share code, notes, and snippets.

@kretes
Created March 4, 2020 18:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kretes/1a51bb8b936fc4e6277f71931b886bed to your computer and use it in GitHub Desktop.
Save kretes/1a51bb8b936fc4e6277f71931b886bed to your computer and use it in GitHub Desktop.
reproduce of tokenizers hang on encode_batch
from multiprocessing import Process
import os
from tokenizers.implementations import ByteLevelBPETokenizer
import tokenizers
print(tokenizers.__version__)
# works:
tok = ByteLevelBPETokenizer()
print(tok.encode_batch(['ala']))
print(tok.encode_batch(['ala', 'kot']))
def encode(name):
tok = ByteLevelBPETokenizer()
print("single text")
print(tok.encode_batch(['ala']))
print(tok.encode_batch(['ala', 'kot']))
p = Process(target=encode, args=('ala',))
p.start()
p.join()
0.6.0
[Encoding(num_tokens=0, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing, original_str, normalized_str])]
[Encoding(num_tokens=0, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing, original_str, normalized_str]), Encoding(num_tokens=0, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing, original_str, normalized_str])]
single text
[Encoding(num_tokens=0, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing, original_str, normalized_str])]
... hangs here
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment