Skip to content

Instantly share code, notes, and snippets.

@bfarzin
Last active August 11, 2019 13:39
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bfarzin/f9407c0b0f2690f36dd6d51c8ef56944 to your computer and use it in GitHub Desktop.
Save bfarzin/f9407c0b0f2690f36dd6d51c8ef56944 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@tanny411
Copy link

tanny411 commented Jun 7, 2019

Any Idea why I may be having this error?

KeyError                                  Traceback (most recent call last)

<ipython-input-90-08d0dd5c9f7c> in <module>()
----> 1 data.show_batch()

4 frames

/usr/local/lib/python3.6/dist-packages/fastai/text/transform.py in <listcomp>(.0)
    132     def textify(self, nums:Collection[int], sep=' ') -> List[str]:
    133         "Convert a list of `nums` to their tokens."
--> 134         return sep.join([self.itos[i] for i in nums]) if sep is not None else [self.itos[i] for i in nums]
    135 
    136     def __getstate__(self):

KeyError: tensor(219)

@bfarzin
Copy link
Author

bfarzin commented Jun 7, 2019

I am not too sure. I have moved past this example and now have better code for the custom tokenizer. (which allows it to be saved and applies EncodeAsPieces which will return components of the sub-word rather than the ID (the numericaliztion!) See if this help at all or if you get the same errors:

class SPTokenizer(BaseTokenizer):
    "Wrapper around a SentncePiece tokenizer to make it a `BaseTokenizer`."
    def __init__(self, model_prefix:str):
        self.tok = spm.SentencePieceProcessor()
        self.tok.load(f'{model_prefix}.model')

    def tokenizer(self, t:str) -> List[str]:
        return self.tok.EncodeAsPieces(t)
    
class CustomTokenizer():
    '''Wrapper for SentencePiece toeknizer to fit into Fast.ai V1'''
    def __init__(self,tok_func:Callable,model_prefix:str, pre_rules:ListRules=None):
        self.tok_func,self.model_prefix = tok_func,model_prefix
        self.pre_rules  = ifnone(pre_rules,  defaults.text_pre_rules )
        
    def __repr__(self) -> str:
        res = f'Tokenizer {self.tok_func.__name__} using `{self.model_prefix}` model with the following rules:\n'
        for rule in self.pre_rules: res += f' - {rule.__name__}\n'
        return res        

    def process_text(self, t:str,tok:BaseTokenizer) -> List[str]:
        "Processe one text `t` with tokenizer `tok`."
        for rule in self.pre_rules: t = rule(t)  
        toks = tok.tokenizer(t)
        #post rules?
        return toks 
    
    def _process_all_1(self,texts:Collection[str]) -> List[List[str]]:
        'Process a list of `texts` in one process'
        tok = self.tok_func(self.model_prefix)
        return [self.process_text(t,tok) for t in texts]
                                                                     
    def process_all(self, texts:Collection[str]) -> List[List[str]]: 
        "Process a list of `texts`."                                 
        return self._process_all_1(texts)

@tanny411
Copy link

tanny411 commented Jun 7, 2019

i believe this is how i should use it:
mycust_tok = CustomTokenizer(SPTokenizer,model_prefix)
But i still have the error. Can you help with the full modified code?

@bfarzin
Copy link
Author

bfarzin commented Jun 7, 2019

itos was wrong also. I updated the example above.

@tanny411
Copy link

tanny411 commented Jun 7, 2019

Thank you so much. Sorry I didnt notice the change earlier. It works. Much appreciated.

@bfarzin
Copy link
Author

bfarzin commented Jun 8, 2019

No problem. I am glad I cleaned it up for my own good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment