Skip to content

Instantly share code, notes, and snippets.

@VibhuJawa
Last active September 23, 2022 18:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save VibhuJawa/cc166a57824290bd335d6f1a21d1d885 to your computer and use it in GitHub Desktop.
Save VibhuJawa/cc166a57824290bd335d6f1a21d1d885 to your computer and use it in GitHub Desktop.
import cudf
from cudf.utils.hash_vocab_utils import hash_vocab
hash_vocab('bert-base-cased-vocab.txt', 'voc_hash.txt')
from cudf.core.subword_tokenizer import SubwordTokenizer
cudf_tokenizer = SubwordTokenizer('voc_hash.txt',
do_lower_case=True)
str_series = cudf.Series(['This is the', 'best book'])
tokenizer_output = cudf_tokenizer(str_series,
max_length=8,
max_num_rows=len(str_series),
padding='max_length',
return_tensors='pt',
truncation=True)
tokenizer_output['input_ids']
tensor([[ 101, 1142, 1110, 1103, 102, 0, 0, 0],
[ 101, 1436, 1520, 102, 0, 0, 0, 0]],
device='cuda:0',
dtype=torch.int32)
tokenizer_output['attention_mask']
tensor([[1, 1, 1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0, 0]],
device='cuda:0', dtype=torch.int32)
tokenizer_output['metadata']
tensor([[0, 1, 3],
[1, 1, 2]], device='cuda:0', dtype=torch.int32)
@RaeWallace10
Copy link

can you give a demo of your file before and after using perfect_hash please?

@RaeWallace10
Copy link

Thank you,I am also looking for more documentation for using bert models in TF with the cuda libraries,do you have any repos showing the next steps after building the Cudf Tokenizer?

@VibhuJawa
Copy link
Author

I think you can follow HuggingFace for that . You just need to switch the tokenizer to the RAPIDS one , everything else should remain the same.

https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb

@RaeWallace10
Copy link

RaeWallace10 commented Oct 15, 2021

The tokenizer gives this for the input ID

cudf_tokenizer  = SubwordTokenizer('voc_hash.txt',
                                   do_lower_case=True)

Test_tokenizer_output = cudf_tokenizer(x_test1,
                                  max_length=seq_len,
                                  max_num_rows=len(x_test1),
                                  padding='max_length',
                                  return_tensors='pt',
                                  truncation=True)
print(Test_tokenizer_output['input_ids'])

tensor([[ 101, 6479, 1132,  ...,    0,    0,    0],
        [ 101, 1103, 8152,  ...,    0,    0,    0],
        [ 101, 1176, 1103,  ...,    0,    0,    0],
        ...,
        [ 101, 1169,  181,  ...,    0,    0,    0],
        [ 101,  170, 1647,  ...,    0,    0,    0],
        [ 101, 1107,  170,  ...,    0,    0,    0]], device='cuda:0',
       dtype=torch.int32)

Its in pytorch I want It in tensorflow

import tensorflow as tf
Test_tokenizer_output = cudf_tokenizer(x_test1,
                                  max_length=seq_len,
                                  max_num_rows=len(x_test1),
                                  padding='max_length',
                                  return_tensors='tf',
                                  truncation=True)
print(Test_tokenizer_output)

ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_4735/629598511.py in <module>
      5                                   padding='max_length',
      6                                   return_tensors='tf',
----> 7                                   truncation=True)
      8 print(Test_tokenizer_output)

~/miniconda3/envs/rtd/lib/python3.7/site-packages/cudf/core/subword_tokenizer.py in __call__(self, text, max_length, max_num_rows, add_special_tokens, padding, truncation, stride, return_tensors, return_token_type_ids)
    233         tokenizer_output = {
    234             k: _cast_to_appropriate_type(v, return_tensors)
--> 235             for k, v in tokenizer_output.items()
    236         }
    237 

~/miniconda3/envs/rtd/lib/python3.7/site-packages/cudf/core/subword_tokenizer.py in <dictcomp>(.0)
    233         tokenizer_output = {
    234             k: _cast_to_appropriate_type(v, return_tensors)
--> 235             for k, v in tokenizer_output.items()
    236         }
    237 

~/miniconda3/envs/rtd/lib/python3.7/site-packages/cudf/core/subword_tokenizer.py in _cast_to_appropriate_type(ar, cast_type)
     22 
     23     elif cast_type == "tf":
---> 24         from tf.experimental.dlpack import from_dlpack
     25 
     26     return from_dlpack(ar.astype("int32").toDlpack())

ModuleNotFoundError: No module named 'tf'

Weird

@RaeWallace10
Copy link

I tried multiple tensorflow versions and no luck

@VibhuJawa
Copy link
Author

Just replied here:
rapidsai/cudf#9447 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment