-
-
Save VibhuJawa/cc166a57824290bd335d6f1a21d1d885 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import cudf | |
from cudf.utils.hash_vocab_utils import hash_vocab | |
hash_vocab('bert-base-cased-vocab.txt', 'voc_hash.txt') | |
from cudf.core.subword_tokenizer import SubwordTokenizer | |
cudf_tokenizer = SubwordTokenizer('voc_hash.txt', | |
do_lower_case=True) | |
str_series = cudf.Series(['This is the', 'best book']) | |
tokenizer_output = cudf_tokenizer(str_series, | |
max_length=8, | |
max_num_rows=len(str_series), | |
padding='max_length', | |
return_tensors='pt', | |
truncation=True) | |
tokenizer_output['input_ids'] | |
tensor([[ 101, 1142, 1110, 1103, 102, 0, 0, 0], | |
[ 101, 1436, 1520, 102, 0, 0, 0, 0]], | |
device='cuda:0', | |
dtype=torch.int32) | |
tokenizer_output['attention_mask'] | |
tensor([[1, 1, 1, 1, 1, 0, 0, 0], | |
[1, 1, 1, 1, 0, 0, 0, 0]], | |
device='cuda:0', dtype=torch.int32) | |
tokenizer_output['metadata'] | |
tensor([[0, 1, 3], | |
[1, 1, 2]], device='cuda:0', dtype=torch.int32) |
Checkout below file with the required documentation.
Thank you,I am also looking for more documentation for using bert models in TF with the cuda libraries,do you have any repos showing the next steps after building the Cudf Tokenizer?
I think you can follow HuggingFace for that . You just need to switch the tokenizer to the RAPIDS one , everything else should remain the same.
https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb
The tokenizer gives this for the input ID
cudf_tokenizer = SubwordTokenizer('voc_hash.txt',
do_lower_case=True)
Test_tokenizer_output = cudf_tokenizer(x_test1,
max_length=seq_len,
max_num_rows=len(x_test1),
padding='max_length',
return_tensors='pt',
truncation=True)
print(Test_tokenizer_output['input_ids'])
tensor([[ 101, 6479, 1132, ..., 0, 0, 0],
[ 101, 1103, 8152, ..., 0, 0, 0],
[ 101, 1176, 1103, ..., 0, 0, 0],
...,
[ 101, 1169, 181, ..., 0, 0, 0],
[ 101, 170, 1647, ..., 0, 0, 0],
[ 101, 1107, 170, ..., 0, 0, 0]], device='cuda:0',
dtype=torch.int32)
Its in pytorch I want It in tensorflow
import tensorflow as tf
Test_tokenizer_output = cudf_tokenizer(x_test1,
max_length=seq_len,
max_num_rows=len(x_test1),
padding='max_length',
return_tensors='tf',
truncation=True)
print(Test_tokenizer_output)
ModuleNotFoundError Traceback (most recent call last)
/tmp/ipykernel_4735/629598511.py in <module>
5 padding='max_length',
6 return_tensors='tf',
----> 7 truncation=True)
8 print(Test_tokenizer_output)
~/miniconda3/envs/rtd/lib/python3.7/site-packages/cudf/core/subword_tokenizer.py in __call__(self, text, max_length, max_num_rows, add_special_tokens, padding, truncation, stride, return_tensors, return_token_type_ids)
233 tokenizer_output = {
234 k: _cast_to_appropriate_type(v, return_tensors)
--> 235 for k, v in tokenizer_output.items()
236 }
237
~/miniconda3/envs/rtd/lib/python3.7/site-packages/cudf/core/subword_tokenizer.py in <dictcomp>(.0)
233 tokenizer_output = {
234 k: _cast_to_appropriate_type(v, return_tensors)
--> 235 for k, v in tokenizer_output.items()
236 }
237
~/miniconda3/envs/rtd/lib/python3.7/site-packages/cudf/core/subword_tokenizer.py in _cast_to_appropriate_type(ar, cast_type)
22
23 elif cast_type == "tf":
---> 24 from tf.experimental.dlpack import from_dlpack
25
26 return from_dlpack(ar.astype("int32").toDlpack())
ModuleNotFoundError: No module named 'tf'
Weird
I tried multiple tensorflow versions and no luck
Just replied here:
rapidsai/cudf#9447 (comment)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
can you give a demo of your file before and after using perfect_hash please?