Created
July 18, 2019 09:52
-
-
Save mohdsanadzakirizvi/b5ea94b8d8c7a3546eb3599bc6129ed8 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Mask a token that we will try to predict back with `BertForMaskedLM` | |
masked_index = 8 | |
tokenized_text[masked_index] = '[MASK]' | |
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]'] | |
# Convert token to vocabulary indices | |
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text) | |
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper) | |
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1] | |
# Convert inputs to PyTorch tensors | |
tokens_tensor = torch.tensor([indexed_tokens]) | |
segments_tensors = torch.tensor([segments_ids]) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment