Skip to content

Instantly share code, notes, and snippets.

@mohdsanadzakirizvi
Created July 18, 2019 09:52
Show Gist options
  • Save mohdsanadzakirizvi/b5ea94b8d8c7a3546eb3599bc6129ed8 to your computer and use it in GitHub Desktop.
Save mohdsanadzakirizvi/b5ea94b8d8c7a3546eb3599bc6129ed8 to your computer and use it in GitHub Desktop.
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment