Skip to content

Instantly share code, notes, and snippets.

@charlieoneill11
Created May 1, 2022 05:53
Show Gist options
  • Save charlieoneill11/fd87d3de80cb2f5d8b1e7277da9061db to your computer and use it in GitHub Desktop.
Save charlieoneill11/fd87d3de80cb2f5d8b1e7277da9061db to your computer and use it in GitHub Desktop.
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True)
print(tokenize(offensive["train"][:2]))
> {'input_ids': [[101, 1030, 5310, 23648, 1012, 1012, 1012, 2040, 14977, 1012, 2574, 2111, 2097, 3305, 2008, 2027, 5114, 2498, 2013, 2206, 1037, 6887, 16585, 8958, 1012, 2468, 1037, 3003, 1997, 2115, 2111, 2612, 2030, 2393, 1998, 2490, 2115, 3507, 2406, 3549, 1012, 102], [101, 1030, 5310, 2809, 2086, 1996, 10643, 6380, 8112, 1521, 1055, 11214, 1012, 7987, 20175, 8237, 7747, 19006, 2003, 2004, 6887, 16585, 2004, 2037, 8275, 2343, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment