Skip to content

Instantly share code, notes, and snippets.

@fxnnxc
Created August 13, 2024 03:05
Show Gist options
  • Save fxnnxc/632827af1c044b65708684451ccaa900 to your computer and use it in GitHub Desktop.
Save fxnnxc/632827af1c044b65708684451ccaa900 to your computer and use it in GitHub Desktop.
def get_tokenized_dataset(dataset, tokenizer, batch_size, num_proc=None,):
def process(samples):
batch_inputs = tokenizer(samples['text'],
max_length=None,
padding=True,
truncation=False,
return_tensors='pt')
return batch_inputs
tokenizer.padding_side = "left"
remove_columns = dataset.column_names
dataset = dataset.map(
process,
batched=True,
num_proc=num_proc,
load_from_cache_file=False,
desc="Tokenizing dataset...",
batch_size=batch_size,
remove_columns= remove_columns
)
return dataset
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment