Skip to content

Instantly share code, notes, and snippets.

@kasperjunge
Last active July 29, 2022 09:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kasperjunge/9b33c977dacc4d703770823761d981cd to your computer and use it in GitHub Desktop.
Save kasperjunge/9b33c977dacc4d703770823761d981cd to your computer and use it in GitHub Desktop.
Tokenize Hugging Face Dataset or DatasetDict.
from typing import Union
from transformers import AutoTokenizer
from datasets import Dataset, DatasetDict
def tokenize_huggingface_dataset(
ds: Union[Dataset, DatasetDict],
tokenizer: AutoTokenizer,
max_length: int = 512,
truncation: bool = True,
) -> Union[Dataset, DatasetDict]:
"""Tokenize Hugging Face Dataset or DatasetDict.
Args:
ds (Union[Dataset, DatasetDict]): Hugging Dataset og DatasetDict.
tokenizer (AutoTokenizer): Tokenizer.
max_length (int): Max sequence lenght.
truncation (bool): Wether to truncate sequences longer than max_length.
Returns:
Union[Dataset, DatasetDict]: Dataset with tokenized text.
"""
def tokenize(example):
return tokenizer(example["text"], max_length=max_length, truncation=truncation)
return ds.map(tokenize, batched=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment