Created
March 3, 2024 18:35
-
-
Save ashvardanian/55c2052e9f78b05b8d614aa90cb12347 to your computer and use it in GitHub Desktop.
Benchmark HuggingFace `datasets` library for parsing and preprocessing large textual files
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import argparse | |
import time | |
from datasets import load_dataset | |
from datasets import disable_caching | |
# Set up argument parser | |
parser = argparse.ArgumentParser(description="Benchmark HuggingFace datasets library for a large textual file.") | |
parser.add_argument("file_path", type=str, help="Path to the textual file to be parsed and chunked.") | |
parser.add_argument("--sample_by", type=str, help="How to split - by line or by paragraph.", default="line") | |
# Parse command line arguments | |
args = parser.parse_args() | |
file_path = args.file_path | |
sample_by = args.sample_by | |
# Function to benchmark dataset loading and chunking | |
def benchmark_datasets(file_path, chunk_size=10000): | |
# Measure loading time | |
start_time = time.time() | |
dataset = load_dataset( | |
"text", | |
data_files=file_path, | |
split="train", | |
keep_in_memory=True, | |
sample_by=sample_by, | |
) | |
loading_time = time.time() - start_time | |
print(f"Time taken to load the dataset: {loading_time} seconds") | |
# Measure chunking time | |
start_time = time.time() | |
_ = dataset.train_test_split( | |
test_size=chunk_size, | |
seed=42, | |
shuffle=True, | |
) | |
chunking_time = time.time() - start_time | |
print(f"Time taken to chunk the dataset into parts of size {chunk_size}: {chunking_time} seconds") | |
# Return the total time taken | |
return loading_time + chunking_time | |
# Call the benchmark function | |
disable_caching() | |
total_time = benchmark_datasets(file_path) | |
print(f"Total time taken: {total_time} seconds") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I was preparing some datasets for AI training and noticed that
datasets
by HuggingFace uses the conventionalopen
mechanism to read the file and split it into chunks. I thought it can be significantly accelerated, and started with a benchmark:$ pip install --upgrade --force-reinstall datasets $ python benchmark_huggingface_datasets.py xlsum.csv Generating train split: 1004598 examples [00:47, 21116.16 examples/s] Time taken to load the dataset: 48.66838526725769 seconds Time taken to chunk the dataset into parts of size 10000: 0.11466407775878906 seconds Total time taken: 48.78304934501648 seconds
For benchmarks I've used a large CSV file with mixed UTF-8 content, most common in modern large-scale pre-training pipelines. I've later patched the
datasets
library to usestringzilla
, which resulted in significantly lower memory consumption and in 2.9x throughput improvement on the AWSr7iz
instances. That's using slow SSDs mounted over the network. Performance on local SSDs on something like a DGX-H100 should be even higher:I've already pushed the patches to my fork, and would love to contribute them to the upstream repository.
All the tests pass, but they leave a couple of important questions open. The default Python
open(..., newline=None)
uses universal newlines, where\n
,\r
, and\r\n
are all converted to\n
on the fly. I am not sure if its a good idea for a general purpose dataset preparation pipeline?I can simulate the same behavior (which I don't yet do) for
"line"
splitter. Adjusting it for"paragraph"
-splitter would be harder. Should we stick exactly to the old Pythonic behavior or stay closer to how C and other programming languages do that?