Skip to content

Instantly share code, notes, and snippets.

@ashvardanian
Created March 3, 2024 18:35
Show Gist options
  • Save ashvardanian/55c2052e9f78b05b8d614aa90cb12347 to your computer and use it in GitHub Desktop.
Save ashvardanian/55c2052e9f78b05b8d614aa90cb12347 to your computer and use it in GitHub Desktop.
Benchmark HuggingFace `datasets` library for parsing and preprocessing large textual files
import argparse
import time
from datasets import load_dataset
from datasets import disable_caching
# Set up argument parser
parser = argparse.ArgumentParser(description="Benchmark HuggingFace datasets library for a large textual file.")
parser.add_argument("file_path", type=str, help="Path to the textual file to be parsed and chunked.")
parser.add_argument("--sample_by", type=str, help="How to split - by line or by paragraph.", default="line")
# Parse command line arguments
args = parser.parse_args()
file_path = args.file_path
sample_by = args.sample_by
# Function to benchmark dataset loading and chunking
def benchmark_datasets(file_path, chunk_size=10000):
# Measure loading time
start_time = time.time()
dataset = load_dataset(
"text",
data_files=file_path,
split="train",
keep_in_memory=True,
sample_by=sample_by,
)
loading_time = time.time() - start_time
print(f"Time taken to load the dataset: {loading_time} seconds")
# Measure chunking time
start_time = time.time()
_ = dataset.train_test_split(
test_size=chunk_size,
seed=42,
shuffle=True,
)
chunking_time = time.time() - start_time
print(f"Time taken to chunk the dataset into parts of size {chunk_size}: {chunking_time} seconds")
# Return the total time taken
return loading_time + chunking_time
# Call the benchmark function
disable_caching()
total_time = benchmark_datasets(file_path)
print(f"Total time taken: {total_time} seconds")
@ashvardanian
Copy link
Author

ashvardanian commented Mar 3, 2024

I was preparing some datasets for AI training and noticed that datasets by HuggingFace uses the conventional open mechanism to read the file and split it into chunks. I thought it can be significantly accelerated, and started with a benchmark:

$ pip install --upgrade --force-reinstall datasets
$ python benchmark_huggingface_datasets.py xlsum.csv 
Generating train split: 1004598 examples [00:47, 21116.16 examples/s]
Time taken to load the dataset: 48.66838526725769 seconds
Time taken to chunk the dataset into parts of size 10000: 0.11466407775878906 seconds
Total time taken: 48.78304934501648 seconds

For benchmarks I've used a large CSV file with mixed UTF-8 content, most common in modern large-scale pre-training pipelines. I've later patched the datasets library to use stringzilla, which resulted in significantly lower memory consumption and in 2.9x throughput improvement on the AWS r7iz instances. That's using slow SSDs mounted over the network. Performance on local SSDs on something like a DGX-H100 should be even higher:

$ pip install -e .
$ python benchmark_huggingface_datasets.py xlsum.csv 
Generating train split: 1004598 examples [00:15, 64529.90 examples/s]
Time taken to load the dataset: 16.45028805732727 seconds
Time taken to chunk the dataset into parts of size 10000: 0.1291060447692871 seconds
Total time taken: 16.579394102096558 seconds

I've already pushed the patches to my fork, and would love to contribute them to the upstream repository.


All the tests pass, but they leave a couple of important questions open. The default Python open(..., newline=None) uses universal newlines, where \n, \r, and \r\n are all converted to \n on the fly. I am not sure if its a good idea for a general purpose dataset preparation pipeline?

I can simulate the same behavior (which I don't yet do) for "line" splitter. Adjusting it for "paragraph"-splitter would be harder. Should we stick exactly to the old Pythonic behavior or stay closer to how C and other programming languages do that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment