Skip to content

Instantly share code, notes, and snippets.

@uchidama
Created March 6, 2024 06:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save uchidama/a64e4fa70ef72c6c76e5b370bcb13860 to your computer and use it in GitHub Desktop.
Save uchidama/a64e4fa70ef72c6c76e5b370bcb13860 to your computer and use it in GitHub Desktop.
he-stackデータセットから10MBだけ読み込んで、先頭を表示する
# the-stackデータセットから10MBだけ読み込んで、先頭を表示する
import sys
from datasets import load_dataset
dataset = load_dataset("bigcode/the-stack", split="train", streaming=True)
data_subset = []
total_size = 0
for sample in dataset:
sample_size = sys.getsizeof(sample)
if total_size + sample_size > 10 * 1024 * 1024: # 10MB
break
data_subset.append(sample)
total_size += sample_size
print(len(data_subset))
print(total_size)
print(data_subset[0])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment