Skip to content

Instantly share code, notes, and snippets.

@MichelNivard
Created February 23, 2023 19:57
Show Gist options
  • Save MichelNivard/340e3ec748fd5813603dc1e9056c830e to your computer and use it in GitHub Desktop.
Save MichelNivard/340e3ec748fd5813603dc1e9056c830e to your computer and use it in GitHub Desktop.
cat author_manuscript_txt.incr.2022-12-19/*/*.txt > merged-file.txt
from datasets import load_dataset
dataset = load_dataset('text', data_files="merged-file.txt")
print(dataset)
dataset2 = dataset.filter(lambda x: len(x["text"]) > 500)
print(dataset2)
print(dataset2['train'][1:10])
@MichelNivard
Copy link
Author

jsut a copy paste for myself

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment