Skip to content

Instantly share code, notes, and snippets.

@GeorgeDittmar
Created December 19, 2020 05:57
Show Gist options
  • Save GeorgeDittmar/08f48fcff7dabf01776eacf0c3b0265f to your computer and use it in GitHub Desktop.
Save GeorgeDittmar/08f48fcff7dabf01776eacf0c3b0265f to your computer and use it in GitHub Desktop.
Code to generate the training and eval scripts
"""
Now load the data line by line
"""
from sklearn.model_selection import train_test_split
with open('<path to text file>', 'r') as data:
dataset = ["<|title|>" + x.strip() for x in data.readlines()]
train, eval = train_test_split(dataset, train_size=.9, random_state=2020)
print("training size:" + len(train))
print("Evaluation size: " + len(eval))
with open('train_tmp.txt', 'w') as file_handle:
file_handle.write("<|endoftext|>".join(train))
with open('eval_tmp.txt', 'w') as file_handle:
file_handle.write("<|endoftext|>".join(eval))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment