Created
December 19, 2020 05:57
-
-
Save GeorgeDittmar/08f48fcff7dabf01776eacf0c3b0265f to your computer and use it in GitHub Desktop.
Code to generate the training and eval scripts
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Now load the data line by line | |
""" | |
from sklearn.model_selection import train_test_split | |
with open('<path to text file>', 'r') as data: | |
dataset = ["<|title|>" + x.strip() for x in data.readlines()] | |
train, eval = train_test_split(dataset, train_size=.9, random_state=2020) | |
print("training size:" + len(train)) | |
print("Evaluation size: " + len(eval)) | |
with open('train_tmp.txt', 'w') as file_handle: | |
file_handle.write("<|endoftext|>".join(train)) | |
with open('eval_tmp.txt', 'w') as file_handle: | |
file_handle.write("<|endoftext|>".join(eval)) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment