Skip to content

Instantly share code, notes, and snippets.

@iamaaditya
Last active September 14, 2016 02:06
Show Gist options
  • Save iamaaditya/fca491f1aa32ccc2e1c3e6c163126b62 to your computer and use it in GitHub Desktop.
Save iamaaditya/fca491f1aa32ccc2e1c3e6c163126b62 to your computer and use it in GitHub Desktop.
""" This program takes a file, and splits it into given percentage by line number, but uses
randomization to select the files
USAGE: python randomize_split.py <file_name> <split_percentage_for_test_eg_10>
@author: aaditya prakash"""
from __future__ import division
import sys
import random
file = open(sys.argv[1]).read().splitlines()
split_percent = float(sys.argv[2])
assert split_percent > 1, "give percentage as range from 1 to 100"
file_len = len(file)
file_len_range = range(file_len)
test_split_index = random.sample(file_len_range, int(split_percent*file_len/100))
train_split_index = list(set(file_len_range) - set(test_split_index))
file_train = open("train_" + sys.argv[1], 'w')
file_test = open("test_" + sys.argv[1], 'w')
file_train.write('\n'.join([file[i] for i in train_split_index]))
file_test.write('\n'.join([file[i] for i in test_split_index]))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment