Skip to content

Instantly share code, notes, and snippets.

@noqqe
Last active August 29, 2015 13:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save noqqe/9955833 to your computer and use it in GitHub Desktop.
Save noqqe/9955833 to your computer and use it in GitHub Desktop.
Choose a random set of lines from a big dataset
#!/usr/bin/python
# usage
# ./randompopulation.py dataset.txt 300
# ./randompopulation.py dataset.txt 9001
import random
import sys
import linecache
# configuration
population=sys.argv[1]
samplesize=int(sys.argv[2])
# count lines of population file
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
# set length to value
length=file_len(population)
x=0
while (x < samplesize):
# get random number with max size
y=(int(random.random() * length))
# use linecache to get specific line number
print(linecache.getline(population, y)).rstrip('\n')
x = x + 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment