Skip to content

Instantly share code, notes, and snippets.

@felmoltor
Created June 1, 2015 16:26
Show Gist options
  • Save felmoltor/84ab81e193c784de3c44 to your computer and use it in GitHub Desktop.
Save felmoltor/84ab81e193c784de3c44 to your computer and use it in GitHub Desktop.
It gets a uniform sample of big files
#!/usr/bin/python
import os,sys
if len(sys.argv) < 3:
print "Usage: %s <source file> <percentage>" % sys.argv[0]
exit(1)
if not os.path.exists(sys.argv[1]):
print "Provide a file from wich to extract the sample"
exit(1)
if not (int(sys.argv[2]) < 100 and int(sys.argv[2]) > 1):
print "Provide a percentage of the sample to take (a number between 1 to 100)"
exit(1)
srcfile=sys.argv[1]
percentage=float(sys.argv[2])
nlines=0
sf=open(srcfile,"r")
nlines=sum(1 for _ in sf)
nresult=int((float(nlines)*(float(percentage)/100.0)))
step=nlines/nresult
sf.seek(0)
i=0
for line in sf.readlines():
i+=1
if (i%step) == 0:
print line.strip()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment