Skip to content

Instantly share code, notes, and snippets.

@isoboroff
Created June 13, 2013 16:46
Show Gist options
  • Save isoboroff/5775326 to your computer and use it in GitHub Desktop.
Save isoboroff/5775326 to your computer and use it in GitHub Desktop.
This is a Python script to draw random lines from text files. The key application is where those files are much bigger than RAM and when you really don't want to read the entire file. It works by randomly seeking around in the files, then outputting the next full line. I am concerned that random.randrange(), random.randint(), file.seek(), and fi…
#!/usr/bin/env python2.7
import os
import random
import sys
import argparse
parser = argparse.ArgumentParser(description = 'Print random lines from a file')
parser.add_argument('-n', dest='sample_size', type=int, help='number of lines to sample', default=100)
parser.add_argument('files', nargs=argparse.REMAINDER, help='files to read from')
args = parser.parse_args()
ins = []
sizes = []
seen = []
for f in range(len(args.files)):
sizes.append(os.stat(args.files[f])[6])
ins.append(open(args.files[f], 'r'))
seen.append(set())
i = 0
random.seed(None)
while i < args.sample_size:
w = random.randrange(len(ins))
f = ins[w]
size = sizes[w]
off = (f.tell() + random.randrange(0, size - 1)) % size
f.seek(off)
f.readline() # skip partial line
off = f.tell()
if off in seen[w]:
continue
line = f.readline().strip()
if (len(line) == 0):
if 0 in seen[w]:
continue
off = 0
f.seek(off)
line = f.readline().strip()
seen[w].add(off)
i += 1
print line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment