Skip to content

Instantly share code, notes, and snippets.

@phreeza
Created December 5, 2016 09:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save phreeza/4c28c5cc617327bf374f50c57ece9807 to your computer and use it in GitHub Desktop.
Save phreeza/4c28c5cc617327bf374f50c57ece9807 to your computer and use it in GitHub Desktop.
import glob
entries = []
for n,fname in enumerate(glob.glob('/Users/tom/Downloads/data/*/*.txt')):
f = open(fname)
s = f.readlines()
x = [g.split('\t') for g in ' '.join(s).strip().split('\n ----------\n')][:-1]
if n%1000 == 0:
print n,fname
for raw_entry in x:
entry = {}
entry['id'] = raw_entry[0].strip()
entry['sections'] = raw_entry[1].strip().split(' ')
entry['title'] = raw_entry[2].strip().replace('\n','').replace(' ',' ')
entry['abstract'] = raw_entry[3].strip().replace('\n','')
entry['date'] = fname.split('/')[-1][:-4]
entries.append(entry)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment