Skip to content

Instantly share code, notes, and snippets.

@jeremyjbowers
Forked from schwanksta/parse_scotus.py
Created March 27, 2013 20:52
Show Gist options
  • Save jeremyjbowers/5257913 to your computer and use it in GitHub Desktop.
Save jeremyjbowers/5257913 to your computer and use it in GitHub Desktop.
import re
import json
ws_re = re.compile("\s+")
line_num_re = re.compile("\s\d+\s{2,}", re.M)
# first, pdftotext -layout <pdf> <text>
with open("12-307_jnt1.txt", "r") as f:
data = f.read()
exclude = (
"Alderson Reporting Company",
"Official - Subject to Final Review",
)
data = re.sub(line_num_re, "", data)
for xc in exclude:
data = data.replace(xc, "")
data = re.sub(ws_re, " ", data)
data_split = re.split('([A-Z+.]{3,} [A-Z ]+):', data)
del data_split[0]
pairs = zip(data_split[0::2], data_split[1::2])
js = json.dumps(pairs)
with open("sc-doma.json", "w") as f:
f.write(js)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment