Skip to content

Instantly share code, notes, and snippets.

@nhfruchter
Created November 30, 2017 18:43
Show Gist options
  • Save nhfruchter/bb1a303f7442603261e486701e13b7c5 to your computer and use it in GitHub Desktop.
Save nhfruchter/bb1a303f7442603261e486701e13b7c5 to your computer and use it in GitHub Desktop.
Rough comment citation extraction from FCC text
import string
import re
import requests
from collections import Counter
fcc = requests.get("https://apps.fcc.gov/edocs_public/attachmatch/DOC-347927A1.txt").content.decode('windows-1252')
fcc.replace("”", "")
fcc.replace("“", "")
fcc.replace("et al", "")
pattern = re.compile(r"(?:\w+\W+){2}(?:comment?s|reply)", re.IGNORECASE|re.MULTILINE)
cites = pattern.findall(fcc)
cites = [c.upper() for c in cites]
cites = [c.replace("\n", "").replace(" ", " ") for c in cites]
cites = [c.split(";")[1].strip() if ";" in c else c for c in cites]
cites = [c.split(",")[1].strip() if "," in c else c for c in cites]
cites = [c.split(".")[1].strip() if "." in c else c for c in cites]
cites = [" ".join(c.split(" ")[1:]).strip() if c[0] in string.digits else c for c in cites]
cites = [" ".join(c.split(" ")[1:]).strip() if c.startswith("SEE ") else c for c in cites]
cites = [" ".join(c.split(" ")[1:]).strip() if c.startswith("ALSO ") else c for c in cites]
cites = [(" ".join(c.split(" ")[0:-1]), c.split(" ")[-1]) for c in cites]
Counter(source for source,type in cites)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment