Skip to content

Instantly share code, notes, and snippets.

@vdavez
Created October 10, 2013 01:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vdavez/6911803 to your computer and use it in GitHub Desktop.
Save vdavez/6911803 to your computer and use it in GitHub Desktop.
Scrape the D&Fs for the dc-contracts
#!/usr/bin/env python
##This is the definition for the function to return the dollar value. But it doesn't work because the D&F formats are inconsistent
def dandftext(url):
url = re.split('\\\\',url)[2]
call('wget http://app.ocp.dc.gov/intent_award/D_F/' + url, shell=True)
call('pdftotext ' + url, shell=True)
url_text = re.split('(.pdf)', url)[0] + '.txt'
df = open(url_text,'r')
text = df.read()
# This is the broken part
# pdftext = re.findall(r"(^3.*?\n)(.*?)(^4.)", text, re.DOTALL|re.MULTILINE)[0][1]
call ('rm DF_*', shell=True)
df.close()
return text.strip()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment