Skip to content

Instantly share code, notes, and snippets.

@ronan-mch
Created November 14, 2013 18:43
Show Gist options
  • Save ronan-mch/7472130 to your computer and use it in GitHub Desktop.
Save ronan-mch/7472130 to your computer and use it in GitHub Desktop.
scrape relator codes from Library of Congress - will only work on ScraperWiki
import scraperwiki
import lxml.html
# Grab the page and turn it into an lxml object
html = scraperwiki.scrape("http://www.loc.gov/marc/relators/relaterm.html")
root = lxml.html.fromstring(html)
authorizedList = root.find_class("authorized") # get the title (has class authorized)
codeList = root.find_class('relator-code') # get the code (has class relator-code)
codeDict = dict()
# iterate through our elements - add each elem to dictionary
for i in range(len(authorizedList)):
codeDict = {
'type' : authorizedList[i].text_content().lower(),
'code' : codeList[i].text_content().replace('[','').replace(']','')
}
scraperwiki.sqlite.save(unique_keys=['code'], data=codeDict)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment