Skip to content

Instantly share code, notes, and snippets.

@edsu
Created August 2, 2011 01:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save edsu/1119422 to your computer and use it in GitHub Desktop.
Save edsu/1119422 to your computer and use it in GitHub Desktop.
prints out PubMedCentra IDs and their associated license URL (if one is found in the article XML)
#!/usr/bin/env python
"""
Script to go through all the OAI-PMH records in the PubMedCentral database and
print out a tab delimited list of record identifiers and a license url (if one
is included).
You'll need lxml installed to run this.
"""
from lxml import etree
# namespaces we'll use
ns = {'oai': 'http://www.openarchives.org/OAI/2.0/',
'nlm': 'http://dtd.nlm.nih.gov/2.0/xsd/archivearticle',
'xlink': 'http://www.w3.org/1999/xlink'}
# initial url, for the first set of records
url = 'http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=pmc&set=pmc-open'
while True:
doc = etree.parse(url)
# iterate through each record, printing out the id and license (if there)
for record in doc.xpath('oai:ListRecords/oai:record', namespaces=ns):
id = record.xpath('string(oai:header/oai:identifier)', namespaces=ns)
license = record.xpath('string(oai:ListRecords/oai:record/oai:metadata/nlm:article/nlm:front/nlm:article-meta/nlm:permissions/nlm:license/@xlink:href)', namespaces=ns)
print "%s\t%s" % (id, license)
# use resumption token to construct the url for the next bunch of records
t = doc.xpath('string(oai:ListRecords/oai:resumptionToken)', namespaces=ns)
if not t:
break
url = "http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&resumptionToken=%s" % t
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment