Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save cudevmaxwell/7e1097a5fd67841e406c11f29935b528 to your computer and use it in GitHub Desktop.
Save cudevmaxwell/7e1097a5fd67841e406c11f29935b528 to your computer and use it in GitHub Desktop.
Analysis of Domain Names in the Crossref DOI System

Analysis of Domain Names in the Crossref DOI System

by Kevin Bowrin

created 2017-06-14

The Crossref DOI registration system maintains redirects and metadata for approximately 50,000 academic journals. How healthy is that system?

I used the OAI-PMH endpoint, and the excellent sickle library to do some analysis.

It would be very informative to know the number of HTTP redirects needed and final HTTP status code for every DOI in the Crossref system. That is also impossible to do. Instead, I focused on domain names.

  • DNS lookups are fast.
  • Domain name resolution means a patron would hopefully find something when looking up a DOI.
  • The number of unique domain names in the Crossref system would hopefully be manageable.

I tried to find a representative sample of the domain names which a journal's DOIs would redirect to.

For example, the DOI 10.1055/s-00023617 redirects to the URL http://www.thieme-connect.de/products/ejournals/journal/10.1055/s-00023617 which has the domain name www.thieme-connect.de. It's very likely that most of the DOIs for that journal would point to the same domain name.

I worked backwards in time through the DOIs of each journal.

Some pseudocode:

def backoff(current):
  if current == 0:
    return 1
  if current == 1:
    return 2
  if current == 2:
    return 6
  if current == 6:
    return 12

  if current < 12:
    return 12

  return current+12
        
crossrefOAI = sickle.Sickle('http://oai.crossref.org/OAIHandler', encoding='utf-8')
journal = model.Journal.select().where(model.Journal.processed == False).order_by(peewee.fn.Rand()).first()
today = datetime.date.today()
wayback_months = 0
while True:
  until_date = today - dateutil.relativedelta.relativedelta(months=wayback_months)
  identifiers = crossrefOAI.ListIdentifiers(metadataPrefix='cr_unixml', set=journal.spec, ignore_deleted=True, until=until_date.isoformat())
  identifier = next(identifiers, None)
  if identifier == None:
    break
  if model.DOI.select().where(model.DOI.identifier == identifierdata.identifier).exists():
    print("Already processed: ", identifierdata.identifier)
    wayback_months = backoff(wayback_months)
    continue
  doi_entry = get_entry(identifier)
  if model.DOI.select().where(model.DOI.domainname == doi_entry.domainname).exists():
    print("Domain name already exists:", domainname)
    wayback_months = backoff(wayback_months)
    continue
  doi_entry.save()
  wayback_months = 1 

For each chunk of time, if the first DOI's domain name had not been seen before, the DOI entry was stored.

51,153 journals were analyzed.

11,320 DOI entries with unique domain names were found.

For example, the journal Biochemical Society Transactions has three DOIs in the dataset, which redirect to URLs with domain names biochemsoctrans.org, bst.portlandpress.com, and www.biochemsoctrans.org.

On the other hand, 40,451 journals only had DOIs which redirected to URLs with domain names already in the data set. This means many journals share domain names and the associated services for DOI resolution and/or hosting.

372 were no longer in the DNS system, which is about 3%.

That's really good! Only a small percentage of domains in the system no longer resolve.

Another interesting thing:

852 URLs redirected to the issue, not article, which is about 7%.

3,565 URLs redirected to the journal, not article, which is about 31%.

That is a larger percentage than I expected. I'm likely using the ListIdentifiers OAI-PMH verb incorrectly, or there was some other problem in the code I wrote. I'm curious how many article DOIs in the Crossref system don't resolve to article level.

tl;dr:

  • Most domains in Crossref still work.
  • More journals than I expected have DOIs which only point to the journal or issue.
  • A majority of journals use domain names shared with other journals for DOI resolution and/or hosting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment