by Kevin Bowrin
created 2017-06-14
The Crossref DOI registration system maintains redirects and metadata for approximately 50,000 academic journals. How healthy is that system?
I used the OAI-PMH endpoint, and the excellent sickle library to do some analysis.
It would be very informative to know the number of HTTP redirects needed and final HTTP status code for every DOI in the Crossref system. That is also impossible to do. Instead, I focused on domain names.
- DNS lookups are fast.
- Domain name resolution means a patron would hopefully find something when looking up a DOI.
- The number of unique domain names in the Crossref system would hopefully be manageable.
I tried to find a representative sample of the domain names which a journal's DOIs would redirect to.
For example, the DOI 10.1055/s-00023617
redirects to the URL http://www.thieme-connect.de/products/ejournals/journal/10.1055/s-00023617
which has the domain name www.thieme-connect.de
. It's very likely that most of the DOIs for that journal would point to the same domain name.
I worked backwards in time through the DOIs of each journal.
Some pseudocode:
def backoff(current):
if current == 0:
return 1
if current == 1:
return 2
if current == 2:
return 6
if current == 6:
return 12
if current < 12:
return 12
return current+12
crossrefOAI = sickle.Sickle('http://oai.crossref.org/OAIHandler', encoding='utf-8')
journal = model.Journal.select().where(model.Journal.processed == False).order_by(peewee.fn.Rand()).first()
today = datetime.date.today()
wayback_months = 0
while True:
until_date = today - dateutil.relativedelta.relativedelta(months=wayback_months)
identifiers = crossrefOAI.ListIdentifiers(metadataPrefix='cr_unixml', set=journal.spec, ignore_deleted=True, until=until_date.isoformat())
identifier = next(identifiers, None)
if identifier == None:
break
if model.DOI.select().where(model.DOI.identifier == identifierdata.identifier).exists():
print("Already processed: ", identifierdata.identifier)
wayback_months = backoff(wayback_months)
continue
doi_entry = get_entry(identifier)
if model.DOI.select().where(model.DOI.domainname == doi_entry.domainname).exists():
print("Domain name already exists:", domainname)
wayback_months = backoff(wayback_months)
continue
doi_entry.save()
wayback_months = 1
For each chunk of time, if the first DOI's domain name had not been seen before, the DOI entry was stored.
51,153
journals were analyzed.
11,320
DOI entries with unique domain names were found.
For example, the journal Biochemical Society Transactions has three DOIs in the dataset, which redirect to URLs with domain names biochemsoctrans.org
, bst.portlandpress.com
, and www.biochemsoctrans.org
.
On the other hand, 40,451
journals only had DOIs which redirected to URLs with domain names already in the data set. This means many journals share domain names and the associated services for DOI resolution and/or hosting.
372
were no longer in the DNS system, which is about 3%.
That's really good! Only a small percentage of domains in the system no longer resolve.
Another interesting thing:
852
URLs redirected to the issue, not article, which is about 7%.
3,565
URLs redirected to the journal, not article, which is about 31%.
That is a larger percentage than I expected. I'm likely using the ListIdentifiers
OAI-PMH verb incorrectly, or there was some other problem in the code I wrote. I'm curious how many article DOIs in the Crossref system don't resolve to article level.
tl;dr:
- Most domains in Crossref still work.
- More journals than I expected have DOIs which only point to the journal or issue.
- A majority of journals use domain names shared with other journals for DOI resolution and/or hosting.