Skip to content

Instantly share code, notes, and snippets.

@dwinter
Created September 30, 2014 18:10
Show Gist options
  • Save dwinter/ea44cb475d503075f046 to your computer and use it in GitHub Desktop.
Save dwinter/ea44cb475d503075f046 to your computer and use it in GitHub Desktop.
rentrez and pmc licenses

#What's the easiest way to extract license info from PMC

As a test, play with a 50-paper request from PMC

library(rentrez)
search <- entrez_search(db="pmc", term="Tetrahymena", retmax=50)

Now, see how long it takes to fetch each paper in all the formats avaliable for PMC, and how much memory they take up:

timed_fetch <- function(type, mode){
    time <- system.time(
        rec <- entrez_fetch(db="pmc", id=search$id, rettype=type, retmode=mode)
    )
    size <- format(object.size(rec), units="auto")
    return(list(time=time[3], size=size, rec=rec))
}



pmc_formats <- list( xml=c("xml", "xml"),
                     medline=c("medline", "text"), 
                     summary=c("docsum", "xml") )

compare_formats <- lapply( pmc_formats, function(x) timed_fetch(x[1], x[2]) )
sapply(compare_formats, "[", 1:2)
##      xml      medline    summary  
## time 2.885    1.269      0.719    
## size "6.3 Mb" "144.3 Kb" "60.8 Kb"

So the summary data is the smallest and consequently the quickest to fetch. Sadly there is no license information in a docsum

xmlrec <- xmlTreeParse(compare_formats$summary$rec, useInternalNodes=TRUE)
xpathApply(xmlrec, "//DocSum")[[1]]
## <DocSum>
##   <Id>4123682</Id>
##   <Item Name="PubDate" Type="Date">2014 Sep 19</Item>
##   <Item Name="EPubDate" Type="Date"/>
##   <Item Name="Source" Type="String">Philos Trans R Soc Lond B Biol Sci</Item>
##   <Item Name="AuthorList" Type="List">
##     <Item Name="Author" Type="String">Sereno MI</Item>
##   </Item>
##   <Item Name="Title" Type="String">Origin of symbol-using systems: speech, but not sign, without the semantic urge</Item>
##   <Item Name="Volume" Type="String">369</Item>
##   <Item Name="Issue" Type="String">1651</Item>
##   <Item Name="Pages" Type="String">20130303</Item>
##   <Item Name="ArticleIds" Type="List">
##     <Item Name="pmid" Type="String">25092671</Item>
##     <Item Name="doi" Type="String">10.1098/rstb.2013.0303</Item>
##     <Item Name="pmcid" Type="String">PMC4123682</Item>
##   </Item>
##   <Item Name="DOI" Type="String">10.1098/rstb.2013.0303</Item>
##   <Item Name="FullJournalName" Type="String">Philosophical Transactions of the Royal Society B: Biological Sciences</Item>
##   <Item Name="SO" Type="String">2014 Sep 19;369(1651):20130303</Item>
## </DocSum>

Medline is a text based format that can contain copyright information via a field CI. But it seems none of thes records contain CI fields:

ml_recs <- strsplit( compare_formats$medline$rec, "\n\n")[[1]]
cat(ml_recs[[1]], "\n")
## 
## PMC - PMC4123682
## PMID- 25092671
## IS  - 0962-8436 (Print)
## IS  - 1471-2970 (Electronic)
## VI  - 369
## IP  - 1651
## DP  - 2014 Sep 19
## TI  - Origin of symbol-using systems: speech, but not sign, without the semantic urge.
## LID - 20130303
## AB  - Natural language—spoken and signed—is a multichannel phenomenon, involving facial
##       and body expression, and voice and visual intonation that is often used in the
##       service of a social urge to communicate meaning. Given that iconicity seems
##       easier and less abstract than making arbitrary connections between sound and
##       meaning, iconicity and gesture have often been invoked in the origin of language 
##       alongside the urge to convey meaning. To get a fresh perspective, we critically
##       distinguish the origin of a system capable of evolution from the subsequent
##       evolution that system becomes capable of. Human language arose on a substrate of 
##       a system already capable of Darwinian evolution; the genetically supported
##       uniquely human ability to learn a language reflects a key contact point between
##       Darwinian evolution and language. Though implemented in brains generated by DNA
##       symbols coding for protein meaning, the second higher-level symbol-using system
##       of language now operates in a world mostly decoupled from Darwinian evolutionary 
##       constraints. Examination of Darwinian evolution of vocal learning in other
##       animals suggests that the initial fixation of a key prerequisite to language into
##       the human genome may actually have required initially side-stepping not only
##       iconicity, but the urge to mean itself. If sign languages came later, they would 
##       not have faced this constraint.
## FAU - Sereno, Martin I.
## AU  - Sereno MI
## AUID- ORCID: 0000-0002-7598-7829
## AD  - Experimental Psychology, University College London, London, WC1H 0AP, UK
## LA  - eng
## PT  - Journal Article
## PT  - Review
## TA  - Philos Trans R Soc Lond B Biol Sci
## JT  - Philosophical Transactions of the Royal Society B: Biological Sciences
## AID - 10.1098/rstb.2013.0303 [doi]
## AID - rstb20130303 [pii]
## SO  - Philos Trans R Soc Lond B Biol Sci. 2014 Sep 19;369(1651):.
##       doi:10.1098/rstb.2013.0303.
grep("CI\t", ml_recs)
## integer(0)

There are loads of PLOS papers int there, which should all be CC-BY but...

grepl("plos", ml_recs, ignore.case=TRUE)
##  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [23] FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [34]  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE  TRUE FALSE FALSE  TRUE  TRUE
grepl("CC-BY", ml_recs, ignore.case=TRUE)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE

So that leaves us with the xml:

extract_license <- function(rec){
     parsed_xml <- xmlTreeParse(rec, useInternalNodes=TRUE)
     xpathApply(parsed_xml, "//article/front/article-meta/permissions")
}
system.time(
   licenses <- extract_license(compare_formats$xml$rec)
)
##    user  system elapsed 
##   0.584   0.008   0.592
licenses[[1]]
## <permissions>
##   <copyright-statement/>
##   <copyright-year>2014</copyright-year>
##   <license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/3.0/">
##     <license-p>© 2014 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/3.0/">http://creativecommons.org/licenses/by/3.0/</ext-link>, which permits unrestricted use, provided the original author and source are credited.</license-p>
##   </license>
## </permissions>

##A possible work-around

The xml files are pretty large, and when there are hundreds of them storing and parsing them might not scale well. EUtils let's users limit searchs by license. The first record in the above search is a CC-BY paper, so we can oncfirm that by searching:

pmc_id <- search$ids[1]
entrez_search(db="pmc", term=paste0(pmid, "[uid]", " cc by license[filter]"))
## Entrez search result with 1 IDs (max = 1 )

And also confirm that it's not a more restrice license:

pmc_id <- search$ids[1]
entrez_search(db="pmc", term=paste0(pmid, "[uid]", " cc by-nd license[filter]"))
## Entrez search result with 0 IDs (max = 0 )

This might let us get some license information from just an ID and not the whole XML file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment