Skip to content

Instantly share code, notes, and snippets.

@sckott
Last active August 29, 2015 14:13
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sckott/e628be5c412a3bb38ffc to your computer and use it in GitHub Desktop.
Save sckott/e628be5c412a3bb38ffc to your computer and use it in GitHub Desktop.

been working on the fulltext package lately - and starting thinking

hmmmmmm, it's probably not that easy to extract text from articles for those not familiar with XML, XPATH, etc.

So this happened recently:

Installation, loading

install.packages("devtools")
devtools::install_github("ropensci/fulltext")
library("fulltext")

chunks

I talked about chunks() in a recent gist, but just to remind you, chunks() makes it easy to pull out many different sections of an article by name, and returns an R list, which can then be easily manipulated to a data.frame, make some plots, etc. For example:

Get some PLOS articles

opts <- list(fq=list('doc_type:full',"article_type:\"research article\""))
x <- ft_search(query='ecology', from='plos', limit=2, plosopts = opts)$plos$data$id %>% 
  ft_get(from = "plos") 

Get abstracts

x %>% chunks("abstract")
#> $plos
#> $plos$`10.1371/journal.pone.0059813`
#> $plos$`10.1371/journal.pone.0059813`$abstract
#> [1] "It is thought that the science of ecology has experienced conceptual shifts in recent decades, chiefly from viewing nature as static and balanced to a conception of constantly changing, unpredictable, complex ecosystems. Here, we ask if these changes are reflected in actual ecological research over the last 30 years. We surveyed 750 articles from the entire pool of ecological literature and 750 articles from eight leading journals. Each article was characterized according to its type, ecological domain, and applicability, and major topics. We found that, in contrast to its common image, ecology is still mostly a study of single species (70% of the studies); while ecosystem and community studies together comprise only a quarter of ecological research. Ecological science is somewhat conservative in its topics of research (about a third of all topics changed significantly through time), as well as in its basic methodologies and approaches. However, the growing proportion of problem-solving studies (from 9% in the 1980s to 20% in the 2000 s) may represent a major transition in ecological science in the long run."
#> 
#> 
#> $plos$`10.1371/journal.pone.0001248`
#> $plos$`10.1371/journal.pone.0001248`$abstract
#> [1] "BackgroundSoil ecology has produced a huge corpus of results on relations between soil organisms, ecosystem processes controlled by these organisms and links between belowground and aboveground processes. However, some soil scientists think that soil ecology is short of modelling and evolutionary approaches and has developed too independently from general ecology. We have tested quantitatively these hypotheses through a bibliographic study (about 23000 articles) comparing soil ecology journals, generalist ecology journals, evolutionary ecology journals and theoretical ecology journals.FindingsWe have shown that soil ecology is not well represented in generalist ecology journals and that soil ecologists poorly use modelling and evolutionary approaches. Moreover, the articles published by a typical soil ecology journal (Soil Biology and Biochemistry) are cited by and cite low percentages of articles published in generalist ecology journals, evolutionary ecology journals and theoretical ecology journals.ConclusionThis confirms our hypotheses and suggests that soil ecology would benefit from an effort towards modelling and evolutionary approaches. This effort should promote the building of a general conceptual framework for soil ecology and bridges between soil ecology and general ecology. We give some historical reasons for the parsimonious use of modelling and evolutionary approaches by soil ecologists. We finally suggest that a publication system that classifies journals according to their Impact Factors and their level of generality is probably inadequate to integrate â\u0080\u009cparticularityâ\u0080\u009d (empirical observations) and â\u0080\u009cgeneralityâ\u0080\u009d (general theories), which is the goal of all natural sciences. Such a system might also be particularly detrimental to the development of a science such as ecology that is intrinsically multidisciplinary."

Added more types today, e.g.,

Publishing history

x %>% chunks("history")
#> $plos
#> $plos$`10.1371/journal.pone.0059813`
#> $plos$`10.1371/journal.pone.0059813`$history
#> $plos$`10.1371/journal.pone.0059813`$history$received
#> [1] "2012-09-16"
#> 
#> $plos$`10.1371/journal.pone.0059813`$history$accepted
#> [1] "2013-02-19"
#> 
#> 
#> 
#> $plos$`10.1371/journal.pone.0001248`
#> $plos$`10.1371/journal.pone.0001248`$history
#> $plos$`10.1371/journal.pone.0001248`$history$received
#> [1] "2007-07-02"
#> 
#> $plos$`10.1371/journal.pone.0001248`$history$accepted
#> [1] "2007-11-06"

Acknowledgments

x %>% chunks("acknowledgments")
#> $plos
#> $plos$`10.1371/journal.pone.0059813`
#> $plos$`10.1371/journal.pone.0059813`$acknowledgments
#> [1] "Curtis Flather, Mark Burgman, Leon Blaustein, Yaacov Garb, Yaron Ziv and Daniel Statman have provided valuable comments on a draft of this manuscript."
#> 
#> 
#> $plos$`10.1371/journal.pone.0001248`
#> $plos$`10.1371/journal.pone.0001248`$acknowledgments
#> list()

Permissions

x %>% chunks("permissions")
#> $plos
#> $plos$`10.1371/journal.pone.0059813`
#> $plos$`10.1371/journal.pone.0059813`$permissions
#> $plos$`10.1371/journal.pone.0059813`$permissions$`copyright-year`
#> [1] "2013"
#> 
#> $plos$`10.1371/journal.pone.0059813`$permissions$`copyright-holder`
#> [1] "Carmel et al"
#> 
#> $plos$`10.1371/journal.pone.0059813`$permissions$license
#> [1] "This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited."
#> 
#> $plos$`10.1371/journal.pone.0059813`$permissions$license_url
#> [1] NA
#> 
#> 
#> 
#> $plos$`10.1371/journal.pone.0001248`
#> $plos$`10.1371/journal.pone.0001248`$permissions
#> $plos$`10.1371/journal.pone.0001248`$permissions$`copyright-year`
#> [1] "2007"
#> 
#> $plos$`10.1371/journal.pone.0001248`$permissions$`copyright-holder`
#> [1] "Barot et al"
#> 
#> $plos$`10.1371/journal.pone.0001248`$permissions$license
#> [1] "This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited."
#> 
#> $plos$`10.1371/journal.pone.0001248`$permissions$license_url
#> [1] NA

tabularize

Just added a new function today to further help go from lists to data.frames, which may not make sense for some article sections.

Get some data

x <- ft_get(ids=c("10.3389/fnagi.2014.00130",'10.1155/2014/249309','10.1155/2014/162024'), 
    from='entrez')

Get doi and keywords

x %>% chunks(c("doi","keywords")) %>% tabularize()
#> $entrez
#>                        doi               keywords
#> 1      10.1155/2014/162024                   <NA>
#> 2 10.3389/fnagi.2014.00130                  aging
#> 3 10.3389/fnagi.2014.00130                 stroke
#> 4 10.3389/fnagi.2014.00130           cell therapy
#> 5 10.3389/fnagi.2014.00130                  G-CSF
#> 6 10.3389/fnagi.2014.00130 translational medicine
#> 7 10.3389/fnagi.2014.00130                 BM MSC
#> 8 10.3389/fnagi.2014.00130           angiogenesis
#> 9      10.1155/2014/249309                   <NA>

Authors

x %>% chunks("authors") %>% tabularize()
#> $entrez
#>   authors.given_names authors.surname authors.given_names.1
#> 1                Qing             Gao               Maojuan
#> 2        Adrian Tudor        Balseanu             Ana-Maria
#> 3             Xuefeng             Gao              J. Tyson
#>   authors.surname.1 authors.given_names.2 authors.surname.2
#> 1               Guo                Xijuan             Jiang
#> 2              Buga                Bogdan           Catalin
#> 3          McDonald                 Mamta             Naidu
#>   authors.given_names.3 authors.surname.3 authors.given_names.4
#> 1              Xiantong                Hu                Yijing
#> 2      Daniel-Christoph            Wagner              Johannes
#> 3                Philip         Hahnfeldt                  Lynn
#>   authors.surname.4 authors.given_names.5 authors.surname.5
#> 1              Wang             Yingchang               Fan
#> 2            Boltze             Ana-Maria           Zagrean
#> 3            Hlatky                  <NA>              <NA>
#>   authors.given_names.6 authors.surname.6 authors.given_names.7
#> 1                  <NA>              <NA>                  <NA>
#> 2                 Klaus           Reymann                  Wolf
#> 3                  <NA>              <NA>                  <NA>
#>   authors.surname.7 authors.given_names.8 authors.surname.8
#> 1              <NA>                  <NA>              <NA>
#> 2         Schaebitz                 Aurel       Popa-Wagner
#> 3              <NA>                  <NA>              <NA>

DOI, publisher, and history

x %>% chunks(c("doi","publisher","history")) %>% tabularize()
#> $entrez
#>                        doi                 publisher.name history.received
#> 1      10.1155/2014/162024 Hindawi Publishing Corporation       2014-01-10
#> 2 10.3389/fnagi.2014.00130           Frontiers Media S.A.       2014-04-06
#> 3      10.1155/2014/249309 Hindawi Publishing Corporation       2014-01-22
#>   history.rev.recd history.accepted
#> 1       2014-03-26       2014-04-16
#> 2             <NA>       2014-06-03
#> 3       2014-04-17       2014-04-17

Just history

x %>% chunks("history") %>% tabularize()
#> $entrez
#>   history.received history.rev.recd history.accepted
#> 1       2014-01-10       2014-03-26       2014-04-16
#> 2       2014-04-06             <NA>       2014-06-03
#> 3       2014-01-22       2014-04-17       2014-04-17

hope you like it

🐜 🐘 🐅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment