Skip to content

Instantly share code, notes, and snippets.

@njahn82
Last active August 29, 2015 14:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save njahn82/c9958d5e8c794b4c30a3 to your computer and use it in GitHub Desktop.
Save njahn82/c9958d5e8c794b4c30a3 to your computer and use it in GitHub Desktop.
DVCS in Europe PMC OA articles
dvcs base.url
GitHub github.com
BitBucket bitbucket.org
GoogleCode code.google.com
LaunchPad launchpad.net
SourceForge sourceforge.net

Aim

The aim of this post is to track the use of Open-Source software hosting facilities for disclosing DVCS repositories in life science research papers.

Data and Methods

Our sample consists of the following four hosting services provided by Wikipedia

my.urls <- read.csv("dvcs_url.csv", header = TRUE, sep = ",")
print(my.urls)
##          dvcs        base.url
## 1      GitHub      github.com
## 2   BitBucket   bitbucket.org
## 3  GoogleCode code.google.com
## 4   LaunchPad   launchpad.net
## 5 SourceForge sourceforge.net

We used Europe PMC as literature corpus. We searched for the url patterns within the publication subset for which Europe PMC holds the full text. For this, we used the rebi package, provided by rOpenSci.

require(rebi)
require(plyr)
my.urls$base.url <- as.character(my.urls$base.url)
my.data <- lapply(my.urls$base.url, search_publications, dataset = c("fulltext"))
names(my.data) <- my.urls$base.url
my.data <- ldply(my.data, rbind)

Results

We have found

length(unique(my.data$id))

[1] 3190

publications referencing at least one Open-Source software hosting service.

Table 1 ranks the host services

id PMC Publications found
sourceforge.net 1495
code.google.com 914
github.com 832
bitbucket.org 88
launchpad.net 16

Figure 1 plots the yearly distribution of DVCS hosting sservcies over PubMed Central publications. Please note that data were gathered on

[1] "2014-05-23 16:12:24 CEST"

require(ggplot2)

my.data <- my.data[my.data$pubYear > 2008 & my.data$pubYear < 2014, ]
my.data$.id <- factor(my.data$.id, levels = c(rownames(data.frame(rev(sort(table(my.data$.id)))))))

my.df <- data.frame(as.matrix(table(unlist(my.data$pubYear), my.data$.id)))

ggplot(my.df, aes(Var1, Freq, group = Var2)) + geom_line(aes(colour = Var2, 
    show_guide = FALSE)) + geom_point() + theme_bw() + scale_colour_brewer("DVCS Host", 
    palette = 2, type = "qual") + xlab("Year") + ylab("PMC article disclosure") + 
    opts(legend.key = theme_rect(fill = "white", colour = "white"))

plot of chunk simpleplot

Discussion

We have found that GitHub is gaining in importance for data and code disclosure in the life sciences compared to other DVCS hosting services.

## Aim
The aim of this post is to track the use of Open-Source software hosting facilities for disclosing DVCS repositories in life science research papers.
## Data and Methods
Our sample consists of the following four hosting services provided by [Wikipedia](http://en.wikipedia.org/wiki/Comparison_of_open-source_software_hosting_facilities)
```{r, warning = FALSE, message = FALSE}
my.urls <- read.csv("dvcs_url.csv", header = TRUE, sep = ",")
print(my.urls)
```
We used Europe PMC as literature corpus. We searched for the url patterns within the publication subset for which Europe PMC holds the full text. For this, we used the [rebi](http://ropensci.github.io/rebi/) package, provided by [rOpenSci](http://ropensci.org).
```{r, warning = FALSE, message = FALSE}
require(rebi)
require(plyr)
my.urls$base.url <- as.character(my.urls$base.url)
my.data <- lapply(my.urls$base.url, search_publications, dataset = c("fulltext"))
names(my.data) <- my.urls$base.url
my.data <- ldply(my.data, rbind)
```
## Results
We have found
```{r , results='asis', warning = FALSE, message = FALSE}
length(unique(my.data$id))
```
publications referencing at least one Open-Source software hosting service.
Table 1 ranks the host services
```{r , results='asis', echo=FALSE, warning = FALSE, message = FALSE}
my.table <- data.frame(sort(table(my.data$.id), decreasing =T))
colnames(my.table) <- "PMC Publications found"
kable(my.table, format = "markdown")
```
Figure 1 plots the yearly distribution of DVCS hosting sservcies over PubMed Central publications. Please note that data were gathered on
```{r , results='asis', echo=FALSE}
print(Sys.time())
```
```{r simpleplot, echo=TRUE, fig.height=8/2.54, fig.width=18/2.54, warning = FALSE, message = FALSE}
require(ggplot2)
my.data <- my.data[my.data$pubYear > 2008 & my.data$pubYear < 2014,]
my.data$.id <- factor (my.data$.id, levels = c(rownames(data.frame(rev(sort(table(my.data$.id)))))))
my.df <- data.frame(as.matrix(table(unlist(my.data$pubYear), my.data$.id)))
ggplot(my.df, aes(Var1, Freq, group = Var2)) +
geom_line(aes(colour = Var2, show_guide=FALSE)) + geom_point() + theme_bw() +
scale_colour_brewer("DVCS Host",palette=2, type="qual") +
xlab("Year") + ylab("PMC article disclosure") +
opts(legend.key=theme_rect(fill="white",colour="white"))
```
## Discussion
We have found that [GitHub](http://github.com) is gaining in importance for data and code disclosure in the life sciences compared to other DVCS hosting services.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment