Skip to content

Instantly share code, notes, and snippets.

@timchurches
Created October 23, 2017 04:06
Show Gist options
  • Save timchurches/442a7aaaa31f03d918cbef3c46f1d88a to your computer and use it in GitHub Desktop.
Save timchurches/442a7aaaa31f03d918cbef3c46f1d88a to your computer and use it in GitHub Desktop.
Demo of second-order co-authship graphs using the rOpenSci Rorcid interface to ORCiD
---
title: "Visualising co-authorship networks using ORCID"
author: "Tim Churches"
date: "Mon 23rd October, 2017"
output: html_notebook
---
## Pre-amble
This R notebook re-uses code by Simon Goring described in [this blog entry](https://downwithtime.wordpress.com/2015/02/12/building-your-network-using-orcid-and-ropensci/) and published [here](https://github.com/SimonGoring/ORCiD_Network/blob/master/rorcid_network.R), and combines it with the [visNetwork package](http://datastorm-open.github.io/visNetwork/).
## Set-up
The following R packages are required:
```{r}
# library(devtools)
# install_github('ropensci/rorcid')
library(rorcid)
library(dplyr)
library(igraph)
library(visNetwork)
```
## Target an author
The idea is to fetch the ORCID record for an author and hence get all their papers, and all the papers of people they've published with - that is, a second-order authorship graph. This assumes completeness of ORCID records for the tragetted author and all of his or her co-authors, an assumption which is almost certainly false. Currently this script just ignores that problem.
The examole ORCID ID below belongs to # this Di Cook, one of the [rOpenSci ozunconf17](https://github.com/ropensci/ozunconf17) organisers
```{r}
# Just replace the ORCiD below with your own
target_orcid <- "0000-0002-3813-7155"
```
## Query ORCID for data
### Fetch author details from ORCID
```{r}
orcid.record <- orcid_id(orcid = target_orcid)
au <- paste(orcid.record[[1]]$`orcid-bio`$`personal-details`$`given-names`$value, orcid.record[[1]]$`orcid-bio`$`personal-details`$`family-name`$value)
```
```{r, echo=FALSE}
get_doi <- function(x){
# This pulls the DOIs out of the ORCiD record:
list.x <- x$'work-external-identifiers.work-external-identifier'
# We have to catch a few objects with NULL DOI information:
do.call(rbind.data.frame,lapply(list.x, function(x){
if(length(x) == 0 | (!'DOI' %in% x[,1])){
data.frame(value=NA)
} else{
data.frame(value = x[which(x[,1] %in% 'DOI'),2])
}
}))
}
get_papers <- function(x){
all.papers <- x[[1]]$works # this is where the papers are.
papers <- data.frame(title = all.papers$'work-title.title.value',
doi = get_doi(all.papers))
paper.doi <- lapply(1:nrow(papers), function(x){
if(!is.na(papers[x,2]))return(orcid_doi(dois = check_dois(papers[x,2])$good, fuzzy = FALSE))
return(NA)
})
your.papers <- lapply(1:length(paper.doi), function(x){
if(length(paper.doi[[x]]) == 0 || is.na(paper.doi[[x]])){
data.frame(doi=NA, orcid=NA, name=NA)
} else {
data.frame(doi = papers[x,2],
orcid = paper.doi[[x]][[1]]$'orcid-identifier.path',
name = paste(paper.doi[[x]][[1]]$'personal-details.given-names.value',
paper.doi[[x]][[1]]$'personal-details.family-name.value', sep = ' '),
stringsAsFactors = FALSE)
}})
do.call(rbind.data.frame, your.papers)
}
```
### Fetch details of papers and co-authors for the target author
```{r}
paper.set <- get_papers(orcid.record)
unique.orcids <- unique(paper.set$orcid)
```
There are r`length(unique.orcids` unique ORCID IDs identified from the target author's papers.
### Fetch details of each paper for each co-author
This is remarkably slow, I think because ORCID deliberately rate-limits API queries to prevent it from being overwhelmed or DDOSed by too many simultaneous queries.
```{r}
all.coauthors <- list()
for(i in 1:length(unique.orcids)){
cat(i, "of", length(unique.orcids), "orcid=", unique.orcids[i], "\n")
if (!is.na(unique.orcids[i])) {
all.coauthors[[i]] <- get_papers(orcid_id(orcid = unique.orcids[i], profile="works"))
}
}
```
### Convert to a data frame and to an adjacency matrix
```{r}
all.df <- do.call(rbind.data.frame,all.coauthors)
all.df <- na.omit(all.df[!duplicated(all.df),])
all.pairs <- matrix(ncol = length(unique(all.df$name)),
nrow = length(unique(all.df$name)),
dimnames = list(unique(all.df$name),unique(all.df$name)), 0)
unique.dois <- unique(as.character(all.df$doi))
# remove duplicates
for(i in 1:length(unique.dois)){
doi <- unique.dois[i]
all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] <-
all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] + 1
}
all.pairs <- all.pairs[rowSums(all.pairs)>0, colSums(all.pairs)>0]
all.pairs2 <- all.pairs
# all.pairs2[rownames(all.pairs2) == au,]
# remove self-adjacency
diag(all.pairs) <- 0
```
## Visualise the graph
### Visualise adjacency
```{r, fig.width=10}
author.adj <- graph.adjacency(all.pairs, mode = 'undirected', weighted = TRUE)
author.adj2 <- graph.adjacency(all.pairs2, mode = 'undirected', weighted = TRUE)
plot(author.adj, vertex.label.cex = 0.6, vertex.size=0.2, edge.width = E(author.adj2)$weight)
```
Hmmm, not exactly ideal...this might be better:
```{r, fig.width=10}
plot(author.adj, vertex.label.cex = 0.9, vertex.shape="none", vertex.size=0.5, edge.width = E(author.adj)$weight)
```
Or this:
```{r, fig.width=10}
plot(author.adj, vertex.label.cex = 0.5, vertex.shape="none", edge.width = 1)
```
Or this:
```{r, fig.width=10}
plot(author.adj, vertex.shape="none", vertex.label=V(author.adj)$name, vertex.label.font=2, vertex.label.color="gray40", vertex.label.cex=.7, edge.color="gray85")
```
The problem is that for many authors, there is just too much data in a second-order co-authorship graph to display effectively as a static image.
Let's try a different layout.
```{r, fig.width=10}
l <- layout.kamada.kawai(author.adj)
plot(author.adj, vertex.label.cex = 0.9, vertex.shape="none", vertex.size=0.5, edge.width = E(author.adj)$weight, layout=l)
```
Try pruning low weight edges from the graph:
```{r, fig.width=10}
# E(author.adj)$weight
author.adj.sp <- delete.edges(author.adj, E(author.adj)[weight<4])
plot(author.adj, vertex.label.cex = 0.9, vertex.shape="none", vertex.size=0.5, edge.width = E(author.adj)$weight)
```
OK, let's do some graph analysis! Find the largest cliques:
```{r}
largest_cliques(author.adj)
```
The first-order graph
```{r, fig.width=10}
a <- make_ego_graph(author.adj, order=1, nodes=V(author.adj)[name==au])[[1]]
plot(a, vertex.label.cex = 3, vertex.shape="none", vertex.size=0.5, edge.width = E(a)$weight*2)
```
But the second-order graph is way too busy
```{r, fig.width=10}
a <- make_ego_graph(author.adj, order=2, nodes=V(author.adj)[name==au])[[1]]
plot(author.adj, vertex.label.cex = 2, vertex.shape="none", vertex.size=0.5, edge.width = E(author.adj)$weight)
```
## Interactive visualisation
The output works best if opened in a browser window. You may need to zoom in using the controls at the bottom right in order to see the author labels. Hovering over an edge (link) should display the name of each paper. You can highlight edges for an aouthor by clickiing on their node. To display just the node for a particular (co-)author, choose them from the pull-down on the upper left.
```{r}
authors <- all.df %>% distinct(name, orcid) %>% mutate(id=orcid, title=name, label=name, value=10, font.size=15)
authors$group = "Other"
authors[authors$orcid == target_orcid,]$group <- "Origin"
authors[authors$orcid == target_orcid,]$font.size <- 20
all.papers <- orcid.record[[1]]$works
paper_details <- data.frame(title = paste0("<p>", all.papers$'work-title.title.value', "<br><b>", all.papers$'journal-title.value', "</b>, ", all.papers$'publication-date.year.value', "</p>"), doi=as.character(get_doi(all.papers)$value), stringsAsFactors = FALSE)
papers <- all.df %>% inner_join(all.df, by="doi") %>% filter(orcid.x != orcid.y) %>% mutate(author.left=ifelse(orcid.x < orcid.y, name.x, name.y), orcid.left=ifelse(orcid.x < orcid.y, orcid.x, orcid.y), author.right=ifelse(orcid.x > orcid.y, name.x, name.y), orcid.right=ifelse(orcid.x > orcid.y, orcid.x, orcid.y)) %>% select(doi, author.left, orcid.left, author.right, orcid.right) %>% distinct(doi, orcid.left, orcid.right, .keep_all=TRUE) %>% left_join(paper_details, by="doi") %>% mutate(from=orcid.left, to=orcid.right) %>% distinct(doi, orcid.left, orcid.right, .keep_all=TRUE)
title <- paste0("Co-authorship graph for ", au, " (ORCID ", target_orcid, ")")
visNetwork(authors, papers, height = "600px", width = "100%", main = title) %>% visPhysics(solver = "forceAtlas2Based", forceAtlas2Based = list(gravitationalConstant = 1500)) %>% visGroups(groupname = "Origin" , color = "pink", shape = "box", shadow = list(enabled = TRUE)) %>% visGroups(groupname = "Other", color = "lightblue", shape = "box") %>% visPhysics(stabilization = FALSE) %>% visOptions(highlightNearest = list(enabled = TRUE, degree=1, hover = TRUE)) %>% visInteraction(navigationButtons = TRUE) %>% visOptions(selectedBy = "name")
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment