timchurches/ORCID-second-order-coauthors.Rmd

## ORCID-second-order-coauthors.Rmd
---
title: "Visualising co-authorship networks using ORCID"
author: "Tim Churches"
date: "Mon 23rd October, 2017"
output: html_notebook
---

## Pre-amble

This R notebook re-uses code by Simon Goring described in [this blog entry](https://downwithtime.wordpress.com/2015/02/12/building-your-network-using-orcid-and-ropensci/) and published [here](https://github.com/SimonGoring/ORCiD_Network/blob/master/rorcid_network.R), and combines it with the [visNetwork package](http://datastorm-open.github.io/visNetwork/).

## Set-up

The following R packages are required:

```{r}
# library(devtools)
# install_github('ropensci/rorcid')
library(rorcid)
library(dplyr)
library(igraph)
library(visNetwork)
```

## Target an author

The idea is to fetch the ORCID record for an author and hence get all their papers, and all the papers of people they've published with - that is, a second-order authorship graph. This assumes completeness of ORCID records for the tragetted author and all of his or her co-authors, an assumption which is almost certainly false. Currently this script just ignores that problem.

The examole ORCID ID below belongs to # this Di Cook, one of the [rOpenSci ozunconf17](https://github.com/ropensci/ozunconf17) organisers

```{r}
#  Just replace the ORCiD below with your own
target_orcid <- "0000-0002-3813-7155"
```

## Query ORCID for data

### Fetch author details from ORCID

```{r}
orcid.record <- orcid_id(orcid = target_orcid)
au <- paste(orcid.record[[1]]$`orcid-bio`$`personal-details`$`given-names`$value, orcid.record[[1]]$`orcid-bio`$`personal-details`$`family-name`$value)
```

```{r, echo=FALSE}
get_doi <- function(x){
  #  This pulls the DOIs out of the ORCiD record:
  list.x <- x$'work-external-identifiers.work-external-identifier'

  #  We have to catch a few objects with NULL DOI information:
  do.call(rbind.data.frame,lapply(list.x, function(x){
      if(length(x) == 0 | (!'DOI' %in% x[,1])){
        data.frame(value=NA)
      } else{
        data.frame(value = x[which(x[,1] %in% 'DOI'),2])
      }
    }))
}

get_papers <- function(x){
  all.papers <- x[[1]]$works # this is where the papers are.
  papers <- data.frame(title = all.papers$'work-title.title.value',
                       doi   = get_doi(all.papers))

  paper.doi <- lapply(1:nrow(papers), function(x){
    if(!is.na(papers[x,2]))return(orcid_doi(dois = check_dois(papers[x,2])$good, fuzzy = FALSE))
    return(NA)
  })

  your.papers <- lapply(1:length(paper.doi), function(x){
      if(length(paper.doi[[x]]) == 0 || is.na(paper.doi[[x]])){
        data.frame(doi=NA, orcid=NA, name=NA)
      } else {
        data.frame(doi = papers[x,2],
                   orcid = paper.doi[[x]][[1]]$'orcid-identifier.path',
                   name = paste(paper.doi[[x]][[1]]$'personal-details.given-names.value',
                                paper.doi[[x]][[1]]$'personal-details.family-name.value', sep = ' '),
                   stringsAsFactors = FALSE)
      }})
  do.call(rbind.data.frame, your.papers)

}
```

### Fetch details of papers and co-authors for the target author

```{r}
paper.set <- get_papers(orcid.record)

unique.orcids <- unique(paper.set$orcid)
```

There are r`length(unique.orcids` unique ORCID IDs identified from the target author's papers.

### Fetch details of each paper for each co-author

This is remarkably slow, I think because ORCID deliberately rate-limits API queries to prevent it from being overwhelmed or DDOSed by too many simultaneous queries.

```{r}
all.coauthors <- list()

for(i in 1:length(unique.orcids)){
  cat(i, "of", length(unique.orcids), "orcid=", unique.orcids[i], "\n")
  if (!is.na(unique.orcids[i])) {
    all.coauthors[[i]] <- get_papers(orcid_id(orcid = unique.orcids[i], profile="works"))
  }
}
```

### Convert to a data frame and to an adjacency matrix

```{r}
all.df <- do.call(rbind.data.frame,all.coauthors)
all.df <- na.omit(all.df[!duplicated(all.df),])

all.pairs <- matrix(ncol = length(unique(all.df$name)),
                    nrow = length(unique(all.df$name)),
                    dimnames = list(unique(all.df$name),unique(all.df$name)), 0)

unique.dois <- unique(as.character(all.df$doi))

# remove duplicates
for(i in 1:length(unique.dois)){
  doi <- unique.dois[i]

  all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] <-
    all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] + 1

}

all.pairs <- all.pairs[rowSums(all.pairs)>0, colSums(all.pairs)>0]
all.pairs2 <- all.pairs
# all.pairs2[rownames(all.pairs2) == au,]

# remove self-adjacency
diag(all.pairs) <- 0
```

## Visualise the graph

### Visualise adjacency

```{r, fig.width=10}
author.adj <- graph.adjacency(all.pairs, mode = 'undirected', weighted = TRUE)
author.adj2 <- graph.adjacency(all.pairs2, mode = 'undirected', weighted = TRUE)
plot(author.adj, vertex.label.cex = 0.6, vertex.size=0.2, edge.width = E(author.adj2)$weight)
```

Hmmm, not exactly ideal...this might be better:

```{r, fig.width=10}
plot(author.adj, vertex.label.cex = 0.9,  vertex.shape="none", vertex.size=0.5, edge.width = E(author.adj)$weight)
```

Or this:

```{r, fig.width=10}
plot(author.adj, vertex.label.cex = 0.5, vertex.shape="none", edge.width = 1)
```

Or this:

```{r, fig.width=10}
plot(author.adj, vertex.shape="none", vertex.label=V(author.adj)$name, vertex.label.font=2, vertex.label.color="gray40", vertex.label.cex=.7, edge.color="gray85")
```

The problem is that for many authors, there is just too much data in a second-order co-authorship graph to display effectively as a static image.

Let's try a different layout.

```{r, fig.width=10}
l <- layout.kamada.kawai(author.adj)
plot(author.adj, vertex.label.cex = 0.9,  vertex.shape="none", vertex.size=0.5, edge.width = E(author.adj)$weight, layout=l)
```

Try pruning low weight edges from the graph:

```{r, fig.width=10}
# E(author.adj)$weight
author.adj.sp <- delete.edges(author.adj, E(author.adj)[weight<4])
plot(author.adj, vertex.label.cex = 0.9,  vertex.shape="none", vertex.size=0.5, edge.width = E(author.adj)$weight)
```

OK, let's do some graph analysis! Find the largest cliques:

```{r}
largest_cliques(author.adj)
```

The first-order graph

```{r, fig.width=10}
a <- make_ego_graph(author.adj, order=1, nodes=V(author.adj)[name==au])[[1]]
plot(a, vertex.label.cex = 3,  vertex.shape="none", vertex.size=0.5, edge.width = E(a)$weight*2)
```

But the second-order graph is way too busy

```{r, fig.width=10}
a <- make_ego_graph(author.adj, order=2, nodes=V(author.adj)[name==au])[[1]]
plot(author.adj, vertex.label.cex = 2,  vertex.shape="none", vertex.size=0.5, edge.width = E(author.adj)$weight)
```

## Interactive visualisation

The output works best if opened in a browser window. You may need to zoom in using the controls at the bottom right in order to see the author labels. Hovering over an edge (link) should display the name of each paper. You can highlight edges for an aouthor by clickiing on their node. To display just the node for a particular (co-)author, choose them from the pull-down on the upper left.

```{r}
authors <- all.df %>% distinct(name, orcid) %>% mutate(id=orcid, title=name, label=name,  value=10, font.size=15)
authors$group = "Other"
authors[authors$orcid == target_orcid,]$group <- "Origin"
authors[authors$orcid == target_orcid,]$font.size <- 20
all.papers <- orcid.record[[1]]$works
paper_details <- data.frame(title = paste0("<p>", all.papers$'work-title.title.value', "<br><b>", all.papers$'journal-title.value', "</b>, ", all.papers$'publication-date.year.value', "</p>"), doi=as.character(get_doi(all.papers)$value), stringsAsFactors = FALSE)

papers <- all.df %>% inner_join(all.df, by="doi") %>% filter(orcid.x != orcid.y) %>% mutate(author.left=ifelse(orcid.x < orcid.y, name.x, name.y), orcid.left=ifelse(orcid.x < orcid.y, orcid.x, orcid.y), author.right=ifelse(orcid.x > orcid.y, name.x, name.y), orcid.right=ifelse(orcid.x > orcid.y, orcid.x, orcid.y)) %>% select(doi, author.left, orcid.left, author.right, orcid.right) %>% distinct(doi, orcid.left, orcid.right, .keep_all=TRUE) %>% left_join(paper_details, by="doi") %>% mutate(from=orcid.left, to=orcid.right) %>% distinct(doi, orcid.left, orcid.right, .keep_all=TRUE)

title <- paste0("Co-authorship graph for ", au, " (ORCID ", target_orcid, ")")
visNetwork(authors, papers, height = "600px", width = "100%", main = title) %>% visPhysics(solver = "forceAtlas2Based", forceAtlas2Based = list(gravitationalConstant = 1500)) %>% visGroups(groupname = "Origin" , color = "pink", shape = "box", shadow = list(enabled = TRUE)) %>% visGroups(groupname = "Other", color = "lightblue", shape = "box")  %>% visPhysics(stabilization = FALSE) %>% visOptions(highlightNearest = list(enabled = TRUE, degree=1, hover = TRUE)) %>% visInteraction(navigationButtons = TRUE) %>% visOptions(selectedBy = "name")
```
	---
	title: "Visualising co-authorship networks using ORCID"
	author: "Tim Churches"
	date: "Mon 23rd October, 2017"
	output: html_notebook
	---

	## Pre-amble

	This R notebook re-uses code by Simon Goring described in [this blog entry](https://downwithtime.wordpress.com/2015/02/12/building-your-network-using-orcid-and-ropensci/) and published [here](https://github.com/SimonGoring/ORCiD_Network/blob/master/rorcid_network.R), and combines it with the [visNetwork package](http://datastorm-open.github.io/visNetwork/).

	## Set-up

	The following R packages are required:

	```{r}
	# library(devtools)
	# install_github('ropensci/rorcid')
	library(rorcid)
	library(dplyr)
	library(igraph)
	library(visNetwork)
	```

	## Target an author

	The idea is to fetch the ORCID record for an author and hence get all their papers, and all the papers of people they've published with - that is, a second-order authorship graph. This assumes completeness of ORCID records for the tragetted author and all of his or her co-authors, an assumption which is almost certainly false. Currently this script just ignores that problem.

	The examole ORCID ID below belongs to # this Di Cook, one of the [rOpenSci ozunconf17](https://github.com/ropensci/ozunconf17) organisers

	```{r}
	# Just replace the ORCiD below with your own
	target_orcid <- "0000-0002-3813-7155"
	```

	## Query ORCID for data

	### Fetch author details from ORCID

	```{r}
	orcid.record <- orcid_id(orcid = target_orcid)
	au <- paste(orcid.record[[1]]$`orcid-bio`$`personal-details`$`given-names`$value, orcid.record[[1]]$`orcid-bio`$`personal-details`$`family-name`$value)
	```

	```{r, echo=FALSE}
	get_doi <- function(x){
	# This pulls the DOIs out of the ORCiD record:
	list.x <- x$'work-external-identifiers.work-external-identifier'

	# We have to catch a few objects with NULL DOI information:
	do.call(rbind.data.frame,lapply(list.x, function(x){
	if(length(x) == 0 \| (!'DOI' %in% x[,1])){
	data.frame(value=NA)
	} else{
	data.frame(value = x[which(x[,1] %in% 'DOI'),2])
	}
	}))
	}

	get_papers <- function(x){
	all.papers <- x[[1]]$works # this is where the papers are.
	papers <- data.frame(title = all.papers$'work-title.title.value',
	doi = get_doi(all.papers))

	paper.doi <- lapply(1:nrow(papers), function(x){
	if(!is.na(papers[x,2]))return(orcid_doi(dois = check_dois(papers[x,2])$good, fuzzy = FALSE))
	return(NA)
	})

	your.papers <- lapply(1:length(paper.doi), function(x){
	if(length(paper.doi[[x]]) == 0 \|\| is.na(paper.doi[[x]])){
	data.frame(doi=NA, orcid=NA, name=NA)
	} else {
	data.frame(doi = papers[x,2],
	orcid = paper.doi[[x]][[1]]$'orcid-identifier.path',
	name = paste(paper.doi[[x]][[1]]$'personal-details.given-names.value',
	paper.doi[[x]][[1]]$'personal-details.family-name.value', sep = ' '),
	stringsAsFactors = FALSE)
	}})
	do.call(rbind.data.frame, your.papers)

	}
	```

	### Fetch details of papers and co-authors for the target author

	```{r}
	paper.set <- get_papers(orcid.record)

	unique.orcids <- unique(paper.set$orcid)
	```

	There are r`length(unique.orcids` unique ORCID IDs identified from the target author's papers.

	### Fetch details of each paper for each co-author

	This is remarkably slow, I think because ORCID deliberately rate-limits API queries to prevent it from being overwhelmed or DDOSed by too many simultaneous queries.

	```{r}
	all.coauthors <- list()

	for(i in 1:length(unique.orcids)){
	cat(i, "of", length(unique.orcids), "orcid=", unique.orcids[i], "\n")
	if (!is.na(unique.orcids[i])) {
	all.coauthors[[i]] <- get_papers(orcid_id(orcid = unique.orcids[i], profile="works"))
	}
	}
	```

	### Convert to a data frame and to an adjacency matrix

	```{r}
	all.df <- do.call(rbind.data.frame,all.coauthors)
	all.df <- na.omit(all.df[!duplicated(all.df),])

	all.pairs <- matrix(ncol = length(unique(all.df$name)),
	nrow = length(unique(all.df$name)),
	dimnames = list(unique(all.df$name),unique(all.df$name)), 0)

	unique.dois <- unique(as.character(all.df$doi))

	# remove duplicates
	for(i in 1:length(unique.dois)){
	doi <- unique.dois[i]

	all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] <-
	all.pairs[all.df$name[all.df$doi %in% doi],all.df$name[all.df$doi %in% doi]] + 1

	}

	all.pairs <- all.pairs[rowSums(all.pairs)>0, colSums(all.pairs)>0]
	all.pairs2 <- all.pairs
	# all.pairs2[rownames(all.pairs2) == au,]

	# remove self-adjacency
	diag(all.pairs) <- 0
	```

	## Visualise the graph

	### Visualise adjacency

	```{r, fig.width=10}
	author.adj <- graph.adjacency(all.pairs, mode = 'undirected', weighted = TRUE)
	author.adj2 <- graph.adjacency(all.pairs2, mode = 'undirected', weighted = TRUE)
	plot(author.adj, vertex.label.cex = 0.6, vertex.size=0.2, edge.width = E(author.adj2)$weight)
	```

	Hmmm, not exactly ideal...this might be better:

	```{r, fig.width=10}
	plot(author.adj, vertex.label.cex = 0.9, vertex.shape="none", vertex.size=0.5, edge.width = E(author.adj)$weight)
	```

	Or this:

	```{r, fig.width=10}
	plot(author.adj, vertex.label.cex = 0.5, vertex.shape="none", edge.width = 1)
	```

	Or this:

	```{r, fig.width=10}
	plot(author.adj, vertex.shape="none", vertex.label=V(author.adj)$name, vertex.label.font=2, vertex.label.color="gray40", vertex.label.cex=.7, edge.color="gray85")
	```

	The problem is that for many authors, there is just too much data in a second-order co-authorship graph to display effectively as a static image.

	Let's try a different layout.

	```{r, fig.width=10}
	l <- layout.kamada.kawai(author.adj)
	plot(author.adj, vertex.label.cex = 0.9, vertex.shape="none", vertex.size=0.5, edge.width = E(author.adj)$weight, layout=l)
	```

	Try pruning low weight edges from the graph:

	```{r, fig.width=10}
	# E(author.adj)$weight
	author.adj.sp <- delete.edges(author.adj, E(author.adj)[weight<4])
	plot(author.adj, vertex.label.cex = 0.9, vertex.shape="none", vertex.size=0.5, edge.width = E(author.adj)$weight)
	```

	OK, let's do some graph analysis! Find the largest cliques:

	```{r}
	largest_cliques(author.adj)
	```

	The first-order graph

	```{r, fig.width=10}
	a <- make_ego_graph(author.adj, order=1, nodes=V(author.adj)[name==au])[[1]]
	plot(a, vertex.label.cex = 3, vertex.shape="none", vertex.size=0.5, edge.width = E(a)$weight*2)
	```

	But the second-order graph is way too busy

	```{r, fig.width=10}
	a <- make_ego_graph(author.adj, order=2, nodes=V(author.adj)[name==au])[[1]]
	plot(author.adj, vertex.label.cex = 2, vertex.shape="none", vertex.size=0.5, edge.width = E(author.adj)$weight)
	```

	## Interactive visualisation

	The output works best if opened in a browser window. You may need to zoom in using the controls at the bottom right in order to see the author labels. Hovering over an edge (link) should display the name of each paper. You can highlight edges for an aouthor by clickiing on their node. To display just the node for a particular (co-)author, choose them from the pull-down on the upper left.

	```{r}
	authors <- all.df %>% distinct(name, orcid) %>% mutate(id=orcid, title=name, label=name, value=10, font.size=15)
	authors$group = "Other"
	authors[authors$orcid == target_orcid,]$group <- "Origin"
	authors[authors$orcid == target_orcid,]$font.size <- 20
	all.papers <- orcid.record[[1]]$works
	paper_details <- data.frame(title = paste0("<p>", all.papers$'work-title.title.value', "<br><b>", all.papers$'journal-title.value', "</b>, ", all.papers$'publication-date.year.value', "</p>"), doi=as.character(get_doi(all.papers)$value), stringsAsFactors = FALSE)

	papers <- all.df %>% inner_join(all.df, by="doi") %>% filter(orcid.x != orcid.y) %>% mutate(author.left=ifelse(orcid.x < orcid.y, name.x, name.y), orcid.left=ifelse(orcid.x < orcid.y, orcid.x, orcid.y), author.right=ifelse(orcid.x > orcid.y, name.x, name.y), orcid.right=ifelse(orcid.x > orcid.y, orcid.x, orcid.y)) %>% select(doi, author.left, orcid.left, author.right, orcid.right) %>% distinct(doi, orcid.left, orcid.right, .keep_all=TRUE) %>% left_join(paper_details, by="doi") %>% mutate(from=orcid.left, to=orcid.right) %>% distinct(doi, orcid.left, orcid.right, .keep_all=TRUE)

	title <- paste0("Co-authorship graph for ", au, " (ORCID ", target_orcid, ")")
	visNetwork(authors, papers, height = "600px", width = "100%", main = title) %>% visPhysics(solver = "forceAtlas2Based", forceAtlas2Based = list(gravitationalConstant = 1500)) %>% visGroups(groupname = "Origin" , color = "pink", shape = "box", shadow = list(enabled = TRUE)) %>% visGroups(groupname = "Other", color = "lightblue", shape = "box") %>% visPhysics(stabilization = FALSE) %>% visOptions(highlightNearest = list(enabled = TRUE, degree=1, hover = TRUE)) %>% visInteraction(navigationButtons = TRUE) %>% visOptions(selectedBy = "name")
	```