dougmet/TwitterMentions.Rmd

## TwitterMentions.Rmd
---
title: "Analysing the Twitter Mentions Network"
author: "Douglas Ashton"
date: "Thursday, February 05, 2015"
output: html_document
---

```{r functions, echo=FALSE}
library(knitr)

# Function to convert the @ mentions column to a character vector
splitBrak <- function(x, split=" ") {
  unlist(strsplit(sub("\\]$","",
               sub("^\\[", "", x)),
           split=split))
}

topN <- function(x, n=10,decreasing=TRUE) {
  df <- data.frame(Rank=1:n)
  y <- x[order(x, decreasing=decreasing)[1:n]]
  df$User <- names(y)
  df$Value <- y
  df
}
```


```{r setup, echo=FALSE, cache=TRUE}
library(igraph)

# In this case the tweets come from a preprocessed csv.
allTweets <- read.csv("all-tweets.csv",stringsAsFactors=FALSE)
nTweets <- nrow(allTweets)


# Split up all of the mentions into screen_names tweeted at
allTo <- lapply(allTweets$user_mentions, splitBrak)

# Get unique "To" users (they were mentioned)
allToVec <- unlist(allTo)
uniqueTo <- unique(allToVec)
# Unique from screen_names (the people who actually tweeted)
uniqueFrom <- unique(allTweets$screen_name)
# All unique screen_names (all from and to)
uniqueUsers <- unique(c(uniqueTo,allTweets$screen_name))

nTo <- length(uniqueTo)
nFrom <- length(uniqueFrom)
nTot <- length(uniqueUsers)
# How many people tweet out and tweet in?
nBoth <- sum(uniqueTo %in% uniqueFrom)
```

```{r makeNet, echo=FALSE, cache=TRUE}

# Make the network
lookUp <- 1:nTot
names(lookUp) <- uniqueUsers

# I want a list with items "from", a single ID, and "to" a vector of IDs.
allFromTo <- list()
for (i in 1:nTweets) {
  allFromTo[[i]] <- list(from=allTweets$screen_name[i], to=allTo[[i]])
}


# Now make an edge list
allEdges <- lapply(allFromTo, function(x) {
  from=x$from
  t(sapply(x$to, function(x) {c(from=from, to=x)}))
})
allEdges2 <- allEdges[which(sapply(allEdges,ncol)==2)]
el <- do.call("rbind",allEdges2)

# Number of non-linking tweets
nNoMention <- length(allEdges) - length(allEdges2)

# Remove self loops
nSelfMentions <- sum(el[,1]==el[,2])
el <- el[el[,1]!=el[,2], ]
nEdges <- nrow(el)

topN <- function(x, n=10,decreasing=TRUE) {
  topn <- x[order(x, decreasing=decreasing)[1:n]]
  names(topn) <- paste0('<a href="https://twitter.com/', names(topn), '">', names(topn),'</a>')
}


g <- graph.edgelist(el, directed=TRUE)

cent <- list(`OutDegree`=degree(g,mode="out"),
             `InDegree`=degree(g,mode="in"),
             `Closeness`=closeness(g),
             `Betweenness`=betweenness(g),
             `Eigenvector`=evcent(g)$vector,
             `PageRank`=page.rank(g)$vector)
```

One of the big successes of data analytics is the cultural change in how business decisions are being made. There is now wide spread acceptance of the role that data science has to play in decision making. With the explosion in the quantity of data available, the task for the modern analyst is to filter through the information that is most relevant.

Twitter represents a classic case. The volume and velocity of Twitter data is staggering, and as discussed in the first part of this series it is within reach to obtain large, clean datasets. This puts the pressure on the analyst to ask the right questions of the data. In the remaining parts of this series we'll be looking through a variety of ways to view the data from Twitter. The view you wish to take will ultimately depend on the question that is being asked. In this post we're tackling the question: "Who are the big players in the data science Twitter community?".

The dataset that we will be using is a sample of all tweets tagged with the hashtags #datascience, #rstats and #python (snake related tweets cleaned out) between the 7th and 17th December 2014. Each tweet contains a number of pieces of useful information. The view that we're going to take in this post is the mentions network.

### Mentions Network

There are a number of ways users can interact on Twitter. Users can "follow" other users to receive regular updates of their tweets, and users may "mention" other users in their own tweets.

<img src="dougTweet.jpg" />

In the tweet above I mentioned <a href="https://twitter.com/MangoTheCat">@MangoTheCat</a> in the text of the tweet and so we draw a directed link from me, <a href="https://twitter.com/dougashton">@dougashton</a> to <a href="https://twitter.com/MangoTheCat">@MangoTheCat</a>. A retweet is another way for one user to mention another. We then went through each of the `r nTweets` tweets in our data set and formed links for every mention. The resulting network contained `r nTot` nodes (users) and `r nEdges` edges (mentions). In network language an "edge" is the same as a link.

A useful tool for dealing with networks in R is the feature rich <a href="http://igraph.org/">igraph</a> package (also available for Python and C). Once you have created your network as an igraph object many of the standard network analysis tools become easily available.

```{r eval=FALSE}
g <- graph.edgelist(el, directed=TRUE)
```

While igraph has nice built-in plotting tools, for large graphs I also like the cross-platform, open source, <a href="http://gephi.github.io/">Gephi</a>. Gephi is an interactive network visualisation and analysis tool with many available plugins. We can easily export our graph from igraph to an open format, such as graphml, and read it into Gephi. The layout below shows the full mentions network and was created using the Force Atlas 2 layout.

<img src="fullNet.png" alt="Full data science mentions network"/>

Even with the limited sample that we used, this is a big network. This visualisation is useful as a high level view of the network. For instance, we have a connected core of tweeters and a disconnected periphery, which is to be expected with this type of sampling technique. We can also see a common motif of clusters of nodes around one other node, these look like parachutes in the visualisation. To gain further insight we must dig a little deeper into the data.

### Broadcasters and Receivers

As noted above, this network appears to contain some nodes that are surrounded by large cluster of other nodes. We can quantify this by looking at the degree centrality. The in-degree of a node is the total number of tweets that mention that user. Similarly the out-degree is the number of mentions made by that user. The top ten nodes for both in and out degree are listed below:
<br>

<div style="clear: both">
<div style="float: left; width: 49%;">
```{r echo=FALSE}
kable(topN(cent$InDegree), format="html", caption="Top 10 In Degree")
```
</div>
<div style="float: left; width: 49%;">
```{r echo=FALSE}
kable(topN(cent$OutDegree), format="html", caption="Top 10 Out Degree")
```
</div>
</div>
<div style="clear: both"><br /></div>

The two lists are completely different. We have a group of users with a large out-degree who retweet at a high rate. These are our broadcasters. In general they tend to pass on content. The users with a large in-degree have their tweets retweeted many times by different users. These are our receivers.

It might be that you can stop there. The nodes with a high in-degree are likely important nodes in this network. However, how do we really know how far their influence really goes? It is possible to get a high in-degree score by being retweeted many times by a small group of broadcasters. This is not a guarantee of influence. For a broader view we must go beyond this nearest neighbour approach and look to more sophisticated network measures.


### Centrality Measures

There are many standard measures of network structure available and the choice of which one to use really comes down to exactly what you are interested in. Here I'll go through two, Page Rank and Betweenness Centrality. In igraph it's as easy as running

```{r eval=FALSE}
betweenness(g)
page.rank(g)
```

Page Rank is the basis of Google's search engine. Roughly speaking it tells you which nodes you are likely to land on if you spent some time surfing twitter feeds randomly following links. If mentions tend to flow towards you, you will get a high score. It also shows that you are connected to other influential nodes.

Betweenness Centrality is a useful measure for networks with a strong community structure. If you work out all of the shortest paths between all of the nodes, betweenness tells you how many of those paths go through each node. If you are the bridge between two communities then you will get a high score.

<div style="clear: both">
<div style="float: left; width: 49%;">
```{r echo=FALSE}
kable(topN(cent$PageRank), format="html", caption="Top 10 Page Rank Centrality")
```
</div>
<div style="float: left; width: 49%;">
```{r echo=FALSE}
kable(topN(cent$Betweenness), format="html", caption="Top 10 Betweenness Centrality")
```
</div>
</div>

<div style="clear: both"><br /></div>

This time we see many of the same users but now some familiar names begin to appear in these lists. For instance, we know that <a href="https://twitter.com/hadleywickham">@hadleywickham</a> is an influential figure in the R community, and while Hadley only tweeted 23 times in this period he features high up the Page Rank centrality list.

<a href="https://twitter.com/MangoTheCat">@MangoTheCat</a> only tweeted four times in this period yet the cat's betweenness score is relatively high. This implies that the connections formed in those tweets were bridging connections between different types of node. We can see this a little better if we look at a much smaller version of our network. The strongly connected component is the part of the network where you can travel from any node to any node along the links. We get this by finding the clusters and keeping the largest.

```{r eval=FALSE}
vStrong <- which(clusters(g2, mode="strong")$membership==1)
gStrong <- induced.subgraph(g2, vids=vStrong)
```

<img src="gStrong2.jpg" alt="Strongly connected component" width="80%"/>

While a little out from the core we see that <a href="https://twitter.com/MangoTheCat">@MangoTheCat</a> does indeed sit between the core and a group on the edge of the cluster.

### Summing up

We've seen that the tools available in R have made aquiring and analysing the Twitter network an easily accessible task. In this post we've dipped our toe into the vast array of methods available in network analysis and we've found that digging a little deeper than simply counting connections can lead to a deeper insight of the network's true function. With so much data available it is important to ask the right questions. If your goal is to improve your social media presence, or whether you are trying to better search for influential users, a little bit of the right network tools can go a long way.
	---
	title: "Analysing the Twitter Mentions Network"
	author: "Douglas Ashton"
	date: "Thursday, February 05, 2015"
	output: html_document
	---

	```{r functions, echo=FALSE}
	library(knitr)

	# Function to convert the @ mentions column to a character vector
	splitBrak <- function(x, split=" ") {
	unlist(strsplit(sub("\\]$","",
	sub("^\\[", "", x)),
	split=split))
	}

	topN <- function(x, n=10,decreasing=TRUE) {
	df <- data.frame(Rank=1:n)
	y <- x[order(x, decreasing=decreasing)[1:n]]
	df$User <- names(y)
	df$Value <- y
	df
	}
	```


	```{r setup, echo=FALSE, cache=TRUE}
	library(igraph)

	# In this case the tweets come from a preprocessed csv.
	allTweets <- read.csv("all-tweets.csv",stringsAsFactors=FALSE)
	nTweets <- nrow(allTweets)


	# Split up all of the mentions into screen_names tweeted at
	allTo <- lapply(allTweets$user_mentions, splitBrak)

	# Get unique "To" users (they were mentioned)
	allToVec <- unlist(allTo)
	uniqueTo <- unique(allToVec)
	# Unique from screen_names (the people who actually tweeted)
	uniqueFrom <- unique(allTweets$screen_name)
	# All unique screen_names (all from and to)
	uniqueUsers <- unique(c(uniqueTo,allTweets$screen_name))

	nTo <- length(uniqueTo)
	nFrom <- length(uniqueFrom)
	nTot <- length(uniqueUsers)
	# How many people tweet out and tweet in?
	nBoth <- sum(uniqueTo %in% uniqueFrom)
	```

	```{r makeNet, echo=FALSE, cache=TRUE}

	# Make the network
	lookUp <- 1:nTot
	names(lookUp) <- uniqueUsers

	# I want a list with items "from", a single ID, and "to" a vector of IDs.
	allFromTo <- list()
	for (i in 1:nTweets) {
	allFromTo[[i]] <- list(from=allTweets$screen_name[i], to=allTo[[i]])
	}


	# Now make an edge list
	allEdges <- lapply(allFromTo, function(x) {
	from=x$from
	t(sapply(x$to, function(x) {c(from=from, to=x)}))
	})
	allEdges2 <- allEdges[which(sapply(allEdges,ncol)==2)]
	el <- do.call("rbind",allEdges2)

	# Number of non-linking tweets
	nNoMention <- length(allEdges) - length(allEdges2)

	# Remove self loops
	nSelfMentions <- sum(el[,1]==el[,2])
	el <- el[el[,1]!=el[,2], ]
	nEdges <- nrow(el)

	topN <- function(x, n=10,decreasing=TRUE) {
	topn <- x[order(x, decreasing=decreasing)[1:n]]
	names(topn) <- paste0('<a href="https://twitter.com/', names(topn), '">', names(topn),'</a>')
	}


	g <- graph.edgelist(el, directed=TRUE)

	cent <- list(`OutDegree`=degree(g,mode="out"),
	`InDegree`=degree(g,mode="in"),
	`Closeness`=closeness(g),
	`Betweenness`=betweenness(g),
	`Eigenvector`=evcent(g)$vector,
	`PageRank`=page.rank(g)$vector)
	```

	One of the big successes of data analytics is the cultural change in how business decisions are being made. There is now wide spread acceptance of the role that data science has to play in decision making. With the explosion in the quantity of data available, the task for the modern analyst is to filter through the information that is most relevant.

	Twitter represents a classic case. The volume and velocity of Twitter data is staggering, and as discussed in the first part of this series it is within reach to obtain large, clean datasets. This puts the pressure on the analyst to ask the right questions of the data. In the remaining parts of this series we'll be looking through a variety of ways to view the data from Twitter. The view you wish to take will ultimately depend on the question that is being asked. In this post we're tackling the question: "Who are the big players in the data science Twitter community?".

	The dataset that we will be using is a sample of all tweets tagged with the hashtags #datascience, #rstats and #python (snake related tweets cleaned out) between the 7th and 17th December 2014. Each tweet contains a number of pieces of useful information. The view that we're going to take in this post is the mentions network.

	### Mentions Network

	There are a number of ways users can interact on Twitter. Users can "follow" other users to receive regular updates of their tweets, and users may "mention" other users in their own tweets.

	<img src="dougTweet.jpg" />

	In the tweet above I mentioned <a href="https://twitter.com/MangoTheCat">@MangoTheCat</a> in the text of the tweet and so we draw a directed link from me, <a href="https://twitter.com/dougashton">@dougashton</a> to <a href="https://twitter.com/MangoTheCat">@MangoTheCat</a>. A retweet is another way for one user to mention another. We then went through each of the `r nTweets` tweets in our data set and formed links for every mention. The resulting network contained `r nTot` nodes (users) and `r nEdges` edges (mentions). In network language an "edge" is the same as a link.

	A useful tool for dealing with networks in R is the feature rich <a href="http://igraph.org/">igraph</a> package (also available for Python and C). Once you have created your network as an igraph object many of the standard network analysis tools become easily available.

	```{r eval=FALSE}
	g <- graph.edgelist(el, directed=TRUE)
	```

	While igraph has nice built-in plotting tools, for large graphs I also like the cross-platform, open source, <a href="http://gephi.github.io/">Gephi</a>. Gephi is an interactive network visualisation and analysis tool with many available plugins. We can easily export our graph from igraph to an open format, such as graphml, and read it into Gephi. The layout below shows the full mentions network and was created using the Force Atlas 2 layout.

	<img src="fullNet.png" alt="Full data science mentions network"/>

	Even with the limited sample that we used, this is a big network. This visualisation is useful as a high level view of the network. For instance, we have a connected core of tweeters and a disconnected periphery, which is to be expected with this type of sampling technique. We can also see a common motif of clusters of nodes around one other node, these look like parachutes in the visualisation. To gain further insight we must dig a little deeper into the data.

	### Broadcasters and Receivers

	As noted above, this network appears to contain some nodes that are surrounded by large cluster of other nodes. We can quantify this by looking at the degree centrality. The in-degree of a node is the total number of tweets that mention that user. Similarly the out-degree is the number of mentions made by that user. The top ten nodes for both in and out degree are listed below:
	<br>

	<div style="clear: both">
	<div style="float: left; width: 49%;">
	```{r echo=FALSE}
	kable(topN(cent$InDegree), format="html", caption="Top 10 In Degree")
	```
	</div>
	<div style="float: left; width: 49%;">
	```{r echo=FALSE}
	kable(topN(cent$OutDegree), format="html", caption="Top 10 Out Degree")
	```
	</div>
	</div>
	<div style="clear: both"><br /></div>

	The two lists are completely different. We have a group of users with a large out-degree who retweet at a high rate. These are our broadcasters. In general they tend to pass on content. The users with a large in-degree have their tweets retweeted many times by different users. These are our receivers.

	It might be that you can stop there. The nodes with a high in-degree are likely important nodes in this network. However, how do we really know how far their influence really goes? It is possible to get a high in-degree score by being retweeted many times by a small group of broadcasters. This is not a guarantee of influence. For a broader view we must go beyond this nearest neighbour approach and look to more sophisticated network measures.


	### Centrality Measures

	There are many standard measures of network structure available and the choice of which one to use really comes down to exactly what you are interested in. Here I'll go through two, Page Rank and Betweenness Centrality. In igraph it's as easy as running

	```{r eval=FALSE}
	betweenness(g)
	page.rank(g)
	```

	Page Rank is the basis of Google's search engine. Roughly speaking it tells you which nodes you are likely to land on if you spent some time surfing twitter feeds randomly following links. If mentions tend to flow towards you, you will get a high score. It also shows that you are connected to other influential nodes.

	Betweenness Centrality is a useful measure for networks with a strong community structure. If you work out all of the shortest paths between all of the nodes, betweenness tells you how many of those paths go through each node. If you are the bridge between two communities then you will get a high score.

	<div style="clear: both">
	<div style="float: left; width: 49%;">
	```{r echo=FALSE}
	kable(topN(cent$PageRank), format="html", caption="Top 10 Page Rank Centrality")
	```
	</div>
	<div style="float: left; width: 49%;">
	```{r echo=FALSE}
	kable(topN(cent$Betweenness), format="html", caption="Top 10 Betweenness Centrality")
	```
	</div>
	</div>

	<div style="clear: both"><br /></div>

	This time we see many of the same users but now some familiar names begin to appear in these lists. For instance, we know that <a href="https://twitter.com/hadleywickham">@hadleywickham</a> is an influential figure in the R community, and while Hadley only tweeted 23 times in this period he features high up the Page Rank centrality list.

	<a href="https://twitter.com/MangoTheCat">@MangoTheCat</a> only tweeted four times in this period yet the cat's betweenness score is relatively high. This implies that the connections formed in those tweets were bridging connections between different types of node. We can see this a little better if we look at a much smaller version of our network. The strongly connected component is the part of the network where you can travel from any node to any node along the links. We get this by finding the clusters and keeping the largest.

	```{r eval=FALSE}
	vStrong <- which(clusters(g2, mode="strong")$membership==1)
	gStrong <- induced.subgraph(g2, vids=vStrong)
	```

	<img src="gStrong2.jpg" alt="Strongly connected component" width="80%"/>

	While a little out from the core we see that <a href="https://twitter.com/MangoTheCat">@MangoTheCat</a> does indeed sit between the core and a group on the edge of the cluster.

	### Summing up

	We've seen that the tools available in R have made aquiring and analysing the Twitter network an easily accessible task. In this post we've dipped our toe into the vast array of methods available in network analysis and we've found that digging a little deeper than simply counting connections can lead to a deeper insight of the network's true function. With so much data available it is important to ask the right questions. If your goal is to improve your social media presence, or whether you are trying to better search for influential users, a little bit of the right network tools can go a long way.