wesslen/twitter-trolls.Rmd

## twitter-trolls.Rmd
---
title: "Analyzing Russian Trolls: Tidyverse & Text"
author: "Ryan Wesslen"
date: "2/21/2018"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
```

## Russian Trolls

Given all of the news about Russian Trolls, I found a publicly avaiable dataset of known Russian Twitter trolls posted by NBC <https://www.nbcnews.com/tech/social-media/now-available-more-200-000-deleted-russian-troll-tweets-n844731>.

Let's load the Troll tweets using tidyverse (readr).

```{r data}
library(tidyverse)

url <- "http://nodeassets.nbcnews.com/russian-twitter-trolls/tweets.csv"
tweets <- read_csv(url)

user.url <- "http://nodeassets.nbcnews.com/russian-twitter-trolls/users.csv"
users <- read_csv(user.url)
```

### Plotting tweets over time

A good start is to plot the number of tweets by date.

```{r}
tweets %>%
  count(Date = as.Date(created_str)) %>%
  ggplot(aes(x = Date, y = n)) +
  geom_line() +
  labs(title = "Tweets by Known Russian Twitter Trolls",
       x = "Day",
       y = "Number of Tweets")
```

What's clear is that their activity started back in 2014 but really started to pick up around mid 2016, averaging hundreds to thousands of tweets per day.

Let's see the trends by counting the number of tweets that contain politically related terms.

```{r}
tweets$trump <- grepl("trump", tweets$text, ignore.case = TRUE)
tweets$obama <- grepl("obama", tweets$text, ignore.case = TRUE)
tweets$clinton <- grepl("clinton", tweets$text, ignore.case = TRUE)

tweets %>%
  mutate(term =
    case_when(
      trump + clinton + obama > 1 ~ "Multiple",
      trump == TRUE ~ "Trump",
      clinton == TRUE ~ "Clinton",
      obama == TRUE ~ "Obama"
    )) %>%
  filter(!is.na(term)) %>%
  count(Date = as.Date(created_str), term) %>%
  ggplot(aes(x = Date, y = n, color = term)) +
  geom_line() +
  facet_wrap(~term, ncol = 2) +
  labs(title = "Russian Troll Tweets by Mentioned Terms",
       y = "Tweet Count") +
  theme(legend.position = "none")
```


### Text analysis

To analyze the text, I'm using [`quanteda`](http://docs.quanteda.io/), which is my go-to text analysis package (anywhere). However, another great package that's easier for beginners is [`tidytext`](https://www.tidytextmining.com/). I simply choose `quanteda` because of its speed and versalility, but `tidytext` has a lot to offer as well. I will however use `tidytext` near the end to analyze my topic model results.

Let's load `quanteda` and create a corpus as well as keeping the users' screen_name (handle) on as a corpus feature.

```{r text}
library(quanteda)

corpus <- corpus(tweets$text)
docvars(corpus, "screen_name") <- tweets$user_key
```

Next, we'll do basic pre-processing to create the dfm (document-feature matrix), that is the core datastructure to keep our document-term matrix and corpus feature in a very efficient sparse matrix.

I trimmed terms that are not in the dataset at least 10 times or in at least 5 tweets (somewhat arbitrary but important to remove very rare/sparse terms).

```{r}
stopWords <- c("t.co","http","https","amp")

dfm <- dfm(corpus,
           remove = c(stopwords("english"), stopWords),
           ngrams= 1L,
           stem = F,
           remove_numbers = TRUE,
           remove_punct = TRUE,
           remove_symbols = TRUE) %>%
  dfm_trim(min_count = 10, min_docfreq = 5)
```

Let's look at the top words used.

```{r}
topfeatures(dfm, n = 50)
```

First, most tweets are retweets, hence "rt". Second, it's clear that the top terms used are related to "trump", "hillary", "clinton", and "obama." This provides direct evidence that these trolls were discussing politically related topics.

### Hashtags

We can then use helpful functions to build a hashtag co-occurrence network.

For this, we'll run with the top 40 hashtags and create a network in which the nodes are hashatags and the edges are when the hashtags co-occurred in at least 10% of tweets.

```{r}
tag_dfm <- dfm_select(dfm, ('#*'))

# number of hashtags to show
top <- 100

topfeatures(tag_dfm, top) %>%
  head(n = top)

# create co-occurrence matrix and network
fcm(tag_dfm) %>%
  fcm_select(names(topfeatures(tag_dfm, top))) %>%
  textplot_network(min_freq = 0.1, edge_alpha = 0.8, edge_size = 4)
```

### Handles

Similarly, we can do the same thing but for handles (screen_name) mentions.

```{r}
sn_dfm <- dfm_select(dfm, ('@*'))

# number of hashtags to show
top <- 40

topfeatures(sn_dfm, top) %>%
  head(n = top)

# create co-occurrence matrix and network
fcm(sn_dfm) %>%
  fcm_select(names(topfeatures(sn_dfm, top))) %>%
  textplot_network(min_freq = 0.1, edge_alpha = 0.7, edge_size = 6)
```

### Topic Modeling

For this part, I'll run Correlated Topic Model using the `stm` package.

First, I'll need to convert my dfm to the stm data-structure.

```{r}
library(stm)

# use quanteda converter to convert our Dfm
stmdfm <- convert(dfm, to = "stm")

out <- prepDocuments(stmdfm$documents,
                     stmdfm$vocab,
                     stmdfm$meta,
                     lower.thresh = 10)
```

Next, we'll run topic modeling, choosing 10 topics to simplify our models (there are a ton of great diagnostics for selecting the number of topics in `stm`. Check them out in its help documents!).

```{r fig.height=8}
# k = 10 topic for simplicity
k <- 10

ctmFit <- stm(out$documents, out$vocab, K = k,
              max.em.its = 150, data = out$meta, init.type = "Spectral", seed = 300)

plot(ctmFit,
         type = "summary",
         xlim = c(0,.16),
         n = 5,
         labeltype = "prob",
         main = "Russian Troll Topics",
         text.cex = 0.8)
```

```{r}
# make sure its version 0.1.6 or higher of tidytext
library(tidytext)

# tidy the word-topic combinations
td_beta <- tidy(ctmFit)
td_beta

# helper functions (from David Robinson)
scale_x_reordered <- function(..., sep = "___") {
  reg <- paste0(sep, ".+$")
  ggplot2::scale_x_discrete(labels = function(x) gsub(reg, "", x), ...)
}

reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) {
  new_x <- paste(x, within, sep = sep)
  stats::reorder(new_x, by, FUN = fun)
}

# Examine the topics
td_beta %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  mutate(topic = paste0("Topic ", topic),
           term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = topic)) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_x_reordered() +
  coord_flip()

# tidy the document-topic combinations, with optional document names
td_gamma <- tidy(ctmFit, matrix = "gamma",
                 document_names = rownames(stmdfm))
td_gamma
```

### Next Steps

There's a ton of more analysis that could be done on this dataset. Some ideas include:

* adding in time as a prevalance covariate (i.e., stm) to measure the effect of time on what topics were discussed.

* Explore the links the users posted. This is key to understand what information they're disseminating to different sites. For example, see [Starbird](http://faculty.washington.edu/kstarbi/Alt_Narratives_ICWSM17-CameraReady.pdf) (2017).
	---
	title: "Analyzing Russian Trolls: Tidyverse & Text"
	author: "Ryan Wesslen"
	date: "2/21/2018"
	output: html_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
	```

	## Russian Trolls

	Given all of the news about Russian Trolls, I found a publicly avaiable dataset of known Russian Twitter trolls posted by NBC <https://www.nbcnews.com/tech/social-media/now-available-more-200-000-deleted-russian-troll-tweets-n844731>.

	Let's load the Troll tweets using tidyverse (readr).

	```{r data}
	library(tidyverse)

	url <- "http://nodeassets.nbcnews.com/russian-twitter-trolls/tweets.csv"
	tweets <- read_csv(url)

	user.url <- "http://nodeassets.nbcnews.com/russian-twitter-trolls/users.csv"
	users <- read_csv(user.url)
	```

	### Plotting tweets over time

	A good start is to plot the number of tweets by date.

	```{r}
	tweets %>%
	count(Date = as.Date(created_str)) %>%
	ggplot(aes(x = Date, y = n)) +
	geom_line() +
	labs(title = "Tweets by Known Russian Twitter Trolls",
	x = "Day",
	y = "Number of Tweets")
	```

	What's clear is that their activity started back in 2014 but really started to pick up around mid 2016, averaging hundreds to thousands of tweets per day.

	Let's see the trends by counting the number of tweets that contain politically related terms.

	```{r}
	tweets$trump <- grepl("trump", tweets$text, ignore.case = TRUE)
	tweets$obama <- grepl("obama", tweets$text, ignore.case = TRUE)
	tweets$clinton <- grepl("clinton", tweets$text, ignore.case = TRUE)

	tweets %>%
	mutate(term =
	case_when(
	trump + clinton + obama > 1 ~ "Multiple",
	trump == TRUE ~ "Trump",
	clinton == TRUE ~ "Clinton",
	obama == TRUE ~ "Obama"
	)) %>%
	filter(!is.na(term)) %>%
	count(Date = as.Date(created_str), term) %>%
	ggplot(aes(x = Date, y = n, color = term)) +
	geom_line() +
	facet_wrap(~term, ncol = 2) +
	labs(title = "Russian Troll Tweets by Mentioned Terms",
	y = "Tweet Count") +
	theme(legend.position = "none")
	```


	### Text analysis

	To analyze the text, I'm using [`quanteda`](http://docs.quanteda.io/), which is my go-to text analysis package (anywhere). However, another great package that's easier for beginners is [`tidytext`](https://www.tidytextmining.com/). I simply choose `quanteda` because of its speed and versalility, but `tidytext` has a lot to offer as well. I will however use `tidytext` near the end to analyze my topic model results.

	Let's load `quanteda` and create a corpus as well as keeping the users' screen_name (handle) on as a corpus feature.

	```{r text}
	library(quanteda)

	corpus <- corpus(tweets$text)
	docvars(corpus, "screen_name") <- tweets$user_key
	```

	Next, we'll do basic pre-processing to create the dfm (document-feature matrix), that is the core datastructure to keep our document-term matrix and corpus feature in a very efficient sparse matrix.

	I trimmed terms that are not in the dataset at least 10 times or in at least 5 tweets (somewhat arbitrary but important to remove very rare/sparse terms).

	```{r}
	stopWords <- c("t.co","http","https","amp")

	dfm <- dfm(corpus,
	remove = c(stopwords("english"), stopWords),
	ngrams= 1L,
	stem = F,
	remove_numbers = TRUE,
	remove_punct = TRUE,
	remove_symbols = TRUE) %>%
	dfm_trim(min_count = 10, min_docfreq = 5)
	```

	Let's look at the top words used.

	```{r}
	topfeatures(dfm, n = 50)
	```

	First, most tweets are retweets, hence "rt". Second, it's clear that the top terms used are related to "trump", "hillary", "clinton", and "obama." This provides direct evidence that these trolls were discussing politically related topics.

	### Hashtags

	We can then use helpful functions to build a hashtag co-occurrence network.

	For this, we'll run with the top 40 hashtags and create a network in which the nodes are hashatags and the edges are when the hashtags co-occurred in at least 10% of tweets.

	```{r}
	tag_dfm <- dfm_select(dfm, ('#*'))

	# number of hashtags to show
	top <- 100

	topfeatures(tag_dfm, top) %>%
	head(n = top)

	# create co-occurrence matrix and network
	fcm(tag_dfm) %>%
	fcm_select(names(topfeatures(tag_dfm, top))) %>%
	textplot_network(min_freq = 0.1, edge_alpha = 0.8, edge_size = 4)
	```

	### Handles

	Similarly, we can do the same thing but for handles (screen_name) mentions.

	```{r}
	sn_dfm <- dfm_select(dfm, ('@*'))

	# number of hashtags to show
	top <- 40

	topfeatures(sn_dfm, top) %>%
	head(n = top)

	# create co-occurrence matrix and network
	fcm(sn_dfm) %>%
	fcm_select(names(topfeatures(sn_dfm, top))) %>%
	textplot_network(min_freq = 0.1, edge_alpha = 0.7, edge_size = 6)
	```

	### Topic Modeling

	For this part, I'll run Correlated Topic Model using the `stm` package.

	First, I'll need to convert my dfm to the stm data-structure.

	```{r}
	library(stm)

	# use quanteda converter to convert our Dfm
	stmdfm <- convert(dfm, to = "stm")

	out <- prepDocuments(stmdfm$documents,
	stmdfm$vocab,
	stmdfm$meta,
	lower.thresh = 10)
	```

	Next, we'll run topic modeling, choosing 10 topics to simplify our models (there are a ton of great diagnostics for selecting the number of topics in `stm`. Check them out in its help documents!).

	```{r fig.height=8}
	# k = 10 topic for simplicity
	k <- 10

	ctmFit <- stm(out$documents, out$vocab, K = k,
	max.em.its = 150, data = out$meta, init.type = "Spectral", seed = 300)

	plot(ctmFit,
	type = "summary",
	xlim = c(0,.16),
	n = 5,
	labeltype = "prob",
	main = "Russian Troll Topics",
	text.cex = 0.8)
	```

	```{r}
	# make sure its version 0.1.6 or higher of tidytext
	library(tidytext)

	# tidy the word-topic combinations
	td_beta <- tidy(ctmFit)
	td_beta

	# helper functions (from David Robinson)
	scale_x_reordered <- function(..., sep = "___") {
	reg <- paste0(sep, ".+$")
	ggplot2::scale_x_discrete(labels = function(x) gsub(reg, "", x), ...)
	}

	reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) {
	new_x <- paste(x, within, sep = sep)
	stats::reorder(new_x, by, FUN = fun)
	}

	# Examine the topics
	td_beta %>%
	group_by(topic) %>%
	top_n(10, beta) %>%
	ungroup() %>%
	mutate(topic = paste0("Topic ", topic),
	term = reorder_within(term, beta, topic)) %>%
	ggplot(aes(term, beta, fill = topic)) +
	geom_col(alpha = 0.8, show.legend = FALSE) +
	facet_wrap(~ topic, scales = "free") +
	scale_x_reordered() +
	coord_flip()

	# tidy the document-topic combinations, with optional document names
	td_gamma <- tidy(ctmFit, matrix = "gamma",
	document_names = rownames(stmdfm))
	td_gamma
	```

	### Next Steps

	There's a ton of more analysis that could be done on this dataset. Some ideas include:

	* adding in time as a prevalance covariate (i.e., stm) to measure the effect of time on what topics were discussed.

	* Explore the links the users posted. This is key to understand what information they're disseminating to different sites. For example, see [Starbird](http://faculty.washington.edu/kstarbi/Alt_Narratives_ICWSM17-CameraReady.pdf) (2017).