Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@wesslen
Last active March 19, 2018 20:26
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wesslen/01384bb7cd94a6bd1d393629b984d572 to your computer and use it in GitHub Desktop.
Save wesslen/01384bb7cd94a6bd1d393629b984d572 to your computer and use it in GitHub Desktop.
troll-tweets Rmd
---
title: "Analyzing Russian Trolls: Tidyverse & Text"
author: "Ryan Wesslen"
date: "2/21/2018"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
```
## Russian Trolls
Given all of the news about Russian Trolls, I found a publicly avaiable dataset of known Russian Twitter trolls posted by NBC <https://www.nbcnews.com/tech/social-media/now-available-more-200-000-deleted-russian-troll-tweets-n844731>.
Let's load the Troll tweets using tidyverse (readr).
```{r data}
library(tidyverse)
url <- "http://nodeassets.nbcnews.com/russian-twitter-trolls/tweets.csv"
tweets <- read_csv(url)
user.url <- "http://nodeassets.nbcnews.com/russian-twitter-trolls/users.csv"
users <- read_csv(user.url)
```
### Plotting tweets over time
A good start is to plot the number of tweets by date.
```{r}
tweets %>%
count(Date = as.Date(created_str)) %>%
ggplot(aes(x = Date, y = n)) +
geom_line() +
labs(title = "Tweets by Known Russian Twitter Trolls",
x = "Day",
y = "Number of Tweets")
```
What's clear is that their activity started back in 2014 but really started to pick up around mid 2016, averaging hundreds to thousands of tweets per day.
Let's see the trends by counting the number of tweets that contain politically related terms.
```{r}
tweets$trump <- grepl("trump", tweets$text, ignore.case = TRUE)
tweets$obama <- grepl("obama", tweets$text, ignore.case = TRUE)
tweets$clinton <- grepl("clinton", tweets$text, ignore.case = TRUE)
tweets %>%
mutate(term =
case_when(
trump + clinton + obama > 1 ~ "Multiple",
trump == TRUE ~ "Trump",
clinton == TRUE ~ "Clinton",
obama == TRUE ~ "Obama"
)) %>%
filter(!is.na(term)) %>%
count(Date = as.Date(created_str), term) %>%
ggplot(aes(x = Date, y = n, color = term)) +
geom_line() +
facet_wrap(~term, ncol = 2) +
labs(title = "Russian Troll Tweets by Mentioned Terms",
y = "Tweet Count") +
theme(legend.position = "none")
```
### Text analysis
To analyze the text, I'm using [`quanteda`](http://docs.quanteda.io/), which is my go-to text analysis package (anywhere). However, another great package that's easier for beginners is [`tidytext`](https://www.tidytextmining.com/). I simply choose `quanteda` because of its speed and versalility, but `tidytext` has a lot to offer as well. I will however use `tidytext` near the end to analyze my topic model results.
Let's load `quanteda` and create a corpus as well as keeping the users' screen_name (handle) on as a corpus feature.
```{r text}
library(quanteda)
corpus <- corpus(tweets$text)
docvars(corpus, "screen_name") <- tweets$user_key
```
Next, we'll do basic pre-processing to create the dfm (document-feature matrix), that is the core datastructure to keep our document-term matrix and corpus feature in a very efficient sparse matrix.
I trimmed terms that are not in the dataset at least 10 times or in at least 5 tweets (somewhat arbitrary but important to remove very rare/sparse terms).
```{r}
stopWords <- c("t.co","http","https","amp")
dfm <- dfm(corpus,
remove = c(stopwords("english"), stopWords),
ngrams= 1L,
stem = F,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE) %>%
dfm_trim(min_count = 10, min_docfreq = 5)
```
Let's look at the top words used.
```{r}
topfeatures(dfm, n = 50)
```
First, most tweets are retweets, hence "rt". Second, it's clear that the top terms used are related to "trump", "hillary", "clinton", and "obama." This provides direct evidence that these trolls were discussing politically related topics.
### Hashtags
We can then use helpful functions to build a hashtag co-occurrence network.
For this, we'll run with the top 40 hashtags and create a network in which the nodes are hashatags and the edges are when the hashtags co-occurred in at least 10% of tweets.
```{r}
tag_dfm <- dfm_select(dfm, ('#*'))
# number of hashtags to show
top <- 100
topfeatures(tag_dfm, top) %>%
head(n = top)
# create co-occurrence matrix and network
fcm(tag_dfm) %>%
fcm_select(names(topfeatures(tag_dfm, top))) %>%
textplot_network(min_freq = 0.1, edge_alpha = 0.8, edge_size = 4)
```
### Handles
Similarly, we can do the same thing but for handles (screen_name) mentions.
```{r}
sn_dfm <- dfm_select(dfm, ('@*'))
# number of hashtags to show
top <- 40
topfeatures(sn_dfm, top) %>%
head(n = top)
# create co-occurrence matrix and network
fcm(sn_dfm) %>%
fcm_select(names(topfeatures(sn_dfm, top))) %>%
textplot_network(min_freq = 0.1, edge_alpha = 0.7, edge_size = 6)
```
### Topic Modeling
For this part, I'll run Correlated Topic Model using the `stm` package.
First, I'll need to convert my dfm to the stm data-structure.
```{r}
library(stm)
# use quanteda converter to convert our Dfm
stmdfm <- convert(dfm, to = "stm")
out <- prepDocuments(stmdfm$documents,
stmdfm$vocab,
stmdfm$meta,
lower.thresh = 10)
```
Next, we'll run topic modeling, choosing 10 topics to simplify our models (there are a ton of great diagnostics for selecting the number of topics in `stm`. Check them out in its help documents!).
```{r fig.height=8}
# k = 10 topic for simplicity
k <- 10
ctmFit <- stm(out$documents, out$vocab, K = k,
max.em.its = 150, data = out$meta, init.type = "Spectral", seed = 300)
plot(ctmFit,
type = "summary",
xlim = c(0,.16),
n = 5,
labeltype = "prob",
main = "Russian Troll Topics",
text.cex = 0.8)
```
```{r}
# make sure its version 0.1.6 or higher of tidytext
library(tidytext)
# tidy the word-topic combinations
td_beta <- tidy(ctmFit)
td_beta
# helper functions (from David Robinson)
scale_x_reordered <- function(..., sep = "___") {
reg <- paste0(sep, ".+$")
ggplot2::scale_x_discrete(labels = function(x) gsub(reg, "", x), ...)
}
reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) {
new_x <- paste(x, within, sep = sep)
stats::reorder(new_x, by, FUN = fun)
}
# Examine the topics
td_beta %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = topic)) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_x_reordered() +
coord_flip()
# tidy the document-topic combinations, with optional document names
td_gamma <- tidy(ctmFit, matrix = "gamma",
document_names = rownames(stmdfm))
td_gamma
```
### Next Steps
There's a ton of more analysis that could be done on this dataset. Some ideas include:
* adding in time as a prevalance covariate (i.e., stm) to measure the effect of time on what topics were discussed.
* Explore the links the users posted. This is key to understand what information they're disseminating to different sites. For example, see [Starbird](http://faculty.washington.edu/kstarbi/Alt_Narratives_ICWSM17-CameraReady.pdf) (2017).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment