Last active
March 19, 2018 20:26
-
-
Save wesslen/01384bb7cd94a6bd1d393629b984d572 to your computer and use it in GitHub Desktop.
troll-tweets Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Analyzing Russian Trolls: Tidyverse & Text" | |
author: "Ryan Wesslen" | |
date: "2/21/2018" | |
output: html_document | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE, warning = FALSE) | |
``` | |
## Russian Trolls | |
Given all of the news about Russian Trolls, I found a publicly avaiable dataset of known Russian Twitter trolls posted by NBC <https://www.nbcnews.com/tech/social-media/now-available-more-200-000-deleted-russian-troll-tweets-n844731>. | |
Let's load the Troll tweets using tidyverse (readr). | |
```{r data} | |
library(tidyverse) | |
url <- "http://nodeassets.nbcnews.com/russian-twitter-trolls/tweets.csv" | |
tweets <- read_csv(url) | |
user.url <- "http://nodeassets.nbcnews.com/russian-twitter-trolls/users.csv" | |
users <- read_csv(user.url) | |
``` | |
### Plotting tweets over time | |
A good start is to plot the number of tweets by date. | |
```{r} | |
tweets %>% | |
count(Date = as.Date(created_str)) %>% | |
ggplot(aes(x = Date, y = n)) + | |
geom_line() + | |
labs(title = "Tweets by Known Russian Twitter Trolls", | |
x = "Day", | |
y = "Number of Tweets") | |
``` | |
What's clear is that their activity started back in 2014 but really started to pick up around mid 2016, averaging hundreds to thousands of tweets per day. | |
Let's see the trends by counting the number of tweets that contain politically related terms. | |
```{r} | |
tweets$trump <- grepl("trump", tweets$text, ignore.case = TRUE) | |
tweets$obama <- grepl("obama", tweets$text, ignore.case = TRUE) | |
tweets$clinton <- grepl("clinton", tweets$text, ignore.case = TRUE) | |
tweets %>% | |
mutate(term = | |
case_when( | |
trump + clinton + obama > 1 ~ "Multiple", | |
trump == TRUE ~ "Trump", | |
clinton == TRUE ~ "Clinton", | |
obama == TRUE ~ "Obama" | |
)) %>% | |
filter(!is.na(term)) %>% | |
count(Date = as.Date(created_str), term) %>% | |
ggplot(aes(x = Date, y = n, color = term)) + | |
geom_line() + | |
facet_wrap(~term, ncol = 2) + | |
labs(title = "Russian Troll Tweets by Mentioned Terms", | |
y = "Tweet Count") + | |
theme(legend.position = "none") | |
``` | |
### Text analysis | |
To analyze the text, I'm using [`quanteda`](http://docs.quanteda.io/), which is my go-to text analysis package (anywhere). However, another great package that's easier for beginners is [`tidytext`](https://www.tidytextmining.com/). I simply choose `quanteda` because of its speed and versalility, but `tidytext` has a lot to offer as well. I will however use `tidytext` near the end to analyze my topic model results. | |
Let's load `quanteda` and create a corpus as well as keeping the users' screen_name (handle) on as a corpus feature. | |
```{r text} | |
library(quanteda) | |
corpus <- corpus(tweets$text) | |
docvars(corpus, "screen_name") <- tweets$user_key | |
``` | |
Next, we'll do basic pre-processing to create the dfm (document-feature matrix), that is the core datastructure to keep our document-term matrix and corpus feature in a very efficient sparse matrix. | |
I trimmed terms that are not in the dataset at least 10 times or in at least 5 tweets (somewhat arbitrary but important to remove very rare/sparse terms). | |
```{r} | |
stopWords <- c("t.co","http","https","amp") | |
dfm <- dfm(corpus, | |
remove = c(stopwords("english"), stopWords), | |
ngrams= 1L, | |
stem = F, | |
remove_numbers = TRUE, | |
remove_punct = TRUE, | |
remove_symbols = TRUE) %>% | |
dfm_trim(min_count = 10, min_docfreq = 5) | |
``` | |
Let's look at the top words used. | |
```{r} | |
topfeatures(dfm, n = 50) | |
``` | |
First, most tweets are retweets, hence "rt". Second, it's clear that the top terms used are related to "trump", "hillary", "clinton", and "obama." This provides direct evidence that these trolls were discussing politically related topics. | |
### Hashtags | |
We can then use helpful functions to build a hashtag co-occurrence network. | |
For this, we'll run with the top 40 hashtags and create a network in which the nodes are hashatags and the edges are when the hashtags co-occurred in at least 10% of tweets. | |
```{r} | |
tag_dfm <- dfm_select(dfm, ('#*')) | |
# number of hashtags to show | |
top <- 100 | |
topfeatures(tag_dfm, top) %>% | |
head(n = top) | |
# create co-occurrence matrix and network | |
fcm(tag_dfm) %>% | |
fcm_select(names(topfeatures(tag_dfm, top))) %>% | |
textplot_network(min_freq = 0.1, edge_alpha = 0.8, edge_size = 4) | |
``` | |
### Handles | |
Similarly, we can do the same thing but for handles (screen_name) mentions. | |
```{r} | |
sn_dfm <- dfm_select(dfm, ('@*')) | |
# number of hashtags to show | |
top <- 40 | |
topfeatures(sn_dfm, top) %>% | |
head(n = top) | |
# create co-occurrence matrix and network | |
fcm(sn_dfm) %>% | |
fcm_select(names(topfeatures(sn_dfm, top))) %>% | |
textplot_network(min_freq = 0.1, edge_alpha = 0.7, edge_size = 6) | |
``` | |
### Topic Modeling | |
For this part, I'll run Correlated Topic Model using the `stm` package. | |
First, I'll need to convert my dfm to the stm data-structure. | |
```{r} | |
library(stm) | |
# use quanteda converter to convert our Dfm | |
stmdfm <- convert(dfm, to = "stm") | |
out <- prepDocuments(stmdfm$documents, | |
stmdfm$vocab, | |
stmdfm$meta, | |
lower.thresh = 10) | |
``` | |
Next, we'll run topic modeling, choosing 10 topics to simplify our models (there are a ton of great diagnostics for selecting the number of topics in `stm`. Check them out in its help documents!). | |
```{r fig.height=8} | |
# k = 10 topic for simplicity | |
k <- 10 | |
ctmFit <- stm(out$documents, out$vocab, K = k, | |
max.em.its = 150, data = out$meta, init.type = "Spectral", seed = 300) | |
plot(ctmFit, | |
type = "summary", | |
xlim = c(0,.16), | |
n = 5, | |
labeltype = "prob", | |
main = "Russian Troll Topics", | |
text.cex = 0.8) | |
``` | |
```{r} | |
# make sure its version 0.1.6 or higher of tidytext | |
library(tidytext) | |
# tidy the word-topic combinations | |
td_beta <- tidy(ctmFit) | |
td_beta | |
# helper functions (from David Robinson) | |
scale_x_reordered <- function(..., sep = "___") { | |
reg <- paste0(sep, ".+$") | |
ggplot2::scale_x_discrete(labels = function(x) gsub(reg, "", x), ...) | |
} | |
reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) { | |
new_x <- paste(x, within, sep = sep) | |
stats::reorder(new_x, by, FUN = fun) | |
} | |
# Examine the topics | |
td_beta %>% | |
group_by(topic) %>% | |
top_n(10, beta) %>% | |
ungroup() %>% | |
mutate(topic = paste0("Topic ", topic), | |
term = reorder_within(term, beta, topic)) %>% | |
ggplot(aes(term, beta, fill = topic)) + | |
geom_col(alpha = 0.8, show.legend = FALSE) + | |
facet_wrap(~ topic, scales = "free") + | |
scale_x_reordered() + | |
coord_flip() | |
# tidy the document-topic combinations, with optional document names | |
td_gamma <- tidy(ctmFit, matrix = "gamma", | |
document_names = rownames(stmdfm)) | |
td_gamma | |
``` | |
### Next Steps | |
There's a ton of more analysis that could be done on this dataset. Some ideas include: | |
* adding in time as a prevalance covariate (i.e., stm) to measure the effect of time on what topics were discussed. | |
* Explore the links the users posted. This is key to understand what information they're disseminating to different sites. For example, see [Starbird](http://faculty.washington.edu/kstarbi/Alt_Narratives_ICWSM17-CameraReady.pdf) (2017). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment