Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Code used in my "What's Your Twitter Story?" article on
Unless otherwise specified, code has been adapted from the book 'Text Mining with R' by Julia Silge and David Robinson.
A big thank you to both of these #rstats experts for their work in tidy text mining and for making their work so easily accessible for newcomers like myself.
Also a big thank you to Hadley Wickham for all his work on dplyr and tidyr, two packages also used in this project. His co-authored book with Garrett Grolemund called 'R for Data Science' is another exceptional resource for beginners.'R for Data Science' book link:
And finally a big (or as One Direction would say: MASSIVE) THANK YOU to Terry, and Jake without whose encouragment and support I would not have finished this. MY VERY FIRST ARTICLE!!
You can follow along with my article, and the 'Text Mining with R' explanations (, and code to get a better understanding
To download your own Twitter archive, follow the instructions here:
Save your archive folder and unzip in an easy to access place. If you have no prior experience with R or programming and need help getting started on your analysis, please give me a shout on Twitter @nazneen2411 and I will be more than happy to get you started! :)
Do note that this code works when you are subsetting(splitting) you data into two groups. I will try to get code up for just a general analysis (WITHOUT subsetting). This may not include all the graphs and plots. I will update this gist and also put a link in the article when that happens.
Import tweets.csv to R and change timestamp from chr to POSIXlt format, assign to the same dataframe (df)
#uncomment the next line and run if you do not have these packages
tweets <- read_csv("C:/Users/Nazneen/path/to/tweets.csv")
#AND TOKENIZATION WILL BE UGLY (I really have no other word to describe this)
#Otherwise as I mentioned in the article, I was ending up with partial contractions such as "couldn","don","ve"
tweets$text <- stringi::stri_trans_general(tweets$text, "latin-ascii")
#change timestamp from string to time format
tweets <- bind_rows(tweets %>%
mutate(timestamp = ymd_hms(timestamp)))
#change timezones as the above code will have all times listed in UTC
#This is not really necessary unless you're analyzing a very small amount of tweets or want the correct counts of
#tweets in a day. Which would be wrong if you're not in the UTC zone, so you need this conversion.
#Code accounts for Daylight Savings Time so no extra steps required
#I changed it to EST bc I'm in the Toronto area, replace with appropriate timezone code if necessary
tweets$timestamp <- as.POSIXlt(tweets$timestamp, tz="EST")
Plot 'tweets' to see overview of timeline and amount of tweets
ggplot(tweets, aes(x = timestamp)) +
geom_histogram(position = "identity", bins = 20, show.legend = FALSE)
Seperate/group by year, 2015 and before, and 2016 to present as two categories to see if my tweeting has changed in the 1yr2months that I was away from twitter (among other things)
The next few bits is my own code specific to how I was subsetting the data, not necessary if you're doing a general analysis.
You can skip directly to go to CONTRACTION REMOVAL. I recommend doing this for more accurate word clouds and Top words.
#change back to character otherwise this code will not work
tweets$timestamp <- as.character(tweets$timestamp)
tweets <- tweets %>%
mutate(timeperiod = case_when(grepl("2010", timestamp) ~ "past",
grepl("2011", timestamp) ~ "past",
grepl("2012", timestamp) ~ "past",
grepl("2013", timestamp) ~ "past",
grepl("2014", timestamp) ~ "past",
grepl("2015", timestamp) ~ "past",
grepl("2016", timestamp) ~ "present",
grepl("2017", timestamp) ~ "present",
grepl("2018", timestamp) ~ "present"))
Alternate code to do the above. Compact but requires greater understanding of piping and mutating. Will create a column called year which would not be created in the above code. However it can be useful depending on your needs.
Note: this code only works with POSIXlt and not POSIXct. Understand the difference between the two here:
tweets <- tweets %>%
mutate(year = (tweets$timestamp)$year + 1900) %>%
mutate(timeperiod = ifelse(year %in% 2010:2015, "past",
ifelse(year %in% 2016:2018, "present", "NA")))
Before removing stop words and subsetting into past and present, we will change all contractions.
Code adapted from:
#install.packages(c("tm", "NLP"))
#create the function, trust me you don't want to be doing this manually. It would take DAYS
fixcontractions <- function(a) {
a <- gsub("'","'", a) #the devil that are smart apostrophes being changed into normal ones
a <- gsub("won't", "will not", a)
a <- gsub("can't", "cannot", a) #expansion of can't does not have a space so I fixed that
a <- gsub("n't", " not", a)
a <- gsub("'ll"," will", a)
a <- gsub("'re", " are", a)
a <- gsub("'ve"," have", a)
a <- gsub("'m", " am", a)
a <- gsub("'d","", a) #'d can be "had" or "would"
a <- gsub("'s", "", a) #possessive noun, doesn't need to be expanded
Run the above function on 'tweets' to clean the text up.
First way:
tweets$text <- sapply(tweets$text, fixcontractions)
Second way:
#will not work if you're still using the POSIXlt timestamp, so let's change it back to character
tweets$timestamp <- as.character(tweets$timestamp)
tweets <- bind_rows(tweets %>%
mutate(text= fixcontractions(tweets$text)))
Finally subset into two dataframes based on timeperiod
twts_past <- tweets[tweets$timeperiod == "past",]
twts_present <- tweets[tweets$timeperiod == "present",]
Now to unnest, tokenize and calculate frequencies
replace_reg <- "[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https"
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
#unnest and tokenize tweets to calculate frequencies and also so we can group by timeperiod
tidy_tweets <- tweets %>%
filter(!str_detect(text, "^RT")) %>%
mutate(text = str_replace_all(text, replace_reg, "")) %>%
unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
#Run the same code as above on twts_past and twts_present and save to tidy_past
#and tidy_present respectively
Calculate frequency of words by timeperiod (in my case)
#on full tidy_tweets, save result to another variable
freq_tweets <- tidy_tweets %>%
group_by(timeperiod) %>%
count(word, sort = TRUE) %>%
left_join(tidy_tweets %>%
group_by(timeperiod) %>%
summarise(total = n())) %>%
mutate(freq = n/total)
freq_tweets #shows the results of the code
freq_tweets <- freq_tweets %>%
select(timeperiod, word, freq) %>%
spread(timeperiod, freq) %>%
arrange(past, present)
freq_tweets #gives a nice tidy dataframe by timeperiod as a result
Plot the frequencies
ggplot(freq_tweets, aes(present, past)) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
geom_abline(color = "red")
Overall cloud
tw <- tidy_tweets %>% count(word, sort=TRUE)
tw <- tw %>% filter(n>1)
#word cloud of top 25 words
tw %>%
random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2")))
#Similarly, I ran the above code on the tidy_past and tidy_present dataframes to get the other two wordclouds
Positive and Negative words using BING lexicon for sentiment analysis
tidy_tweets %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>% = c("springgreen4", "orchid3"),
max.words = 25)
#Again, I ran the above code for both tidy_past and tidy_present to get the Top positive and negative words for those two subsets
Word Ratios - another way to see top words by subset (past/present)
word_ratios <- tidy_tweets %>%
filter(!str_detect(word, "^@")) %>% #filtering so we don't have any mentions or handles in this
count(word, timeperiod) %>%
filter(sum(n) >= 10) %>%
ungroup() %>%
spread(timeperiod, n, fill = 0) %>%
mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>%
mutate(logratio = log(present / past)) %>%
Shows distinct words and their logratios when compared to Past and Present
word_ratios %>%
Plot top 12 words based on logratios of the two timeperiods
word_ratios %>%
group_by(logratio < 0) %>%
top_n(12, abs(logratio)) %>%
ungroup() %>%
mutate(word = reorder(word, logratio)) %>%
ggplot(aes(word, logratio, fill = logratio < 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ylab("log odds ratio (Present/Past)") +
scale_fill_discrete(name = "", labels = c("Past", "Present"))
The End!
As I mentioned earlier, I will try to get a code up in the next few days for a very general analysis without the subsetting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment