Skip to content

Instantly share code, notes, and snippets.

@nazmunshi
Last active August 21, 2018 18:42
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save nazmunshi/5b9df996a1d7e48a3c0a8395bbaaccc8 to your computer and use it in GitHub Desktop.
Code used in my "What's Your Twitter Story?" article on datacritics.com
Unless otherwise specified, code has been adapted from the book 'Text Mining with R' by Julia Silge and David Robinson.
A big thank you to both of these #rstats experts for their work in tidy text mining and for making their work so easily accessible for newcomers like myself.
Also a big thank you to Hadley Wickham for all his work on dplyr and tidyr, two packages also used in this project. His co-authored book with Garrett Grolemund called 'R for Data Science' is another exceptional resource for beginners.'R for Data Science' book link: http://r4ds.had.co.nz/index.html
And finally a big (or as One Direction would say: MASSIVE) THANK YOU to Terry, and Jake without whose encouragment and support I would not have finished this. MY VERY FIRST ARTICLE!!
You can follow along with my article, and the 'Text Mining with R' explanations (https://www.tidytextmining.com/twitter.html), and code to get a better understanding
To download your own Twitter archive, follow the instructions here: https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive.
Save your archive folder and unzip in an easy to access place. If you have no prior experience with R or programming and need help getting started on your analysis, please give me a shout on Twitter @nazneen2411 and I will be more than happy to get you started! :)
Do note that this code works when you are subsetting(splitting) you data into two groups. I will try to get code up for just a general analysis (WITHOUT subsetting). This may not include all the graphs and plots. I will update this gist and also put a link in the article when that happens.
#IMPORTING, TIMESTAMPS, TWEET ENCODING
Import tweets.csv to R and change timestamp from chr to POSIXlt format, assign to the same dataframe (df)
```{r}
#uncomment the next line and run if you do not have these packages
#install.packages(c("lubridate","ggplot2","dplyr","readr"))
library(lubridate)
library(ggplot2)
library(dplyr)
library(readr)
tweets <- read_csv("C:/Users/Nazneen/path/to/tweets.csv")
#CHANGE ENCODING OTHERWISE RECENT TWEETS (LATE 2017 ONWARDS) WILL NOT TRANSFORM
#AND TOKENIZATION WILL BE UGLY (I really have no other word to describe this)
#Otherwise as I mentioned in the article, I was ending up with partial contractions such as "couldn","don","ve"
tweets$text <- stringi::stri_trans_general(tweets$text, "latin-ascii")
#change timestamp from string to time format
tweets <- bind_rows(tweets %>%
mutate(timestamp = ymd_hms(timestamp)))
#change timezones as the above code will have all times listed in UTC
#This is not really necessary unless you're analyzing a very small amount of tweets or want the correct counts of
#tweets in a day. Which would be wrong if you're not in the UTC zone, so you need this conversion.
#Code accounts for Daylight Savings Time so no extra steps required
#I changed it to EST bc I'm in the Toronto area, replace with appropriate timezone code if necessary
tweets$timestamp <- as.POSIXlt(tweets$timestamp, tz="EST")
```
Plot 'tweets' to see overview of timeline and amount of tweets
```{r}
ggplot(tweets, aes(x = timestamp)) +
geom_histogram(position = "identity", bins = 20, show.legend = FALSE)
```
#PREPARING FOR SUBSETTING
Seperate/group by year, 2015 and before, and 2016 to present as two categories to see if my tweeting has changed in the 1yr2months that I was away from twitter (among other things)
The next few bits is my own code specific to how I was subsetting the data, not necessary if you're doing a general analysis.
You can skip directly to go to CONTRACTION REMOVAL. I recommend doing this for more accurate word clouds and Top words.
```{r}
#change back to character otherwise this code will not work
tweets$timestamp <- as.character(tweets$timestamp)
tweets <- tweets %>%
mutate(timeperiod = case_when(grepl("2010", timestamp) ~ "past",
grepl("2011", timestamp) ~ "past",
grepl("2012", timestamp) ~ "past",
grepl("2013", timestamp) ~ "past",
grepl("2014", timestamp) ~ "past",
grepl("2015", timestamp) ~ "past",
grepl("2016", timestamp) ~ "present",
grepl("2017", timestamp) ~ "present",
grepl("2018", timestamp) ~ "present"))
```
Alternate code to do the above. Compact but requires greater understanding of piping and mutating. Will create a column called year which would not be created in the above code. However it can be useful depending on your needs.
Note: this code only works with POSIXlt and not POSIXct. Understand the difference between the two here: https://www.r-bloggers.com/whats-the-difference-between-posixct-and-posixlt/
```{r}
tweets <- tweets %>%
mutate(year = (tweets$timestamp)$year + 1900) %>%
mutate(timeperiod = ifelse(year %in% 2010:2015, "past",
ifelse(year %in% 2016:2018, "present", "NA")))
```
#CLEANING UP TWEETS BEFORE SUBETTING
First: CONTRACTION REMOVAL
Before removing stop words and subsetting into past and present, we will change all contractions.
Code adapted from: https://www.datacamp.com/community/tutorials/R-nlp-machine-learning
```{r}
#install.packages(c("tm", "NLP"))
library(tm)
#create the function, trust me you don't want to be doing this manually. It would take DAYS
fixcontractions <- function(a) {
a <- gsub("'","'", a) #the devil that are smart apostrophes being changed into normal ones
a <- gsub("won't", "will not", a)
a <- gsub("can't", "cannot", a) #expansion of can't does not have a space so I fixed that
a <- gsub("n't", " not", a)
a <- gsub("'ll"," will", a)
a <- gsub("'re", " are", a)
a <- gsub("'ve"," have", a)
a <- gsub("'m", " am", a)
a <- gsub("'d","", a) #'d can be "had" or "would"
a <- gsub("'s", "", a) #possessive noun, doesn't need to be expanded
return(a)
}
```
Run the above function on 'tweets' to clean the text up.
First way:
```{r}
tweets$text <- sapply(tweets$text, fixcontractions)
```
OR
Second way:
```{r}
#will not work if you're still using the POSIXlt timestamp, so let's change it back to character
tweets$timestamp <- as.character(tweets$timestamp)
tweets <- bind_rows(tweets %>%
mutate(text= fixcontractions(tweets$text)))
```
Finally subset into two dataframes based on timeperiod
```{r}
twts_past <- tweets[tweets$timeperiod == "past",]
twts_present <- tweets[tweets$timeperiod == "present",]
```
#UNNEST, TOKENIZE, FREQUENCIES
Now to unnest, tokenize and calculate frequencies
```{r}
library(tidytext)
library(stringr)
library(dplyr)
replace_reg <- "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https"
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
#unnest and tokenize tweets to calculate frequencies and also so we can group by timeperiod
tidy_tweets <- tweets %>%
filter(!str_detect(text, "^RT")) %>%
mutate(text = str_replace_all(text, replace_reg, "")) %>%
unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
#Run the same code as above on twts_past and twts_present and save to tidy_past
#and tidy_present respectively
```
Calculate frequency of words by timeperiod (in my case)
```{r}
#on full tidy_tweets, save result to another variable
freq_tweets <- tidy_tweets %>%
group_by(timeperiod) %>%
count(word, sort = TRUE) %>%
left_join(tidy_tweets %>%
group_by(timeperiod) %>%
summarise(total = n())) %>%
mutate(freq = n/total)
freq_tweets #shows the results of the code
```
```{r}
library(tidyr)
freq_tweets <- freq_tweets %>%
select(timeperiod, word, freq) %>%
spread(timeperiod, freq) %>%
arrange(past, present)
freq_tweets #gives a nice tidy dataframe by timeperiod as a result
```
Plot the frequencies
```{r}
library(scales)
ggplot(freq_tweets, aes(present, past)) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
geom_abline(color = "red")
```
#VISUALIZATION (WORD CLOUDS, SENTIMENT ANALYSIS)
Overall cloud
```{r}
library(stringr)
tw <- tidy_tweets %>% count(word, sort=TRUE)
tw <- tw %>% filter(n>1)
#install.packages("wordcloud")
library(wordcloud)
#word cloud of top 25 words
tw %>%
with(wordcloud(word,n,max.words=25,
random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2")))
#Similarly, I ran the above code on the tidy_past and tidy_present dataframes to get the other two wordclouds
```
Positive and Negative words using BING lexicon for sentiment analysis
```{r}
#install.packages("reshape2")
library(reshape2)
tidy_tweets %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("springgreen4", "orchid3"),
max.words = 25)
#Again, I ran the above code for both tidy_past and tidy_present to get the Top positive and negative words for those two subsets
```
Word Ratios - another way to see top words by subset (past/present)
```{r}
word_ratios <- tidy_tweets %>%
filter(!str_detect(word, "^@")) %>% #filtering so we don't have any mentions or handles in this
count(word, timeperiod) %>%
filter(sum(n) >= 10) %>%
ungroup() %>%
spread(timeperiod, n, fill = 0) %>%
mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>%
mutate(logratio = log(present / past)) %>%
arrange(desc(logratio))
```
Shows distinct words and their logratios when compared to Past and Present
```{r}
word_ratios %>%
arrange(abs(logratio))
```
Plot top 12 words based on logratios of the two timeperiods
```{r}
word_ratios %>%
group_by(logratio < 0) %>%
top_n(12, abs(logratio)) %>%
ungroup() %>%
mutate(word = reorder(word, logratio)) %>%
ggplot(aes(word, logratio, fill = logratio < 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ylab("log odds ratio (Present/Past)") +
scale_fill_discrete(name = "", labels = c("Past", "Present"))
```
The End!
As I mentioned earlier, I will try to get a code up in the next few days for a very general analysis without the subsetting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment