Last active
May 8, 2017 07:13
-
-
Save csiu/a20433adb05618bbbd9f89b8f75acbdf to your computer and use it in GitHub Desktop.
Doing emoji data science
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Download twitter data" | |
author: "csiu" | |
output: html_document | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE, | |
eval=TRUE, | |
cache = TRUE ) | |
``` | |
- Primary reference: [Emoji data science in R: A tutorial (Hamdan Azhar, 2017)](https://prismoji.com/2017/02/06/emoji-data-science-in-r-tutorial/#part1) | |
- Secondary reference: [Twimoji: Identifying Emoji in Tweets (Chris Tufts, 2015)](http://miningthedetails.com/blog/r/IdentifyEmojiInTweets/) | |
## Connect to twitter | |
```{r load-lib, message=FALSE} | |
# install.packages("twitteR") | |
library(twitteR) | |
library(dplyr) | |
library(readr) | |
library(stringr) | |
library(lubridate) | |
``` | |
```{r twitter-connect} | |
#' Create API keys from https://apps.twitter.com | |
api_key <- 'XXX' | |
api_secret <- 'XXX' | |
access_token <- 'XXX' | |
access_token_secret <- 'XXX' | |
source("twitter_api_key.R") # To load true values | |
setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret) | |
``` | |
## Pull tweets | |
```{r} | |
set.seed(20170202) | |
#' Pull tweets | |
search_string <- "celiassiu" | |
tweets.raw <- | |
searchTwitter(search_string, | |
n = 1000, # Max number of tweets to return | |
lang = 'en', # Restrict tweets to given language | |
since = '2017-05-07') | |
#' Remove retweets & convert twitteR lists to data.frames | |
#' We do the following: | |
#' 1. Remove retweets | |
#' 2. convert to data frame | |
(df <- | |
strip_retweets(tweets.raw, strip_manual = TRUE, strip_mt = TRUE) %>% | |
twListToDF() | |
) %>% | |
head() | |
``` | |
## Tidy data frame | |
```{r} | |
df_tidy <- | |
df %>% | |
mutate( | |
# Add new columns containing the hashtag & tweet url | |
hashtag = search_string, | |
url = paste0('https://twitter.com/', screenName, '/status/', id), | |
# Convert character vector between encodings | |
text = iconv(text, from='latin1', to='ASCII', sub='byte'), | |
# Update type | |
created = lubridate::ymd_hms(created, tz = "UTC") | |
) %>% | |
rename( | |
retweets = retweetCount | |
) %>% | |
select( | |
text, created, url, latitude, longitude, retweets, hashtag, screenName | |
) | |
#' Print head 10 lines of data frame | |
head(df_tidy) | |
``` | |
Number of tweets: | |
```{r} | |
nrow(df_tidy) | |
``` | |
## Load emoji dictionary | |
The Emoji dictionary is obtained from [GitHub: today-is-a-good-day/emojis](https://github.com/today-is-a-good-day/emojis/blob/master/emDict.csv). | |
```{r warning=FALSE, message=FALSE} | |
(emoticons <- | |
readr::read_delim( | |
"emDict.csv", | |
delim = ";", | |
col_names = c("description", "native", "bytes", "r_encoding"), | |
skip = 1 | |
) %>% | |
mutate(description = tolower(description)) | |
) %>% | |
head() | |
#' Number of emojis | |
nrow(emoticons) | |
``` | |
## Count emojis | |
- [Installation of `rWeka`](http://justrocketscience.com/post/install-rweka-mac) (containing the `WordTokenizer` mentioned in [Twimoji: Identifying Emoji in Tweets (CHRIS TUFTS, 2015)](http://miningthedetails.com/blog/r/IdentifyEmojiInTweets/)) package failed | |
- Alternative solution: use `stringr::str_count` to count the number of matches in a string | |
```{r} | |
# Helper function to count number of times pattern occur in string | |
count_emojis <- function(e){ | |
counts <- str_count(df_tidy$text, e) | |
data.frame( | |
counts, | |
tweet_id = 1:length(counts) | |
) | |
} | |
# Do the counting of emojis for each tweet | |
emoji_counts <- | |
emoticons %>% | |
select(description, r_encoding) %>% | |
mutate( | |
counts = purrr::map(r_encoding, ~count_emojis(.x)) | |
) %>% | |
tidyr::unnest(counts) | |
# Summarize the counts per emoji | |
emoji_counts %>% | |
filter(counts != 0) %>% | |
# Here I want to only consider the latest tweet | |
filter(tweet_id == 1) %>% | |
select(-tweet_id) %>% | |
group_by(description) %>% | |
summarise(count = sum(counts)) %>% | |
arrange(desc(count)) | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment