csiu/emoji-data-science-twitter.Rmd

## emoji-data-science-twitter.Rmd
---
title: "Download twitter data"
author: "csiu"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      eval=TRUE,
                      cache = TRUE )
```

- Primary reference: [Emoji data science in R: A tutorial (Hamdan Azhar, 2017)](https://prismoji.com/2017/02/06/emoji-data-science-in-r-tutorial/#part1)
- Secondary reference: [Twimoji: Identifying Emoji in Tweets (Chris Tufts, 2015)](http://miningthedetails.com/blog/r/IdentifyEmojiInTweets/)

## Connect to twitter

```{r load-lib, message=FALSE}
# install.packages("twitteR")
library(twitteR)
library(dplyr)
library(readr)
library(stringr)
library(lubridate)
```

```{r twitter-connect}
#' Create API keys from https://apps.twitter.com
api_key <- 'XXX'
api_secret <- 'XXX'
access_token <- 'XXX'
access_token_secret <- 'XXX'
source("twitter_api_key.R") # To load true values

setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)
```

## Pull tweets

```{r}
set.seed(20170202)

#' Pull tweets
search_string <- "celiassiu"
tweets.raw <-
  searchTwitter(search_string,
                n = 1000,    # Max number of tweets to return
                lang = 'en', # Restrict tweets to given language
                since = '2017-05-07')

#' Remove retweets & convert twitteR lists to data.frames
#' We do the following:
#' 1. Remove retweets
#' 2. convert to data frame
(df <-
  strip_retweets(tweets.raw, strip_manual = TRUE, strip_mt = TRUE) %>%
  twListToDF()
) %>%
  head()
```

## Tidy data frame

```{r}
df_tidy <-
  df %>%
  mutate(
    # Add new columns containing the hashtag & tweet url
    hashtag = search_string,
    url = paste0('https://twitter.com/', screenName, '/status/', id),

    # Convert character vector between encodings
    text = iconv(text, from='latin1', to='ASCII', sub='byte'),

    # Update type
    created = lubridate::ymd_hms(created, tz = "UTC")
  ) %>%
  rename(
    retweets = retweetCount
  ) %>%
  select(
    text, created, url, latitude, longitude, retweets, hashtag, screenName
  )

#' Print head 10 lines of data frame
head(df_tidy)
```

Number of tweets:

```{r}
nrow(df_tidy)
```

## Load emoji dictionary

The Emoji dictionary is obtained from [GitHub: today-is-a-good-day/emojis](https://github.com/today-is-a-good-day/emojis/blob/master/emDict.csv).

```{r warning=FALSE, message=FALSE}
(emoticons <-
  readr::read_delim(
    "emDict.csv",
    delim = ";",
    col_names = c("description", "native", "bytes", "r_encoding"),
    skip = 1
  ) %>%
  mutate(description = tolower(description))
) %>%
  head()

#' Number of emojis
nrow(emoticons)
```

## Count emojis

- [Installation of `rWeka`](http://justrocketscience.com/post/install-rweka-mac) (containing the `WordTokenizer` mentioned in [Twimoji: Identifying Emoji in Tweets (CHRIS TUFTS, 2015)](http://miningthedetails.com/blog/r/IdentifyEmojiInTweets/)) package failed
- Alternative solution: use `stringr::str_count` to count the number of matches in a string

```{r}
# Helper function to count number of times pattern occur in string
count_emojis <- function(e){
  counts <- str_count(df_tidy$text, e)
  data.frame(
    counts,
    tweet_id = 1:length(counts)
  )
}
# Do the counting of emojis for each tweet
emoji_counts <-
  emoticons %>%
  select(description, r_encoding) %>%
  mutate(
    counts = purrr::map(r_encoding, ~count_emojis(.x))
  ) %>%
  tidyr::unnest(counts)

# Summarize the counts per emoji
emoji_counts %>%
  filter(counts != 0) %>%

  # Here I want to only consider the latest tweet
  filter(tweet_id == 1) %>%
  select(-tweet_id) %>%

  group_by(description) %>%
  summarise(count = sum(counts)) %>%
  arrange(desc(count))
```
	---
	title: "Download twitter data"
	author: "csiu"
	output: html_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE,
	eval=TRUE,
	cache = TRUE )
	```

	- Primary reference: [Emoji data science in R: A tutorial (Hamdan Azhar, 2017)](https://prismoji.com/2017/02/06/emoji-data-science-in-r-tutorial/#part1)
	- Secondary reference: [Twimoji: Identifying Emoji in Tweets (Chris Tufts, 2015)](http://miningthedetails.com/blog/r/IdentifyEmojiInTweets/)

	## Connect to twitter

	```{r load-lib, message=FALSE}
	# install.packages("twitteR")
	library(twitteR)
	library(dplyr)
	library(readr)
	library(stringr)
	library(lubridate)
	```

	```{r twitter-connect}
	#' Create API keys from https://apps.twitter.com
	api_key <- 'XXX'
	api_secret <- 'XXX'
	access_token <- 'XXX'
	access_token_secret <- 'XXX'
	source("twitter_api_key.R") # To load true values

	setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)
	```

	## Pull tweets

	```{r}
	set.seed(20170202)

	#' Pull tweets
	search_string <- "celiassiu"
	tweets.raw <-
	searchTwitter(search_string,
	n = 1000, # Max number of tweets to return
	lang = 'en', # Restrict tweets to given language
	since = '2017-05-07')

	#' Remove retweets & convert twitteR lists to data.frames
	#' We do the following:
	#' 1. Remove retweets
	#' 2. convert to data frame
	(df <-
	strip_retweets(tweets.raw, strip_manual = TRUE, strip_mt = TRUE) %>%
	twListToDF()
	) %>%
	head()
	```

	## Tidy data frame

	```{r}
	df_tidy <-
	df %>%
	mutate(
	# Add new columns containing the hashtag & tweet url
	hashtag = search_string,
	url = paste0('https://twitter.com/', screenName, '/status/', id),

	# Convert character vector between encodings
	text = iconv(text, from='latin1', to='ASCII', sub='byte'),

	# Update type
	created = lubridate::ymd_hms(created, tz = "UTC")
	) %>%
	rename(
	retweets = retweetCount
	) %>%
	select(
	text, created, url, latitude, longitude, retweets, hashtag, screenName
	)

	#' Print head 10 lines of data frame
	head(df_tidy)
	```

	Number of tweets:

	```{r}
	nrow(df_tidy)
	```

	## Load emoji dictionary

	The Emoji dictionary is obtained from [GitHub: today-is-a-good-day/emojis](https://github.com/today-is-a-good-day/emojis/blob/master/emDict.csv).

	```{r warning=FALSE, message=FALSE}
	(emoticons <-
	readr::read_delim(
	"emDict.csv",
	delim = ";",
	col_names = c("description", "native", "bytes", "r_encoding"),
	skip = 1
	) %>%
	mutate(description = tolower(description))
	) %>%
	head()

	#' Number of emojis
	nrow(emoticons)
	```

	## Count emojis

	- [Installation of `rWeka`](http://justrocketscience.com/post/install-rweka-mac) (containing the `WordTokenizer` mentioned in [Twimoji: Identifying Emoji in Tweets (CHRIS TUFTS, 2015)](http://miningthedetails.com/blog/r/IdentifyEmojiInTweets/)) package failed
	- Alternative solution: use `stringr::str_count` to count the number of matches in a string

	```{r}
	# Helper function to count number of times pattern occur in string
	count_emojis <- function(e){
	counts <- str_count(df_tidy$text, e)
	data.frame(
	counts,
	tweet_id = 1:length(counts)
	)
	}
	# Do the counting of emojis for each tweet
	emoji_counts <-
	emoticons %>%
	select(description, r_encoding) %>%
	mutate(
	counts = purrr::map(r_encoding, ~count_emojis(.x))
	) %>%
	tidyr::unnest(counts)

	# Summarize the counts per emoji
	emoji_counts %>%
	filter(counts != 0) %>%

	# Here I want to only consider the latest tweet
	filter(tweet_id == 1) %>%
	select(-tweet_id) %>%

	group_by(description) %>%
	summarise(count = sum(counts)) %>%
	arrange(desc(count))
	```