michelleboisson/Lets-Get-Political.R

## Lets-Get-Political.R
# Let’s breakdown our tweets around a certain topic.  How about, oh, say, Iran?
# So how do we pull tweets out that have a certain word in them?
# grep() to the rescue!  If you’ve used the grep function on the command-line, this
# should look familiar.  grep() takes as arguments a phrase you’re searching for, a
# set of text to look through, and optional arguments about how to search.  It will
# then return the row numbers of any rows that match your search.  To pull out Iran
# tweets, we can use the code:

iran.tweets <- tweets[grep(“iran”, ignore.case=TRUE, tweets$text), ]

# this data set includes words like "zerugiran" and "Miranda,"
# how do we take these out?

# Plot the time series for iran.tweets using a histogram with breaks=100. Add red
# vertical lines to the plot at the 3 largest peaks using abline().
hist(iran.tweets$seconds, breaks=100)
iran.hist = hist(iran.tweets$seconds, breaks=100)
plot(iran.hist$counts, type='l')
abline(v= max(iran.hist$counts), col="red")

rev(sort(iran.hist$counts))[1:3]
top3 = which(iran.hist$counts == rev(sort(iran.hist$counts))[1:2])
abline(v=top3, col="red")


#There’s not a lot of seasonality in this plot, so let’s go straight to analyzing
# the trend.  Use SMA() with the default settings to smooth the signal and plot it.

library(TTR)
iran.counts = iran.hist$counts
iran.counts.smooth = SMA(iran.counts)
plot.ts(iran.counts.smooth, type='l')

# Let’s build a basic event detection algorithm, but let’s not use the total number of
# tweets, since that misses the "velocity" of the signal.  Use the diff() function with a
# lag of 5 to look at the differences in tweet volume over time on the smoothed
# signal (use ?diff if you need a refresher).  Create a figure with two graphs – one
# with the smoothed signal above and one with the diff() of the signal below it.
# What do you see?

plot.ts(diff(iran.counts.smooth, lag=5), type='l')
### There are two huge a spikes in the difference where the number of tweets jumped.


# I’d like to know why all these tweets started increasing.  Can we figure out what
# time the tweets started increasing using your results from diff, i.e. where is the
# biggest jump in tweets? (hint:  there are lots of ways to do this, many of which
# require you to remove the NAs created by SMA)  Pull 20 or so tweets from
# around that time and write down why you think they’re increasing based on what
# people are saying.

### I'm not sure how to get rid of the NAs so I'm cheating

iran.counts.smooth
which(iran.counts.smooth == 1.4)
#[1] 10
good.iran.counts = diff(iran.counts.smooth[10:length(iran.counts.smooth)],lag=5)
max(good.iran.counts)
# [1] 10.9
max.iran.tweets.smooth = which(good.iran.counts == max(good.iran.counts))
max.iran.tweets.smooth
# [1] 29
# 29th break

# remove the NAs from iran.counts.smooth so that the breaks match with good.iran.counts
iran.counts.smooth = iran.counts.smooth[10:length(iran.counts.smooth)]

# now take the tweets inside the 29th break of iran.counts.smooth
start.time.of.peak = iran.hist$breaks[max.iran.tweets.smooth]
start.time.of.peak
#[1] 1244885000
end.time.of.peak = iran.hist$breaks[max.iran.tweets.smooth + 1]
end.time.of.peak
#[1] 1244890000

#get the tweets in between
peak.tweets = iran.tweets[iran.tweets$seconds > start.time.of.break & iran.tweets$seconds < end.time.of.break,]

dim(peak.tweets)
#[1] 3 4
### looks like I only got 3 tweets in that time frame

peak.tweets[,"text"]
#[1] "Iran - elections - Ahmadinejad winner after preliminary results http://bit.ly/gP1k"
#[2] "RT: @sadeqn: RT: @iranbaan: ابطحی گفت که کروبی نتیجه انتخابات رو پذیرفته و بیانیه میده در این زمینه"
#[3] "Ahmadinejad 'wins Iran presidential vote' http://bit.ly/1Q7iI"

### I did't pull out 20 tweets but it looks like the spike happened when
### Ahmadinejad won the Iranian presidential election.

# I expand to more breaks...
start.time.of.break = iran.hist$breaks[max.iran.tweets.smooth -3 ]
end.time.of.break = iran.hist$breaks[max.iran.tweets.smooth + 4]
peak.tweets = iran.tweets[iran.tweets$seconds > start.time.of.break & iran.tweets$seconds < end.time.of.break,]
dim(peak.tweets)
#[1] 14  4

peak.tweets[,"text"]
# [1] "RT @Adam_Ackerman @Thomas_Erdbrink RT United States unreachable by phone from Iran #Iranelections"
# [2] "RT @shahrzadmo: Green wave of Mousavi seemed so big, bcs it was something new and flashy.Not many people involved in it. #iranelection"
# [3] "RT @bbcbreaking: Mahmoud Ahmadinejad has won Iran's presidential election, officials say, but his nearest rival..,http://www.bbc.co.uk/news"
# [4] "Iran - elections - Ahmadinejad winner after preliminary results http://bit.ly/gP1k"
# [5] "RT: @sadeqn: RT: @iranbaan: ابطحی گفت که کروبی نتیجه انتخابات رو پذیرفته و بیانیه میده در این زمینه"
# [6] "Ahmadinejad 'wins Iran presidential vote' http://bit.ly/1Q7iI"
# [7] "RT @cnnbrk: Main challenger in Iran's presidential election calls for counting of ballots to halt due to \"blatant violations.\""
# [8] "http://cliqz.com/de.schlagzeilen/c/20333.html : Mitarbeiter: Mussawi gewinnt Wahl Iran"
# [9] "‘Verkiezingen Iran waren een show’: Buiten zijn eigen aanhang gelooft niemand dat Ahmadinejad zijn monsterzege e.. http://bit.ly/o2ikr"
# [10] "Hell RT @WSJ BREAKING NEWS: Iran says Mahmoud Ahmadinejad is the winner of the election with a landslide 62.63 percent of the vote."
# [11] "RT @erovira JSLeFanu BBC's Peter Simpson: scenes on streets not seen since 1979 Iranian revolution. Dramatic. #IranElection"
# [12] "I can't believe he won! http://tinyurl.com/ney29y . I thought Iran was looking to change."
# [13] "RT @bob_edwards: Axis of evil: Barack Obama, Nancy Pelosi, Harry Reid ... destroying America much faster than Iran or No Korea ever could"
# [14] "Official: Obama Administration Skeptical of Iran's Election Results: U.S. analysts find it \"not credible\" that M.. http://bit.ly/saTUJ"

## Rhythms-of-Twitter.R
# This dataset spans 6/11/09 to 6/15/09 and each row in this dataset is a tweet.
# I'd be curious to know what tweet volume over time looked like and if there were
# any significant trends.  Using what we learned in class, let’s create a histogram
# of tweets using breaks=500.

# Make a plot of the timeseries as a line (plot() with type=’l’) just so we can see
# our data (recall you can create a timeseries by pulling out the "counts" entry of
# the histogram object).  What do you see?

hist(tweets$seconds, breaks=500)
h = hist(tweets$seconds, breaks=500)
plot(h$counts, type='l')

### I see spikes, up and down. Possibly a cycle

# Let’s figure out if there are any cycles / seasonal trends in this data.  Use the
# acf() function to identify cycles in the tweet frequency, just as we did with the
# NYPD data.  Recall that acf() only looks at a small time frame so you’ll want to
# pass it a lag.max argument that’s about 200 or more.  Where is it most likely that
# we have a cycle and how can you tell?

acf(h$counts, lag.max=200)
### looks like there could be a cycle around lag 175-180ish.

# OK, let’s remove the cycles and analyze this data. Create an official timeseries
# object with frequency equal to the cycle length.  Use decompose() to decompose the
# timeseries into its components and plot the results. What do you see in terms of
# an overall trend?

tweets.over.time = ts(h$counts, frequency = 180)
parts = decompose(tweets.over.time)
plot(parts)
### it seems there's an overall trend of the number of tweets dropping for the
### first 3 days and then coming back up on the forth day.
	# Let’s breakdown our tweets around a certain topic. How about, oh, say, Iran?
	# So how do we pull tweets out that have a certain word in them?
	# grep() to the rescue! If you’ve used the grep function on the command-line, this
	# should look familiar. grep() takes as arguments a phrase you’re searching for, a
	# set of text to look through, and optional arguments about how to search. It will
	# then return the row numbers of any rows that match your search. To pull out Iran
	# tweets, we can use the code:

	iran.tweets <- tweets[grep(“iran”, ignore.case=TRUE, tweets$text), ]

	# this data set includes words like "zerugiran" and "Miranda,"
	# how do we take these out?

	# Plot the time series for iran.tweets using a histogram with breaks=100. Add red
	# vertical lines to the plot at the 3 largest peaks using abline().
	hist(iran.tweets$seconds, breaks=100)
	iran.hist = hist(iran.tweets$seconds, breaks=100)
	plot(iran.hist$counts, type='l')
	abline(v= max(iran.hist$counts), col="red")

	rev(sort(iran.hist$counts))[1:3]
	top3 = which(iran.hist$counts == rev(sort(iran.hist$counts))[1:2])
	abline(v=top3, col="red")


	#There’s not a lot of seasonality in this plot, so let’s go straight to analyzing
	# the trend. Use SMA() with the default settings to smooth the signal and plot it.

	library(TTR)
	iran.counts = iran.hist$counts
	iran.counts.smooth = SMA(iran.counts)
	plot.ts(iran.counts.smooth, type='l')

	# Let’s build a basic event detection algorithm, but let’s not use the total number of
	# tweets, since that misses the "velocity" of the signal. Use the diff() function with a
	# lag of 5 to look at the differences in tweet volume over time on the smoothed
	# signal (use ?diff if you need a refresher). Create a figure with two graphs – one
	# with the smoothed signal above and one with the diff() of the signal below it.
	# What do you see?

	plot.ts(diff(iran.counts.smooth, lag=5), type='l')
	### There are two huge a spikes in the difference where the number of tweets jumped.


	# I’d like to know why all these tweets started increasing. Can we figure out what
	# time the tweets started increasing using your results from diff, i.e. where is the
	# biggest jump in tweets? (hint: there are lots of ways to do this, many of which
	# require you to remove the NAs created by SMA) Pull 20 or so tweets from
	# around that time and write down why you think they’re increasing based on what
	# people are saying.

	### I'm not sure how to get rid of the NAs so I'm cheating

	iran.counts.smooth
	which(iran.counts.smooth == 1.4)
	#[1] 10
	good.iran.counts = diff(iran.counts.smooth[10:length(iran.counts.smooth)],lag=5)
	max(good.iran.counts)
	# [1] 10.9
	max.iran.tweets.smooth = which(good.iran.counts == max(good.iran.counts))
	max.iran.tweets.smooth
	# [1] 29
	# 29th break

	# remove the NAs from iran.counts.smooth so that the breaks match with good.iran.counts
	iran.counts.smooth = iran.counts.smooth[10:length(iran.counts.smooth)]

	# now take the tweets inside the 29th break of iran.counts.smooth
	start.time.of.peak = iran.hist$breaks[max.iran.tweets.smooth]
	start.time.of.peak
	#[1] 1244885000
	end.time.of.peak = iran.hist$breaks[max.iran.tweets.smooth + 1]
	end.time.of.peak
	#[1] 1244890000

	#get the tweets in between
	peak.tweets = iran.tweets[iran.tweets$seconds > start.time.of.break & iran.tweets$seconds < end.time.of.break,]

	dim(peak.tweets)
	#[1] 3 4
	### looks like I only got 3 tweets in that time frame

	peak.tweets[,"text"]
	#[1] "Iran - elections - Ahmadinejad winner after preliminary results http://bit.ly/gP1k"
	#[2] "RT: @sadeqn: RT: @iranbaan: ابطحی گفت که کروبی نتیجه انتخابات رو پذیرفته و بیانیه میده در این زمینه"
	#[3] "Ahmadinejad 'wins Iran presidential vote' http://bit.ly/1Q7iI"

	### I did't pull out 20 tweets but it looks like the spike happened when
	### Ahmadinejad won the Iranian presidential election.

	# I expand to more breaks...
	start.time.of.break = iran.hist$breaks[max.iran.tweets.smooth -3 ]
	end.time.of.break = iran.hist$breaks[max.iran.tweets.smooth + 4]
	peak.tweets = iran.tweets[iran.tweets$seconds > start.time.of.break & iran.tweets$seconds < end.time.of.break,]
	dim(peak.tweets)
	#[1] 14 4

	peak.tweets[,"text"]
	# [1] "RT @Adam_Ackerman @Thomas_Erdbrink RT United States unreachable by phone from Iran #Iranelections"
	# [2] "RT @shahrzadmo: Green wave of Mousavi seemed so big, bcs it was something new and flashy.Not many people involved in it. #iranelection"
	# [3] "RT @bbcbreaking: Mahmoud Ahmadinejad has won Iran's presidential election, officials say, but his nearest rival..,http://www.bbc.co.uk/news"
	# [4] "Iran - elections - Ahmadinejad winner after preliminary results http://bit.ly/gP1k"
	# [5] "RT: @sadeqn: RT: @iranbaan: ابطحی گفت که کروبی نتیجه انتخابات رو پذیرفته و بیانیه میده در این زمینه"
	# [6] "Ahmadinejad 'wins Iran presidential vote' http://bit.ly/1Q7iI"
	# [7] "RT @cnnbrk: Main challenger in Iran's presidential election calls for counting of ballots to halt due to \"blatant violations.\""
	# [8] "http://cliqz.com/de.schlagzeilen/c/20333.html : Mitarbeiter: Mussawi gewinnt Wahl Iran"
	# [9] "‘Verkiezingen Iran waren een show’: Buiten zijn eigen aanhang gelooft niemand dat Ahmadinejad zijn monsterzege e.. http://bit.ly/o2ikr"
	# [10] "Hell RT @WSJ BREAKING NEWS: Iran says Mahmoud Ahmadinejad is the winner of the election with a landslide 62.63 percent of the vote."
	# [11] "RT @erovira JSLeFanu BBC's Peter Simpson: scenes on streets not seen since 1979 Iranian revolution. Dramatic. #IranElection"
	# [12] "I can't believe he won! http://tinyurl.com/ney29y . I thought Iran was looking to change."
	# [13] "RT @bob_edwards: Axis of evil: Barack Obama, Nancy Pelosi, Harry Reid ... destroying America much faster than Iran or No Korea ever could"
	# [14] "Official: Obama Administration Skeptical of Iran's Election Results: U.S. analysts find it \"not credible\" that M.. http://bit.ly/saTUJ"
	# This dataset spans 6/11/09 to 6/15/09 and each row in this dataset is a tweet.
	# I'd be curious to know what tweet volume over time looked like and if there were
	# any significant trends. Using what we learned in class, let’s create a histogram
	# of tweets using breaks=500.

	# Make a plot of the timeseries as a line (plot() with type=’l’) just so we can see
	# our data (recall you can create a timeseries by pulling out the "counts" entry of
	# the histogram object). What do you see?

	hist(tweets$seconds, breaks=500)
	h = hist(tweets$seconds, breaks=500)
	plot(h$counts, type='l')

	### I see spikes, up and down. Possibly a cycle

	# Let’s figure out if there are any cycles / seasonal trends in this data. Use the
	# acf() function to identify cycles in the tweet frequency, just as we did with the
	# NYPD data. Recall that acf() only looks at a small time frame so you’ll want to
	# pass it a lag.max argument that’s about 200 or more. Where is it most likely that
	# we have a cycle and how can you tell?

	acf(h$counts, lag.max=200)
	### looks like there could be a cycle around lag 175-180ish.

	# OK, let’s remove the cycles and analyze this data. Create an official timeseries
	# object with frequency equal to the cycle length. Use decompose() to decompose the
	# timeseries into its components and plot the results. What do you see in terms of
	# an overall trend?

	tweets.over.time = ts(h$counts, frequency = 180)
	parts = decompose(tweets.over.time)
	plot(parts)
	### it seems there's an overall trend of the number of tweets dropping for the
	### first 3 days and then coming back up on the forth day.