Skip to content

Instantly share code, notes, and snippets.

@michelleboisson
Created October 23, 2012 03:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save michelleboisson/3936415 to your computer and use it in GitHub Desktop.
Save michelleboisson/3936415 to your computer and use it in GitHub Desktop.
Data Without Borders - Assignment 6 (really 4)
# Let’s breakdown our tweets around a certain topic. How about, oh, say, Iran?
# So how do we pull tweets out that have a certain word in them?
# grep() to the rescue! If you’ve used the grep function on the command-line, this
# should look familiar. grep() takes as arguments a phrase you’re searching for, a
# set of text to look through, and optional arguments about how to search. It will
# then return the row numbers of any rows that match your search. To pull out Iran
# tweets, we can use the code:
iran.tweets <- tweets[grep(“iran”, ignore.case=TRUE, tweets$text), ]
# this data set includes words like "zerugiran" and "Miranda,"
# how do we take these out?
# Plot the time series for iran.tweets using a histogram with breaks=100. Add red
# vertical lines to the plot at the 3 largest peaks using abline().
hist(iran.tweets$seconds, breaks=100)
iran.hist = hist(iran.tweets$seconds, breaks=100)
plot(iran.hist$counts, type='l')
abline(v= max(iran.hist$counts), col="red")
rev(sort(iran.hist$counts))[1:3]
top3 = which(iran.hist$counts == rev(sort(iran.hist$counts))[1:2])
abline(v=top3, col="red")
#There’s not a lot of seasonality in this plot, so let’s go straight to analyzing
# the trend. Use SMA() with the default settings to smooth the signal and plot it.
library(TTR)
iran.counts = iran.hist$counts
iran.counts.smooth = SMA(iran.counts)
plot.ts(iran.counts.smooth, type='l')
# Let’s build a basic event detection algorithm, but let’s not use the total number of
# tweets, since that misses the "velocity" of the signal. Use the diff() function with a
# lag of 5 to look at the differences in tweet volume over time on the smoothed
# signal (use ?diff if you need a refresher). Create a figure with two graphs – one
# with the smoothed signal above and one with the diff() of the signal below it.
# What do you see?
plot.ts(diff(iran.counts.smooth, lag=5), type='l')
### There are two huge a spikes in the difference where the number of tweets jumped.
# I’d like to know why all these tweets started increasing. Can we figure out what
# time the tweets started increasing using your results from diff, i.e. where is the
# biggest jump in tweets? (hint: there are lots of ways to do this, many of which
# require you to remove the NAs created by SMA) Pull 20 or so tweets from
# around that time and write down why you think they’re increasing based on what
# people are saying.
### I'm not sure how to get rid of the NAs so I'm cheating
iran.counts.smooth
which(iran.counts.smooth == 1.4)
#[1] 10
good.iran.counts = diff(iran.counts.smooth[10:length(iran.counts.smooth)],lag=5)
max(good.iran.counts)
# [1] 10.9
max.iran.tweets.smooth = which(good.iran.counts == max(good.iran.counts))
max.iran.tweets.smooth
# [1] 29
# 29th break
# remove the NAs from iran.counts.smooth so that the breaks match with good.iran.counts
iran.counts.smooth = iran.counts.smooth[10:length(iran.counts.smooth)]
# now take the tweets inside the 29th break of iran.counts.smooth
start.time.of.peak = iran.hist$breaks[max.iran.tweets.smooth]
start.time.of.peak
#[1] 1244885000
end.time.of.peak = iran.hist$breaks[max.iran.tweets.smooth + 1]
end.time.of.peak
#[1] 1244890000
#get the tweets in between
peak.tweets = iran.tweets[iran.tweets$seconds > start.time.of.break & iran.tweets$seconds < end.time.of.break,]
dim(peak.tweets)
#[1] 3 4
### looks like I only got 3 tweets in that time frame
peak.tweets[,"text"]
#[1] "Iran - elections - Ahmadinejad winner after preliminary results http://bit.ly/gP1k"
#[2] "RT: @sadeqn: RT: @iranbaan: ابطحی گفت که کروبی نتیجه انتخابات رو پذیرفته و بیانیه میده در این زمینه"
#[3] "Ahmadinejad 'wins Iran presidential vote' http://bit.ly/1Q7iI"
### I did't pull out 20 tweets but it looks like the spike happened when
### Ahmadinejad won the Iranian presidential election.
# I expand to more breaks...
start.time.of.break = iran.hist$breaks[max.iran.tweets.smooth -3 ]
end.time.of.break = iran.hist$breaks[max.iran.tweets.smooth + 4]
peak.tweets = iran.tweets[iran.tweets$seconds > start.time.of.break & iran.tweets$seconds < end.time.of.break,]
dim(peak.tweets)
#[1] 14 4
peak.tweets[,"text"]
# [1] "RT @Adam_Ackerman @Thomas_Erdbrink RT United States unreachable by phone from Iran #Iranelections"
# [2] "RT @shahrzadmo: Green wave of Mousavi seemed so big, bcs it was something new and flashy.Not many people involved in it. #iranelection"
# [3] "RT @bbcbreaking: Mahmoud Ahmadinejad has won Iran's presidential election, officials say, but his nearest rival..,http://www.bbc.co.uk/news"
# [4] "Iran - elections - Ahmadinejad winner after preliminary results http://bit.ly/gP1k"
# [5] "RT: @sadeqn: RT: @iranbaan: ابطحی گفت که کروبی نتیجه انتخابات رو پذیرفته و بیانیه میده در این زمینه"
# [6] "Ahmadinejad 'wins Iran presidential vote' http://bit.ly/1Q7iI"
# [7] "RT @cnnbrk: Main challenger in Iran's presidential election calls for counting of ballots to halt due to \"blatant violations.\""
# [8] "http://cliqz.com/de.schlagzeilen/c/20333.html : Mitarbeiter: Mussawi gewinnt Wahl Iran"
# [9] "‘Verkiezingen Iran waren een show’: Buiten zijn eigen aanhang gelooft niemand dat Ahmadinejad zijn monsterzege e.. http://bit.ly/o2ikr"
# [10] "Hell RT @WSJ BREAKING NEWS: Iran says Mahmoud Ahmadinejad is the winner of the election with a landslide 62.63 percent of the vote."
# [11] "RT @erovira JSLeFanu BBC's Peter Simpson: scenes on streets not seen since 1979 Iranian revolution. Dramatic. #IranElection"
# [12] "I can't believe he won! http://tinyurl.com/ney29y . I thought Iran was looking to change."
# [13] "RT @bob_edwards: Axis of evil: Barack Obama, Nancy Pelosi, Harry Reid ... destroying America much faster than Iran or No Korea ever could"
# [14] "Official: Obama Administration Skeptical of Iran's Election Results: U.S. analysts find it \"not credible\" that M.. http://bit.ly/saTUJ"
# This dataset spans 6/11/09 to 6/15/09 and each row in this dataset is a tweet.
# I'd be curious to know what tweet volume over time looked like and if there were
# any significant trends. Using what we learned in class, let’s create a histogram
# of tweets using breaks=500.
# Make a plot of the timeseries as a line (plot() with type=’l’) just so we can see
# our data (recall you can create a timeseries by pulling out the "counts" entry of
# the histogram object). What do you see?
hist(tweets$seconds, breaks=500)
h = hist(tweets$seconds, breaks=500)
plot(h$counts, type='l')
### I see spikes, up and down. Possibly a cycle
# Let’s figure out if there are any cycles / seasonal trends in this data. Use the
# acf() function to identify cycles in the tweet frequency, just as we did with the
# NYPD data. Recall that acf() only looks at a small time frame so you’ll want to
# pass it a lag.max argument that’s about 200 or more. Where is it most likely that
# we have a cycle and how can you tell?
acf(h$counts, lag.max=200)
### looks like there could be a cycle around lag 175-180ish.
# OK, let’s remove the cycles and analyze this data. Create an official timeseries
# object with frequency equal to the cycle length. Use decompose() to decompose the
# timeseries into its components and plot the results. What do you see in terms of
# an overall trend?
tweets.over.time = ts(h$counts, frequency = 180)
parts = decompose(tweets.over.time)
plot(parts)
### it seems there's an overall trend of the number of tweets dropping for the
### first 3 days and then coming back up on the forth day.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment