Created
October 23, 2012 03:05
-
-
Save michelleboisson/3936415 to your computer and use it in GitHub Desktop.
Data Without Borders - Assignment 6 (really 4)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Let’s breakdown our tweets around a certain topic. How about, oh, say, Iran? | |
# So how do we pull tweets out that have a certain word in them? | |
# grep() to the rescue! If you’ve used the grep function on the command-line, this | |
# should look familiar. grep() takes as arguments a phrase you’re searching for, a | |
# set of text to look through, and optional arguments about how to search. It will | |
# then return the row numbers of any rows that match your search. To pull out Iran | |
# tweets, we can use the code: | |
iran.tweets <- tweets[grep(“iran”, ignore.case=TRUE, tweets$text), ] | |
# this data set includes words like "zerugiran" and "Miranda," | |
# how do we take these out? | |
# Plot the time series for iran.tweets using a histogram with breaks=100. Add red | |
# vertical lines to the plot at the 3 largest peaks using abline(). | |
hist(iran.tweets$seconds, breaks=100) | |
iran.hist = hist(iran.tweets$seconds, breaks=100) | |
plot(iran.hist$counts, type='l') | |
abline(v= max(iran.hist$counts), col="red") | |
rev(sort(iran.hist$counts))[1:3] | |
top3 = which(iran.hist$counts == rev(sort(iran.hist$counts))[1:2]) | |
abline(v=top3, col="red") | |
#There’s not a lot of seasonality in this plot, so let’s go straight to analyzing | |
# the trend. Use SMA() with the default settings to smooth the signal and plot it. | |
library(TTR) | |
iran.counts = iran.hist$counts | |
iran.counts.smooth = SMA(iran.counts) | |
plot.ts(iran.counts.smooth, type='l') | |
# Let’s build a basic event detection algorithm, but let’s not use the total number of | |
# tweets, since that misses the "velocity" of the signal. Use the diff() function with a | |
# lag of 5 to look at the differences in tweet volume over time on the smoothed | |
# signal (use ?diff if you need a refresher). Create a figure with two graphs – one | |
# with the smoothed signal above and one with the diff() of the signal below it. | |
# What do you see? | |
plot.ts(diff(iran.counts.smooth, lag=5), type='l') | |
### There are two huge a spikes in the difference where the number of tweets jumped. | |
# I’d like to know why all these tweets started increasing. Can we figure out what | |
# time the tweets started increasing using your results from diff, i.e. where is the | |
# biggest jump in tweets? (hint: there are lots of ways to do this, many of which | |
# require you to remove the NAs created by SMA) Pull 20 or so tweets from | |
# around that time and write down why you think they’re increasing based on what | |
# people are saying. | |
### I'm not sure how to get rid of the NAs so I'm cheating | |
iran.counts.smooth | |
which(iran.counts.smooth == 1.4) | |
#[1] 10 | |
good.iran.counts = diff(iran.counts.smooth[10:length(iran.counts.smooth)],lag=5) | |
max(good.iran.counts) | |
# [1] 10.9 | |
max.iran.tweets.smooth = which(good.iran.counts == max(good.iran.counts)) | |
max.iran.tweets.smooth | |
# [1] 29 | |
# 29th break | |
# remove the NAs from iran.counts.smooth so that the breaks match with good.iran.counts | |
iran.counts.smooth = iran.counts.smooth[10:length(iran.counts.smooth)] | |
# now take the tweets inside the 29th break of iran.counts.smooth | |
start.time.of.peak = iran.hist$breaks[max.iran.tweets.smooth] | |
start.time.of.peak | |
#[1] 1244885000 | |
end.time.of.peak = iran.hist$breaks[max.iran.tweets.smooth + 1] | |
end.time.of.peak | |
#[1] 1244890000 | |
#get the tweets in between | |
peak.tweets = iran.tweets[iran.tweets$seconds > start.time.of.break & iran.tweets$seconds < end.time.of.break,] | |
dim(peak.tweets) | |
#[1] 3 4 | |
### looks like I only got 3 tweets in that time frame | |
peak.tweets[,"text"] | |
#[1] "Iran - elections - Ahmadinejad winner after preliminary results http://bit.ly/gP1k" | |
#[2] "RT: @sadeqn: RT: @iranbaan: ابطحی گفت که کروبی نتیجه انتخابات رو پذیرفته و بیانیه میده در این زمینه" | |
#[3] "Ahmadinejad 'wins Iran presidential vote' http://bit.ly/1Q7iI" | |
### I did't pull out 20 tweets but it looks like the spike happened when | |
### Ahmadinejad won the Iranian presidential election. | |
# I expand to more breaks... | |
start.time.of.break = iran.hist$breaks[max.iran.tweets.smooth -3 ] | |
end.time.of.break = iran.hist$breaks[max.iran.tweets.smooth + 4] | |
peak.tweets = iran.tweets[iran.tweets$seconds > start.time.of.break & iran.tweets$seconds < end.time.of.break,] | |
dim(peak.tweets) | |
#[1] 14 4 | |
peak.tweets[,"text"] | |
# [1] "RT @Adam_Ackerman @Thomas_Erdbrink RT United States unreachable by phone from Iran #Iranelections" | |
# [2] "RT @shahrzadmo: Green wave of Mousavi seemed so big, bcs it was something new and flashy.Not many people involved in it. #iranelection" | |
# [3] "RT @bbcbreaking: Mahmoud Ahmadinejad has won Iran's presidential election, officials say, but his nearest rival..,http://www.bbc.co.uk/news" | |
# [4] "Iran - elections - Ahmadinejad winner after preliminary results http://bit.ly/gP1k" | |
# [5] "RT: @sadeqn: RT: @iranbaan: ابطحی گفت که کروبی نتیجه انتخابات رو پذیرفته و بیانیه میده در این زمینه" | |
# [6] "Ahmadinejad 'wins Iran presidential vote' http://bit.ly/1Q7iI" | |
# [7] "RT @cnnbrk: Main challenger in Iran's presidential election calls for counting of ballots to halt due to \"blatant violations.\"" | |
# [8] "http://cliqz.com/de.schlagzeilen/c/20333.html : Mitarbeiter: Mussawi gewinnt Wahl Iran" | |
# [9] "‘Verkiezingen Iran waren een show’: Buiten zijn eigen aanhang gelooft niemand dat Ahmadinejad zijn monsterzege e.. http://bit.ly/o2ikr" | |
# [10] "Hell RT @WSJ BREAKING NEWS: Iran says Mahmoud Ahmadinejad is the winner of the election with a landslide 62.63 percent of the vote." | |
# [11] "RT @erovira JSLeFanu BBC's Peter Simpson: scenes on streets not seen since 1979 Iranian revolution. Dramatic. #IranElection" | |
# [12] "I can't believe he won! http://tinyurl.com/ney29y . I thought Iran was looking to change." | |
# [13] "RT @bob_edwards: Axis of evil: Barack Obama, Nancy Pelosi, Harry Reid ... destroying America much faster than Iran or No Korea ever could" | |
# [14] "Official: Obama Administration Skeptical of Iran's Election Results: U.S. analysts find it \"not credible\" that M.. http://bit.ly/saTUJ" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# This dataset spans 6/11/09 to 6/15/09 and each row in this dataset is a tweet. | |
# I'd be curious to know what tweet volume over time looked like and if there were | |
# any significant trends. Using what we learned in class, let’s create a histogram | |
# of tweets using breaks=500. | |
# Make a plot of the timeseries as a line (plot() with type=’l’) just so we can see | |
# our data (recall you can create a timeseries by pulling out the "counts" entry of | |
# the histogram object). What do you see? | |
hist(tweets$seconds, breaks=500) | |
h = hist(tweets$seconds, breaks=500) | |
plot(h$counts, type='l') | |
### I see spikes, up and down. Possibly a cycle | |
# Let’s figure out if there are any cycles / seasonal trends in this data. Use the | |
# acf() function to identify cycles in the tweet frequency, just as we did with the | |
# NYPD data. Recall that acf() only looks at a small time frame so you’ll want to | |
# pass it a lag.max argument that’s about 200 or more. Where is it most likely that | |
# we have a cycle and how can you tell? | |
acf(h$counts, lag.max=200) | |
### looks like there could be a cycle around lag 175-180ish. | |
# OK, let’s remove the cycles and analyze this data. Create an official timeseries | |
# object with frequency equal to the cycle length. Use decompose() to decompose the | |
# timeseries into its components and plot the results. What do you see in terms of | |
# an overall trend? | |
tweets.over.time = ts(h$counts, frequency = 180) | |
parts = decompose(tweets.over.time) | |
plot(parts) | |
### it seems there's an overall trend of the number of tweets dropping for the | |
### first 3 days and then coming back up on the forth day. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment