Created

Embed URL

HTTPS clone URL

SSH clone URL

You can clone with HTTPS or SSH.

Download Gist

Denver debate analysis I

View tm_example.R
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
rm(list = ls())
doInstall <- TRUE # Change to FALSE if you don't want packages installed.
toInstall <- c("zoo", "tm", "ggplot2", "Snowball")
if(doInstall){install.packages(toInstall, repos = "http://cran.r-project.org")}
lapply(toInstall, library, character.only = TRUE)
 
# From: http://www.cnn.com/2012/10/03/politics/debate-transcript/index.html
Transcript <- readLines("https://raw.github.com/dsparks/Test_image/master/Denver_Debate_Transcript.txt")
head(Transcript, 20)
 
Transcript <- data.frame(Words = Transcript, Speaker = NA, stringsAsFactors = FALSE)
Transcript$Speaker[regexpr("LEHRER: ", Transcript$Words) != -1] <- 1
Transcript$Speaker[regexpr("OBAMA: ", Transcript$Words) != -1] <- 2
Transcript$Speaker[regexpr("ROMNEY: ", Transcript$Words) != -1] <- 3
table(Transcript$Speaker)
Transcript$Speaker <- na.locf(Transcript$Speaker)
 
# Remove moderator:
Transcript <- Transcript[Transcript$Speaker != 1, ]
 
myCorpus <- Corpus(DataframeSource(Transcript))
inspect(myCorpus)
 
myCorpus <- tm_map(myCorpus, tolower) # Make lowercase
myCorpus <- tm_map(myCorpus, removePunctuation, preserve_intra_word_dashes = FALSE)
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english")) # Remove stopwords
myCorpus <- tm_map(myCorpus, removeWords, c("lehrer", "obama", "romney"))
myCorpus <- tm_map(myCorpus, stemDocument) # Stem words
 
inspect(myCorpus)
docTermMatrix <- DocumentTermMatrix(myCorpus)
 
docTermMatrix <- inspect(docTermMatrix)
sort(colSums(docTermMatrix))
table(colSums(docTermMatrix))
 
termCountFrame <- data.frame(Term = colnames(docTermMatrix))
termCountFrame$Obama <- colSums(docTermMatrix[Transcript$Speaker == 2, ])
termCountFrame$Romney <- colSums(docTermMatrix[Transcript$Speaker == 3, ])
 
head(termCountFrame)
 
# Plot
zp1 <- ggplot(termCountFrame)
zp1 <- zp1 + geom_text(aes(x = Obama, y = Romney, label = Term))
print(zp1)

Replace line 8 with:
library(RCurl)
Transcript <- getURL("https://raw.github.com/dsparks/Test_image/master/Denver_Debate_Transcript.txt")
and it will work

I'm having trouble with this and and any help I can get is much appreciated. After line 45, I get "ERROR: 'x' must be an array of at least two dimensions. ERROR: object 'Romney' not found" Going back to >head(termCountFrame), the result is a table with 2 columns, "Term" and "Obama". There is no Romney. Going back to >Transcript$Speaker[regexpr("ROMNEY: ", Transcript$Words) != -1] <- 3, there was no error message. But the result from >table(Transcript$Speaker) is "3" and "1". At the end of the script, >Transcript$Speaker results in "3". This is odd because it's Romney which is missing from termCountFrame, rather than "2" for Obama which is not missing. Thanks is advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.