Skip to content

Instantly share code, notes, and snippets.

@jwinternheimer
Created April 30, 2015 15:03
Show Gist options
  • Save jwinternheimer/6218df156de45ff154fe to your computer and use it in GitHub Desktop.
Save jwinternheimer/6218df156de45ff154fe to your computer and use it in GitHub Desktop.
library(tm);library(SnowballC); library(wordcloud); library(RColorBrewer);library(RWeka)
## Read Text File of Conversations
hs <- read.table("~/PycharmProjects/helpscout-api/hs/hs_text.txt",header=F)
names(hs) <- c("text")
## Clean Text and Move to Data Frame
hs_text <- as.data.frame(clean.text(hs$text))
names(hs_text) <- c("text")
## Convert Text to Corpus and Create Term Document Matrix
hs_corpus <- Corpus(VectorSource(hs_text$text))
hs_corpus <- tm_map(hs_corpus,removeWords,stopwords("english"))
## Create Wordcloud
pal2 <- brewer.pal(8,"Dark2")
wordcloud(hs_corpus, scale=c(8,.2),min.freq=3,max.words=Inf,
random.order=FALSE, rot.per=.15, colors=pal2)
## Build Document-Term Matrix
hs.tdm <- TermDocumentMatrix(hs_corpus)
## Identify Terms Used at Least 10 Times
findFreqTerms(hs.tdm,lowfreq=10)
## Find Terms That Frequently Co-Occur
findAssocs(hs.tdm,'try',0.4)
## Remove Sparse Terms and Convert to Data Frame
hs2.tdm <- removeSparseTerms(hs.tdm,sparse=0.92)
hs2.df <- as.data.frame(inspect(hs2.tdm))
## Scale Data and Create Distance Matrix
hs2.df.scale <- scale(hs2.df)
hs2.dis <- dist(hs2.df.scale, method="euclidean")
## Cluster the Data
hs.fit <- hclust(hs2.dis, method="ward.D")
plot(hs.fit,main="Cluster - Helpscout")
## Five Clusters
groups <- cutree(hs.fit,k=5)
rect.hclust(hs.fit,k=5)
## N-gram Identifyier
ngramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 3))
helpscout.tdm <- TermDocumentMatrix(Corpus(VectorSource(hs_text$text)), control=list(tokenize=ngramTokenizer))
inspect(helpscout.tdm)[300:340,1:10
]
## Clean Text Function
clean.text <- function(some_txt) {
some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", some_txt)
some_txt = gsub("@\\w+", "", some_txt)
some_txt = gsub("[[:punct:]]", "", some_txt)
some_txt = gsub("[[:digit:]]", "", some_txt)
some_txt = gsub("http\\w+", "", some_txt)
some_txt = gsub("[ \t]{2,}", "", some_txt)
some_txt = gsub("^\\s+|\\s+$", "", some_txt)
some_txt = gsub("amp", "", some_txt)
# define "tolower error handling" function
try.tolower = function(x) {
y = NA
try_error = tryCatch(tolower(x), error=function(e) e)
if (!inherits(try_error, "error"))
y = tolower(x)
return(y)
}
some_txt = sapply(some_txt, try.tolower)
some_txt = some_txt[some_txt != ""]
names(some_txt) = NULL
return(some_txt)
}
@michael-erasmus
Copy link

Thanks for sharing! Learned a couple of things reading the code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment