Skip to content

Instantly share code, notes, and snippets.

@Xparx
Last active July 5, 2016 20:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Xparx/02507e48917142e7b38d928d8a0ab9bf to your computer and use it in GitHub Desktop.
Save Xparx/02507e48917142e7b38d928d8a0ab9bf to your computer and use it in GitHub Desktop.
Word cloud creation for SBW 2015 abstract book.
# pip install pdfminer
pdf2txt.py SBW2015_book.pdf > SBW2015_book.txt
library(tm)
library(wordcloud)
library(memoise)
text_file <- 'SBW2015_book.txt'
# Using "memoise" to automatically cache the results
text <- readLines(sprintf("./%s", text_file), encoding="UTF-8")
text <- stemDocument(text) # doesn't seem to do anything
myCorpus <- Corpus(VectorSource(text))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removeWords, c(stopwords("SMART"), "thy", "thou", "thee", "the", "and", "but","university","karolinska","institutet","institute","presenter","poster","stockholm","kth","linköping","scilifelab","gothenburg","uppsala"))
myDTM <- TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))
m <- as.matrix(myDTM)
getTermMatrix <- data.frame(freq=sort(rowSums(m), decreasing = TRUE))
wordcloud(rownames(getTermMatrix),getTermMatrix[,'freq'],colors=brewer.pal(8, "Paired"),max.words=200)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment