Skip to content

Instantly share code, notes, and snippets.

@leobarone
Created May 13, 2016 13:44
Show Gist options
  • Save leobarone/eaace26233f65006427ad99aa5376cc8 to your computer and use it in GitHub Desktop.
Save leobarone/eaace26233f65006427ad99aa5376cc8 to your computer and use it in GitHub Desktop.
Declaração Dilma Roussef 12/05/2016
library(tm)
library(SnowballC)
library(wordcloud)
getwd()
pdfToText <- function(arquivo){
texto <- readPDF(control = list(text = "-layout"))(elem = list(uri = arquivo),
language = "pt", id = "id1")
texto <- as.character(texto)
return(texto)
}
download.file("bibliotecadigital.fgv.br/dspace/bitstream/handle/10438/11683/teseLSB.pdf", "tese_cloud.pdf")
texto <- pdfToText("tese_cloud.pdf")
dir.create("tese_cloud")
writeLines(texto, "~/tese_cloud/tese_cloud.txt")
file.remove("tese_cloud.pdf")
ponteCorpus <- VCorpus(DirSource("~/tese_cloud"), readerControl = list(language = "por"))
inspect(ponteCorpus)
ponteCorpus <- tm_map(ponteCorpus, stripWhitespace)
ponteCorpus <- tm_map(ponteCorpus, content_transformer(tolower))
ponteCorpus <- tm_map(ponteCorpus, removeWords, stopwords("portuguese"))
ponteCorpus <- tm_map(ponteCorpus, removePunctuation)
ponteCorpus <- tm_map(ponteCorpus, removeNumbers)
as.character(ponteCorpus[[1]])
wordcloud(ponteCorpus, max.words = 100, random.order = FALSE)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment