Skip to content

Instantly share code, notes, and snippets.

@benmarwick
Created April 12, 2013 07:57
Show Gist options
  • Save benmarwick/5370329 to your computer and use it in GitHub Desktop.
Save benmarwick/5370329 to your computer and use it in GitHub Desktop.
How to extract ngrams from a corpus with R's tm and RWeka packages. From http://tm.r-forge.r-project.org/faq.html
library("RWeka")
library("tm")
data("crude")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
inspect(tdm[340:345,1:10])
plot(tdm, terms = findFreqTerms(tdm, lowfreq = 2)[1:50], corThreshold = 0.5)
@tomkauffman
Copy link

The Weka_control statement does not work for me.
With docs<-VCorpus(VectorSource(a))

Get the document term matrices

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm_unigram <- DocumentTermMatrix(docs, control = list(tokenize="words",
removePunctuation = TRUE,
stopwords = stopwords("english"),
stemming = TRUE))
dtm_bigram <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer,
removePunctuation = TRUE,
stopwords = stopwords("english"),
stemming = TRUE))

inspect(dtm_unigram)
inspect(dtm_bigram),

I get the correct output from dtm_unigram but not dtm_bigram
I have installed RWeka, SnowballC, tm, etc,
In creating dtm_bigram, I get the message
Error in simple_triplet_matrix(i, j, v, nrow = length(terms), ncol = length(corpus), :
'i, j' invalid

If I replace min=2, max=2 with n=2, I don't get the error message, but I don't get the right answer, either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment