Skip to content

Instantly share code, notes, and snippets.

@nonsleepr
Created February 12, 2015 21:58
Show Gist options
  • Save nonsleepr/0c1d7f1bdd0953dabf2f to your computer and use it in GitHub Desktop.
Save nonsleepr/0c1d7f1bdd0953dabf2f to your computer and use it in GitHub Desktop.
N-gram tokenizer function without any Java dependencies (like in RWeka)
ngrams.tokenizer <- function(x, n = 2) {
trim <- function(x) gsub("(^\\s+|\\s+$)", "", x)
terms <- strsplit(trim(x), split = "\\s+")[[1]]
ngrams <- vector()
if (length(terms) >= n) {
for (i in n:length(terms)) {
ngram <- paste(terms[(i-n+1):i], collapse = " ")
ngrams <- c(ngrams,ngram)
}
}
ngrams
}
ngrams.tokenizer(" this is a sentense to be ngrammized", 3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment