Skip to content

Instantly share code, notes, and snippets.

@amir-rahnama
Last active February 21, 2019 18:54
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save amir-rahnama/b9d3fd69bc02b078b1ab7d180301dd29 to your computer and use it in GitHub Desktop.
Save amir-rahnama/b9d3fd69bc02b078b1ab7d180301dd29 to your computer and use it in GitHub Desktop.
Create N-grams for large text-files (very fast)
source("fast-ngrams.R")
con <- file("path_to_file", "r")
data <- readLines(con, encoding = 'UTF-8')
close(con)
data <- clean(data)
onegram <- text_to_ngrams(decode(data), 1)
bigram <- text_to_ngrams(decode(data), 2)
trigram <- text_to_ngrams(decode(data, 3))
# How to calculate ngrams for a term
sum(blogs_ngram[,colnames(onegram) == 'term'])
library(stringi)
library(Matrix)
library(iconv)
library(tm)
find_ngrams <- function(dat, n=1, verbose=FALSE) {
library(pbapply)
stopifnot(is.list(dat))
stopifnot(is.numeric(n))
stopifnot(n>0)
if(n == 1) return(dat)
pblapply(dat, function(y) {
if(length(y)<=1) return(y)
c(y, unlist(lapply(2:n, function(n_i) {
if(n_i > length(y)) return(NULL)
do.call(paste, unname(as.data.frame(embed(rev(y), n_i), stringsAsFactors=FALSE)), quote=FALSE)
})))
})
}
text_to_ngrams <- function(sents, n=2){
tokens <- stri_split_fixed(sents, ' ')
tokens <- find_ngrams(tokens, n=n, verbose=TRUE)
token_vector <- unlist(tokens)
bagofwords <- unique(token_vector)
n.ids <- sapply(tokens, length)
i <- rep(seq_along(n.ids), n.ids)
j <- match(token_vector, bagofwords)
M <- sparseMatrix(i=i, j=j, x=1L)
colnames(M) <- bagofwords
return(M)
}
clean <- function(docs) {
docs <- removeNumbers(docs)
docs <- removePunctuation(docs)
docs <- stripWhitespace(docs)
docs <- stemDocument(docs)
return(docs)
}
decode <- function(text) {
t1 <- iconv(text, from = "UTF-8", to = "ASCII")
return(t1)
}
@amir-rahnama
Copy link
Author

amir-rahnama commented Jun 26, 2016

I was searching to find a way to create ngrams for large text files but avoid long pauses and to be honest the solution Zach (https://github.com/zachmayer) gave in his answer http://stackoverflow.com/questions/31570437/really-fast-word-ngram-vectorization-in-r really did the trick. I improved upon it a bit and published it here for people who need it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment