Skip to content

Instantly share code, notes, and snippets.

@voltek62
Created September 11, 2018 15:09
Show Gist options
  • Save voltek62/ff47d3db8a3e7476b051d02d8fe9fda9 to your computer and use it in GitHub Desktop.
Save voltek62/ff47d3db8a3e7476b051d02d8fe9fda9 to your computer and use it in GitHub Desktop.
# install.packages("ngram")
# install.packages("tm")
library(ngram)
library(tm)
# read txt file
url <- "https://raw.githubusercontent.com/voltek62/RsparkleR-examples/master/examples/advs.txt"
txt <- readLines(url)
data.sentence <- concatenate(txt)
# remove punctuations & numbers, fix spacing
data.sentence.staging <- preprocess(data.sentence
,case='lower'
,remove.punct = TRUE
,remove.numbers = TRUE
,fix.spacing = TRUE
)
# remove stopwords
stopwords_regex = paste(c(stopwords('en'),'holmes'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
data.sentence.prepared = stringr::str_replace_all(data.sentence.staging, stopwords_regex, '')
# bigram only
ng <- ngram(data.sentence.prepared, n=2)
print(head(get.phrasetable(ng)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment