Skip to content

Instantly share code, notes, and snippets.

@remibacha
Forked from voltek62/sherlock-holmes.R
Last active September 11, 2018 15:14
Show Gist options
  • Save remibacha/af03da3acf59730a7070928a734e5c48 to your computer and use it in GitHub Desktop.
Save remibacha/af03da3acf59730a7070928a734e5c48 to your computer and use it in GitHub Desktop.
Add install.packages with condition
if(!require("ngram")){install.packages("ngram")}
if(!require("tm")){install.packages("tm")}
library(ngram)
library(tm)
# read txt file
url <- "https://raw.githubusercontent.com/voltek62/RsparkleR-examples/master/examples/advs.txt"
txt <- readLines(url)
data.sentence <- concatenate(txt)
# remove punctuations & numbers, fix spacing
data.sentence.staging <- preprocess(data.sentence
,case='lower'
,remove.punct = TRUE
,remove.numbers = TRUE
,fix.spacing = TRUE
)
# remove stopwords
stopwords_regex = paste(c(stopwords('en'),'holmes'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
data.sentence.prepared = stringr::str_replace_all(data.sentence.staging, stopwords_regex, '')
# bigram only
ng <- ngram(data.sentence.prepared, n=2)
print(head(get.phrasetable(ng)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment