Skip to content

Instantly share code, notes, and snippets.

@markziemann
Created April 22, 2019 11:36
Show Gist options
  • Save markziemann/74cb2d1fbb0c8bda0304abf83ec6f4e9 to your computer and use it in GitHub Desktop.
Save markziemann/74cb2d1fbb0c8bda0304abf83ec6f4e9 to your computer and use it in GitHub Desktop.
An example of text similarity analysis using R
library(stringr)
library(text2vec)
filelist = list.files(pattern = ".*.txt")
x = lapply(filelist, function(x)readLines(x))
prep_fun = function(x) {
x %>%
# make text lower case
str_to_lower %>%
# remove non-alphanumeric symbols
str_replace_all("[^[:alnum:]]", " ") %>%
# collapse multiple spaces
str_replace_all("\\s+", " ")
}
x$clean = prep_fun(x)
it = itoken(x$clean, progressbar = FALSE)
v = create_vocabulary(it) %>% prune_vocabulary(doc_proportion_max = 0.1, term_count_min = 3)
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
heatmap(as.matrix(sim2(dtm,method="cosine",norm="l2")),scale="none")
as.vector(t(str_match(x$clean[1],regex("(\\d{9})") )))[1]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment