Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@shawngraham
Created January 16, 2017 19:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shawngraham/149a40c24a5eb55b5fc76ee9a281e377 to your computer and use it in GitHub Desktop.
Save shawngraham/149a40c24a5eb55b5fc76ee9a281e377 to your computer and use it in GitHub Desktop.
library(textreuse)
dir <- ("posts", package = "textreuse")
minhash <- minhash_generator(200, seed = 235)
ats <- TextReuseCorpus(dir = dir,
tokenizer = tokenize_ngrams, n = 5,
minhash_func = minhash)
buckets <- lsh(ats, bands = 50, progress = FALSE)
candidates <- lsh_candidates(buckets)
scores <- lsh_compare(candidates, ats, jaccard_similarity, progress = FALSE)
scores
write.csv(scores, file="textreusescores.csv")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment