Skip to content

Instantly share code, notes, and snippets.

@shawngraham shawngraham/textreuse.r
Last active Dec 7, 2018

What would you like to do?
walking through textreuse for andrew
# use ctrl+enter to run each line in turn
# next line just displays the help file for the package in the help window in R studio
vignette("textreuse-introduction", package = "textreuse")
# check what directory you're in
## have the text you're interested in a subdirectory called corpus
# put the subdirectory into a variable
dir <- ("corpus")
# start the text reuse; it compares sequences of words 7 words in length. You might want to change that for your
# purposes, so in line 26 change n = 7 to whatever seems appropriate
# dir <- system.file("corpus", package = "textreuse") <- don't use this line, seems to screw things up.
# if you get an error with that line, remove 'system.file' and ', package = "textreuse"'
corpus <- TextReuseCorpus(dir = dir, meta = list(title = "Collosal Cave Adventure"),
tokenizer = tokenize_ngrams, n = 7)
# check that everything is there:
# compute similarity
comparisons <- pairwise_compare(corpus, jaccard_similarity)
# turn it into a dataframe if you want
df <- pairwise_candidates(comparisons)
# write the results to file!
write.csv(comparisons, file="textreuse-comparisons.csv")
#### if you've got a lot of data, that can be really slow. so you'd do this instead:
#dir <- system.file("corpus", package = "textreuse") # see comment on line 24 if you get an error here
dir <- ("corpus")
minhash <- minhash_generator(200, seed = 235)
ats <- TextReuseCorpus(dir = dir,
tokenizer = tokenize_ngrams, n = 5,
minhash_func = minhash)
### then the scores:
buckets <- lsh(ats, bands = 50, progress = FALSE)
candidates <- lsh_candidates(buckets)
scores <- lsh_compare(candidates, ats, jaccard_similarity, progress = FALSE)
write.csv(scores, file="textreuse-scores.csv")

This comment has been minimized.

Copy link
Owner Author

shawngraham commented Dec 7, 2018

while the vignette says to do
dir <- system.file("corpus", package = "textreuse")
that seems to throw an error every time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.