Skip to content

Instantly share code, notes, and snippets.

@pjazdzewski1990
Created February 4, 2016 07:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pjazdzewski1990/fdf1983346141132427c to your computer and use it in GitHub Desktop.
Save pjazdzewski1990/fdf1983346141132427c to your computer and use it in GitHub Desktop.
private def countSimilarity(target: Array[String], rdd: RDD[(String, Seq[String])]): RDD[(String, Double, Seq[String])] = {
rdd.map {
case (author, words) =>
val matching = (for {
w <- words
t <- target
if t.length > 3 && w.length > 3 //remove short words
score = scoreSimilarity(t, w)
if score > 0.65d //remove accidental, short matches
} yield (score, w)).distinct //remove duplicated results
val score = Math.min(matching.map(_._1).sum, target.length)
val wordsMatched = matching.map(_._2)
(author, score, wordsMatched)
}.sortBy(_._2, false)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment