Skip to content

Instantly share code, notes, and snippets.

@thomasjungblut
Last active August 29, 2015 14:16
Show Gist options
  • Save thomasjungblut/e4759797f5a52d78e06d to your computer and use it in GitHub Desktop.
Save thomasjungblut/e4759797f5a52d78e06d to your computer and use it in GitHub Desktop.
MinHashing Example
package de.jungblut.nlp;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;
import com.google.common.base.Stopwatch;
import de.jungblut.datastructure.LimitedPriorityQueue;
import de.jungblut.math.DoubleVector;
import de.jungblut.nlp.MinHash.HashType;
public class MinHashExample {
public static void main(String[] args) throws IOException {
List<String> lines = Files
.lines(
Paths
.get("C:/Users/thomas.jungblut/Downloads/unlabeledTrainData.tsv"))
.skip(1).collect(Collectors.toList()); // skip the header
Stopwatch watch = Stopwatch.createStarted();
// normalization and bigram tokenization
List<String[]> documents = lines
.stream()
.parallel()
.map((line) -> TokenizerUtils.normalizeString(line))
.map(
(normalized) -> TokenizerUtils.whiteSpaceTokenizeNGrams(normalized,
2)).collect(Collectors.toList());
// every token that occurs more often than 50% in the corpus and less than
// twice will be discarded
String[] dict = VectorizerUtils.buildDictionary(documents.stream()
.parallel(), 0.5f, 2);
List<DoubleVector> vectors = VectorizerUtils.wordFrequencyVectorize(
documents.stream().parallel(), dict).collect(Collectors.toList());
System.out.println("Done vectorizing in "
+ watch.elapsed(TimeUnit.MILLISECONDS) + "ms!");
watch = Stopwatch.createStarted();
MinHash hasher = MinHash.create(20, HashType.MURMUR128);
List<int[]> minHashes = vectors.stream()
.map((v) -> hasher.minHashVector(v)).collect(Collectors.toList());
System.out.println("Done hashing in "
+ watch.elapsed(TimeUnit.MILLISECONDS) + "ms!");
final int sourceDocIndex = 2;
// document index
LimitedPriorityQueue<Integer> queue = new LimitedPriorityQueue<>(5);
int[] first = minHashes.get(sourceDocIndex);
watch = Stopwatch.createStarted();
for (int i = 0; i < minHashes.size(); i++) {
if (i != sourceDocIndex) {
int[] reference = minHashes.get(i);
double similarity = hasher.measureSimilarity(first, reference);
if (similarity > 0.1) {
queue.add(i, similarity);
}
}
}
System.out.println("Done finding similar docs in "
+ watch.elapsed(TimeUnit.MILLISECONDS) + "ms!");
System.out.println("document:");
System.out.println(lines.get(sourceDocIndex));
while (!queue.isEmpty()) {
double similarity = queue.getMaximumPriority();
int index = queue.poll();
System.out.println(similarity + " -> " + lines.get(index));
}
}
}
@thomasjungblut
Copy link
Author

// output:
Done vectorizing in 9454ms!
Done hashing in 16029ms!
Done finding similar docs in 144ms!
document:
"15561_0" "Minor Spoilers

In New York, Joan Barnard (Elvire Audrey) is informed that her husband, the archeologist Arthur Barnard (John Saxon), was mysteriously murdered in Italy while searching an Etruscan tomb. Joan decides to travel to Italy, in the company of her colleague, who offers his support. Once in Italy, she starts having visions relative to an ancient people and maggots, many maggots. After shootings and weird events, Joan realizes that her father is an international drug dealer, there are drugs hidden in the tomb and her colleague is a detective of the narcotic department. The story ends back in New York, when Joan and her colleague decide to get married with each other, in a very romantic end. Yesterday I had the displeasure of wasting my time watching this crap. The story is so absurd, mixing thriller, crime, supernatural and horror (and even a romantic end) in a non-sense way. The acting is the worst possible, highlighting the horrible performance of the beautiful Elvire Audrey. John Saxon just gives his name to the credits and works less than five minutes, when his character is killed. The special effects are limited to maggots everywhere. The direction is ridiculous. I lost a couple of hours of my life watching 'Assassinio al Cimitero Etrusco'. If you have the desire or curiosity of seeing this trash, choose another movie, go to a pizzeria, watch TV, go sleep, navigate in Internet, go to the gym, but do not waste your time like I did. My vote is two.

Title (Brazil): 'O Mistério Etrusco' ('The Etruscan Mystery')"
0.8181818181818182 -> "15556_0" "Minor Spoilers

In New York, Joan Barnard (Elvire Audrey) is informed that her husband, the archaeologist Arthur Barnard (John Saxon), was mysteriously murdered in Italy while searching an Etruscan tomb. Joan decides to travel to Italy, in the company of her colleague, who offers his support. Once in Italy, she starts having visions relative to an ancient people and maggots, many maggots. After shootings and weird events, Joan realizes that her father is an international drug dealer, there are drugs hidden in the tomb and her colleague is a detective of the narcotic department. The story ends back in New York, when Joan and her colleague decide to get married with each other, in a very romantic end. Yesterday I had the displeasure of wasting my time watching this crap. The story is so absurd, mixing thriller, crime, supernatural and horror (and even a romantic end) in a non-sense way. The acting is the worst possible, highlighting the horrible and screaming performance of the beautiful Elvire Audrey. John Saxon just gives his name to the credits and works less than five minutes, when his character is killed. The special effects are limited to maggots everywhere. The direction is ridiculous. I lost a couple of hours of my life watching 'Assassinio al Cimitero Etrusco'. My suggestion is that if you have the desire or curiosity of seeing this trash, choose another movie, go to a pizzeria, watch TV, go sleep, navigate in Internet, go to the gym, but do not waste your time like I did. My vote is two.

Title (Brazil): 'O Mistério Etrusco' ('The Etruscan Mystery')"
0.7391304347826086 -> "15555_0" "Minor Spoilers

In New York, Joan Barnard (Elvire Audrey) is informed that her husband, the archeologist Arthur Barnard (John Saxon), was mysteriously murdered in Italy while searching an Etruscan tomb. Joan decides to travel to Italy, in the company of her colleague, who offers his support. Once in Italy, she starts having visions relative to an ancient people and maggots, many maggots. After shootings and weird events, Joan realizes that her father is an international drug dealer, there are drugs hidden in the tomb and her colleague is a detective of the narcotic department. The story ends back in New York, when Joan and her colleague decide to get married with each other, in a very romantic end. Yesterday I had the displeasure of wasting my time watching this crap. The story is so absurd, mixing thriller, crime, supernatural and horror (and even a romantic end) in a non-sense way. The acting is the worst possible, highlighting the horrible and screaming performance of the beautiful Elvire Audrey. John Saxon just gives his name to the credits and works less than five minutes, when his character is killed. The special effects are limited to maggots everywhere. The direction is ridiculous. I lost a couple of hours of my life watching 'Assassinio al Cimitero Etrusco'. If you have the desire or curiosity of seeing this trash, choose another movie, go to a pizzeria, watch TV, go sleep, navigate in Internet, go to the gym, but do not waste your time like I did. AVOID IT! My vote is two.

Title (Brazil): 'O Mistério Etrusco' ('The Etruscan Mystery')"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment