Skip to content

Instantly share code, notes, and snippets.

@ijt
Created October 3, 2019 00:36
Show Gist options
  • Save ijt/f15aaec8dc4f35d84987d3da5e37ee50 to your computer and use it in GitHub Desktop.
Save ijt/f15aaec8dc4f35d84987d3da5e37ee50 to your computer and use it in GitHub Desktop.
Rust program to compute the trigram Jaccard similarity between two strings
//! The stringsim program prints out the trigram similarity of two strings
//! using what appears to be the same algorithm used by Postgres.
//! https://www.postgresql.org/docs/9.1/pgtrgm.html
use std::collections::HashSet;
use std::hash::Hash;
fn main() {
let args: Vec<String> = ::std::env::args().collect();
if args.len() != 1+2 {
eprintln!("usage: stringsim s1 s2");
::std::process::exit(1);
}
let s1 = &args[1];
let s2 = &args[2];
let sim = similarity(&s1, &s2);
println!("{}", sim);
}
fn similarity(a: &String, b: &String) -> f32 {
let ta = trigrams(a);
let tb = trigrams(b);
return jaccard(ta, tb);
}
fn trigrams(s: &String) -> HashSet<String> {
let mut ts = HashSet::new();
let s = format!("{} ", s);
let mut p1 = ' ';
let mut p2 = ' ';
for c in s.chars() {
let v = vec![p1, p2, c];
let t: String = v.into_iter().collect();
ts.insert(t);
p1 = p2;
p2 = c;
}
ts
}
fn jaccard<T>(s1: HashSet<T>, s2: HashSet<T>) -> f32 where T: Hash+Eq {
let i = s1.intersection(&s2).count() as f32;
let u = s1.union(&s2).count() as f32;
return i / u;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment