Skip to content

Instantly share code, notes, and snippets.

@icydee
Created December 18, 2013 14:51
Show Gist options
  • Save icydee/8023564 to your computer and use it in GitHub Desktop.
Save icydee/8023564 to your computer and use it in GitHub Desktop.
Compare two files to see how similar they are
use strict;
use warnings;
use Text::Levenshtein qw(distance fastdistance);
use File::Slurp;
use Digest::MD4 qw(md4_hex);
use Text::JaroWinkler qw(strcmp95);
my $i = 0;
my @checksum;
for my $file (qw(a2rm_file_1.txt a2rm_file_2.txt)) {
my $cs = '';
open(FH, $file) or die "Can't open '$file': $!";
while(my $line = <FH>) {
# strip whitespace
$line =~ s/\s+//g;
my $digest = substr(md4_hex($line),0,4);
$cs .= "$digest";
}
$checksum[$i++] = $cs;
close FH;
}
print "0: ".$checksum[0]."\n";
print "1: ".$checksum[1]."\n";
#my $distance = distance($checksum[0],$checksum[1]);
#print "Distance is [$distance]\n";
my $jw = 1.0 * strcmp95($checksum[0],$checksum[1], length($checksum[1]));
print "JW distance is [$jw]\n";
@icydee
Copy link
Author

icydee commented Dec 18, 2013

testing two files, differing only on minor edits gave a value of 0.91
testing two totally different files gave a value of about 0.75
testing two versions of the same file (with minor edits) gave a value of 0.85

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment