Skip to content

Instantly share code, notes, and snippets.

@vi
Last active May 2, 2019 07:31
Show Gist options
  • Save vi/8242191 to your computer and use it in GitHub Desktop.
Save vi/8242191 to your computer and use it in GitHub Desktop.
Calculate "distance" between files using compression as metric
#!/bin/bash
COMPRESSOR="xz --lzma2=dict="
OVERHEAD=60
if [ -z "$2" ]; then
echo "Usage: filedistance file1 file2"
echo " Outputs similarity metric using between two small files using compression"
echo " 0 - completely similar; 1 - completely dissimilar"
exit 1
fi
DICTSIZE=$(cat "$1" "$2" | wc -c)
if [ "$DICTSIZE" -lt 65536 ]; then DICTSIZE=65536; fi
CC="$COMPRESSOR""$DICTSIZE"
A=$(cat "$1" "$1" | $CC | wc -c)
B=$(cat "$2" "$2" | $CC | wc -c)
C1=$(cat "$1" "$2" | $CC | wc -c)
C2=$(cat "$2" "$1" | $CC | wc -c)
#echo "A=$A B=$B C=$C1 $C2"
echo "($C1 + $C2 - $A - $B) / ($A + $B - 2 * $OVERHEAD)" | bc -l
#!/bin/bash
FILEDISTANCE=filedistance
if [ -z "$2" ]; then
echo "Usage: filedistance_multi file1 file2 ... fileN"
exit 1
fi
FF="$1"
shift;
for i in "$@"; do.
printf "%s\t%s\t%s\n" $($FILEDISTANCE "$FF" "$i") "$FF" "$i"
done
if [ ! -z "$2" ]; then
exec "$0" "$@"
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment