Skip to content

Instantly share code, notes, and snippets.

@involer
Last active January 11, 2023 11:34
Show Gist options
  • Save involer/4566650 to your computer and use it in GitHub Desktop.
Save involer/4566650 to your computer and use it in GitHub Desktop.
<?php
/**
* Return the coefficient of two items based on Jaccard index
* http://en.wikipedia.org/wiki/Jaccard_index
*
* Example:
*
* $tags1 = "code, php, jaccard, test, items";
* $tags2 = "test, code";
* echo getSimilarityCoefficient( $tags1, $tags2 ); // 0.28
*
* $str1 = "similarity coefficient of two items";
* $str2 = "two items are cool";
* echo getSimilarityCoefficient( $str1, $str2, " " ); // 0.44
*
* @param string $item1
* @param string $item2
* @param string $separator
* @return float
* @author Henrique Hohmann
* @version 0.1
*/
function getSimilarityCoefficient( $item1, $item2, $separator = "," ) {
$item1 = explode( $separator, $item1 );
$item2 = explode( $separator, $item2 );
$arr_intersection = array_intersect( $item2, $item2 );
$arr_union = array_merge( $item1, $item2 );
$coefficient = count( $arr_intersection ) / count( $arr_union );
return $coefficient;
}
?>
@scjonatas
Copy link

Hello.
Thank you for sharing.

I think you should use array_unique on the return of array_merge, because array_merge adds duplicities. This way, your function is returning 0.5 for two identical sets instead of returning 1.

So, the line 29 would be:
$arr_union = array_unique(array_merge( $item1, $item2 ));

@SaurabhDhar
Copy link

SaurabhDhar commented May 18, 2018

After fixing the above to errors, final function should look like this. Thanks everyone.

function getSimilarityCoefficient( $item1, $item2, $separator = "," ) {
	$item1 = explode( $separator, $item1 );
	$item2 = explode( $separator, $item2 );
	$arr_intersection = array_intersect( $item1, $item2 );
	$arr_union = array_unique(array_merge( $item1, $item2 ));
	$coefficient = count( $arr_intersection ) / count( $arr_union );
	
	return $coefficient;
}

@Pierstoval
Copy link

I made some changes because similarities may consider lowercase/uppercase words the same way, check it out there: https://3v4l.org/48UZL

Here's the new code:

function getSimilarityCoefficient( $item1, $item2, $separator = "," ) {
	
	$item1 = array_unique(array_map('trim', explode( $separator, strtolower($item1) )));
	$item2 = array_unique(array_map('trim', explode( $separator, strtolower($item2) )));
    $arr_intersection = array_intersect( $item2, $item1 );
	$arr_union = array_unique(array_merge( $item1, $item2 ));
	$coefficient = count( $arr_intersection ) / count( $arr_union );
	
	return $coefficient;
}

@mylogin
Copy link

mylogin commented Jan 11, 2023

array_intersect will return duplicates if they are present in the first array, so the results may be wrong:

echo getSimilarityCoefficient("red red red", "green blue red", " "); // 1

change line 28 to:

$arr_intersection = array_unique(array_intersect( $item1, $item2 ));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment