Skip to content

Instantly share code, notes, and snippets.

@larsmans
Created July 15, 2012 13:25
Show Gist options
  • Star 31 You must be signed in to star a gist
  • Fork 9 You must be signed in to fork a gist
  • Save larsmans/3116927 to your computer and use it in GitHub Desktop.
Save larsmans/3116927 to your computer and use it in GitHub Desktop.
Hellinger distance for discrete probability distributions in Python
"""
Three ways of computing the Hellinger distance between two discrete
probability distributions using NumPy and SciPy.
"""
import numpy as np
from scipy.linalg import norm
from scipy.spatial.distance import euclidean
_SQRT2 = np.sqrt(2) # sqrt(2) with default precision np.float64
def hellinger1(p, q):
return norm(np.sqrt(p) - np.sqrt(q)) / _SQRT2
def hellinger2(p, q):
return euclidean(np.sqrt(p), np.sqrt(q)) / _SQRT2
def hellinger3(p, q):
return np.sqrt(np.sum((np.sqrt(p) - np.sqrt(q)) ** 2)) / _SQRT2
@EvgeniDubov
Copy link

In case anyone is interested, I've implemented Hellinger Distance in Cython as a split criterion for sklearn DecisionTreeClassifier and RandomForestClassifier.
It performs great in my use cases of imbalanced data classification, beats RandomForestClassifier with gini and XGBClassifier.
You are welcome to check it out on https://github.com/EvgeniDubov/hellinger-distance-criterion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment