Skip to content

Instantly share code, notes, and snippets.

@larsmans
Created July 15, 2012 13:25
Show Gist options
  • Star 31 You must be signed in to star a gist
  • Fork 9 You must be signed in to fork a gist
  • Save larsmans/3116927 to your computer and use it in GitHub Desktop.
Save larsmans/3116927 to your computer and use it in GitHub Desktop.
Hellinger distance for discrete probability distributions in Python
"""
Three ways of computing the Hellinger distance between two discrete
probability distributions using NumPy and SciPy.
"""
import numpy as np
from scipy.linalg import norm
from scipy.spatial.distance import euclidean
_SQRT2 = np.sqrt(2) # sqrt(2) with default precision np.float64
def hellinger1(p, q):
return norm(np.sqrt(p) - np.sqrt(q)) / _SQRT2
def hellinger2(p, q):
return euclidean(np.sqrt(p), np.sqrt(q)) / _SQRT2
def hellinger3(p, q):
return np.sqrt(np.sum((np.sqrt(p) - np.sqrt(q)) ** 2)) / _SQRT2
@larsmans
Copy link
Author

@cscorley Probability distributions aren't supposed to ever contain negative numbers. These functions don't do input validation for speed reasons.

@drussellmrichie
Copy link

drussellmrichie commented Mar 17, 2017

In case anyone is wondering, I believe hellinger2 and hellinger3 are faster than hellinger1. (I had been using hellinger1 in one of my projects until some profiling determined it was a rate-limiting step.) Here is some timing code:

"""
Three ways of computing the Hellinger distance between two discrete
probability distributions using NumPy and SciPy.
"""
import time
import numpy as np
from scipy.linalg import norm
from scipy.spatial.distance import euclidean

_SQRT2 = np.sqrt(2)     # sqrt(2) with default precision np.float64

def hellinger1(p, q):
    return norm(np.sqrt(p) - np.sqrt(q)) / _SQRT2

def hellinger2(p, q):
    return euclidean(np.sqrt(p), np.sqrt(q)) / _SQRT2

def hellinger3(p, q):
    return np.sqrt(np.sum((np.sqrt(p) - np.sqrt(q)) ** 2)) / _SQRT2

repeat = 1000000
p = np.array([.05, .05, .1, .1, .2, .2, .3] * repeat)
q = np.array([  0,   0,  0, .1, .3, .3 ,.3] * repeat)

p /= p.sum()
q /= q.sum()

for hellingerFunction in [hellinger1, hellinger2, hellinger3]:
    start = time.time()
    hellingerFunction(p=p, q=q)
    duration = time.time() - start
    print("{} took {} long".format(hellingerFunction.__name__, duration))

Should get something like:

hellinger1 took 0.15966796875 long
hellinger2 took 0.10175800323486328 long
hellinger3 took 0.09865593910217285 long

The difference shrinks for shorter arrays p and q, but even if repeat=1 so that p and q are of length 7, hellinger3 is still faster.

Hope this is right.

@EvgeniDubov
Copy link

In case anyone is interested, I've implemented Hellinger Distance in Cython as a split criterion for sklearn DecisionTreeClassifier and RandomForestClassifier.
It performs great in my use cases of imbalanced data classification, beats RandomForestClassifier with gini and XGBClassifier.
You are welcome to check it out on https://github.com/EvgeniDubov/hellinger-distance-criterion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment