Skip to content

Instantly share code, notes, and snippets.

@wladston
Last active December 12, 2023 06:50
Show Gist options
  • Star 19 You must be signed in to star a gist
  • Fork 8 You must be signed in to fork a gist
  • Save wladston/c931b1495184fbb99bec to your computer and use it in GitHub Desktop.
Save wladston/c931b1495184fbb99bec to your computer and use it in GitHub Desktop.
Distance correlation with p-value
from scipy.spatial.distance import pdist, squareform
import numpy as np
import copy
def distcorr(Xval, Yval, pval=True, nruns=500):
""" Compute the distance correlation function, returning the p-value.
Based on Satra/distcorr.py (gist aa3d19a12b74e9ab7941)
>>> a = [1,2,3,4,5]
>>> b = np.array([1,2,9,4,4])
>>> distcorr(a, b)
(0.76267624241686671, 0.266)
"""
X = np.atleast_1d(Xval)
Y = np.atleast_1d(Yval)
if np.prod(X.shape) == len(X):
X = X[:, None]
if np.prod(Y.shape) == len(Y):
Y = Y[:, None]
X = np.atleast_2d(X)
Y = np.atleast_2d(Y)
n = X.shape[0]
if Y.shape[0] != X.shape[0]:
raise ValueError('Number of samples must match')
a = squareform(pdist(X))
b = squareform(pdist(Y))
A = a - a.mean(axis=0)[None, :] - a.mean(axis=1)[:, None] + a.mean()
B = b - b.mean(axis=0)[None, :] - b.mean(axis=1)[:, None] + b.mean()
dcov2_xy = (A * B).sum() / float(n * n)
dcov2_xx = (A * A).sum() / float(n * n)
dcov2_yy = (B * B).sum() / float(n * n)
dcor = np.sqrt(dcov2_xy) / np.sqrt(np.sqrt(dcov2_xx) * np.sqrt(dcov2_yy))
if pval:
greater = 0
for i in range(nruns):
Y_r = copy.copy(Yval)
np.random.shuffle(Y_r)
if distcorr(Xval, Y_r, pval=False) > dcor:
greater += 1
return (dcor, greater / float(nruns))
else:
return dcor
@kirk86
Copy link

kirk86 commented May 27, 2017

This also gives different results from spicy.spatial.distance.correlation. Is that correct?

@rmill040
Copy link

rmill040 commented Sep 8, 2017

I believe the p-value should be calculated using nruns and not n, since the p-value can be greater than 1.0 using n.

So greater/float(n)) -> greater/float(nruns))

@wladston
Copy link
Author

Hi @B1azingB1ade, thanks for your comments! I have changed the code to use np.random.shuffle. According to the docs, it only shuffles along the first axis: https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.shuffle.html… they seem to be cooking a way to do this: numpy/numpy#5173.

@kirk86 yes, it seems like "correlation distance" is different from "distance correlation".

@rmill040, correct! Thanks for spotting this bug, I just fixed it.

@WeiHao97
Copy link

Hi,
I think Xval should be shuffled with Yval. or your distcorr([1,2,3],[2,4,6]) will not give the correct p-value, 1 in this case.

@wladston
Copy link
Author

wladston commented Jun 26, 2019

I tried running distcorr([1,2,3],[2,4,6], nruns=100000) both shuffling Xval and not shuffling it. Both cases returns a distance correlation of 1 with p-value of 0.33. This means that in both cases, 33% of the random shuffles also produce a distance correlation of 1. If we change if distcorr(Xval, Y_r, pval=False) >= dcor to > dcor, then no random shuffling makes that line be true, and the p-value is zero. I'm unsure whether we should use > instead of >=, but I've made the change anyways, because in the example you described it makes sense that the the p-value is zero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment