Skip to content

Instantly share code, notes, and snippets.

@fabianp
Last active March 21, 2024 09:00
Show Gist options
  • Star 37 You must be signed in to star a gist
  • Fork 14 You must be signed in to fork a gist
  • Save fabianp/9396204419c7b638d38f to your computer and use it in GitHub Desktop.
Save fabianp/9396204419c7b638d38f to your computer and use it in GitHub Desktop.
Partial Correlation in Python (clone of Matlab's partialcorr)
"""
Partial Correlation in Python (clone of Matlab's partialcorr)
This uses the linear regression approach to compute the partial
correlation (might be slow for a huge number of variables). The
algorithm is detailed here:
http://en.wikipedia.org/wiki/Partial_correlation#Using_linear_regression
Taking X and Y two variables of interest and Z the matrix with all the variable minus {X, Y},
the algorithm can be summarized as
1) perform a normal linear least-squares regression with X as the target and Z as the predictor
2) calculate the residuals in Step #1
3) perform a normal linear least-squares regression with Y as the target and Z as the predictor
4) calculate the residuals in Step #3
5) calculate the correlation coefficient between the residuals from Steps #2 and #4;
The result is the partial correlation between X and Y while controlling for the effect of Z
Date: Nov 2014
Author: Fabian Pedregosa-Izquierdo, f@bianp.net
Testing: Valentina Borghesani, valentinaborghesani@gmail.com
"""
import numpy as np
from scipy import stats, linalg
def partial_corr(C):
"""
Returns the sample linear partial correlation coefficients between pairs of variables in C, controlling
for the remaining variables in C.
Parameters
----------
C : array-like, shape (n, p)
Array with the different variables. Each column of C is taken as a variable
Returns
-------
P : array-like, shape (p, p)
P[i, j] contains the partial correlation of C[:, i] and C[:, j] controlling
for the remaining variables in C.
"""
C = np.asarray(C)
p = C.shape[1]
P_corr = np.zeros((p, p), dtype=np.float)
for i in range(p):
P_corr[i, i] = 1
for j in range(i+1, p):
idx = np.ones(p, dtype=np.bool)
idx[i] = False
idx[j] = False
beta_i = linalg.lstsq(C[:, idx], C[:, j])[0]
beta_j = linalg.lstsq(C[:, idx], C[:, i])[0]
res_j = C[:, j] - C[:, idx].dot( beta_i)
res_i = C[:, i] - C[:, idx].dot(beta_j)
corr = stats.pearsonr(res_i, res_j)[0]
P_corr[i, j] = corr
P_corr[j, i] = corr
return P_corr
@jwcarr
Copy link

jwcarr commented Aug 22, 2015

Thanks for posting this! I couldn't find any other Python implementations of the partial correlation. However, I'm a little confused about the results I'm getting. Should this function be equivalent to ppcor in R? (see here: https://cran.r-project.org/web/packages/ppcor/index.html)

Given this data:

>>> data = np.array([[  0.81214374,   0.75793651,   0.        ],
       [  1.8964884 ,   0.75880456,   1.        ],
       [  4.44113951,   0.73611111,   2.        ],
       [  6.07757503,   0.68043155,   3.        ],
       [  3.4540138 ,   0.57291667,   4.        ],
       [  9.06033523,   0.4609375 ,   5.        ],
       [  4.52278321,   0.546875  ,   6.        ],
       [  3.96472532,   0.47842262,   7.        ],
       [  3.23761437,   0.40190972,   8.        ],
       [  2.94132016,   0.46961806,   9.        ],
       [  0.81405421,   0.77777778,   0.        ],
       [  2.59633893,   0.73177083,   1.        ],
       [  8.74133555,   0.640625  ,   2.        ],
       [  4.91243565,   0.60243056,   3.        ],
       [  6.52566722,   0.61631944,   4.        ],
       [  9.14605222,   0.53888889,   5.        ],
       [  8.35901455,   0.49878472,   6.        ],
       [ 10.51921626,   0.5       ,   7.        ],
       [ 11.56800923,   0.45486111,   8.        ],
       [  7.5933584 ,   0.55902778,   9.        ],
       [  0.81296971,   0.85329861,   0.        ],
       [ -1.26243957,   0.88020833,   1.        ],
       [  6.27671297,   0.68402778,   2.        ],
       [ 13.72624297,   0.63541667,   3.        ],
       [  3.90786197,   0.63020833,   4.        ],
       [  1.79030896,   0.65674603,   5.        ],
       [  3.51589642,   0.52876984,   6.        ],
       [  5.1629781 ,   0.60069444,   7.        ],
       [  9.10774469,   0.52256944,   8.        ],
       [  4.66882455,   0.56339286,   9.        ],
       [  0.81752733,   0.80381944,   0.        ],
       [  1.20513141,   0.69201389,   1.        ],
       [  3.18574928,   0.59002149,   2.        ],
       [  7.47468745,   0.57589211,   3.        ],
       [ 10.93493755,   0.44583333,   4.        ],
       [  9.39221578,   0.4549979 ,   5.        ],
       [ 11.51554589,   0.5292298 ,   6.        ],
       [  9.0335466 ,   0.3671875 ,   7.        ],
       [ 13.11844822,   0.30989583,   8.        ],
       [  9.00946829,   0.42534722,   9.        ]])

I get these results:

>>> partial_corr(data)
array([[ 1.        ,  0.25363885,  0.67638027],
       [ 0.25363885,  1.        ,  0.17226437],
       [ 0.67638027,  0.17226437,  1.        ]])

However, using ppcor in R, I get these results for the same data:

> pcor(data)
$estimate
           [,1]       [,2]       [,3]
[1,]  1.0000000 -0.5434100 -0.1407695
[2,] -0.5434100  1.0000000 -0.7620759
[3,] -0.1407695 -0.7620759  1.0000000

Any ideas? (It's quite possible I'm doing something pretty stupid!)

Thanks!

@snarkmaster
Copy link

@jwcarr -- I think you might need to add a column of 1s to include an intercept in the model, via np.column_stack([data, np.ones(data.shape[0])]) or similar. I'm not familiar with Matlab or R, but given your example, I would guess that R includes an intercept by default, and Matlab does not.

Copy link

ghost commented May 4, 2016

Sometimes R and Matlab libraries standardize your data by default. For this problem, adding the intercept column and standardizing the data have the same effect (due to removing the mean out of the data).

Regular:
>partial_corr(data) array([[ 1. , 0.25363885, 0.67638027], [ 0.25363885, 1. , 0.17226437], [ 0.67638027, 0.17226437, 1. ]])

Intercept:
data_int = np.hstack((np.ones((data.shape[0],1)), data)) partial_corr(data_int)[1:, 1:] array([[ 1. , -0.54341003, -0.14076948], [-0.54341003, 1. , -0.76207595], [-0.14076948, -0.76207595, 1. ]])

Standardized:
data_std = data.copy() data_std -= data.mean(axis=0)[np.newaxis, :] data_std /= data.std(axis=0)[np.newaxis, :] partial_corr(data_std) array([[ 1. , -0.54341003, -0.14076948], [-0.54341003, 1. , -0.76207595], [-0.14076948, -0.76207595, 1. ]])

@mw3i
Copy link

mw3i commented Mar 29, 2018

maybe they should add this to the scipy library

@rubenSaro
Copy link

there is a mistake in line 61 and 62, the subscripts of the betas are incorrect

@xiecong
Copy link

xiecong commented Dec 13, 2019

@rubenSaro I guess it is correct since it is consistent with line 58 and 59. Although the subscripts are actually confusion.

@seralouk
Copy link

seralouk commented Mar 9, 2020

Why not just -np.linalg.inv(np.corrcoef(C.T)) ? except if you really need the linear regression method.

In [24]: C = np.random.normal(0,1,(1000,5))

In [25]: partial_corr(C)
Out[25]:
array([[ 1.        ,  0.04377477,  0.05926928, -0.0048639 , -0.00949965],
       [ 0.04377477,  1.        , -0.02458582, -0.00286263,  0.00101031],
       [ 0.05926928, -0.02458582,  1.        ,  0.00670762, -0.04408118],
       [-0.0048639 , -0.00286263,  0.00670762,  1.        ,  0.02981604],
       [-0.00949965,  0.00101031, -0.04408118,  0.02981604,  1.        ]])

In [26]: -np.linalg.inv(np.corrcoef(C.T))
Out[26]:
array([[-1.00550451,  0.04393255,  0.05962884, -0.00486145, -0.009527  ],
       [ 0.04393255, -1.0024175 , -0.02466733, -0.00284463,  0.00102981],
       [ 0.05962884, -0.02466733, -1.0060574 ,  0.0067103 , -0.04429945],
       [-0.00486145, -0.00284463,  0.0067103 , -1.00095072,  0.0298542 ],
       [-0.009527  ,  0.00102981, -0.04429945,  0.0298542 , -1.00297644]])

@jfsantos-ds
Copy link

Why not just -np.linalg.inv(np.corrcoef(C.T)) ? except if you really need the linear regression method.

In [24]: C = np.random.normal(0,1,(1000,5))

In [25]: partial_corr(C)
Out[25]:
array([[ 1.        ,  0.04377477,  0.05926928, -0.0048639 , -0.00949965],
       [ 0.04377477,  1.        , -0.02458582, -0.00286263,  0.00101031],
       [ 0.05926928, -0.02458582,  1.        ,  0.00670762, -0.04408118],
       [-0.0048639 , -0.00286263,  0.00670762,  1.        ,  0.02981604],
       [-0.00949965,  0.00101031, -0.04408118,  0.02981604,  1.        ]])

In [26]: -np.linalg.inv(np.corrcoef(C.T))
Out[26]:
array([[-1.00550451,  0.04393255,  0.05962884, -0.00486145, -0.009527  ],
       [ 0.04393255, -1.0024175 , -0.02466733, -0.00284463,  0.00102981],
       [ 0.05962884, -0.02466733, -1.0060574 ,  0.0067103 , -0.04429945],
       [-0.00486145, -0.00284463,  0.0067103 , -1.00095072,  0.0298542 ],
       [-0.009527  ,  0.00102981, -0.04429945,  0.0298542 , -1.00297644]])

In your opinion what do partial coefficients bigger than 1 in absolute signify?

@seirios
Copy link

seirios commented Dec 7, 2021

If you do this, it works:

def pcor(data):
    X = -np.linalg.inv(np.cov(data.T))
    stdev = np.sqrt(np.abs(np.diag(X)))
    X /= stdev[:, None]
    X /= stdev[None, :]
    np.fill_diagonal(X, 1.0)
    return X

For the original data this gives:

[[ 1.         -0.54341003 -0.14076948]
 [-0.54341003  1.         -0.76207595]
 [-0.14076948 -0.76207595  1.        ]]

@yhpan
Copy link

yhpan commented Dec 20, 2021

C = np.asarray(C)
p = C.shape[1]
P_corr = np.zeros((p, p), dtype=np.float) # sample linear partial correlation coefficients

corr = np.corrcoef(C,rowvar=False) # Pearson product-moment correlation coefficients.
corr_inv = inv(corr) # the (multiplicative) inverse of a matrix.

for i in range(p):
    P_corr[i, i] = 1
    for j in range(i+1, p):
        pcorr_ij = -corr_inv[i,j]/(np.sqrt(corr_inv[i,i]*corr_inv[j,j]))
        # idx = np.ones(p, dtype=np.bool)
        # idx[i] = False
        # idx[j] = False
        # beta_i = linalg.lstsq(C[:, idx], C[:, j])[0]
        # beta_j = linalg.lstsq(C[:, idx], C[:, i])[0]

        # res_j = C[:, j] - C[:, idx].dot( beta_i)
        # res_i = C[:, i] - C[:, idx].dot(beta_j)
        
        # corr = stats.pearsonr(res_i, res_j)[0]
        # P_corr[i, j] = corr
        # P_corr[j, i] = corr
        P_corr[i,j]=pcorr_ij
        P_corr[j,i]=pcorr_ij
    
return P_corr

by this you can get the result same as in R 'pcor'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment