Last active
January 11, 2018 17:37
-
-
Save jacoblevine/2e91ff5ccad671d4ba716c3f941ec296 to your computer and use it in GitHub Desktop.
Correspondence Analysis
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
def correspondence_analysis(x): | |
"""Correspondence Analysis on frequency table x | |
as described in Härdle & Simar (2012) "Applied Multivariate Statistical Analysis" | |
Parameters | |
---------- | |
x : np.ndarray | |
n-by-m table of counts or frequencies of m categories in n samples | |
Returns | |
------- | |
r : np.ndarray | |
n-by-k array of row (sample) projections. k = min(n, m) | |
s : np.ndarray | |
m-by-k array of column (category) projections | |
percent_explained : np.ndarray | |
length k array of cumulative percent explained by each CA component | |
""" | |
# Prepare chi-squared matrix | |
a, b = x.sum(axis=1), x.sum(axis=0) # row sums, column sums | |
e = np.outer(a, b) / x.sum() # expected value of each entry in x is the normalized outer product of row and column sums | |
cc = (x - e) / e ** .5 # the chi-squared matrix | |
# SVD of the chi-squared matrix | |
# columns of g are eigenvectors of cc*cc.T | |
# rows of d are eigenvectors of cc.T*cc | |
g, l, d = np.linalg.svd(cc) | |
k = len(l) | |
r = l * g[:, :k] / np.sqrt(a).reshape(-1, 1) # row projections | |
s = l * d[:k].T / np.sqrt(b).reshape(-1, 1) # column projections | |
percent_explained = (l ** 2).cumsum() / (l ** 2).sum() | |
return r, s, percent_explained |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment