Skip to content

Instantly share code, notes, and snippets.

@jacoblevine
Last active January 11, 2018 17:37
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save jacoblevine/2e91ff5ccad671d4ba716c3f941ec296 to your computer and use it in GitHub Desktop.
Save jacoblevine/2e91ff5ccad671d4ba716c3f941ec296 to your computer and use it in GitHub Desktop.
Correspondence Analysis
import numpy as np
def correspondence_analysis(x):
"""Correspondence Analysis on frequency table x
as described in Härdle & Simar (2012) "Applied Multivariate Statistical Analysis"
Parameters
----------
x : np.ndarray
n-by-m table of counts or frequencies of m categories in n samples
Returns
-------
r : np.ndarray
n-by-k array of row (sample) projections. k = min(n, m)
s : np.ndarray
m-by-k array of column (category) projections
percent_explained : np.ndarray
length k array of cumulative percent explained by each CA component
"""
# Prepare chi-squared matrix
a, b = x.sum(axis=1), x.sum(axis=0) # row sums, column sums
e = np.outer(a, b) / x.sum() # expected value of each entry in x is the normalized outer product of row and column sums
cc = (x - e) / e ** .5 # the chi-squared matrix
# SVD of the chi-squared matrix
# columns of g are eigenvectors of cc*cc.T
# rows of d are eigenvectors of cc.T*cc
g, l, d = np.linalg.svd(cc)
k = len(l)
r = l * g[:, :k] / np.sqrt(a).reshape(-1, 1) # row projections
s = l * d[:k].T / np.sqrt(b).reshape(-1, 1) # column projections
percent_explained = (l ** 2).cumsum() / (l ** 2).sum()
return r, s, percent_explained
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment