Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save monocongo/e9061a17e94f62ba5986452fe907b963 to your computer and use it in GitHub Desktop.
Save monocongo/e9061a17e94f62ba5986452fe907b963 to your computer and use it in GitHub Desktop.
Compute Pandas Correlation Matrix of a Spark Data Frame
from pyspark.mllib.stat import Statistics
import pandas as pd
# result can be used w/ seaborn's heatmap
def compute_correlation_matrix(df, method='pearson'):
# wrapper around
# https://forums.databricks.com/questions/3092/how-to-calculate-correlation-matrix-with-all-colum.html
df_rdd = df.rdd.map(lambda row: row[0:])
corr_mat = Statistics.corr(df_rdd, method=method)
corr_mat_df = pd.DataFrame(corr_mat,
columns=df.columns,
index=df.columns)
return corr_mat_df
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment