Skip to content

Instantly share code, notes, and snippets.

@cameres
Last active November 22, 2022 14:19
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save cameres/bc24ac6711c9e537dd20be47b2a83558 to your computer and use it in GitHub Desktop.
Save cameres/bc24ac6711c9e537dd20be47b2a83558 to your computer and use it in GitHub Desktop.
Compute Pandas Correlation Matrix of a Spark Data Frame
from pyspark.mllib.stat import Statistics
import pandas as pd
# result can be used w/ seaborn's heatmap
def compute_correlation_matrix(df, method='pearson'):
# wrapper around
# https://forums.databricks.com/questions/3092/how-to-calculate-correlation-matrix-with-all-colum.html
df_rdd = df.rdd.map(lambda row: row[0:])
corr_mat = Statistics.corr(df_rdd, method=method)
corr_mat_df = pd.DataFrame(corr_mat,
columns=df.columns,
index=df.columns)
return corr_mat_df
@juhotuho10
Copy link

Thank you! works perfectly, can't believe they don't have a internally built method to handle dataframe -> dataframe correlation tables

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment