Skip to content

Instantly share code, notes, and snippets.

@LittleWat
Created April 11, 2017 08:57
Show Gist options
  • Save LittleWat/2c4e46bf7fbee31e58c599125c1aa3a8 to your computer and use it in GitHub Desktop.
Save LittleWat/2c4e46bf7fbee31e58c599125c1aa3a8 to your computer and use it in GitHub Desktop.
pandasのDataFrame形式の2つの確率分布のcos類似度を計算するスクリプト ref: http://qiita.com/LittleWat/items/259354f1364b72f27043
def calc_cos_mat(mat_df1, mat_df2):
"""
Args:
pd.DataFrame: mat_df1, mat_df2
The modals of mat_df1 column and mat_df2 column must be same!
Returns:
pd.DataFrame: cosign_simularity_matrix_df
"""
import pandas as pd
from sklearn.preprocessing import normalize
assert type(mat_df1) == pd.core.frame.DataFrame
assert type(mat_df1) == pd.core.frame.DataFrame
assert mat_df1.shape[1] == mat_df2.shape[1]
normalized_mat_df1 = pd.DataFrame(normalize(mat_df1), index=mat_df1.index)
normalized_mat_df2 = pd.DataFrame(normalize(mat_df2), index=mat_df2.index)
return normalized_mat_df1.dot(normalized_mat_df2.T)
def get_sorted_mats(mat_df):
""" change the order column for each row
Args:
pd.DataFrame: mat_df
Returns:
pd.DataFrame: sorted_column_mat, probability_mat
"""
jan_mat = pd.DataFrame()
prob_mat = pd.DataFrame()
for i, idx in enumerate(mat_df.index):
jan = pd.DataFrame(mat_df.sort_values(idx, axis=1, ascending=False).columns).T
jan.index = [idx]
prob = pd.DataFrame(mat_df.sort_values(idx, axis=1, ascending=False).loc[idx].values).T
prob.index = [idx]
jan_mat = pd.concat([jan_mat, jan])
prob_mat = pd.concat([prob_mat, prob])
return jan_mat, prob_mat
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment