Skip to content

Instantly share code, notes, and snippets.

@makispl
Created February 22, 2021 19:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save makispl/94b41da04b85a61b9b6c4bb35c91cebe to your computer and use it in GitHub Desktop.
Save makispl/94b41da04b85a61b9b6c4bb35c91cebe to your computer and use it in GitHub Desktop.
# Switch to a copy of the labeled dataframe
df_no_nuls_2 = df_no_nuls.copy()
# Randomise the df
shuffled_rows = np.random.permutation(df_no_nuls_2.index)
df_no_nuls_2 = df_no_nuls_2.loc[shuffled_rows]
# Split to train and test datasets
train = df_no_nuls_2.iloc[:int(df_no_nuls_2.shape[0]*0.8)].copy()
test = df_no_nuls_2.iloc[int(df_no_nuls_2.shape[0]*0.8):].copy().reset_index()
# Subset to the numerical columns we are about to use on the ML algorithm
train_data = train[['rating', 'alcohol', 'age']].copy()
test_data = test[['rating', 'alcohol', 'age']].copy()
# List the unique clasters
unique_clusters = train['cluster'].unique()
unique_clusters.sort()
models = {}
# Train each binary classification model
for cluster in unique_clusters:
X = train[['rating', 'alcohol', 'age']].copy()
y = train['cluster'] == cluster
model = LogisticRegression()
model.fit(X, y)
models[cluster] = model
testing_probs = pd.DataFrame(columns=unique_clusters)
# Test the models
for cluster in unique_clusters:
X_test = test[['rating', 'alcohol', 'age']].copy()
testing_probs[cluster] = models[cluster].predict_proba(X_test)[:,1]
# Label the new data
test['pred_cluster'] = testing_probs.idxmax(axis=1)
# Evaluate the model
accuracy = (test['cluster'] == test['pred_cluster']).sum() / test.shape[0]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment