Skip to content

Instantly share code, notes, and snippets.

@astoeckl
Last active November 28, 2022 04:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save astoeckl/8c2d6f10e93cd2e7cad501c5af8b8dfe to your computer and use it in GitHub Desktop.
Save astoeckl/8c2d6f10e93cd2e7cad501c5af8b8dfe to your computer and use it in GitHub Desktop.
from sklearn.cluster import KMeans
from tqdm.notebook import tqdm
from sklearn.metrics import silhouette_score
X = matrix
cluster_results_km = pd.DataFrame({'K': range(6, 25), 'SIL': np.nan})
cluster_results_km.set_index('K', inplace=True)
for k in tqdm(cluster_results_km.index):
km_model = KMeans(n_clusters = k, init ='k-means++', random_state = 42)
y = km_model.fit_predict(X)
cluster_results_km.loc[k, 'SIL'] = silhouette_score(X, y)
cluster_results_km.idxmax(), cluster_results_km.max()
@gsw101
Copy link

gsw101 commented Nov 25, 2022

where are you getting the matrix value from line 5? I am confused per your article: https://towardsdatascience.com/clustering-the-20-newsgroups-dataset-with-gpt3-embeddings-10411a9ad150

@astoeckl
Copy link
Author

where are you getting the matrix value from line 5? I am confused per your article: https://towardsdatascience.com/clustering-the-20-newsgroups-dataset-with-gpt3-embeddings-10411a9ad150

It is the matrix build from the embeddingvectors download from GPT3.

@gsw101
Copy link

gsw101 commented Nov 28, 2022

@astoeckl I am still learning. Is the matrix build from the embedding vectors, the column new column generated in the dataframe df_news['babbage_similarity']? Do I pass only this column back in place of matrix? If yes, I tried the code below and got error ValueError: setting an array element with a sequence, from this line in the for loop y = km_model.fit_predict(X).

code replacing matrix variable:
X = df_news[['babbage_similarity']]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment