Last active
November 28, 2022 04:17
-
-
Save astoeckl/8c2d6f10e93cd2e7cad501c5af8b8dfe to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.cluster import KMeans | |
from tqdm.notebook import tqdm | |
from sklearn.metrics import silhouette_score | |
X = matrix | |
cluster_results_km = pd.DataFrame({'K': range(6, 25), 'SIL': np.nan}) | |
cluster_results_km.set_index('K', inplace=True) | |
for k in tqdm(cluster_results_km.index): | |
km_model = KMeans(n_clusters = k, init ='k-means++', random_state = 42) | |
y = km_model.fit_predict(X) | |
cluster_results_km.loc[k, 'SIL'] = silhouette_score(X, y) | |
cluster_results_km.idxmax(), cluster_results_km.max() |
where are you getting the matrix value from line 5? I am confused per your article: https://towardsdatascience.com/clustering-the-20-newsgroups-dataset-with-gpt3-embeddings-10411a9ad150
It is the matrix build from the embeddingvectors download from GPT3.
@astoeckl I am still learning. Is the matrix build from the embedding vectors, the column new column generated in the dataframe df_news['babbage_similarity']
? Do I pass only this column back in place of matrix? If yes, I tried the code below and got error ValueError: setting an array element with a sequence
, from this line in the for loop y = km_model.fit_predict(X)
.
code replacing matrix variable:
X = df_news[['babbage_similarity']]
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
where are you getting the matrix value from line 5? I am confused per your article: https://towardsdatascience.com/clustering-the-20-newsgroups-dataset-with-gpt3-embeddings-10411a9ad150