I have run the following code to compute dimension reduction with unlabeled UMAP and DBScan for clustering to group dissimilar names for the same academic journals into clusters representing each journal.
The UMAP code is:
# Step 2: Dimension Reduction with UMAP
reducer = umap.UMAP()
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
The DBSCAN clustering is:
# Step 3: Clustering with DBSCAN - you can search for the best hyperparameters
dbscan = DBSCAN(eps=0.5, min_samples=100)
clusters = dbscan.fit_predict(reduced_embeddings)
To describe the clusters, I run:
np.unique(clusters, return_counts=True)
Which returns (in a Jupyter notebook):
(array([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28]),
array([2332, 7005, 2474, 379, 188, 2381, 3074, 210, 261, 1032, 264,
468, 149, 1955, 1136, 497, 575, 242, 360, 275, 336, 287,
512, 269, 112, 132, 190, 444, 105, 126]))
These clusters look good. Now I want to plot the data using the seaborn
library so that they are in a 2 dimensional plot, colored by the cluster ID. Please take a deep breath, and write code to do this. Include comments for students to understand what you are doing.