Skip to content

Instantly share code, notes, and snippets.

@luis261
Last active March 23, 2024 21:16
Show Gist options
  • Save luis261/5ad7f13d8b8ebe6762e424cc1d430f30 to your computer and use it in GitHub Desktop.
Save luis261/5ad7f13d8b8ebe6762e424cc1d430f30 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"kernelspec": {
"name": "python",
"display_name": "Python (Pyodide)",
"language": "python"
},
"language_info": {
"codemirror_mode": {
"name": "python",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8"
}
},
"nbformat_minor": 5,
"nbformat": 4,
"cells": [
{
"cell_type": "markdown",
"source": "# Discrepancies between clustering metrics and \"ideal\" fits for non-spherical clusters\nThis notebook contains the setup for producing an illustration showing how inadequate the conventional clustering validity metrics (as in: popular metrics, such as SWC, Calinski-Harabasz, Davies-Bouldin, which all implicitly work with definitions/assumptions of spherical/centroid-based clusters) can be when applied to datasets not consisting of spherical clusters.",
"metadata": {},
"id": "b25fdcc0-b6be-4c33-8b1b-60d4846ada8e"
},
{
"cell_type": "markdown",
"source": "start off by importing libs and defining basic consts",
"metadata": {},
"id": "643b93cc-a26c-4759-a2ed-b82db739e0c5"
},
{
"cell_type": "code",
"source": "import matplotlib.pyplot as plt\nimport numpy as np\nimport random\nfrom sklearn.datasets import make_blobs\nfrom sklearn.metrics import silhouette_samples\nfrom sklearn.preprocessing import MinMaxScaler\n\n\nN_CLUSTERS = 2\nNOISE_LABEL = -1\nNOISE_PER_CLUSTER_FACTOR = 10\nPLOTTING_MARKER_SIZE = 1\n\n\nis_noise = lambda label : label == NOISE_LABEL",
"metadata": {
"trusted": true
},
"execution_count": 6,
"outputs": [],
"id": "c534aef5-7d07-4f05-8436-1fd52084570d"
},
{
"cell_type": "markdown",
"source": "simplified 2-cluster generation, adapted from: https://github.com/luis261/clustering-validity-comparison/blob/main/generate_benchmarking_suite.py",
"metadata": {},
"id": "373da2ed-0eb6-49ee-8395-e811704f1e63"
},
{
"cell_type": "code",
"source": "def generate_dataset(cluster_size=400, spherical=False):\n data, labels = generate_clusters(N_CLUSTERS, cluster_size, spherical=spherical)\n\n bounds = ((-25, 25), (-25, 25))\n noise_count = NOISE_PER_CLUSTER_FACTOR * N_CLUSTERS\n data = np.concatenate((data, generate_noise(noise_count, bounds, data)))\n noise_labels = np.array([NOISE_LABEL for _ in range(noise_count)])\n labels = np.concatenate((labels, noise_labels))\n\n return (MinMaxScaler().fit_transform(data), labels)\n\ndef generate_noise(count, bounds, data, distance=2.0):\n noise = []\n while len(noise) < count:\n gen = [random.uniform(bounds[0][0], bounds[0][1]), random.uniform(bounds[1][0], bounds[1][1])]\n if point_far_enough_from_points(gen, data, distance):\n noise.append(gen)\n\n return np.array([np.array(x) for x in noise])\n\ndef generate_clusters(n_clusters, cluster_size, spherical=False):\n clusters = []\n while len(clusters) < n_clusters:\n overlap = True\n data, _ = make_blobs(n_samples=cluster_size,\n centers=1, center_box=(0, 0))\n\n if not spherical:\n matrix = generate_transformation_matrix()\n for i in range(0, data.shape[0]):\n data[i] = np.matmul(matrix, data[i])\n\n x_offset = random.uniform(-5.0, 5.0)\n for i in range(0, data.shape[0]):\n data[i][0] += x_offset\n\n y_offset = 0.0\n while overlap:\n if y_offset > 0.0:\n for i in range(0, data.shape[0]):\n data[i][1] += y_offset\n\n overlap = False\n for cluster in clusters:\n if clusters_are_too_close(cluster, data, 2.0):\n overlap = True\n break\n\n y_offset = 0.8\n\n clusters.append(data)\n\n data = clusters[0]\n labels = np.array([0 for _ in range(clusters[0].shape[0])])\n for i in range(1, len(clusters)):\n data = np.concatenate((data, clusters[i]))\n labels = np.concatenate((labels, [i for _ in range(clusters[i].shape[0])]))\n\n return (data, labels)\n\ndef clusters_are_too_close(points0, points1, distance):\n for i in range(points0.shape[0]):\n if not point_far_enough_from_points(points0[i], points1, distance):\n return True\n\n return False\n\n# uses euclidean distance\ndef point_far_enough_from_points(point, points, distance):\n for i in range(points.shape[0]):\n if np.linalg.norm(points[i] - point) < distance:\n return False\n\n return True\n\ndef generate_transformation_matrix(shape_modifier=5.0):\n return [[shape_modifier, 0],\n [0, 1/shape_modifier]]",
"metadata": {
"trusted": true
},
"execution_count": 4,
"outputs": [],
"id": "96fb1900-43b6-4577-bc0a-7ec58715cede"
},
{
"cell_type": "markdown",
"source": "run the experiment",
"metadata": {},
"id": "b49117d0-b3bd-408d-9b80-128bd80b69fa"
},
{
"cell_type": "code",
"source": "data, labels = generate_dataset()\n\nnon_noise_indices = [i for i in range(len(labels)) if not is_noise(labels[i])]\nscores = silhouette_samples(data[non_noise_indices], labels[non_noise_indices])\n\nplt.figure()\nnon_noise_pts = []\nfor idx, point in enumerate(data):\n if idx in non_noise_indices:\n non_noise_pts.append(point)\n continue\n\n plt.scatter(*point, s=PLOTTING_MARKER_SIZE, c=\"#000000\")\n\nkwargs = {\"s\": PLOTTING_MARKER_SIZE, \"c\": scores[non_noise_indices]}\nplt.scatter(*np.swapaxes(non_noise_pts, 0, 1), **kwargs)\nplt.colorbar()\nplt.show()",
"metadata": {
"trusted": true
},
"execution_count": 5,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": "<Figure size 640x480 with 2 Axes>",
"image/png": ""
},
"metadata": {}
}
],
"id": "ff9d1051-2978-4050-87a4-548dc4e0af8d"
},
{
"cell_type": "markdown",
"source": "add `, \"cmap\": \"Spectral\"` to the dict of kwargs and zoom into the resulting plot if you want to make the effect of the implicit spherical cluster definition of the Silhouette Coefficient even more apparent",
"metadata": {},
"id": "afe30b4e-6f0d-4b24-a744-f7ca44ae5cf2"
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment