Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save edisoncastro01/4527519144a45dfc99de3133ef854dfc to your computer and use it in GitHub Desktop.
Save edisoncastro01/4527519144a45dfc99de3133ef854dfc to your computer and use it in GitHub Desktop.
Created on Skills Network Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/Logos/organization_logo/organization_logo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n",
"</center>\n",
"\n",
"# Hierarchical Clustering\n",
"\n",
"Estimated time needed: **25** minutes\n",
"\n",
"## Objectives\n",
"\n",
"After completing this lab you will be able to:\n",
"\n",
"- Use scikit-learn to Hierarchical clustering\n",
"- Create dendograms to visualize the clustering\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>Table of contents</h1>\n",
"\n",
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
" <ol>\n",
" <li><a href=\"#hierarchical_agglomerative\">Hierarchical Clustering - Agglomerative</a></li>\n",
" <ol>\n",
" <li><a href=\"#generating_data\">Generating Random Data</a></li>\n",
" <li><a href=\"#agglomerative_clustering\">Agglomerative Clustering</a></li>\n",
" <li><a href=\"#dendrogram\">Dendrogram Associated for the Agglomerative Hierarchical Clustering</a></li>\n",
" </ol> \n",
" <li><a href=\"#clustering_vehicle_dataset\">Clustering on the Vehicle Dataset</a></li>\n",
" <ol>\n",
" <li><a href=\"#data_cleaning\">Data Cleaning</a></li>\n",
" <li><a href=\"#clustering_using_scipy\">Clustering Using Scipy</a></li>\n",
" <li><a href=\"#clustering_using_skl\">Clustering using scikit-learn</a></li>\n",
" </ol>\n",
" </ol>\n",
"</div>\n",
"<br>\n",
"<hr>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1 id=\"hierarchical_agglomerative\">Hierarchical Clustering - Agglomerative</h1>\n",
"\n",
"We will be looking at a clustering technique, which is <b>Agglomerative Hierarchical Clustering</b>. Remember that agglomerative is the bottom up approach. <br> <br>\n",
"In this lab, we will be looking at Agglomerative clustering, which is more popular than Divisive clustering. <br> <br>\n",
"We will also be using Complete Linkage as the Linkage Criteria. <br>\n",
"<b> <i> NOTE: You can also try using Average Linkage wherever Complete Linkage would be used to see the difference! </i> </b>\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np \n",
"import pandas as pd\n",
"from scipy import ndimage \n",
"from scipy.cluster import hierarchy \n",
"from scipy.spatial import distance_matrix \n",
"from matplotlib import pyplot as plt \n",
"from sklearn import manifold, datasets \n",
"from sklearn.cluster import AgglomerativeClustering \n",
"from sklearn.datasets.samples_generator import make_blobs \n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>\n",
"<h3 id=\"generating_data\">Generating Random Data</h3>\n",
"We will be generating a set of data using the <b>make_blobs</b> class. <br> <br>\n",
"Input these parameters into make_blobs:\n",
"<ul>\n",
" <li> <b>n_samples</b>: The total number of points equally divided among clusters. </li>\n",
" <ul> <li> Choose a number from 10-1500 </li> </ul>\n",
" <li> <b>centers</b>: The number of centers to generate, or the fixed center locations. </li>\n",
" <ul> <li> Choose arrays of x,y coordinates for generating the centers. Have 1-10 centers (ex. centers=[[1,1], [2,5]]) </li> </ul>\n",
" <li> <b>cluster_std</b>: The standard deviation of the clusters. The larger the number, the further apart the clusters</li>\n",
" <ul> <li> Choose a number between 0.5-1.5 </li> </ul>\n",
"</ul> <br>\n",
"Save the result to <b>X1</b> and <b>y1</b>.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"X1, y1 = make_blobs(n_samples=50, centers=[[4,4], [-2, -1], [1, 1], [10,4]], cluster_std=0.9)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot the scatter plot of the randomly generated data\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.collections.PathCollection at 0x7f09eec42b38>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXIAAAD4CAYAAADxeG0DAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAVB0lEQVR4nO3de4ycZ3XH8d8Px8ASiLZVlkvWSR1UZEAxxDCKQi3R4oQ6hShxA6hBBaUFyUICCi01OI1UpIo2llJxkUC0K8JNuElREpuIAInBoIioiVjjkJsTGoVLvAn1IORysxoSTv+YWbMez+zcnvfyzHw/0ire2fH7nux6zzxz3vOc1xEhAEC+nlZ1AACA8ZDIASBzJHIAyByJHAAyRyIHgMydUsVJTz/99Fi/fn0VpwaAbB04cOCnETHX+XgliXz9+vVaXFys4tQAkC3bP+r2eJLSiu1Z2zfYftD2IduvSnFcAEB/qVbkH5P0tYh4o+2nS3pWouMCAPoYO5HbPk3SqyX9lSRFxBOSnhj3uACAwaQorbxQUlPSZ2wftP0p26d2Psn2dtuLthebzWaC0wIApDSJ/BRJr5D0yYjYJOlXknZ2PikiFiKiERGNubmTLroCAEaUokZ+WNLhiLir/fkN6pLIgWmx9+CSrrn1IT129JjOmJ3Rjq0btG3TfNVhYYKNvSKPiJ9IetT2hvZDF0h6YNzjAjnae3BJV950r5aOHlNIWjp6TFfedK/2HlyqOjRMsFQ7O98tabfteySdK+lfEh0XyMo1tz6kY7956oTHjv3mKV1z60MVRYRpkKT9MCLultRIcSwgZ48dPTbU40AKzFoBEjpjdmaox4EUSORAQju2btDM2jUnPDazdo12bN3Q428A46tk1gowqZa7U+haQZlI5EBi2zbNk7hRKkorAJA5EjkAZI5EDgCZI5EDQOZI5ACQORI5AGSORA4AmaOPHOiDsbSoOxI5sIrlsbTLEw2Xx9JKIpmjNiitAKtgLC1ywIocWAVjaYtBuSotVuTAKhhLmx53UUovSSK3/UPb99q+2/ZiimMCdcBY2vQoV6WXsrTymoj4acLjAZVjLG16lKvSo0YO9MFY2rTOmJ3RUpekTblqdKlq5CHpNtsHbG/v9gTb220v2l5sNpuJTgsgN5Sr0ku1It8cEY/Zfq6kfbYfjIjbVz4hIhYkLUhSo9GIROcFkBnKVeklSeQR8Vj7v0ds75F0nqTbV/9bAKYV5aq0xk7ktk+V9LSI+EX7z38q6Z/GjgwAKpJbn3uKFfnzJO2xvXy8/4iIryU4LgCULsexDGMn8oh4RNLLE8QCAJVbrc+9romcnZ0AsEKOfe4kcgBYIcexDCRyAFghxz53dnYCwAo59rmTyIES5NbONu1y63MnkQMFy7GdDXkhkQMFy7GdDekV+a6MRA4ULMd2NqRV9LsyulaAguXYzoa0ir6ZBokcKFiO7WzTZO/BJW3etV9n77xFm3ftL+SWc0W/K6O0AhQsx3a2SdVZp37Ni+d044Glwi9EF30zDUeUPxq80WjE4iK39gRQns46tSRZrbvidJqfndEdO7cUeu6ZtWt09WUbh3rBsH0gIhqdj1NaATAVutWpey1jl44eS1pi2bZpXldftlHzszOyWi8Uwybx1VBaATAVhq1Hpy6xFLnJiBU5gKnQqx7tHs8ft6ukjIuoy0jkAKZCr+6hvzz/rJ5/Z9SukuWa+NLRYwr97iJqUck8WSK3vcb2QdtfTnVMAEilV536Q9taj3UzaldJ0X3jnVLWyN8j6ZCk0xIeEwCS6VWn3rF1Q9euklF7/cvezZskkdteJ+n1kv5Z0t+lOCbQiQmCKErqXv+i+8Y7pVqRf1TS+yU9p9cTbG+XtF2Szjqrd00K6KasCYK8WEyvlF0lqVf4/YxdI7d9saQjEXFgtedFxEJENCKiMTc3N+5pMWXKqDmWfYEKk2vbpnm94ZXzWuNWT8waW294Zb3bDzdLusT2DyVdL2mL7S8kOC5wXBk1x7IvUGFy7T24pBsPLOmp9s75pyJ044Gl+natRMSVEbEuItZLulzS/oh4y9iRASv0qi2GlKxHN+WLRZk9xCjXID/bshcF9JEjC916gJeNWwJZ/sXstV172AtUlGgm16A/27K7VpIm8oj4VkRcnPKYgHRiD3A3vVY7/VZPK38xuxnlAhUlmsk16M+27Bn0rMiRjW2b5nXHzi09t1R3rnYGWT11+8VcNjuzdqTBRtwRaHIN+rMtewY9iRzZGXS1M8jqabXkeuozThmpy4A7AqVRx+sMg/5si5522IlEjuwMutoZZPW0WnIddQXdLT6r9Y6gLgmp7up6nWGYlfbyO8gf7Hq97ti5pdD9CCRyZGfQ1c4gq6cdWzf0LNWMuoLurOevvHlBXRJS3dX1OkPZK+1BMY8cWRpkF94gu+u2bZrX4o9+pt13/viErpVx65nL8W3etf+kC6nLCanqX/46q/N1hiLnio+KFTkm1qCrpw9t26iP/MW5hayy6pyQ6ozrDMNhRY6JNujqqahVVtnDkyZF2bNKcseKHChQ2W1ok6Kutei6YkUOFCj1eNRpUsdadF2RyDEx6jqCloSEopHIMRHKmlcO1BE1ckyEuvYdA2UgkWMi0OaHaUZpBROBNr/JVNfrHnXDihwTgTa/yVPXeSt1RCLHRKDvePJw3WNwY5dWbD9T0u2SntE+3g0R8cFxjwsMiza/ycJ1j8GlqJH/n6QtEfFL22slfdv2VyPizgTHBoZCTXVycN1jcCluvhwR8cv2p2vbH71ufwgUhprqZOG6x+CS1Mhtr7F9t6QjkvZFxF1dnrPd9qLtxWazmeK0wAmoqU4WrnsMLkn7YUQ8Jelc27OS9tg+JyLu63jOgqQFSWo0GqzYkRw11cnDdY/BJO1aiYijkr4l6aKUxwUGwQxrTKuxE7ntufZKXLZnJF0o6cFxjwsMi5oqplWK0soLJH3O9hq1Xhi+GBFfTnBcYCiMjMW0GjuRR8Q9kjYliAUYGzVVTCN2dgJA5kjkAJA5EjkAZI4xtgDQluuIBxI5ACjv2wVSWgEA5T3igUQOAMp7xAOJHACU94gHEjkAKO8RD1zsBADlPeKBRA4AbbmOeKC0AgCZI5EDQOZI5ACQORI5AGSOi50YWxXzKXKdiQEUYexEbvtMSZ+X9HxJv5W0EBEfG/e4yEMV8ylynokBFCFFaeVJSe+LiJdIOl/SO22/NMFxkYEq5lPkPBMDKEKKW709Lunx9p9/YfuQpHlJD4x7bJRrlHJFFfMpcp6JARQh6cVO2+vVun/nXV2+tt32ou3FZrOZ8rRIYLlcsXT0mEK/K1fsPbi06t+rYj5FzjMxgCIkS+S2ny3pRknvjYifd349IhYiohERjbm5uVSnRSKjliuqmE+R80wMoAhJulZsr1Urie+OiJtSHBPlGrVcUcV8ipxnYgBFSNG1YknXSjoUER8ePyRU4YzZGS11Sdp1LVfkOhMDKEKK0spmSW+VtMX23e2P1yU4Lko0arli1No6gHRSdK18W5ITxIIKjVquWK22zooZKAc7O3HcKOUKWgGB6jFrBWOhFRCoHokcY6EVEKgepRWMhVZAoHokcoyNVkCgWpRWACBzJHIAyByJHAAyRyIHgMyRyAEgcyRyAMgciRwAMkciB4DMkcgBIHPs7JxCo9xkGUB9kcinzPKNIJZniC/fCEISyRzIVJLSiu1P2z5i+74Ux0NxRr3JMoD6SrUi/6ykj0v6fKLjoSDD3giCMgxQf0lW5BFxu6SfpTgWijXMjSC4HyeQh9K6Vmxvt71oe7HZbJZ1WnQY5kYQlGGAPJSWyCNiISIaEdGYm5sr67TosG3TvK6+bKPmZ2dkSfOzM7r6so1dyyXcjxPIA10rU2jQG0GcMTujpS5Jm/txAvXChiD0xP04gTykaj+8TtJ/Sdpg+7Dtt6c4Lqo1TBkGQHWSlFYi4s0pjoP6Kfp+nLQ3AuOjRo7KsMsUSINEjkKttuJerb2RRA4MjkSOwvRbcdPeCKRB1woGsvfgkjbv2q+zd96izbv2D7S7s9+GomF2mQLojUSOvkbdqt9vxU17I5AGiRx9jbpVv9+Km/ZGIA1q5Ohr1Fr2jq0bTqiRSyevuItubwSmAYm8Arn1To+6VX9ld0ou/69AjkjkJcuxd3qQlXUvrLiB4lEjL1mOo2GpZQP1xoq8ZLn2TrOyBupr6hJ51fVpRsMCSG2qSit1uHUZvdMAUpuqRF6H+jT1ZgCpTVVppS71aerNAFKaqhU5sz0ATKJUdwi6yPZDth+2vTPFMYtAfRrAJBq7tGJ7jaRPSHqtpMOSvmP75oh4YNxjp8ZOQwCTKEWN/DxJD0fEI5Jk+3pJl0qqXSKXqE8DmDwpSivzkh5d8fnh9mMnsL3d9qLtxWazmeC0AAApzYrcXR6Lkx6IWJC0IEmNRuOkrwPDqHpjF1AnKRL5YUlnrvh8naTHEhwXmSo6yeY4eAwoUorSynckvcj22bafLulySTcnOC4yVMbu2Tps7ALqZOxEHhFPSnqXpFslHZL0xYi4f9zjIk9lJNm6bOwC6iLJzs6I+Iqkr6Q4FlpyrQGXkWQZPAacaKp2duaiDsO9+tl7cEmbd+3X2Ttv0eZd+4/HVsbuWTZ2ASea+ETeK+HUWd1rwKu90JSRZBk8Bpxooodm5drdUPca8GovNHfs3HL8OUWWhdjYBfzORCfy1RJOnZNA3WvA/V5oSLJAuSa6tFL3lW0vda8BM0USqJeJTuS5Jpy614Dr/kIDTJvsSivDtOXt2LrhhBq5lE/CqXN5gimSQL1klciHvXhJwilOnV9ogGmTVSIf5eIlCQfApMuqRp7rxUsAKFJWK/K6t+VJ+W6tB5CvrFbkde+WyGFrPYDJk1Uir3tbXt231gOYTFmVVqR6X7ykhg+gClmtyOsu1w1IAPI2lYm8qImIda/hA5hMYyVy22+yfb/t39pupAqqSEVekKx7DR/AZBq3Rn6fpMsk/XuCWEpR9ETEOtfwAUymsRJ5RBySJNtpoilBURck6R8HUJXSauS2t9tetL3YbDbLOu1JirggSf84gCr1TeS2v277vi4flw5zoohYiIhGRDTm5uZGj3hMRVyQpH8cQJX6llYi4sIyAilLERMR6R8HUKXsNgSlkPqCZA4zYABMrnHbD//c9mFJr5J0i+1b04SVF/rHAVRp3K6VPZL2JIplVXXuCinrBhZ1/h4AqE4WpZVh7wxUhaL7x3P4HgCoRhZb9OkK4XsAoLcsEjldIXwPAPSWRSJnqiDfAwC9ZZHI6QrhewCgtywudpbRFVL3jpCyOmMA5McRUfpJG41GLC4uln7eXjo7QqTWapcRtADqxPaBiDhpZHgWpZWi0RECIGckctERAiBvJHLREQIgbyRy0RECIG9ZdK0UobNL5Q2vnNc3H2zSEQIgO1OZyLvNLbnxwBJdKgCyNJWlFbpUAEySqUzkdKkAmCRTmcjpUgEwSca9Q9A1th+0fY/tPbZnE8VVKLpUAEyScVfk+ySdExEvk/R9SVeOH1Lxtm2a19WXbdT87IwsaX52hgudALI17q3eblvx6Z2S3jheOOUp+o4+AFCWlO2Hb5P0n72+aHu7pO2SdNZZZyU8bbnqPiURwPTpm8htf13S87t86aqI+FL7OVdJelLS7l7HiYgFSQtSa/rhSNFWjPtmAqijvok8Ii5c7eu2r5B0saQLooqZuCVarf+cRA6gKmOVVmxfJOkDkv44In6dJqT6ov8cQB2N27XycUnPkbTP9t22/y1BTLVF/zmAOhorkUfEH0bEmRFxbvvjHakCqyP6zwHU0VQOzRoV980EUEck8iHRfw6gbqZy1goATBISOQBkjkQOAJkjkQNA5kjkAJA5V7Gr3nZT0o8knS7pp6UHMDziTCuXOKV8YiXOtOoa5x9ExFzng5Uk8uMntxcjolFZAAMizrRyiVPKJ1biTCuXOJdRWgGAzJHIASBzVSfyhYrPPyjiTCuXOKV8YiXOtHKJU1LFNXIAwPiqXpEDAMZEIgeAzNUmkdv+e9th+/SqY+nG9jW2H7R9j+09tmerjmkl2xfZfsj2w7Z3Vh1PN7bPtP1N24ds32/7PVXHtBrba2wftP3lqmPpxfas7Rva/zYP2X5V1TF1Y/tv2z/z+2xfZ/uZVce0zPanbR+xfd+Kx37f9j7b/93+7+9VGWM/tUjkts+U9FpJP646llXsk3RORLxM0vclXVlxPMfZXiPpE5L+TNJLJb3Z9kurjaqrJyW9LyJeIul8Se+saZzL3iPpUNVB9PExSV+LiBdLerlqGK/teUl/I6kREedIWiPp8mqjOsFnJV3U8dhOSd+IiBdJ+kb789qqRSKX9BFJ75dU2yuvEXFbRDzZ/vROSeuqjKfDeZIejohHIuIJSddLurTimE4SEY9HxHfbf/6FWkmnlsPdba+T9HpJn6o6ll5snybp1ZKulaSIeCIijlYaVG+nSJqxfYqkZ0l6rOJ4jouI2yX9rOPhSyV9rv3nz0naVmZMw6o8kdu+RNJSRHyv6liG8DZJX606iBXmJT264vPDqmmCXGZ7vaRNku6qOJRePqrW4uK3FcexmhdKakr6TLsE9Cnbp1YdVKeIWJL0r2q9435c0v9GxG3VRtXX8yLicam1AJH03IrjWVUpidz219u1sc6PSyVdJekfy4ijnz5xLj/nKrVKBLuri/Qk7vJYbd/d2H62pBslvTcifl51PJ1sXyzpSEQcqDqWPk6R9ApJn4yITZJ+pRqWANr15UslnS3pDEmn2n5LtVFNllJu9RYRF3Z73PZGtX6437MttcoV37V9XkT8pIzYVuoV5zLbV0i6WNIFUa8G/MOSzlzx+TrV6K3rSrbXqpXEd0fETVXH08NmSZfYfp2kZ0o6zfYXIqJuyeewpMMRsfyu5gbVMJFLulDSDyKiKUm2b5L0R5K+UGlUq/sf2y+IiMdtv0DSkaoDWk2lpZWIuDcinhsR6yNivVr/MF9RRRLvx/ZFkj4g6ZKI+HXV8XT4jqQX2T7b9tPVupB0c8UxncStV+trJR2KiA9XHU8vEXFlRKxr/5u8XNL+GiZxtX9PHrW9of3QBZIeqDCkXn4s6Xzbz2r/G7hANbwo2+FmSVe0/3yFpC9VGEtf3Hx5cB+X9AxJ+9rvHu6MiHdUG1JLRDxp+12SblWrI+DTEXF/xWF1s1nSWyXda/vu9mP/EBFfqS6k7L1b0u72C/gjkv664nhOEhF32b5B0nfVKkseVI22wNu+TtKfSDrd9mFJH5S0S9IXbb9drReiN1UXYX9s0QeAzFXetQIAGA+JHAAyRyIHgMyRyAEgcyRyAMgciRwAMkciB4DM/T/9rVjO25dzigAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.scatter(X1[:, 0], X1[:, 1], marker='o') "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>\n",
"<h3 id=\"agglomerative_clustering\">Agglomerative Clustering</h3>\n",
"\n",
"We will start by clustering the random data points we just created.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The <b> Agglomerative Clustering </b> class will require two inputs:\n",
"\n",
"<ul>\n",
" <li> <b>n_clusters</b>: The number of clusters to form as well as the number of centroids to generate. </li>\n",
" <ul> <li> Value will be: 4 </li> </ul>\n",
" <li> <b>linkage</b>: Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. </li>\n",
" <ul> \n",
" <li> Value will be: 'complete' </li> \n",
" <li> <b>Note</b>: It is recommended you try everything with 'average' as well </li>\n",
" </ul>\n",
"</ul> <br>\n",
"Save the result to a variable called <b> agglom </b>\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"agglom = AgglomerativeClustering(n_clusters = 4, linkage = 'average')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fit the model with <b> X2 </b> and <b> y2 </b> from the generated data above.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',\n",
" connectivity=None, linkage='average', memory=None,\n",
" n_clusters=4, pooling_func='deprecated')"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"agglom.fit(X1,y1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the following code to show the clustering! <br>\n",
"Remember to read the code and comments to gain more understanding on how the plotting works.\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAADrCAYAAABXYUzjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAVlElEQVR4nO3de3DdZZ3H8c9Jcs5Jk9CShJZeaFJierEsLbRRUhHbcgkIXrbg6q6grMruONbpON3F26KItQhjdXTaKsx2i6MbXcfqjGAV60iVgtNKD9BSK6WlNJTeSU8baHPP2T/SE3895HLOye/yPL/zfs0wzPSS86NDPn1+3+f7fJ9IKpUSAMB/RUE/AAAUKgIYAAJCAANAQAhgAAgIAQwAASGAASAgJbn84osuuig1bdo0jx4FAMIpkUi8nkqlxmf+eE4BPG3aNG3fvt29pwKAAhCJRFoG+3FKEAAQkJxWwEChSKzdpW0P7lBvV6/m/vvbdc19DYpEIkE/FkKGAAYyHE2c0O8/+7QWPXiVKiaX6dcf26yLr6jWzFvrgn40hAwlCCDD3kcPSJLmfHKmLrt9uqLlJdr7q0FLeMCoEMBAhjPH2iVJsQtiikQiilVEdebY2YCfCmFEAAMZyi8eI0nqautSKpVS1xvdKr+4LOCnQhhRAwYcEi1JPTexWJK085E9qphcpu6zPap/f23AT4YwIoCBcxItSd2+bqu6evo058ZKxb+7U8W90oIvX6mZt10a9OMhhAhg4Jyt+1vV1dOnvpT0wpXluuEL87R0cX3Qj4UQowYMnNNYV61YSZGKI1K0pEiNddVBPxJCjhUwcM782ko139Worftb1VhXrfm1lUE/EkKOAAYc5tdWErzwDSUIAAgIAQwAASGAASAgBDAABIRNOBQExkvCRAQwQo/xkjAVJQiEHuMlYSoCGKHHeEn3JZNJzZo1S6WlpaqsrNQdd9yhjo6OoB/LOgQwQo/xku6LRqNauXKldu3apbvuukvNzc3auHFj0I9lHWrACDXGS3qjoqJCt912mySppqZG8XhcM2bMCPip7EMAI7QYL+mtLVu2qKmpSR0dHWpqatKll/JnmitKEAitzPGSseZ3admxj2vhynfSguaChoYGPffcc1qxYoU2bdqk9evXB/1I1mEFjNBKj5fs7uljvKSLEi1Jbdi0RVPLenXTgrkqLy+XJJWVUVfPFQGM0GK8pPvSZZ1Te3eo9berpfZTqq6q0tKlS3XnnXcG/XjWIYARaoyXdFe6rFM6bZ5qPvOIljfN5NaQUaAGDCBrNt0aYkOvMgEMIGvpss7ypplqvqvR6LcLG3qVKUEAyIktZR0bepUJYAChZXqvMiUIAKFleq8yK2AAoZJoSWrr/lZVth/WhFiX6urqjO1VJoABhIbz+Hl3y/PqefIhtR4/pipDe5UJYCBL3KphPufx81jtFfrijzcb3adMAANZ4FYNO9h2/JxNOCAL3KphBy/7lL042EEAA1ngVg17zK+t1NLF9a73KntxsIMABrLArRpm8+PYcfpgR319vWsHOwhgYBiJlqTWbt6n7vlVkvpv1fhr815u1TCMX8eOt2zZojFjxmjZsmVauHDhqA92sAkHDMHZ0hQrKdI3v3alEqt3qberj1s1ApRMJrVgwQIdOHBAY8aM0S233KJ169Z5duw43VfcWFc9cLBjw4YN+spXvqL169dr2bJleX9tAhgYgrOlqbunT4evrdLSe+8I+rEKXnq1O3fuXD388MNatWqVlixZogkTJrh+7Nj5l3Df66/o84umuDqEngAGhmBbS1OhGGrITnV1tSZOnKiDBw9q06ZNWrRokZ566imVlpbm/VnOv4Q730jqnuXf1OeSJ1w72EEAA0PgRg1zOYfszGp4t5556ZDGxw7o7rvv1qFDh3T//fcrkUho48aNA2GdD+dfwuOmN6j5wV2u/n9AAAPDsGX0YiHIrMX+76//qE+vWKsX//RjLb9/tcYc2zlw7Piaa67Rtm3bRlUHTiaTuv3GBXrllQMqjsW16PobddnFi138LyKAAVhgsFrsCY1TX3FckhS9+G26Z9X3NKfkiJqamgZWyKOpAw9Wa9648SOjWlFnIoABGG+wWuwbyRNKxSo0dt4tqrriBjXWVWv2hCmudSn4MdCdAAZgvKFqsQOjJ8ti2rBpi6aW9brapeD1QHcCGL5jqhhyNdSGaPrft6/bqlN7d6j1t6ul9lOqdqlLwe2+30wEMHzl11QxQj58htoQTZcnSqfNU81nHtHyppmjGkGZXlXHTuzR1//j0zp+/LiKivoPDUej0by/7mAIYPjKOVVsTHWpfvfpLdr7qxZXA5jRkYXFzX5t52Zf58vbdPZsu1Kp1EAAV1VVufXYkpgFAZ/5MVWM0ZGFJdcRlMMN7nFu9pW+7Sp9Y8NWdXV16YEHHlA8Htfs2bNdfXYCGL5yThVLrNmlM8fbdfBPR/TkV59RKpVy5TPcCvnE2l36fk2zVk/8kavPB/flMoJyuME96dV0cUSKlhQp/vpLrg7fyUQJAr5I19Xefm6q2NMrnlVi9S5J0tx/e7v+vOLZUZUJnDXfyhnjJPWHfPFFpXmNjqSMEV7DtZdlbvbNnlCqd7MJB5sNNlXshW/3h2/D5y7Xtd9u1M71Lw5ZCx5pQy0dluNvmKwT247rzJajkqQd61/UBVPK8xod6UetGsEZrr0svYr2oq0tEwEMzw02VWz2kXo9//DftOiBq1RUVDRkmSCblWg6LE/8/rC2LRqr+VvaVNIrbfvWDhUVF+U1OpIbMMJtuPay9ILBi7a2TNSA4TlnXa24KKLDp9rVFu9fwY50w0Q2G2rpsJSkv11erq54RCqS6m+p1bJjH9fCle/MuQWNGzDc4cdNFblItCT1pf9+VI/88nHFYrFBV7aZbW3f/d1uHT16VGvWrKENDfZJ19V+8exr2pB4TT/9y6v645vder/6b5iomFw2ZJlgpJVooiWp3e1//4bujUUU7ZaKokV5r1gTLUk9N7FYyni+lze+qtUTf0RfcQ6Gmt3r5jyFbGW7svVzDCkBDF/Mr63U1v2t6untL0UcnRBV/LMzhr1hwhmug22opb+hxkU6tOTc7/lMT4U6u1OKxCN5rVid9eo5N1Yq/t2dinT2SZLeefccXTClnA25HPgxTyFb2R7Y8HMMKQEM32SuLN7zn1do/urBx/tlhutvv/O8Zl5Wfd5KOf0NdWxiTM+/o0JXPPOm+tbtV/0HarXv0Za87mxz1qtfuLJcN3xhnub8Mamnv/6s5n5qFhtyefB6nkK2clnZ+jWGlACGb3JZWTjD9c/XX6jI/+zRocj5G2qNddUqKYqoqzelrddeqPbKqBa90KHDW4/nfWfbYN+kJ352WBIbcvnyep5CtkwcsE8Aw1fZriycQfhS41jdu67pLb9vfm2l/qlhqn6y7VWlJO2aV66mL84b1RyAzG9SScOWQTC4gSll7Yc1Idaluro6z1q5cmHagH0CGEbKdrVy67xL9ItnX3N1wyT9TZpNGQRv5ayjd7c8r54nHxq4qcKLVi6bEcAwVjarFS9fK7Mpg+CtnHX0WO0V+uKPN4/qrSTMCGAYJZ8xkl69VmZTBsFbcZt09ghgGMO0+QsmbtrYgD+37BHAMIaJ8xdM27SxBX9u2eEoMozB/AUUGgIYgUu0JLV2876s50PALqbNgzAJAYxApVuWvr1pj773Zquk/vkLf23eS7tXSAw3AL3QUQNGoJwtS9nMh4B9TJoHYRoCGIHKZT4E7GXKPAjTUIJAoAa7UJG72MInPQ9ixYoV2rRpk9avXx/0IxmBFTAC52xZMq0XGPkzdR6ESQhgGMXEXmDkjnkQ2SGAYRR6gcOBeRDZIYBhhPTr6nhHLzCjH+3FPIjsEMAInPN1dVKWd8XBbMyDyA4BjMDRCxxOzIMYGQGMwNELjEJFACNwvK6iUBHAMAKvqyhEnIQDgIAQwABCw7bRlwQwgNCwbfQlNWAAoWHb6EsCGECo2DT6khIEAOulr7VKtCStGn3JChiA1ZxH2ftef0WfXzRFNy2Ya8XoSwIYgNWcR9k730jqnuXf1OeSJ6wYfUkAA7Ca8yj7uOkNan5wlzWHeghgAFaz+Sg7AQzAerYeZacLAgACQgADQEAoQUCStDvZpY/+4bheOt2tspKIPjHjAn1rgXfXyLy+O6lHP/oHnXzptKJlJbr8EzN07bcWePZ5gIlYAUOS1NGT0semV2j7rVP04boKrdp5Wk8cavfs83o6enTZx6brX7ffqlkfrtNfVu3UgScOefZ5gIlYAUOSNG98XPPGxyVJ100p1Q92t+lkZ69nnzdx3nhNnDdeklR73RQ994Pd6jjZ6dnnASYigENmtKWE0519ui9xSvVjS3TzVO9PEHWc7tTT9yVUWT9Wb7t5quefB5iEEkTIjKaUcLqzT02/OaLWjl49fvMklUW9/d+j43Snftb0G7W3dujDj9+saFnU088DTMMKOGRyLSUkWpLaur9V/1BTpbt3tGtfW7d+2XSx4sURtXX1aWzM3RBOf17D+LF68ZNPKbmvTUt+2aTieLE627oUHxtz9fMAkxHAIZVNKcE5xOTs2At1rGqSJGnxY0ckSffOv1Bfa6hy7Zmcn1fzWpdueuaEJOmnix+TJF1973xd87UG1z4PMB0BHELOUsKTH5g8ZCnBOcSkpKNDSy7q0zeurdGaXW36we42vWfSGFefy/l5By+J6YInrtPSxfWufgZgE2rAIZGeh/qnl0/qho1HtPd0t5qvmzBQShhMeohJcUSq6OvSfzVUa3ZlTNdNKZUk17sgnJ8XLSlSY513fcaADSKpVCrrX9zQ0JDavn27h4+DfDhf7XvKy3VwfM15Pz9cKSFdk00PMTnd2adrHj2s9p4+7fjQJa5vxGV+HlAIIpFIIpVKvaW+RgkiBJyv9tEzZ7Tm6r6sX+2dQ0yyLV2Mhq1DUwAvEMAh4JyHmuurvd9dEAD+jgAOgXznoQ5VuvCqCwLA+QjgkMjn1X40pQsAo8f7ZQGjKwEIFivgAmbzVS5AGBDABY6uBCA4lCAAICAEMAAEhAAGgIAQwAAQEAIYAAJCAANAQAhgAAgIfcAWGu3FmwDMwArYQqO5eBOAOVgBWyjXizcBmIkAtlg2F29KlCwAUxHAlhkYoD61Sp9/oT2r2yvSJYv31pRpza42rdp5Wu+tKdO1U9y9dBNAbghgi6QHqHf0Sscm1ShWXqZf3TRxxNsrKFkAZmITziLpAeodsVJ1xMaorTulxY8d0dTmV/WdnadG/P3ZliwA+IMVsEXSA9QjnWc187U9ar6rcdhRks4biOsnjvP8wk0AuSGALZLLAHXnfW/F0RLFZs/SkY4UF24CBiGALZPtAHXnfW8dxTHtO91f8+XCTcAcBHBIOa+qH9vTocdu8ubmC1rcgPwRwCHl131vtLgB+SOAQ8yN+95GWuHS4gbkj10YDCvbuRO0uAG5YwVcQPKp12azwj3d2UeLG5AHAriAjKZeO9gKN9GS1Oa9rXrk9SgtbkAeCOACkmu9dri5E+k+49MlpTo6sVYSLW5ArgjgApRNvXakuRPpPuN491nVHfibljfN1NLF9T7/lwB2I4Dz1N2d0sKFh5VIdKqrS3rllamaNi0a9GMNKp8jyQNzJ+Jl6oiNUce5uRNS/wr3/Y4+42hJkRrr6P0FckUA5ykSkd73vjJdckmJfv7zM0E/zpDyPZKczdwJP/qMgTAjgPNUUhLRl79cqXvuORn0owwr3yPJ2RzkcKPPGChkBHDIjeZIMgELeMu6ALap9moCv44kA8iddQEcdO3VuaFV3l6u1tb+V/qXX+5RPB7RpEnm/ZGykgXMZF5ajCDI2qtzQytWUqQ935g58HPXX39Ed95ZoR/+cILvzwXATtYFcJCcG1rdPX1a80Qfva8A8mbNedFES1JrN+9ToiUZ2DOkN7SKI6L3FcCoWbECznz1X7noHWptTUnyt/bKhhYAN1kRwJmv/h+6vm3g5/yuvbKhBcAtVgRwY8ax1+0HCEEA9rMigHn1BxBGVgSwxKu/37y+bJPLPAGLuiDgr2yvIjL16wM2sGYFDH95fdkml3kCrIAxAq8v2+QyTxQyAthD3d0pvetdhxSP71cksl8HDnQH/Ug5cV62+fjNk1y/bNPrrw+YjhKEh4IeHDSSwTbC/nlyUf89cDVVuntHu/a1dbt62ebAPXMefX3AJsYHsM3jJ00f2j7YLcn/94eDip55Uz3l5To4vkaSe5dtOk80evH1AdsYH8CmryJtNthGWFcqouKUFD1zRmuudnfYkPNEoxdfH7CN8e976VXkjBl2rHptlN4Im1pWpHHdZz0bNsQwI+B8xq+AbWT60Pahb0meolOnxnl24pATjcD5AkkCm+u6IzF9aPtItyRPnzTO02DkRCPwd4EEcDZ1XdNXkUMxfWh7vrckA3BfICk2UneA6avI4WRObjOtzjmaW5IBuMvIZaTpq8jhmF7nNP35gEJiZACbvoociel1TtOfDygUvgZwtnXdoFdpYd4kBGAO3wI417pukKs0Dn8A8INvBzEGq+umUnUD/5i0qcbhDwB+8C2AOQUFAOfzrQQRdF03G84atYnPByBcfN2Ec7Ou6/ZGWWaNeuWid6i1NSXJ/MMfAOxk/DCeoaQ3yj74wXJXvl5mjfpD17fpoYfekNS/SfilL5k5UhKAvaxd0rk9azez93j7AXplAXjL2gB225zJFyr+m9l6+YUe9XRHVP2piqAfCUDIWVuCcFskIn3ktrFa8o8ELwB/WBfAiZak1m7ep0RLUi++2HXeabojR3ry/rr0/gLwm1UlCJunpAFAJqsC2OYpaQCQyaoA9mJKmq2D3wHYz9d0Ge3hCbdP0wVR0mDSGoA0XwPYjSljbp6mC6KkwaQ1AGm+dkGY1mkQxIAg0/4MAASnoAucNgwIAhBeBR3Akn+D35m0BiCTLwFc6J0GTFoDMBjPv+v96jQwubtgsElraRwgAQqX5wHsV6eByd0FTFoDMBjPA9ivK+bdHk/pJjb7AAzG8wAmfPoFecszADP5svND+ADAW1m/9V7oHRYA7GV1Om3bd1KLFh9Rx5G41JuUlBz4OboLAJjO7gA+cFJj6t5QUXm3zu4Zq/t+0qev/gvjKQHYwbobMZyunl6tCe9JKlbdJUm6cuqFwT4QAOTA6gBOd1hcdWl/a9vll4wL+IkAIHtWliAy5yo0TEvpdzoV9GMBQE6sC2DmKgAIC+uSirkKAMLCugBmrgKAsLAugN042mzy5DQAhcO6AJZGf7TZ5MlpAAqH1W1o+eJeNgAmKMgABgATFFQAJ1qSWrt5nxItyZF/MQB4zMoacD7oHwZgmoJJHPqHAZimYAKY/mEApimYAOZqJACmKZgAlrgaCYBZCqoLAgBMQgADQEAIYAAICAEMAAEhgAEgIAQwAAQkkkqlsv/FkcgJSS3ePQ4AhFJtKpUan/mDOQUwAMA9lCAAICAEMAAEhAAGgIAQwAAQEAIYAAJCAANAQAhgAAgIAQwAASGAASAg/w9jN1OEaGyz4wAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Create a figure of size 6 inches by 4 inches.\n",
"plt.figure(figsize=(6,4))\n",
"\n",
"# These two lines of code are used to scale the data points down,\n",
"# Or else the data points will be scattered very far apart.\n",
"\n",
"# Create a minimum and maximum range of X1.\n",
"x_min, x_max = np.min(X1, axis=0), np.max(X1, axis=0)\n",
"\n",
"# Get the average distance for X1.\n",
"X1 = (X1 - x_min) / (x_max - x_min)\n",
"\n",
"# This loop displays all of the datapoints.\n",
"for i in range(X1.shape[0]):\n",
" # Replace the data points with their respective cluster value \n",
" # (ex. 0) and is color coded with a colormap (plt.cm.spectral)\n",
" plt.text(X1[i, 0], X1[i, 1], str(y1[i]),\n",
" color=plt.cm.nipy_spectral(agglom.labels_[i] / 10.),\n",
" fontdict={'weight': 'bold', 'size': 9})\n",
" \n",
"# Remove the x ticks, y ticks, x and y axis\n",
"plt.xticks([])\n",
"plt.yticks([])\n",
"#plt.axis('off')\n",
"\n",
"\n",
"\n",
"# Display the plot of the original data before clustering\n",
"plt.scatter(X1[:, 0], X1[:, 1], marker='.')\n",
"# Display the plot\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3 id=\"dendrogram\">Dendrogram Associated for the Agglomerative Hierarchical Clustering</h3>\n",
"\n",
"Remember that a <b>distance matrix</b> contains the <b> distance from each point to every other point of a dataset </b>. \n",
"\n",
"Use the function <b> distance_matrix, </b> which requires <b>two inputs</b>. Use the Feature Matrix, <b> X2 </b> as both inputs and save the distance matrix to a variable called <b> dist_matrix </b> <br> <br>\n",
"Remember that the distance values are symmetric, with a diagonal of 0's. This is one way of making sure your matrix is correct. <br> (print out dist_matrix to make sure it's correct)\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0. 0.13893713 0.41065552 ... 0.59120664 0.87545224 0.16843559]\n",
" [0.13893713 0. 0.4967219 ... 0.68402067 0.99617376 0.2949719 ]\n",
" [0.41065552 0.4967219 0. ... 0.18794082 0.54332806 0.26991334]\n",
" ...\n",
" [0.59120664 0.68402067 0.18794082 ... 0. 0.38729848 0.4368824 ]\n",
" [0.87545224 0.99617376 0.54332806 ... 0.38729848 0. 0.70704646]\n",
" [0.16843559 0.2949719 0.26991334 ... 0.4368824 0.70704646 0. ]]\n"
]
}
],
"source": [
"dist_matrix = distance_matrix(X1,X1) \n",
"print(dist_matrix)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the <b> linkage </b> class from hierarchy, pass in the parameters:\n",
"\n",
"<ul>\n",
" <li> The distance matrix </li>\n",
" <li> 'complete' for complete linkage </li>\n",
"</ul> <br>\n",
"Save the result to a variable called <b> Z </b>\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/jupyterlab/conda/envs/python/lib/python3.6/site-packages/ipykernel_launcher.py:1: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix\n",
" \"\"\"Entry point for launching an IPython kernel.\n"
]
}
],
"source": [
"Z = hierarchy.linkage(dist_matrix, 'complete')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A Hierarchical clustering is typically visualized as a dendrogram as shown in the following cell. Each merge is represented by a horizontal line. The y-coordinate of the horizontal line is the similarity of the two clusters that were merged, where cities are viewed as singleton clusters. \n",
"By moving up from the bottom layer to the top node, a dendrogram allows us to reconstruct the history of merges that resulted in the depicted clustering. \n",
"\n",
"Next, we will save the dendrogram to a variable called <b>dendro</b>. In doing this, the dendrogram will also be displayed.\n",
"Using the <b> dendrogram </b> class from hierarchy, pass in the parameter:\n",
"\n",
"<ul> <li> Z </li> </ul>\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"dendro = hierarchy.dendrogram(Z)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practice\n",
"\n",
"We used **complete** linkage for our case, change it to **average** linkage to see how the dendogram changes.\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/jupyterlab/conda/envs/python/lib/python3.6/site-packages/ipykernel_launcher.py:2: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix\n",
" \n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# write your code here\n",
"Z = hierarchy.linkage(dist_matrix, 'average')\n",
"dendro = hierarchy.dendrogram(Z)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```python\n",
"Z = hierarchy.linkage(dist_matrix, 'average')\n",
"dendro = hierarchy.dendrogram(Z)\n",
"\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>\n",
"<h1 id=\"clustering_vehicle_dataset\">Clustering on Vehicle dataset</h1>\n",
"\n",
"Imagine that an automobile manufacturer has developed prototypes for a new vehicle. Before introducing the new model into its range, the manufacturer wants to determine which existing vehicles on the market are most like the prototypes--that is, how vehicles can be grouped, which group is the most similar with the model, and therefore which models they will be competing against.\n",
"\n",
"Our objective here, is to use clustering methods, to find the most distinctive clusters of vehicles. It will summarize the existing vehicles and help manufacturers to make decision about the supply of new models.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Download data\n",
"\n",
"To download the data, we will use **`!wget`** to download it from IBM Object Storage. \n",
"**Did you know?** When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2020-12-11 01:10:34-- https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%204/data/cars_clus.csv\n",
"Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104\n",
"Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 17774 (17K) [text/csv]\n",
"Saving to: ‘cars_clus.csv’\n",
"\n",
"cars_clus.csv 100%[===================>] 17.36K --.-KB/s in 0.001s \n",
"\n",
"2020-12-11 01:10:34 (17.9 MB/s) - ‘cars_clus.csv’ saved [17774/17774]\n",
"\n"
]
}
],
"source": [
"!wget -O cars_clus.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%204/data/cars_clus.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read data\n",
"\n",
"Lets read dataset to see what features the manufacturer has collected about the existing models.\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape of dataset: (159, 16)\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>manufact</th>\n",
" <th>model</th>\n",
" <th>sales</th>\n",
" <th>resale</th>\n",
" <th>type</th>\n",
" <th>price</th>\n",
" <th>engine_s</th>\n",
" <th>horsepow</th>\n",
" <th>wheelbas</th>\n",
" <th>width</th>\n",
" <th>length</th>\n",
" <th>curb_wgt</th>\n",
" <th>fuel_cap</th>\n",
" <th>mpg</th>\n",
" <th>lnsales</th>\n",
" <th>partition</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Acura</td>\n",
" <td>Integra</td>\n",
" <td>16.919</td>\n",
" <td>16.360</td>\n",
" <td>0.000</td>\n",
" <td>21.500</td>\n",
" <td>1.800</td>\n",
" <td>140.000</td>\n",
" <td>101.200</td>\n",
" <td>67.300</td>\n",
" <td>172.400</td>\n",
" <td>2.639</td>\n",
" <td>13.200</td>\n",
" <td>28.000</td>\n",
" <td>2.828</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Acura</td>\n",
" <td>TL</td>\n",
" <td>39.384</td>\n",
" <td>19.875</td>\n",
" <td>0.000</td>\n",
" <td>28.400</td>\n",
" <td>3.200</td>\n",
" <td>225.000</td>\n",
" <td>108.100</td>\n",
" <td>70.300</td>\n",
" <td>192.900</td>\n",
" <td>3.517</td>\n",
" <td>17.200</td>\n",
" <td>25.000</td>\n",
" <td>3.673</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Acura</td>\n",
" <td>CL</td>\n",
" <td>14.114</td>\n",
" <td>18.225</td>\n",
" <td>0.000</td>\n",
" <td>$null$</td>\n",
" <td>3.200</td>\n",
" <td>225.000</td>\n",
" <td>106.900</td>\n",
" <td>70.600</td>\n",
" <td>192.000</td>\n",
" <td>3.470</td>\n",
" <td>17.200</td>\n",
" <td>26.000</td>\n",
" <td>2.647</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Acura</td>\n",
" <td>RL</td>\n",
" <td>8.588</td>\n",
" <td>29.725</td>\n",
" <td>0.000</td>\n",
" <td>42.000</td>\n",
" <td>3.500</td>\n",
" <td>210.000</td>\n",
" <td>114.600</td>\n",
" <td>71.400</td>\n",
" <td>196.600</td>\n",
" <td>3.850</td>\n",
" <td>18.000</td>\n",
" <td>22.000</td>\n",
" <td>2.150</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Audi</td>\n",
" <td>A4</td>\n",
" <td>20.397</td>\n",
" <td>22.255</td>\n",
" <td>0.000</td>\n",
" <td>23.990</td>\n",
" <td>1.800</td>\n",
" <td>150.000</td>\n",
" <td>102.600</td>\n",
" <td>68.200</td>\n",
" <td>178.000</td>\n",
" <td>2.998</td>\n",
" <td>16.400</td>\n",
" <td>27.000</td>\n",
" <td>3.015</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" manufact model sales resale type price engine_s horsepow wheelbas \\\n",
"0 Acura Integra 16.919 16.360 0.000 21.500 1.800 140.000 101.200 \n",
"1 Acura TL 39.384 19.875 0.000 28.400 3.200 225.000 108.100 \n",
"2 Acura CL 14.114 18.225 0.000 $null$ 3.200 225.000 106.900 \n",
"3 Acura RL 8.588 29.725 0.000 42.000 3.500 210.000 114.600 \n",
"4 Audi A4 20.397 22.255 0.000 23.990 1.800 150.000 102.600 \n",
"\n",
" width length curb_wgt fuel_cap mpg lnsales partition \n",
"0 67.300 172.400 2.639 13.200 28.000 2.828 0.0 \n",
"1 70.300 192.900 3.517 17.200 25.000 3.673 0.0 \n",
"2 70.600 192.000 3.470 17.200 26.000 2.647 0.0 \n",
"3 71.400 196.600 3.850 18.000 22.000 2.150 0.0 \n",
"4 68.200 178.000 2.998 16.400 27.000 3.015 0.0 "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"filename = 'cars_clus.csv'\n",
"\n",
"#Read csv\n",
"pdf = pd.read_csv(filename)\n",
"print (\"Shape of dataset: \", pdf.shape)\n",
"\n",
"pdf.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The feature sets include price in thousands (price), engine size (engine_s), horsepower (horsepow), wheelbase (wheelbas), width (width), length (length), curb weight (curb_wgt), fuel capacity (fuel_cap) and fuel efficiency (mpg).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"data_cleaning\">Data Cleaning</h2>\n",
"\n",
"Lets simply clear the dataset by dropping the rows that have null value:\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape of dataset before cleaning: 1872\n",
"Shape of dataset after cleaning: 1872\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>manufact</th>\n",
" <th>model</th>\n",
" <th>sales</th>\n",
" <th>resale</th>\n",
" <th>type</th>\n",
" <th>price</th>\n",
" <th>engine_s</th>\n",
" <th>horsepow</th>\n",
" <th>wheelbas</th>\n",
" <th>width</th>\n",
" <th>length</th>\n",
" <th>curb_wgt</th>\n",
" <th>fuel_cap</th>\n",
" <th>mpg</th>\n",
" <th>lnsales</th>\n",
" <th>partition</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Acura</td>\n",
" <td>Integra</td>\n",
" <td>16.919</td>\n",
" <td>16.360</td>\n",
" <td>0.0</td>\n",
" <td>21.50</td>\n",
" <td>1.8</td>\n",
" <td>140.0</td>\n",
" <td>101.2</td>\n",
" <td>67.3</td>\n",
" <td>172.4</td>\n",
" <td>2.639</td>\n",
" <td>13.2</td>\n",
" <td>28.0</td>\n",
" <td>2.828</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Acura</td>\n",
" <td>TL</td>\n",
" <td>39.384</td>\n",
" <td>19.875</td>\n",
" <td>0.0</td>\n",
" <td>28.40</td>\n",
" <td>3.2</td>\n",
" <td>225.0</td>\n",
" <td>108.1</td>\n",
" <td>70.3</td>\n",
" <td>192.9</td>\n",
" <td>3.517</td>\n",
" <td>17.2</td>\n",
" <td>25.0</td>\n",
" <td>3.673</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Acura</td>\n",
" <td>RL</td>\n",
" <td>8.588</td>\n",
" <td>29.725</td>\n",
" <td>0.0</td>\n",
" <td>42.00</td>\n",
" <td>3.5</td>\n",
" <td>210.0</td>\n",
" <td>114.6</td>\n",
" <td>71.4</td>\n",
" <td>196.6</td>\n",
" <td>3.850</td>\n",
" <td>18.0</td>\n",
" <td>22.0</td>\n",
" <td>2.150</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Audi</td>\n",
" <td>A4</td>\n",
" <td>20.397</td>\n",
" <td>22.255</td>\n",
" <td>0.0</td>\n",
" <td>23.99</td>\n",
" <td>1.8</td>\n",
" <td>150.0</td>\n",
" <td>102.6</td>\n",
" <td>68.2</td>\n",
" <td>178.0</td>\n",
" <td>2.998</td>\n",
" <td>16.4</td>\n",
" <td>27.0</td>\n",
" <td>3.015</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Audi</td>\n",
" <td>A6</td>\n",
" <td>18.780</td>\n",
" <td>23.555</td>\n",
" <td>0.0</td>\n",
" <td>33.95</td>\n",
" <td>2.8</td>\n",
" <td>200.0</td>\n",
" <td>108.7</td>\n",
" <td>76.1</td>\n",
" <td>192.0</td>\n",
" <td>3.561</td>\n",
" <td>18.5</td>\n",
" <td>22.0</td>\n",
" <td>2.933</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" manufact model sales resale type price engine_s horsepow \\\n",
"0 Acura Integra 16.919 16.360 0.0 21.50 1.8 140.0 \n",
"1 Acura TL 39.384 19.875 0.0 28.40 3.2 225.0 \n",
"2 Acura RL 8.588 29.725 0.0 42.00 3.5 210.0 \n",
"3 Audi A4 20.397 22.255 0.0 23.99 1.8 150.0 \n",
"4 Audi A6 18.780 23.555 0.0 33.95 2.8 200.0 \n",
"\n",
" wheelbas width length curb_wgt fuel_cap mpg lnsales partition \n",
"0 101.2 67.3 172.4 2.639 13.2 28.0 2.828 0.0 \n",
"1 108.1 70.3 192.9 3.517 17.2 25.0 3.673 0.0 \n",
"2 114.6 71.4 196.6 3.850 18.0 22.0 2.150 0.0 \n",
"3 102.6 68.2 178.0 2.998 16.4 27.0 3.015 0.0 \n",
"4 108.7 76.1 192.0 3.561 18.5 22.0 2.933 0.0 "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print (\"Shape of dataset before cleaning: \", pdf.size)\n",
"pdf[[ 'sales', 'resale', 'type', 'price', 'engine_s',\n",
" 'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',\n",
" 'mpg', 'lnsales']] = pdf[['sales', 'resale', 'type', 'price', 'engine_s',\n",
" 'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',\n",
" 'mpg', 'lnsales']].apply(pd.to_numeric, errors='coerce')\n",
"pdf = pdf.dropna()\n",
"pdf = pdf.reset_index(drop=True)\n",
"print (\"Shape of dataset after cleaning: \", pdf.size)\n",
"pdf.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Feature selection\n",
"\n",
"Lets select our feature set:\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"featureset = pdf[['engine_s', 'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap', 'mpg']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Normalization\n",
"\n",
"Now we can normalize the feature set. **MinMaxScaler** transforms features by scaling each feature to a given range. It is by default (0, 1). That is, this estimator scales and translates each feature individually such that it is between zero and one.\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0.11428571, 0.21518987, 0.18655098, 0.28143713, 0.30625832,\n",
" 0.2310559 , 0.13364055, 0.43333333],\n",
" [0.31428571, 0.43037975, 0.3362256 , 0.46107784, 0.5792277 ,\n",
" 0.50372671, 0.31797235, 0.33333333],\n",
" [0.35714286, 0.39240506, 0.47722343, 0.52694611, 0.62849534,\n",
" 0.60714286, 0.35483871, 0.23333333],\n",
" [0.11428571, 0.24050633, 0.21691974, 0.33532934, 0.38082557,\n",
" 0.34254658, 0.28110599, 0.4 ],\n",
" [0.25714286, 0.36708861, 0.34924078, 0.80838323, 0.56724368,\n",
" 0.5173913 , 0.37788018, 0.23333333]])"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import MinMaxScaler\n",
"x = featureset.values #returns a numpy array\n",
"min_max_scaler = MinMaxScaler()\n",
"feature_mtx = min_max_scaler.fit_transform(x)\n",
"feature_mtx [0:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"clustering_using_scipy\">Clustering using Scipy</h2>\n",
"\n",
"In this part we use Scipy package to cluster the dataset.\n",
"\n",
"First, we calculate the distance matrix. \n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/jupyterlab/conda/envs/python/lib/python3.6/site-packages/ipykernel_launcher.py:3: DeprecationWarning: scipy.zeros is deprecated and will be removed in SciPy 2.0.0, use numpy.zeros instead\n",
" This is separate from the ipykernel package so we can avoid doing imports until\n"
]
}
],
"source": [
"import scipy\n",
"leng = feature_mtx.shape[0]\n",
"D = scipy.zeros([leng,leng])\n",
"for i in range(leng):\n",
" for j in range(leng):\n",
" D[i,j] = scipy.spatial.distance.euclidean(feature_mtx[i], feature_mtx[j])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In agglomerative clustering, at each iteration, the algorithm must update the distance matrix to reflect the distance of the newly formed cluster with the remaining clusters in the forest. \n",
"The following methods are supported in Scipy for calculating the distance between the newly formed cluster and each:\n",
"\n",
"```\n",
"- single\n",
"- complete\n",
"- average\n",
"- weighted\n",
"- centroid\n",
"```\n",
"\n",
"We use **complete** for our case, but feel free to change it to see how the results change.\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/jupyterlab/conda/envs/python/lib/python3.6/site-packages/ipykernel_launcher.py:3: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix\n",
" This is separate from the ipykernel package so we can avoid doing imports until\n"
]
}
],
"source": [
"import pylab\n",
"import scipy.cluster.hierarchy\n",
"Z = hierarchy.linkage(D, 'complete')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Essentially, Hierarchical clustering does not require a pre-specified number of clusters. However, in some applications we want a partition of disjoint clusters just as in flat clustering.\n",
"So you can use a cutting line:\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 1, 5, 5, 6, 5, 4, 6, 5, 5, 5, 5, 5, 4, 4, 5, 1, 6,\n",
" 5, 5, 5, 4, 2, 11, 6, 6, 5, 6, 5, 1, 6, 6, 10, 9, 8,\n",
" 9, 3, 5, 1, 7, 6, 5, 3, 5, 3, 8, 7, 9, 2, 6, 6, 5,\n",
" 4, 2, 1, 6, 5, 2, 7, 5, 5, 5, 4, 4, 3, 2, 6, 6, 5,\n",
" 7, 4, 7, 6, 6, 5, 3, 5, 5, 6, 5, 4, 4, 1, 6, 5, 5,\n",
" 5, 6, 4, 5, 4, 1, 6, 5, 6, 6, 5, 5, 5, 7, 7, 7, 2,\n",
" 2, 1, 2, 6, 5, 1, 1, 1, 7, 8, 1, 1, 6, 1, 1],\n",
" dtype=int32)"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from scipy.cluster.hierarchy import fcluster\n",
"max_d = 3\n",
"clusters = fcluster(Z, max_d, criterion='distance')\n",
"clusters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Also, you can determine the number of clusters directly:\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 2, 3, 1, 3, 3, 3, 3, 2, 1,\n",
" 5, 3, 3, 3, 3, 3, 1, 3, 3, 4, 4, 4, 4, 2, 3, 1, 3, 3, 3, 2, 3, 2,\n",
" 4, 3, 4, 1, 3, 3, 3, 2, 1, 1, 3, 3, 1, 3, 3, 3, 3, 2, 2, 2, 1, 3,\n",
" 3, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 1, 3, 3, 3, 3, 3, 2,\n",
" 3, 2, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 3, 3, 1, 1, 1,\n",
" 3, 4, 1, 1, 3, 1, 1], dtype=int32)"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from scipy.cluster.hierarchy import fcluster\n",
"k = 5\n",
"clusters = fcluster(Z, k, criterion='maxclust')\n",
"clusters\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, plot the dendrogram:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig = pylab.figure(figsize=(18,50))\n",
"def llf(id):\n",
" return '[%s %s %s]' % (pdf['manufact'][id], pdf['model'][id], int(float(pdf['type'][id])) )\n",
" \n",
"dendro = hierarchy.dendrogram(Z, leaf_label_func=llf, leaf_rotation=0, leaf_font_size =12, orientation = 'right')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"clustering_using_skl\">Clustering using scikit-learn</h2>\n",
"\n",
"Lets redo it again, but this time using scikit-learn package:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dist_matrix = distance_matrix(feature_mtx,feature_mtx) \n",
"print(dist_matrix)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can use the 'AgglomerativeClustering' function from scikit-learn library to cluster the dataset. The AgglomerativeClustering performs a hierarchical clustering using a bottom up approach. The linkage criteria determines the metric used for the merge strategy:\n",
"\n",
"- Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.\n",
"- Maximum or complete linkage minimizes the maximum distance between observations of pairs of clusters.\n",
"- Average linkage minimizes the average of the distances between all observations of pairs of clusters.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"agglom = AgglomerativeClustering(n_clusters = 6, linkage = 'complete')\n",
"agglom.fit(feature_mtx)\n",
"agglom.labels_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And, we can add a new field to our dataframe to show the cluster of each row:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pdf['cluster_'] = agglom.labels_\n",
"pdf.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.cm as cm\n",
"n_clusters = max(agglom.labels_)+1\n",
"colors = cm.rainbow(np.linspace(0, 1, n_clusters))\n",
"cluster_labels = list(range(0, n_clusters))\n",
"\n",
"# Create a figure of size 6 inches by 4 inches.\n",
"plt.figure(figsize=(16,14))\n",
"\n",
"for color, label in zip(colors, cluster_labels):\n",
" subset = pdf[pdf.cluster_ == label]\n",
" for i in subset.index:\n",
" plt.text(subset.horsepow[i], subset.mpg[i],str(subset['model'][i]), rotation=25) \n",
" plt.scatter(subset.horsepow, subset.mpg, s= subset.price*10, c=color, label='cluster'+str(label),alpha=0.5)\n",
"# plt.scatter(subset.horsepow, subset.mpg)\n",
"plt.legend()\n",
"plt.title('Clusters')\n",
"plt.xlabel('horsepow')\n",
"plt.ylabel('mpg')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, we are seeing the distribution of each cluster using the scatter plot, but it is not very clear where is the centroid of each cluster. Moreover, there are 2 types of vehicles in our dataset, \"truck\" (value of 1 in the type column) and \"car\" (value of 1 in the type column). So, we use them to distinguish the classes, and summarize the cluster. First we count the number of cases in each group:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pdf.groupby(['cluster_','type'])['cluster_'].count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can look at the characteristics of each cluster:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"agg_cars = pdf.groupby(['cluster_','type'])['horsepow','engine_s','mpg','price'].mean()\n",
"agg_cars"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is obvious that we have 3 main clusters with the majority of vehicles in those.\n",
"\n",
"**Cars**:\n",
"\n",
"- Cluster 1: with almost high mpg, and low in horsepower.\n",
"- Cluster 2: with good mpg and horsepower, but higher price than average.\n",
"- Cluster 3: with low mpg, high horsepower, highest price.\n",
"\n",
"**Trucks**:\n",
"\n",
"- Cluster 1: with almost highest mpg among trucks, and lowest in horsepower and price.\n",
"- Cluster 2: with almost low mpg and medium horsepower, but higher price than average.\n",
"- Cluster 3: with good mpg and horsepower, low price.\n",
"\n",
"Please notice that we did not use **type** , and **price** of cars in the clustering process, but Hierarchical clustering could forge the clusters and discriminate them with quite high accuracy.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(16,10))\n",
"for color, label in zip(colors, cluster_labels):\n",
" subset = agg_cars.loc[(label,),]\n",
" for i in subset.index:\n",
" plt.text(subset.loc[i][0]+5, subset.loc[i][2], 'type='+str(int(i)) + ', price='+str(int(subset.loc[i][3]))+'k')\n",
" plt.scatter(subset.horsepow, subset.mpg, s=subset.price*20, c=color, label='cluster'+str(label))\n",
"plt.legend()\n",
"plt.title('Clusters')\n",
"plt.xlabel('horsepow')\n",
"plt.ylabel('mpg')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Want to learn more?</h2>\n",
"\n",
"IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: <a href=\"https://www.ibm.com/analytics/spss-statistics-software\">SPSS Modeler</a>\n",
"\n",
"Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at <a href=\"https://www.ibm.com/cloud/watson-studio\">Watson Studio</a>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Thank you for completing this lab!\n",
"\n",
"## Author\n",
"\n",
"Saeed Aghabozorgi\n",
"\n",
"### Other Contributors\n",
"\n",
"<a href=\"https://www.linkedin.com/in/joseph-s-50398b136/\" target=\"_blank\">Joseph Santarcangelo</a>\n",
"\n",
"## Change Log\n",
"\n",
"| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n",
"| ----------------- | ------- | ---------- | ---------------------------------- |\n",
"| 2020-11-03 | 2.1 | Lakshmi | Updated URL |\n",
"| 2020-08-27 | 2.0 | Lavanya | Moved lab to course repo in GitLab |\n",
"| | | | |\n",
"| | | | |\n",
"\n",
"## <h3 align=\"center\"> © IBM Corporation 2020. All rights reserved. <h3/>\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.11"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment