Skip to content

Instantly share code, notes, and snippets.

@bnwiran
Created July 16, 2019 02:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bnwiran/8a1e0fa0dfc39568c4314cfbf4ebfca8 to your computer and use it in GitHub Desktop.
Save bnwiran/8a1e0fa0dfc39568c4314cfbf4ebfca8 to your computer and use it in GitHub Desktop.
Created on Cognitive Class Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://www.bigdatauniversity.com\"><img src=\"https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png\" width=\"400\" align=\"center\"></a>\n",
"\n",
"<h1><center>Density-Based Clustering</center></h1>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most of the traditional clustering techniques, such as k-means, hierarchical and fuzzy clustering, can be used to group data without supervision. \n",
"\n",
"However, when applied to tasks with arbitrary shape clusters, or clusters within cluster, the traditional techniques might be unable to achieve good results. That is, elements in the same cluster might not share enough similarity or the performance may be poor.\n",
"Additionally, Density-based Clustering locates regions of high density that are separated from one another by regions of low density. Density, in this context, is defined as the number of points within a specified radius.\n",
"\n",
"\n",
"\n",
"In this section, the main focus will be manipulating the data and properties of DBSCAN and observing the resulting clustering."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>Table of contents</h1>\n",
"\n",
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
" <ol>\n",
" <li>Clustering with Randomly Generated Data</li>\n",
" <ol>\n",
" <li><a href=\"#data_generation\">Data generation</a></li>\n",
" <li><a href=\"#modeling\">Modeling</a></li>\n",
" <li><a href=\"#distinguishing_outliers\">Distinguishing Outliers</a></li>\n",
" <li><a href=\"#data_visualization\">Data Visualization</a></li>\n",
" </ol>\n",
" <li><a href=\"#weather_station_clustering\">Weather Station Clustering with DBSCAN & scikit-learn</a></li> \n",
" <ol>\n",
" <li><a href=\"#download_data\">Loading data</a></li>\n",
" <li><a href=\"#load_dataset\">Overview data</a></li>\n",
" <li><a href=\"#cleaning\">Data cleaning</a></li>\n",
" <li><a href=\"#visualization\">Data selection</a></li>\n",
" <li><a href=\"#clustering\">Clustering</a></li>\n",
" <li><a href=\"#visualize_cluster\">Visualization of clusters based on location</a></li>\n",
" <li><a href=\"#clustering_location_mean_max_min_temperature\">Clustering of stations based on their location, mean, max, and min Temperature</a></li>\n",
" <li><a href=\"#visualization_location_temperature\">Visualization of clusters based on location and Temperature</a></li>\n",
" </ol>\n",
" </ol>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import the following libraries:\n",
"<ul>\n",
" <li> <b>numpy as np</b> </li>\n",
" <li> <b>DBSCAN</b> from <b>sklearn.cluster</b> </li>\n",
" <li> <b>make_blobs</b> from <b>sklearn.datasets.samples_generator</b> </li>\n",
" <li> <b>StandardScaler</b> from <b>sklearn.preprocessing</b> </li>\n",
" <li> <b>matplotlib.pyplot as plt</b> </li>\n",
"</ul> <br>\n",
"Remember <b> %matplotlib inline </b> to display plots"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Notice: For visualization of map, you need basemap package.\n",
"# if you dont have basemap install on your machine, you can use the following line to install it\n",
"# !conda install -c conda-forge basemap==1.1.0 matplotlib==2.2.2 -y\n",
"# Notice: you maight have to refresh your page and re-run the notebook after installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np \n",
"from sklearn.cluster import DBSCAN \n",
"from sklearn.datasets.samples_generator import make_blobs \n",
"from sklearn.preprocessing import StandardScaler \n",
"import matplotlib.pyplot as plt \n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"data_generation\">Data generation</h2>\n",
"The function below will generate the data points and requires these inputs:\n",
"<ul>\n",
" <li> <b>centroidLocation</b>: Coordinates of the centroids that will generate the random data. </li>\n",
" <ul> <li> Example: input: [[4,3], [2,-1], [-1,4]] </li> </ul>\n",
" <li> <b>numSamples</b>: The number of data points we want generated, split over the number of centroids (# of centroids defined in centroidLocation) </li>\n",
" <ul> <li> Example: 1500 </li> </ul>\n",
" <li> <b>clusterDeviation</b>: The standard deviation between the clusters. The larger the number, the further the spacing. </li>\n",
" <ul> <li> Example: 0.5 </li> </ul>\n",
"</ul>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def createDataPoints(centroidLocation, numSamples, clusterDeviation):\n",
" # Create random data and store in feature matrix X and response vector y.\n",
" X, y = make_blobs(n_samples=numSamples, centers=centroidLocation, \n",
" cluster_std=clusterDeviation)\n",
" \n",
" # Standardize features by removing the mean and scaling to unit variance\n",
" X = StandardScaler().fit_transform(X)\n",
" return X, y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use <b>createDataPoints</b> with the <b>3 inputs</b> and store the output into variables <b>X</b> and <b>y</b>."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"X, y = createDataPoints([[4,3], [2,-1], [-1,4]] , 1500, 0.5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"modeling\">Modeling</h2>\n",
"DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. This technique is one of the most common clustering algorithms which works based on density of object.\n",
"The whole idea is that if a particular point belongs to a cluster, it should be near to lots of other points in that cluster.\n",
"\n",
"It works based on two parameters: Epsilon and Minimum Points \n",
"__Epsilon__ determine a specified radius that if includes enough number of points within, we call it dense area \n",
"__minimumSamples__ determine the minimum number of data points we want in a neighborhood to define a cluster.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"epsilon = 0.3\n",
"minimumSamples = 7\n",
"db = DBSCAN(eps=epsilon, min_samples=minimumSamples).fit(X)\n",
"labels = db.labels_\n",
"labels"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"distinguishing_outliers\">Distinguishing Outliers</h2>\n",
"Lets Replace all elements with 'True' in core_samples_mask that are in the cluster, 'False' if the points are outliers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# First, create an array of booleans using the labels from db.\n",
"core_samples_mask = np.zeros_like(db.labels_, dtype=bool)\n",
"core_samples_mask[db.core_sample_indices_] = True\n",
"core_samples_mask"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Number of clusters in labels, ignoring noise if present.\n",
"n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)\n",
"n_clusters_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Remove repetition in labels by turning it into a set.\n",
"unique_labels = set(labels)\n",
"unique_labels"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"data_visualization\">Data visualization</h2>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create colors for the clusters.\n",
"colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))\n",
"colors"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Plot the points with colors\n",
"for k, col in zip(unique_labels, colors):\n",
" if k == -1:\n",
" # Black used for noise.\n",
" col = 'k'\n",
"\n",
" class_member_mask = (labels == k)\n",
"\n",
" # Plot the datapoints that are clustered\n",
" xy = X[class_member_mask & core_samples_mask]\n",
" plt.scatter(xy[:, 0], xy[:, 1],s=50, c=col, marker=u'o', alpha=0.5)\n",
"\n",
" # Plot the outliers\n",
" xy = X[class_member_mask & ~core_samples_mask]\n",
" plt.scatter(xy[:, 0], xy[:, 1],s=50, c=col, marker=u'o', alpha=0.5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practice\n",
"To better underestand differences between partitional and density-based clusteitng, try to cluster the above dataset into 3 clusters using k-Means. \n",
"Notice: do not generate data again, use the same dataset as above."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# write your code here\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Double-click __here__ for the solution.\n",
"\n",
"<!-- Your answer is below:\n",
"\n",
"\n",
"from sklearn.cluster import KMeans \n",
"k = 3\n",
"k_means3 = KMeans(init = \"k-means++\", n_clusters = k, n_init = 12)\n",
"k_means3.fit(X)\n",
"fig = plt.figure(figsize=(6, 4))\n",
"ax = fig.add_subplot(1, 1, 1)\n",
"for k, col in zip(range(k), colors):\n",
" my_members = (k_means3.labels_ == k)\n",
" plt.scatter(X[my_members, 0], X[my_members, 1], c=col, marker=u'o', alpha=0.5)\n",
"plt.show()\n",
"\n",
"\n",
"-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"\n",
"<h1 id=\"weather_station_clustering\" align=\"center\"> Weather Station Clustering using DBSCAN & scikit-learn </h1>\n",
"<hr>\n",
"\n",
"DBSCAN is specially very good for tasks like class identification on a spatial context. The wonderful attribute of DBSCAN algorithm is that it can find out any arbitrary shape cluster without getting affected by noise. For example, this following example cluster the location of weather stations in Canada.\n",
"<br>\n",
"DBSCAN can be used here, for instance, to find the group of stations which show the same weather condition. As you can see, it not only finds different arbitrary shaped clusters, can find the denser part of data-centered samples by ignoring less-dense areas or noises.\n",
"\n",
"let's start playing with the data. We will be working according to the following workflow: </font>\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### About the dataset\n",
"\n",
"\t\t\n",
"<h4 align = \"center\">\n",
"Environment Canada \n",
"Monthly Values for July - 2015\t\n",
"</h4>\n",
"<html>\n",
"<head>\n",
"<style>\n",
"table {\n",
" font-family: arial, sans-serif;\n",
" border-collapse: collapse;\n",
" width: 100%;\n",
"}\n",
"\n",
"td, th {\n",
" border: 1px solid #dddddd;\n",
" text-align: left;\n",
" padding: 8px;\n",
"}\n",
"\n",
"tr:nth-child(even) {\n",
" background-color: #dddddd;\n",
"}\n",
"</style>\n",
"</head>\n",
"<body>\n",
"\n",
"<table>\n",
" <tr>\n",
" <th>Name in the table</th>\n",
" <th>Meaning</th>\n",
" </tr>\n",
" <tr>\n",
" <td><font color = \"green\"><strong>Stn_Name</font></td>\n",
" <td><font color = \"green\"><strong>Station Name</font</td>\n",
" </tr>\n",
" <tr>\n",
" <td><font color = \"green\"><strong>Lat</font></td>\n",
" <td><font color = \"green\"><strong>Latitude (North+, degrees)</font></td>\n",
" </tr>\n",
" <tr>\n",
" <td><font color = \"green\"><strong>Long</font></td>\n",
" <td><font color = \"green\"><strong>Longitude (West - , degrees)</font></td>\n",
" </tr>\n",
" <tr>\n",
" <td>Prov</td>\n",
" <td>Province</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Tm</td>\n",
" <td>Mean Temperature (°C)</td>\n",
" </tr>\n",
" <tr>\n",
" <td>DwTm</td>\n",
" <td>Days without Valid Mean Temperature</td>\n",
" </tr>\n",
" <tr>\n",
" <td>D</td>\n",
" <td>Mean Temperature difference from Normal (1981-2010) (°C)</td>\n",
" </tr>\n",
" <tr>\n",
" <td><font color = \"black\">Tx</font></td>\n",
" <td><font color = \"black\">Highest Monthly Maximum Temperature (°C)</font></td>\n",
" </tr>\n",
" <tr>\n",
" <td>DwTx</td>\n",
" <td>Days without Valid Maximum Temperature</td>\n",
" </tr>\n",
" <tr>\n",
" <td><font color = \"black\">Tn</font></td>\n",
" <td><font color = \"black\">Lowest Monthly Minimum Temperature (°C)</font></td>\n",
" </tr>\n",
" <tr>\n",
" <td>DwTn</td>\n",
" <td>Days without Valid Minimum Temperature</td>\n",
" </tr>\n",
" <tr>\n",
" <td>S</td>\n",
" <td>Snowfall (cm)</td>\n",
" </tr>\n",
" <tr>\n",
" <td>DwS</td>\n",
" <td>Days without Valid Snowfall</td>\n",
" </tr>\n",
" <tr>\n",
" <td>S%N</td>\n",
" <td>Percent of Normal (1981-2010) Snowfall</td>\n",
" </tr>\n",
" <tr>\n",
" <td><font color = \"green\"><strong>P</font></td>\n",
" <td><font color = \"green\"><strong>Total Precipitation (mm)</font></td>\n",
" </tr>\n",
" <tr>\n",
" <td>DwP</td>\n",
" <td>Days without Valid Precipitation</td>\n",
" </tr>\n",
" <tr>\n",
" <td>P%N</td>\n",
" <td>Percent of Normal (1981-2010) Precipitation</td>\n",
" </tr>\n",
" <tr>\n",
" <td>S_G</td>\n",
" <td>Snow on the ground at the end of the month (cm)</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Pd</td>\n",
" <td>Number of days with Precipitation 1.0 mm or more</td>\n",
" </tr>\n",
" <tr>\n",
" <td>BS</td>\n",
" <td>Bright Sunshine (hours)</td>\n",
" </tr>\n",
" <tr>\n",
" <td>DwBS</td>\n",
" <td>Days without Valid Bright Sunshine</td>\n",
" </tr>\n",
" <tr>\n",
" <td>BS%</td>\n",
" <td>Percent of Normal (1981-2010) Bright Sunshine</td>\n",
" </tr>\n",
" <tr>\n",
" <td>HDD</td>\n",
" <td>Degree Days below 18 °C</td>\n",
" </tr>\n",
" <tr>\n",
" <td>CDD</td>\n",
" <td>Degree Days above 18 °C</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Stn_No</td>\n",
" <td>Climate station identifier (first 3 digits indicate drainage basin, last 4 characters are for sorting alphabetically).</td>\n",
" </tr>\n",
" <tr>\n",
" <td>NA</td>\n",
" <td>Not Available</td>\n",
" </tr>\n",
"\n",
"\n",
"</table>\n",
"\n",
"</body>\n",
"</html>\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1-Download data\n",
"<div id=\"download_data\">\n",
" To download the data, we will use <b>!wget</b> to download it from IBM Object Storage.<br> \n",
" <b>Did you know?</b> When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: <a href=\"http://cocl.us/ML0101EN-IBM-Offer-CC\">Sign up now for free</a>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!wget -O weather-stations20140101-20141231.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/weather-stations20140101-20141231.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2- Load the dataset\n",
"<div id=\"load_dataset\">\n",
"We will import the .csv then we creates the columns for year, month and day.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"filename='weather-stations20140101-20141231.csv'\n",
"\n",
"#Read csv\n",
"pdf = pd.read_csv(filename)\n",
"pdf.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3-Cleaning\n",
"<div id=\"cleaning\">\n",
"Lets remove rows that don't have any value in the <b>Tm</b> field.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pdf = pdf[pd.notnull(pdf[\"Tm\"])]\n",
"pdf = pdf.reset_index(drop=True)\n",
"pdf.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4-Visualization\n",
"<div id=\"visualization\">\n",
"Visualization of stations on map using basemap package. The matplotlib basemap toolkit is a library for plotting 2D data on maps in Python. Basemap does not do any plotting on it’s own, but provides the facilities to transform coordinates to a map projections. <br>\n",
"\n",
"Please notice that the size of each data points represents the average of maximum temperature for each station in a year.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from mpl_toolkits.basemap import Basemap\n",
"import matplotlib.pyplot as plt\n",
"from pylab import rcParams\n",
"%matplotlib inline\n",
"rcParams['figure.figsize'] = (14,10)\n",
"\n",
"llon=-140\n",
"ulon=-50\n",
"llat=40\n",
"ulat=65\n",
"\n",
"pdf = pdf[(pdf['Long'] > llon) & (pdf['Long'] < ulon) & (pdf['Lat'] > llat) &(pdf['Lat'] < ulat)]\n",
"\n",
"my_map = Basemap(projection='merc',\n",
" resolution = 'l', area_thresh = 1000.0,\n",
" llcrnrlon=llon, llcrnrlat=llat, #min longitude (llcrnrlon) and latitude (llcrnrlat)\n",
" urcrnrlon=ulon, urcrnrlat=ulat) #max longitude (urcrnrlon) and latitude (urcrnrlat)\n",
"\n",
"my_map.drawcoastlines()\n",
"my_map.drawcountries()\n",
"# my_map.drawmapboundary()\n",
"my_map.fillcontinents(color = 'white', alpha = 0.3)\n",
"my_map.shadedrelief()\n",
"\n",
"# To collect data based on stations \n",
"\n",
"xs,ys = my_map(np.asarray(pdf.Long), np.asarray(pdf.Lat))\n",
"pdf['xm']= xs.tolist()\n",
"pdf['ym'] =ys.tolist()\n",
"\n",
"#Visualization1\n",
"for index,row in pdf.iterrows():\n",
"# x,y = my_map(row.Long, row.Lat)\n",
" my_map.plot(row.xm, row.ym,markerfacecolor =([1,0,0]), marker='o', markersize= 5, alpha = 0.75)\n",
"#plt.text(x,y,stn)\n",
"plt.show()\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5- Clustering of stations based on their location i.e. Lat & Lon\n",
"<div id=\"clustering\">\n",
" <b>DBSCAN</b> form sklearn library can runs DBSCAN clustering from vector array or distance matrix.<br>\n",
" In our case, we pass it the Numpy array Clus_dataSet to find core samples of high density and expands clusters from them. \n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.cluster import DBSCAN\n",
"import sklearn.utils\n",
"from sklearn.preprocessing import StandardScaler\n",
"sklearn.utils.check_random_state(1000)\n",
"Clus_dataSet = pdf[['xm','ym']]\n",
"Clus_dataSet = np.nan_to_num(Clus_dataSet)\n",
"Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)\n",
"\n",
"# Compute DBSCAN\n",
"db = DBSCAN(eps=0.15, min_samples=10).fit(Clus_dataSet)\n",
"core_samples_mask = np.zeros_like(db.labels_, dtype=bool)\n",
"core_samples_mask[db.core_sample_indices_] = True\n",
"labels = db.labels_\n",
"pdf[\"Clus_Db\"]=labels\n",
"\n",
"realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)\n",
"clusterNum = len(set(labels)) \n",
"\n",
"\n",
"# A sample of clusters\n",
"pdf[[\"Stn_Name\",\"Tx\",\"Tm\",\"Clus_Db\"]].head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see for outliers, the cluster label is -1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"set(labels)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6- Visualization of clusters based on location\n",
"<div id=\"visualize_cluster\">\n",
"Now, we can visualize the clusters using basemap:\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from mpl_toolkits.basemap import Basemap\n",
"import matplotlib.pyplot as plt\n",
"from pylab import rcParams\n",
"%matplotlib inline\n",
"rcParams['figure.figsize'] = (14,10)\n",
"\n",
"my_map = Basemap(projection='merc',\n",
" resolution = 'l', area_thresh = 1000.0,\n",
" llcrnrlon=llon, llcrnrlat=llat, #min longitude (llcrnrlon) and latitude (llcrnrlat)\n",
" urcrnrlon=ulon, urcrnrlat=ulat) #max longitude (urcrnrlon) and latitude (urcrnrlat)\n",
"\n",
"my_map.drawcoastlines()\n",
"my_map.drawcountries()\n",
"#my_map.drawmapboundary()\n",
"my_map.fillcontinents(color = 'white', alpha = 0.3)\n",
"my_map.shadedrelief()\n",
"\n",
"# To create a color map\n",
"colors = plt.get_cmap('jet')(np.linspace(0.0, 1.0, clusterNum))\n",
"\n",
"\n",
"\n",
"#Visualization1\n",
"for clust_number in set(labels):\n",
" c=(([0.4,0.4,0.4]) if clust_number == -1 else colors[np.int(clust_number)])\n",
" clust_set = pdf[pdf.Clus_Db == clust_number] \n",
" my_map.scatter(clust_set.xm, clust_set.ym, color =c, marker='o', s= 20, alpha = 0.85)\n",
" if clust_number != -1:\n",
" cenx=np.mean(clust_set.xm) \n",
" ceny=np.mean(clust_set.ym) \n",
" plt.text(cenx,ceny,str(clust_number), fontsize=25, color='red',)\n",
" print (\"Cluster \"+str(clust_number)+', Avg Temp: '+ str(np.mean(clust_set.Tm)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 7- Clustering of stations based on their location, mean, max, and min Temperature\n",
"<div id=\"clustering_location_mean_max_min_temperature\">\n",
"In this section we re-run DBSCAN, but this time on a 5-dimensional dataset:\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"from sklearn.cluster import DBSCAN\n",
"import sklearn.utils\n",
"from sklearn.preprocessing import StandardScaler\n",
"sklearn.utils.check_random_state(1000)\n",
"Clus_dataSet = pdf[['xm','ym','Tx','Tm','Tn']]\n",
"Clus_dataSet = np.nan_to_num(Clus_dataSet)\n",
"Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)\n",
"\n",
"# Compute DBSCAN\n",
"db = DBSCAN(eps=0.3, min_samples=10).fit(Clus_dataSet)\n",
"core_samples_mask = np.zeros_like(db.labels_, dtype=bool)\n",
"core_samples_mask[db.core_sample_indices_] = True\n",
"labels = db.labels_\n",
"pdf[\"Clus_Db\"]=labels\n",
"\n",
"realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)\n",
"clusterNum = len(set(labels)) \n",
"\n",
"\n",
"# A sample of clusters\n",
"pdf[[\"Stn_Name\",\"Tx\",\"Tm\",\"Clus_Db\"]].head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 8- Visualization of clusters based on location and Temperature\n",
"<div id=\"visualization_location_temperature\">\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from mpl_toolkits.basemap import Basemap\n",
"import matplotlib.pyplot as plt\n",
"from pylab import rcParams\n",
"%matplotlib inline\n",
"rcParams['figure.figsize'] = (14,10)\n",
"\n",
"my_map = Basemap(projection='merc',\n",
" resolution = 'l', area_thresh = 1000.0,\n",
" llcrnrlon=llon, llcrnrlat=llat, #min longitude (llcrnrlon) and latitude (llcrnrlat)\n",
" urcrnrlon=ulon, urcrnrlat=ulat) #max longitude (urcrnrlon) and latitude (urcrnrlat)\n",
"\n",
"my_map.drawcoastlines()\n",
"my_map.drawcountries()\n",
"#my_map.drawmapboundary()\n",
"my_map.fillcontinents(color = 'white', alpha = 0.3)\n",
"my_map.shadedrelief()\n",
"\n",
"# To create a color map\n",
"colors = plt.get_cmap('jet')(np.linspace(0.0, 1.0, clusterNum))\n",
"\n",
"\n",
"\n",
"#Visualization1\n",
"for clust_number in set(labels):\n",
" c=(([0.4,0.4,0.4]) if clust_number == -1 else colors[np.int(clust_number)])\n",
" clust_set = pdf[pdf.Clus_Db == clust_number] \n",
" my_map.scatter(clust_set.xm, clust_set.ym, color =c, marker='o', s= 20, alpha = 0.85)\n",
" if clust_number != -1:\n",
" cenx=np.mean(clust_set.xm) \n",
" ceny=np.mean(clust_set.ym) \n",
" plt.text(cenx,ceny,str(clust_number), fontsize=25, color='red',)\n",
" print (\"Cluster \"+str(clust_number)+', Avg Temp: '+ str(np.mean(clust_set.Tm)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Want to learn more?</h2>\n",
"\n",
"IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: <a href=\"http://cocl.us/ML0101EN-SPSSModeler\">SPSS Modeler</a>\n",
"\n",
"Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at <a href=\"https://cocl.us/ML0101EN_DSX\">Watson Studio</a>\n",
"\n",
"<h3>Thanks for completing this lesson!</h3>\n",
"\n",
"<h4>Author: <a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a></h4>\n",
"<p><a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>\n",
"\n",
"<hr>\n",
"\n",
"<p>Copyright &copy; 2018 <a href=\"https://cocl.us/DX0108EN_CC\">Cognitive Class</a>. This notebook and its source code are released under the terms of the <a href=\"https://bigdatauniversity.com/mit-license/\">MIT License</a>.</p>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment