pb111/K-Means Clustering with Python and Scikit-Learn.ipynb

## K-Means Clustering with Python and Scikit-Learn.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# K-Means Clustering with Python and Scikit-Learn\n",
    "\n",
    "\n",
    "K-Means clustering is the most popular unsupervised machine learning algorithm. K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw inferences from them. I have used `Facebook Live Sellers in Thailand` dataset for this project. I implement K-Means clustering to find intrinsic groups within this dataset that display the same `status_type` behaviour. The `status_type` behaviour variable consists of posts of a different nature (video, photos, statuses and links)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of Contents\n",
    "\n",
    "\n",
    "1.\tIntroduction to K-Means Clustering\n",
    "2.\tK-Means Clustering intuition\n",
    "3.\tChoosing the value of K\n",
    "4.\tThe elbow method\n",
    "5.\tThe problem statement\n",
    "6.\tDataset description\n",
    "7.\tImport libraries\n",
    "8.\tImport dataset\n",
    "9.\tExploratory data analysis\n",
    "10.\tDeclare feature vector and target variable\n",
    "11.\tConvert categorical variable into integers\n",
    "12.\tFeature scaling\n",
    "13.\tK-Means model with two clusters\n",
    "14.\tK-Means model parameters study\n",
    "15.\tCheck quality of weak classification by the model\n",
    "16.\tUse elbow method to find optimal number of clusters\n",
    "17.\tK-Means model with different clusters\n",
    "18.\tResults and conclusion\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Introduction to K-Means Clustering\n",
    "\n",
    "\n",
    "Machine learning algorithms can be broadly classified into two categories - supervised and unsupervised learning. There are other categories also like semi-supervised learning and reinforcement learning. But, most of the algorithms are classified as supervised or unsupervised learning. The difference between them happens because of presence of target variable. In unsupervised learning, there is no target variable. The dataset only has input variables which describe the data. This is called unsupervised learning.\n",
    "\n",
    "**K-Means clustering** is the most popular unsupervised learning algorithm. It is used when we have unlabelled data which is data without defined categories or groups. The algorithm follows an easy or simple way to classify a given data set through a certain number of clusters, fixed apriori. K-Means algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. K-Means Clustering intuition\n",
    "\n",
    "\n",
    "K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw inferences from them. It is based on centroid-based clustering.\n",
    "\n",
    "\n",
    "**Centroid** - A centroid is a data point at the centre of a cluster. In centroid-based clustering, clusters are represented by a centroid. It is an iterative algorithm in which the notion of similarity is derived by how close a data point is to the centroid of the cluster.\n",
    "K-Means clustering works as follows:-\n",
    "The K-Means clustering algorithm uses an iterative procedure to deliver a final result. The algorithm requires number of clusters K and the data set as input. The data set is a collection of features for each data point. The algorithm starts with initial estimates for the K centroids. The algorithm then iterates between two steps:-\n",
    "\n",
    "\n",
    "**1. Data assignment step**\n",
    "\n",
    "\n",
    "Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, which is based on the squared Euclidean distance. So, if ci is the collection of centroids in set C, then each data point is assigned to a cluster based on minimum Euclidean distance. \n",
    "\n",
    "\n",
    "\n",
    "**2. Centroid update step**\n",
    "\n",
    "\n",
    "In this step, the centroids are recomputed and updated. This is done by taking the mean of all data points assigned to that centroid’s cluster. \n",
    "\n",
    "\n",
    "The algorithm then iterates between step 1 and step 2 until a stopping criteria is met. Stopping criteria means no data points change the clusters, the sum of the distances is minimized or some maximum number of iterations is reached.\n",
    "This algorithm is guaranteed to converge to a result. The result may be a local optimum meaning that assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Choosing the value of K\n",
    "\n",
    "\n",
    "The K-Means algorithm depends upon finding the number of clusters and data labels for a pre-defined value of K. To find the number of clusters in the data, we need to run the K-Means clustering algorithm for different values of K and compare the results. So, the performance of K-Means algorithm depends upon the value of K. We should choose the optimal value of K that gives us best performance. There are different techniques available to find the optimal value of K. The most common technique is the **elbow method** which is described below.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. The elbow method\n",
    "\n",
    "\n",
    "The elbow method is used to determine the optimal number of clusters in K-means clustering. The elbow method plots the value of the cost function produced by different values of K. \n",
    "\n",
    "If K increases, average distortion will decrease.  Then each cluster will have fewer constituent instances, and the instances will be closer to their respective centroids. However, the improvements in average distortion will decline as K increases. The value of K at which improvement in distortion declines the most is called the elbow, at which we should stop dividing the data into further clusters.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. The problem statement\n",
    "\n",
    "\n",
    "In this project, I implement K-Means clustering with Python and Scikit-Learn. As mentioned earlier, K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw inferences from them. I have used `Facebook Live Sellers in Thailand Dataset` for this project. I implement K-Means clustering to find intrinsic groups within this dataset that display the same `status_type` behaviour. The `status_type` behaviour variable consists of posts of a different nature (video, photos, statuses and links). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Dataset description\n",
    "\n",
    "\n",
    "In this project, I have used `Facebook Live Sellers in Thailand` Dataset, downloaded from the UCI Machine Learning repository. The dataset can be found at the following url-\n",
    "\n",
    "\n",
    "https://archive.ics.uci.edu/ml/datasets/Facebook+Live+Sellers+in+Thailand\n",
    "\n",
    "\n",
    "The dataset consists of Facebook pages of 10 Thai fashion and cosmetics retail sellers. The `status_type` behaviour variable  consists of posts of a different nature (video, photos, statuses and links). It also contains engagement metrics of comments, shares and reactions.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Import libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ignore warnings\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Import dataset\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = 'C:/datasets/Live.csv'\n",
    "\n",
    "df = pd.read_csv(data)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Exploratory data analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check shape of the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(7050, 16)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that there are 7050 instances and 16 attributes in the dataset. In the dataset description, it is given that there are 7051 instances and 12 attributes in the dataset.\n",
    "\n",
    "So, we can infer that the first instance is the row header and there are 4 extra attributes in the dataset. Next, we should take a look at the dataset to gain more insight about it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preview the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>status_id</th>\n",
       "      <th>status_type</th>\n",
       "      <th>status_published</th>\n",
       "      <th>num_reactions</th>\n",
       "      <th>num_comments</th>\n",
       "      <th>num_shares</th>\n",
       "      <th>num_likes</th>\n",
       "      <th>num_loves</th>\n",
       "      <th>num_wows</th>\n",
       "      <th>num_hahas</th>\n",
       "      <th>num_sads</th>\n",
       "      <th>num_angrys</th>\n",
       "      <th>Column1</th>\n",
       "      <th>Column2</th>\n",
       "      <th>Column3</th>\n",
       "      <th>Column4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>246675545449582_1649696485147474</td>\n",
       "      <td>video</td>\n",
       "      <td>4/22/2018 6:00</td>\n",
       "      <td>529</td>\n",
       "      <td>512</td>\n",
       "      <td>262</td>\n",
       "      <td>432</td>\n",
       "      <td>92</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>246675545449582_1649426988507757</td>\n",
       "      <td>photo</td>\n",
       "      <td>4/21/2018 22:45</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>246675545449582_1648730588577397</td>\n",
       "      <td>video</td>\n",
       "      <td>4/21/2018 6:17</td>\n",
       "      <td>227</td>\n",
       "      <td>236</td>\n",
       "      <td>57</td>\n",
       "      <td>204</td>\n",
       "      <td>21</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>246675545449582_1648576705259452</td>\n",
       "      <td>photo</td>\n",
       "      <td>4/21/2018 2:29</td>\n",
       "      <td>111</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>111</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>246675545449582_1645700502213739</td>\n",
       "      <td>photo</td>\n",
       "      <td>4/18/2018 3:22</td>\n",
       "      <td>213</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>204</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          status_id status_type status_published  \\\n",
       "0  246675545449582_1649696485147474       video   4/22/2018 6:00   \n",
       "1  246675545449582_1649426988507757       photo  4/21/2018 22:45   \n",
       "2  246675545449582_1648730588577397       video   4/21/2018 6:17   \n",
       "3  246675545449582_1648576705259452       photo   4/21/2018 2:29   \n",
       "4  246675545449582_1645700502213739       photo   4/18/2018 3:22   \n",
       "\n",
       "   num_reactions  num_comments  num_shares  num_likes  num_loves  num_wows  \\\n",
       "0            529           512         262        432         92         3   \n",
       "1            150             0           0        150          0         0   \n",
       "2            227           236          57        204         21         1   \n",
       "3            111             0           0        111          0         0   \n",
       "4            213             0           0        204          9         0   \n",
       "\n",
       "   num_hahas  num_sads  num_angrys  Column1  Column2  Column3  Column4  \n",
       "0          1         1           0      NaN      NaN      NaN      NaN  \n",
       "1          0         0           0      NaN      NaN      NaN      NaN  \n",
       "2          1         0           0      NaN      NaN      NaN      NaN  \n",
       "3          0         0           0      NaN      NaN      NaN      NaN  \n",
       "4          0         0           0      NaN      NaN      NaN      NaN  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View summary of dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 7050 entries, 0 to 7049\n",
      "Data columns (total 16 columns):\n",
      "status_id           7050 non-null object\n",
      "status_type         7050 non-null object\n",
      "status_published    7050 non-null object\n",
      "num_reactions       7050 non-null int64\n",
      "num_comments        7050 non-null int64\n",
      "num_shares          7050 non-null int64\n",
      "num_likes           7050 non-null int64\n",
      "num_loves           7050 non-null int64\n",
      "num_wows            7050 non-null int64\n",
      "num_hahas           7050 non-null int64\n",
      "num_sads            7050 non-null int64\n",
      "num_angrys          7050 non-null int64\n",
      "Column1             0 non-null float64\n",
      "Column2             0 non-null float64\n",
      "Column3             0 non-null float64\n",
      "Column4             0 non-null float64\n",
      "dtypes: float64(4), int64(9), object(3)\n",
      "memory usage: 881.3+ KB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check for missing values in dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "status_id              0\n",
       "status_type            0\n",
       "status_published       0\n",
       "num_reactions          0\n",
       "num_comments           0\n",
       "num_shares             0\n",
       "num_likes              0\n",
       "num_loves              0\n",
       "num_wows               0\n",
       "num_hahas              0\n",
       "num_sads               0\n",
       "num_angrys             0\n",
       "Column1             7050\n",
       "Column2             7050\n",
       "Column3             7050\n",
       "Column4             7050\n",
       "dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that there are 4 redundant columns in the dataset. We should drop them before proceeding further."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Drop redundant columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.drop(['Column1', 'Column2', 'Column3', 'Column4'], axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Again view summary of dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 7050 entries, 0 to 7049\n",
      "Data columns (total 12 columns):\n",
      "status_id           7050 non-null object\n",
      "status_type         7050 non-null object\n",
      "status_published    7050 non-null object\n",
      "num_reactions       7050 non-null int64\n",
      "num_comments        7050 non-null int64\n",
      "num_shares          7050 non-null int64\n",
      "num_likes           7050 non-null int64\n",
      "num_loves           7050 non-null int64\n",
      "num_wows            7050 non-null int64\n",
      "num_hahas           7050 non-null int64\n",
      "num_sads            7050 non-null int64\n",
      "num_angrys          7050 non-null int64\n",
      "dtypes: int64(9), object(3)\n",
      "memory usage: 661.0+ KB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can see that redundant columns have been removed from the dataset. \n",
    "\n",
    "We can see that, there are 3 character variables (data type = object) and remaining 9 numerical variables (data type = int64).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View the statistical summary of numerical variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>num_reactions</th>\n",
       "      <th>num_comments</th>\n",
       "      <th>num_shares</th>\n",
       "      <th>num_likes</th>\n",
       "      <th>num_loves</th>\n",
       "      <th>num_wows</th>\n",
       "      <th>num_hahas</th>\n",
       "      <th>num_sads</th>\n",
       "      <th>num_angrys</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>7050.000000</td>\n",
       "      <td>7050.000000</td>\n",
       "      <td>7050.000000</td>\n",
       "      <td>7050.000000</td>\n",
       "      <td>7050.000000</td>\n",
       "      <td>7050.000000</td>\n",
       "      <td>7050.000000</td>\n",
       "      <td>7050.000000</td>\n",
       "      <td>7050.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>230.117163</td>\n",
       "      <td>224.356028</td>\n",
       "      <td>40.022553</td>\n",
       "      <td>215.043121</td>\n",
       "      <td>12.728652</td>\n",
       "      <td>1.289362</td>\n",
       "      <td>0.696454</td>\n",
       "      <td>0.243688</td>\n",
       "      <td>0.113191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>462.625309</td>\n",
       "      <td>889.636820</td>\n",
       "      <td>131.599965</td>\n",
       "      <td>449.472357</td>\n",
       "      <td>39.972930</td>\n",
       "      <td>8.719650</td>\n",
       "      <td>3.957183</td>\n",
       "      <td>1.597156</td>\n",
       "      <td>0.726812</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>17.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>17.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>59.500000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>58.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>219.000000</td>\n",
       "      <td>23.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>184.750000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>4710.000000</td>\n",
       "      <td>20990.000000</td>\n",
       "      <td>3424.000000</td>\n",
       "      <td>4710.000000</td>\n",
       "      <td>657.000000</td>\n",
       "      <td>278.000000</td>\n",
       "      <td>157.000000</td>\n",
       "      <td>51.000000</td>\n",
       "      <td>31.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       num_reactions  num_comments   num_shares    num_likes    num_loves  \\\n",
       "count    7050.000000   7050.000000  7050.000000  7050.000000  7050.000000   \n",
       "mean      230.117163    224.356028    40.022553   215.043121    12.728652   \n",
       "std       462.625309    889.636820   131.599965   449.472357    39.972930   \n",
       "min         0.000000      0.000000     0.000000     0.000000     0.000000   \n",
       "25%        17.000000      0.000000     0.000000    17.000000     0.000000   \n",
       "50%        59.500000      4.000000     0.000000    58.000000     0.000000   \n",
       "75%       219.000000     23.000000     4.000000   184.750000     3.000000   \n",
       "max      4710.000000  20990.000000  3424.000000  4710.000000   657.000000   \n",
       "\n",
       "          num_wows    num_hahas     num_sads   num_angrys  \n",
       "count  7050.000000  7050.000000  7050.000000  7050.000000  \n",
       "mean      1.289362     0.696454     0.243688     0.113191  \n",
       "std       8.719650     3.957183     1.597156     0.726812  \n",
       "min       0.000000     0.000000     0.000000     0.000000  \n",
       "25%       0.000000     0.000000     0.000000     0.000000  \n",
       "50%       0.000000     0.000000     0.000000     0.000000  \n",
       "75%       0.000000     0.000000     0.000000     0.000000  \n",
       "max     278.000000   157.000000    51.000000    31.000000  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are 3 categorical variables in the dataset. I will explore them one by one."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explore `status_id` variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['246675545449582_1649696485147474',\n",
       "       '246675545449582_1649426988507757',\n",
       "       '246675545449582_1648730588577397', ...,\n",
       "       '1050855161656896_1060126464063099',\n",
       "       '1050855161656896_1058663487542730',\n",
       "       '1050855161656896_1050858841656528'], dtype=object)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view the labels in the variable\n",
    "\n",
    "df['status_id'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6997"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view how many different types of variables are there\n",
    "\n",
    "len(df['status_id'].unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that there are 6997 unique labels in the `status_id` variable. The total number of instances in the dataset is 7050. So, it is approximately a unique identifier for each of the instances. Thus this is not a variable that we can use. Hence, I will drop it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explore `status_published` variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['4/22/2018 6:00', '4/21/2018 22:45', '4/21/2018 6:17', ...,\n",
       "       '9/21/2016 23:03', '9/20/2016 0:43', '9/10/2016 10:30'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view the labels in the variable\n",
    "\n",
    "df['status_published'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6913"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view how many different types of variables are there\n",
    "\n",
    "len(df['status_published'].unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, we can see that there are 6913 unique labels in the `status_published` variable. The total number of instances in the dataset is 7050. So, it is also a approximately a unique identifier for each of the instances. Thus this is not a variable that we can use. Hence, I will drop it also."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explore `status_type` variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['video', 'photo', 'link', 'status'], dtype=object)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view the labels in the variable\n",
    "\n",
    "df['status_type'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view how many different types of variables are there\n",
    "\n",
    "len(df['status_type'].unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that there are 4 categories of labels in the `status_type` variable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Drop `status_id` and `status_published` variable from the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.drop(['status_id', 'status_published'], axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View the summary of dataset again"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 7050 entries, 0 to 7049\n",
      "Data columns (total 10 columns):\n",
      "status_type      7050 non-null object\n",
      "num_reactions    7050 non-null int64\n",
      "num_comments     7050 non-null int64\n",
      "num_shares       7050 non-null int64\n",
      "num_likes        7050 non-null int64\n",
      "num_loves        7050 non-null int64\n",
      "num_wows         7050 non-null int64\n",
      "num_hahas        7050 non-null int64\n",
      "num_sads         7050 non-null int64\n",
      "num_angrys       7050 non-null int64\n",
      "dtypes: int64(9), object(1)\n",
      "memory usage: 550.9+ KB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preview the dataset again"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>status_type</th>\n",
       "      <th>num_reactions</th>\n",
       "      <th>num_comments</th>\n",
       "      <th>num_shares</th>\n",
       "      <th>num_likes</th>\n",
       "      <th>num_loves</th>\n",
       "      <th>num_wows</th>\n",
       "      <th>num_hahas</th>\n",
       "      <th>num_sads</th>\n",
       "      <th>num_angrys</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>video</td>\n",
       "      <td>529</td>\n",
       "      <td>512</td>\n",
       "      <td>262</td>\n",
       "      <td>432</td>\n",
       "      <td>92</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>photo</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>video</td>\n",
       "      <td>227</td>\n",
       "      <td>236</td>\n",
       "      <td>57</td>\n",
       "      <td>204</td>\n",
       "      <td>21</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>photo</td>\n",
       "      <td>111</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>111</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>photo</td>\n",
       "      <td>213</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>204</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  status_type  num_reactions  num_comments  num_shares  num_likes  num_loves  \\\n",
       "0       video            529           512         262        432         92   \n",
       "1       photo            150             0           0        150          0   \n",
       "2       video            227           236          57        204         21   \n",
       "3       photo            111             0           0        111          0   \n",
       "4       photo            213             0           0        204          9   \n",
       "\n",
       "   num_wows  num_hahas  num_sads  num_angrys  \n",
       "0         3          1         1           0  \n",
       "1         0          0         0           0  \n",
       "2         1          1         0           0  \n",
       "3         0          0         0           0  \n",
       "4         0          0         0           0  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that there is 1 non-numeric column `status_type` in the dataset. I will convert it into integer equivalents."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10. Declare feature vector and target variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = df\n",
    "\n",
    "y = df['status_type']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 11. Convert categorical variable into integers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "le = LabelEncoder()\n",
    "\n",
    "X['status_type'] = le.fit_transform(X['status_type'])\n",
    "\n",
    "y = le.transform(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View the summary of X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 7050 entries, 0 to 7049\n",
      "Data columns (total 10 columns):\n",
      "status_type      7050 non-null int32\n",
      "num_reactions    7050 non-null int64\n",
      "num_comments     7050 non-null int64\n",
      "num_shares       7050 non-null int64\n",
      "num_likes        7050 non-null int64\n",
      "num_loves        7050 non-null int64\n",
      "num_wows         7050 non-null int64\n",
      "num_hahas        7050 non-null int64\n",
      "num_sads         7050 non-null int64\n",
      "num_angrys       7050 non-null int64\n",
      "dtypes: int32(1), int64(9)\n",
      "memory usage: 523.3 KB\n"
     ]
    }
   ],
   "source": [
    "X.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preview the dataset X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>status_type</th>\n",
       "      <th>num_reactions</th>\n",
       "      <th>num_comments</th>\n",
       "      <th>num_shares</th>\n",
       "      <th>num_likes</th>\n",
       "      <th>num_loves</th>\n",
       "      <th>num_wows</th>\n",
       "      <th>num_hahas</th>\n",
       "      <th>num_sads</th>\n",
       "      <th>num_angrys</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>529</td>\n",
       "      <td>512</td>\n",
       "      <td>262</td>\n",
       "      <td>432</td>\n",
       "      <td>92</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>227</td>\n",
       "      <td>236</td>\n",
       "      <td>57</td>\n",
       "      <td>204</td>\n",
       "      <td>21</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>111</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>111</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>213</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>204</td>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   status_type  num_reactions  num_comments  num_shares  num_likes  num_loves  \\\n",
       "0            3            529           512         262        432         92   \n",
       "1            1            150             0           0        150          0   \n",
       "2            3            227           236          57        204         21   \n",
       "3            1            111             0           0        111          0   \n",
       "4            1            213             0           0        204          9   \n",
       "\n",
       "   num_wows  num_hahas  num_sads  num_angrys  \n",
       "0         3          1         1           0  \n",
       "1         0          0         0           0  \n",
       "2         1          1         0           0  \n",
       "3         0          0         0           0  \n",
       "4         0          0         0           0  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 12. Feature Scaling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "cols = X.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "ms = MinMaxScaler()\n",
    "\n",
    "X = ms.fit_transform(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = pd.DataFrame(X, columns=[cols])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>status_type</th>\n",
       "      <th>num_reactions</th>\n",
       "      <th>num_comments</th>\n",
       "      <th>num_shares</th>\n",
       "      <th>num_likes</th>\n",
       "      <th>num_loves</th>\n",
       "      <th>num_wows</th>\n",
       "      <th>num_hahas</th>\n",
       "      <th>num_sads</th>\n",
       "      <th>num_angrys</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.112314</td>\n",
       "      <td>0.024393</td>\n",
       "      <td>0.076519</td>\n",
       "      <td>0.091720</td>\n",
       "      <td>0.140030</td>\n",
       "      <td>0.010791</td>\n",
       "      <td>0.006369</td>\n",
       "      <td>0.019608</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.333333</td>\n",
       "      <td>0.031847</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.031847</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.048195</td>\n",
       "      <td>0.011243</td>\n",
       "      <td>0.016647</td>\n",
       "      <td>0.043312</td>\n",
       "      <td>0.031963</td>\n",
       "      <td>0.003597</td>\n",
       "      <td>0.006369</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.333333</td>\n",
       "      <td>0.023567</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.023567</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.333333</td>\n",
       "      <td>0.045223</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.043312</td>\n",
       "      <td>0.013699</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  status_type num_reactions num_comments num_shares num_likes num_loves  \\\n",
       "0    1.000000      0.112314     0.024393   0.076519  0.091720  0.140030   \n",
       "1    0.333333      0.031847     0.000000   0.000000  0.031847  0.000000   \n",
       "2    1.000000      0.048195     0.011243   0.016647  0.043312  0.031963   \n",
       "3    0.333333      0.023567     0.000000   0.000000  0.023567  0.000000   \n",
       "4    0.333333      0.045223     0.000000   0.000000  0.043312  0.013699   \n",
       "\n",
       "   num_wows num_hahas  num_sads num_angrys  \n",
       "0  0.010791  0.006369  0.019608        0.0  \n",
       "1  0.000000  0.000000  0.000000        0.0  \n",
       "2  0.003597  0.006369  0.000000        0.0  \n",
       "3  0.000000  0.000000  0.000000        0.0  \n",
       "4  0.000000  0.000000  0.000000        0.0  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 13. K-Means model with two clusters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,\n",
       "    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',\n",
       "    random_state=0, tol=0.0001, verbose=0)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.cluster import KMeans\n",
    "\n",
    "kmeans = KMeans(n_clusters=2, random_state=0) \n",
    "\n",
    "kmeans.fit(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 14. K-Means model parameters study"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[3.28506857e-01, 3.90710874e-02, 7.54854864e-04, 7.53667113e-04,\n",
       "        3.85438884e-02, 2.17448568e-03, 2.43721364e-03, 1.20039760e-03,\n",
       "        2.75348016e-03, 1.45313276e-03],\n",
       "       [9.54921576e-01, 6.46330441e-02, 2.67028654e-02, 2.93171709e-02,\n",
       "        5.71231462e-02, 4.71007076e-02, 8.18581889e-03, 9.65207685e-03,\n",
       "        8.04219428e-03, 7.19501847e-03]])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kmeans.cluster_centers_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- The KMeans algorithm clusters data by trying to separate samples in n groups of equal variances, minimizing a criterion known as **inertia**, or within-cluster sum-of-squares Inertia, or the within-cluster sum of squares criterion, can be recognized as a measure of how internally coherent clusters are.\n",
    "\n",
    "\n",
    "- The k-means algorithm divides a set of N samples X into K disjoint clusters C, each described by the mean j of the samples in the cluster. The means are commonly called the cluster **centroids**.\n",
    "\n",
    "\n",
    "- The K-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum of squared criterion."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inertia\n",
    "\n",
    "\n",
    "- **Inertia** is not a normalized metric. \n",
    "\n",
    "- The lower values of inertia are better and zero is optimal. \n",
    "\n",
    "- But in very high-dimensional spaces, euclidean distances tend to become inflated (this is an instance of `curse of dimensionality`). \n",
    "\n",
    "- Running a dimensionality reduction algorithm such as PCA prior to k-means clustering can alleviate this problem and speed up the computations.\n",
    "\n",
    "- We can calculate model inertia as follows:-"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "237.75726404419564"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kmeans.inertia_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- The lesser the model inertia, the better the model fit.\n",
    "\n",
    "- We can see that the model has very high inertia. So, this is not a good model fit to the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## 15. Check quality of weak classification by the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result: 63 out of 7050 samples were correctly labeled.\n"
     ]
    }
   ],
   "source": [
    "labels = kmeans.labels_\n",
    "\n",
    "# check how many of the samples were correctly labeled\n",
    "correct_labels = sum(y == labels)\n",
    "\n",
    "print(\"Result: %d out of %d samples were correctly labeled.\" % (correct_labels, y.size))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy score: 0.01\n"
     ]
    }
   ],
   "source": [
    "print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have achieved a weak classification accuracy of 1% by our unsupervised model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 16. Use elbow method to find optimal number of clusters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAEWCAYAAAB8LwAVAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAIABJREFUeJzt3XmUXOV55/HvU1Vdvaq7WlJLSK0uWoDYly4sFLyyemyPjSGxiZ3YA8EkzCSOMcYOsX0mieOZkzF4wfZ4QoLBLDaxTbAzEJtge9hMHAMSSEiAWLV1SwK1ll7U+/LMH/eWVGqVultSV9+urt/nnD51t6r73ELUr+59632vuTsiIiJjxaIuQEREZiYFhIiI5KWAEBGRvBQQIiKSlwJCRETyUkCIiEheCgiZsczsS2b2g2nYT7OZuZklwvnHzOyPC73f6TCVx2Jmd5rZ/5yK15LioICQyJjZ3py/UTPry5n/2BTv604zGxyzz+emch9HKiegnh2zfH5Y86ZJvs60BKqUDgWERMbda7J/wBbgkpxl9xRglzfl7tPdzyrAPo5GtZmdnjP/h8DGqIoRUUDITJc0s7vNrNvMXjCz5dkVZrbYzH5iZu1mttHMrp3C/R5vZk+bWaeZ3W9mc3P2+8Gwlo7wEs4p4fKrzOxfc7Z7zczuzZlvNbOWcfb5feDKnPkrgLtzNzjUMZvZe4EvAh/Jc3Z0rJn9JnwPf2lm8yc6lnBdxsyeDZ/3Y6Bicm+dzBYKCJnpPgj8CEgBDwDfATCzGPCvwHNAI3ARcJ2ZvWeK9nsF8AlgMTAMfDvc74nAD4HrgAbgQeBfzSwJPA6808xiZrYIKAPeHj7vOKAGWDvOPn8AfNTM4uEH9RzgqezK8Y7Z3R8C/g74cZ6zoz8ErgIWAEngcxMdS3g8/5cgtOYC/wx86LDeQSl6CgiZ6f7d3R909xGCD6vsB985QIO7f9ndB919A/Bd4KPjvNbnwm/K2b+7xtn2++7+vLv3AH8F/L6ZxYGPAD9391+5+xDwNaASeFtYQzfQApwH/ALYamYnh/NPuPvoOPtsA14GLiY4k7h7zPojOWaAO9z9FXfvA+4N62O8YwHOJQi4b7r7kLvfB6ycYD8yyySiLkBkAm/kTPcCFeGvjY4FFptZR876OPDEOK/1NXf/75Pcb2vO9GaCD8v5BGcUm7Mr3H3UzFoJvtFDcBZxPnBCON1BEA5vDecncjfwRwQf0u8CluWsO5JjhoPfw5pwerxjGQG2+oGjeW5GSooCQopVK7DR3ZdNuOWRacqZTgNDwE5gG3BGdoWZWbjt1nDR48AlwFKCSz4dwMcIAuI7k9jvT8LtnnH3zWaWe3wTHfPhDs083rE40GhmlhMSaeD1w9yHFDFdYpJi9TTQZWZ/aWaV4XX7083snCl6/Y+b2almVgV8GbgvvMx1L/B+M7vIzMqAzwIDwH+Ez3scuACodPc2gm/37wXmAasn2ml4SetCIF/fhYmO+U2gOWyrmIzxjuW3BG0v15pZwsx+D1gxydeVWUIBIUUp/LC+hOB6+kaCb/e3AXXjPO2GMf0gdo6z7feBOwkuz1QA14b7fRn4OPC/w31eQvDz3MFw/SvAXsLLPu7eBWwAfhPWPJljW+XuB31Tn8Qx/3P4uGtsn4pD7OeQxxIez+8RXO7aQ9Be8dPJ1C+zh+mGQSIiko/OIEREJC8FhIiI5KWAEBGRvBQQIiKSV1H3g5g/f743NzdHXYaISFF55plndrp7w0TbFXVANDc3s2rVqqjLEBEpKmY2qV7xusQkIiJ5KSBERCQvBYSIiOSlgBARkbwUECIikpcCQkRE8lJAiIhIXiUZEM9s3s2ND72ERrIVETm0kgyIF7Z1cctjr7Otsz/qUkREZqySDIiWphQAa7Z0TLCliEjpKsmAOPmYWsoTMVZv2RN1KSIiM1ZJBkQyEeOMxjpWt+oMQkTkUEoyIAAy6RTrtnYyODwadSkiIjNSyQZES1M9g8OjvPRGV9SliIjMSCUbEJl00FC9Wg3VIiJ5lWxALKqrYGFtuRqqRUQOoWQDwsxoaUqpoVpE5BBKNiAAMul6Nu/qZXfPYNSliIjMOKUdENkOc626zCQiMlZJB8QZS+qIx0wN1SIieZR0QFQlE5y0cI4CQkQkj5IOCAh+7vpcawejoxrZVUQklwIiXU/3wDCvt++NuhQRkRlFAaEOcyIieZV8QCydV01tRYLV+iWTiMgBSj4gYjGjJV2vMwgRkTFKPiAg6A/xypvd7B0YjroUEZEZQwFB0A4x6rC2TWcRIiJZBQ0IM/uMmb1gZs+b2Q/NrMLMlprZU2b2qpn92MyS4bbl4fxr4frmQtaWK3sLUl1mEhHZr2ABYWaNwLXAcnc/HYgDHwVuBG5292XAHuDq8ClXA3vc/QTg5nC7aZGqSnLc/GrWaOA+EZF9Cn2JKQFUmlkCqAK2AxcC94Xr7wIuC6cvDecJ119kZlbg+vZpSadYvaUDd3WYExGBAgaEu28FvgZsIQiGTuAZoMPds63BbUBjON0ItIbPHQ63n1eo+sbKpOvZuXeAtj1907VLEZEZrZCXmOoJzgqWAouBauB9eTbNfmXPd7Zw0Nd5M7vGzFaZ2ar29vapKjdnZFddZhIRgcJeYroY2Oju7e4+BPwUeBuQCi85ASwBtoXTbUATQLi+Dtg99kXd/VZ3X+7uyxsaGqas2JOOmUNFWUwN1SIioUIGxBbgXDOrCtsSLgJeBB4FPhxucyVwfzj9QDhPuP4Rn8YGgbJ4jDMbU+pRLSISKmQbxFMEjc3PAuvCfd0K/CVwvZm9RtDGcHv4lNuBeeHy64HPF6q2Q8mkU7ywtYuB4ZHp3rWIyIyTmHiTI+fufwP8zZjFG4AVebbtBy4vZD0TaWlKMTgyyvrt3fv6RoiIlCr1pM6RSdcDsHqLLjOJiCggchxTV8Giugo1VIuIoIA4SCathmoREVBAHKSlKUXr7j527h2IuhQRkUgpIMbItkOs0WUmESlxCogxTl9cRyJmuswkIiVPATFGZTLOKYtq1VAtIiVPAZFHS1OKtW2djIxqZFcRKV0KiDwy6RR7B4Z5bcfeqEsREYmMAiIPdZgTEVFA5NU8r4pUVZnaIUSkpCkg8jAzWppSujeEiJQ0BcQhZJrqeWVHN939Q1GXIiISCQXEIWTSKdxhbVtn1KWIiERCAXEIZ4XDfauhWkRKlQLiEOoqyzi+oVrtECJSshQQ48ik61m9pYNpvPOpiMiMoYAYRyadYlfPIK27+6IuRURk2ikgxpG97agG7hORUqSAGMdJC+dQWRZXhzkRKUkKiHEk4jHOXFLHajVUi0gJUkBMIJOu58VtnfQPjURdiojItFJATKClKcXQiPPCtq6oSxERmVYKiAlk0kFDtfpDiEipUUBMYGFtBY2pSvWoFpGSo4CYhJZ0Sr9kEpGSo4CYhExTiq0dfezo7o+6FBGRaaOAmIR97RA6ixCREqKAmITTFtdRFjf1hxCRkqKAmISKsjinLqpVQ7WIlBQFxCS1NKVY29bJyKhGdhWR0qCAmKRMup7ewRFeebM76lJERKaFAmKSsg3V+rmriJQKBcQkpedWMbc6qXYIESkZCohJMjNamlIackNESoYC4jBkmlK8umMvnX1DUZciIlJwCojDkEnXA7C2TWcRIjL7KSAOw5lNdZipoVpESkNBA8LMUmZ2n5m9ZGbrzeytZjbXzH5lZq+Gj/XhtmZm3zaz18xsrZmdXcjajkRtRRknNNSoHUJESkKhzyC+BTzk7icDZwHrgc8DD7v7MuDhcB7gfcCy8O8a4JYC13ZEMukUq7fswV0d5kRkditYQJhZLfAu4HYAdx909w7gUuCucLO7gMvC6UuBuz3wJJAys0WFqu9IZdL17OkdYvOu3qhLEREpqEKeQRwHtAN3mNlqM7vNzKqBhe6+HSB8XBBu3wi05jy/LVx2ADO7xsxWmdmq9vb2Apaf374Oc63qDyEis1shAyIBnA3c4u4ZoIf9l5PysTzLDrqO4+63uvtyd1/e0NAwNZUehmUL5lCdjGvobxGZ9QoZEG1Am7s/Fc7fRxAYb2YvHYWPO3K2b8p5/hJgWwHrOyLxmHHmkpSG/haRWa9gAeHubwCtZnZSuOgi4EXgAeDKcNmVwP3h9APAFeGvmc4FOrOXomaaTDrFi9u66B8aiboUEZGCSRT49T8F3GNmSWADcBVBKN1rZlcDW4DLw20fBP4z8BrQG247I2XS9QyPOs9v7WR589yoyxERKYiCBoS7rwGW51l1UZ5tHfhkIeuZKi1N4S1IWzsUECIya6kn9RFomFPOkvpK9agWkVlNAXGEMul6Df0tIrOaAuIIZZpSbOvs582u/qhLEREpCAXEEWrRHeZEZJZTQByh0xbXkozH1KNaRGYtBcQRKk/EOXVxrc4gRGTWUkAchUw6xbq2ToZHRqMuRURkyikgjkJLU4q+oRFefrM76lJERKacAuIonB3eglSXmURkNlJAHIUl9ZXMr0kqIERkVlJAHAUzo6WpXr9kEpFZSQFxlDLpFBvae+jsHYq6FBGRKaWAOEqZ7MB9bbrMJCKziwLiKJ3ZlMIMjcskIrOOAuIo1ZQnOHHBHDVUi8iso4CYApl0ijWtHQS3tBARmR0UEFMgk07R2TfExp09UZciIjJlFBBTIKMOcyIyCykgpsDxDTXUlCfUH0JEZpVxA8LMjjWzupz5C8zsW2Z2vZklC19ecYjHjLOa6ljTqjMIEZk9JjqDuBeoBjCzFuCfgS3AWcDfF7a04pJpqmf99m76BkeiLkVEZEpMFBCV7r4tnP448D13/zpwFbCioJUVmUw6xcios25rZ9SliIhMiYkCwnKmLwQeBnB33QBhjJam7C1I1Q4hIrNDYoL1j5jZvcB2oB54BMDMFgGDBa6tqMyrKSc9t0rtECIya0x0BvEdYC2wCXiHu2dHpFsG3FHAuopSJp3ST11FZNaYKCBuBh5w95vdfWvO8l7gvYUrqzhlmlK80dXP9s6+qEsRETlqEwVEs7uvHbvQ3VcBzQWpqIi1hB3m1ugsQkRmgYkComKcdZVTWchscOqiWpKJGKvVDiEis8BEAbHSzP5k7EIzuxp4pjAlFa9kIsbpi2v1SyYRmRUm+hXTdcC/mNnH2B8Iy4Ek8LuFLKxYZdL1/ODJzQyNjFIW10gmIlK8xv0Ec/c33f1twN8S/JJpE/C37v5Wd3+j8OUVn5amFAPDo7z8RnfUpYiIHJWJziAAcPdHgUcLXMuskEnv7zB3emPdBFuLiMxcugYyxRpTlTTMKVd/CBEpegqIKWZmZJpS+iWTiBQ9BUQBtKRTbNzZw54ejUYiIsVLAVEAmaaww1ybziJEpHgpIArgzCV1xEy3IBWR4qaAKIDq8gQnHaMOcyJS3AoeEGYWN7PVZvazcH6pmT1lZq+a2Y+zty41s/Jw/rVwfXOhayuklqYUz7V2MDrqUZciInJEpuMM4tPA+pz5G4Gb3X0ZsAe4Olx+NbDH3U8gGEX2xmmorWAy6RRd/cNs2NkTdSkiIkekoAFhZkuA9wO3hfNGcGe6+8JN7gIuC6cvDecJ118Ubl+Uzk7rDnMiUtwKfQbxTeAGIHuL0nlAh7sPh/NtQGM43Qi0AoTrO8PtD2Bm15jZKjNb1d7eXsjaj8px82uYU5FQfwgRKVoFCwgz+wCww91zR33Nd0bgk1i3f4H7re6+3N2XNzQ0TEGlhRGLGS1NKd0bQkSKViHPIN4OfNDMNgE/Iri09E0gZWbZMaCWANvC6TagCSBcXwfsLmB9BZdpSvHSG130Dg5PvLGIyAxTsIBw9y+4+xJ3bwY+Cjzi7h8jGPTvw+FmVwL3h9MPhPOE6x9x96L+CVAmXc+ow9q2zqhLERE5bFH0g/hL4Hoze42gjeH2cPntwLxw+fXA5yOobUq1NGUbqnWZSUSKz6SG+z5a7v4Y8Fg4vQFYkWebfuDy6ahnutRXJ2meV8WaVv2SSUSKj3pSF1gmXc+zWzoo8qtlIlKCFBAFlkmnaO8eYFtnf9SliIgcFgVEgWVHdlWHOREpNgqIAjt50RzKEzH1hxCRoqOAKLCyeIwzGuvUo1pEio4CYhpk0inWbe1kcHh04o1FRGYIBcQ0aGmqZ3B4lPXbu6IuRURk0hQQ0yATjuy6RpeZRKSIKCCmwaK6ChbWluuXTCJSVBQQ08DMyDTVq6FaRIqKAmKatKRTbN7Vy669A1GXIiIyKQqIaZIJB+57rk1nESJSHBQQ0+SMJXXEY6aRXUWkaCggpklVMsHJx8xRQIhI0VBATKOWphTPtXYwOqqRXUVk5lNATKNMup7ugWFeb98bdSkiIhNSQEyjbIc5XWYSkWKggJhGS+dVU1dZxmrdYU5EioACYhrFYsZZTSmdQYhIUVBATLNMU4pX3uxm78Bw1KWIiIxLATHNMukUow5r1WFORGY4BcQ0a2lSQ7WIFAcFxDRLVSU5bn61hv4WkRlPARGBlnTQUO2uDnMiMnMpICKQSdezc+8AbXv6oi5FROSQFBARyI7sqvtDiMhMpoCIwMnHzKGiLMYaNVSLyAymgIhAIh7jzMaUelSLyIymgIhIJp3iha1dDAyPRF2KiEheCoiIZNIpBkdGeXFbV9SliIjkpYCISEtTPYD6Q4jIjKWAiMgxdRUsqqtQj2oRmbEUEBFasXQu//b8dr70wAu0dw9EXY6IyAESURdQyv7qA6dSlYzz/Sc3c++qVq5+x1L+5F3HUVtRFnVpIiJYMQ/3sHz5cl+1alXUZRy1De17+fqvXuHna7eTqirjz84/nive2kxFWTzq0kRkFjKzZ9x9+YTbKSBmjue3dvLVX7zM46+0c0xtBZ++eBmXv2UJibiuBIrI1JlsQOiTZwY5vbGOuz6xgh9dcy6LUxV84afrePfNv+Zna7cxOlq8QS4ixUkBMQOde9w8fvKnb+O7VywnGY/x5/+0mku+8+889vIOjQArItOmYAFhZk1m9qiZrTezF8zs0+HyuWb2KzN7NXysD5ebmX3bzF4zs7VmdnahaisGZsa7T13Ig59+Jzd/5Cw6+4b4oztW8pFbn+SZzbujLk9ESkAhzyCGgc+6+ynAucAnzexU4PPAw+6+DHg4nAd4H7As/LsGuKWAtRWNeMz43cwSHvns+Xz50tPY0N7Dh275LX9810peekO9sEWkcAoWEO6+3d2fDae7gfVAI3ApcFe42V3AZeH0pcDdHngSSJnZokLVV2ySiRhXvLWZX99wPn/xnpN4auNu3vetJ/jMj9ewZVdv1OWJyCw0LW0QZtYMZICngIXuvh2CEAEWhJs1Aq05T2sLl419rWvMbJWZrWpvby9k2TNSVTLBJy84gSduuIBr3nUcD67bzkXfeIy/vv95dnT3R12eiMwiBQ8IM6sBfgJc5+7jXROxPMsOapF191vdfbm7L29oaJiqMotOqirJF953Co//xQVcvryJe57awnk3PcZND71EZ99Q1OWJyCxQ0IAwszKCcLjH3X8aLn4ze+kofNwRLm8DmnKevgTYVsj6ZoNj6ir4u989g4evP493n7qQv3/sdd5106Pc8tjr9A1qKHEROXKF/BWTAbcD6939GzmrHgCuDKevBO7PWX5F+Gumc4HO7KUomVjz/Gq+/QcZfn7tOzg7neLGh17ivK8+yg+e3MzQyGjU5YlIESpYT2ozewfwBLAOyH5CfZGgHeJeIA1sAS53991hoHwHeC/QC1zl7uN2k55tPamn0tMbd3PTQy+xavMejp1XxfXvPpFLzlxMLJbvSp6IlBINtSG4O4++vIObHnqZl97o5pRFtdzwnpM4/6QGgjwWkVKkoTYEM+PCkxfy4LXv5FsfbaFnYJir7lzJ7//jb1m5SZ3tRGR8CogSEIsZl7Y08v+uP4//cdnpbNrVy+X/8FuuuuNp3fJURA5Jl5hKUN/gCHf+xyZueew1uvqH+eBZi/nEO5Zy+uJajRwrUgLUBiET6uwd4h9//Trf+81G+odGqUrGOTtdz/LmelY0zyWTrqcyqXtSiMw2CgiZtN09g/zH6ztZuXE3T2/aw0tvdOEOiZhxemMdK5bO5ZzmuZzTXE+qKhl1uSJylBQQcsQ6+4Z4dvMent60m5Ubd7O2rZPBsC/FiQtrOKd57r7QWJyqjLhaETlcCgiZMv1DIzzX2sHKTcEZxrOb97B3YBiAxlQlK5bO3XdZ6oQFNfoJrcgMN9mASExHMVLcKsri/M5x8/id4+YBMDLqrN/excpNu1m5aTdPvLqTf1m9FYD6qjKWN89lRfNczlk6l9MW11Kmhm+RoqQzCDlq7s6mXb1hG0YQGpvDIcgry+KcfWwquCzVPJeWdIqqpL6XiERJl5gkUm929QdnGHkavk9rrGNFc33Y8D2X+mo1fItMJwWEzCi5Dd+rNu3mudb9Dd/LFtRwztLgDOMtx9azpL5S7RgiBaSAkBmtf2iEtW2dQcP3xt08k9PwPa86SUtTipamFJl0PWc21VFbURZxxSKzhwJCikq24Xt1awdrtnSwpnUPr7f3AGAGxzfU5IRGipMWzlGvb5EjpICQotfZN8RzrR2syfnb3TMIBI3fZzTW0ZJOkWlK0ZJOsahOfTJEJkMBIbOOu9O6u4/VrXtYvSUIjBe3de1ry1hYW77vslRLU4ozGuuoLtcvpkTGUj8ImXXMjPS8KtLzqri0pRGAgeERXtzWdcBZxi9eeBOAmMGJC+eQSdfvO8s4oaFGN00SmSQFhBS18kQ8CIB0/b5lu/YO8Fxb0JaxurWDn6/dxg+f3gJATXmCM5fUkUmnaGkKzjQa5pRHVb7IjKaAkFlnXk05F568kAtPXgjA6KizYWdPeIaxhzWtHfzD4xsYGQ0urzamKsPACBrAT1tcR0WZRrEVUUDIrBeLGScsqOGEBTV8+C1LgOCeGM9v6wx/MdXB6i0d/GztdgDiMWN+TZIFcypomFPOgjnlBzw2zKnYN60gkdlMASElqTIZ39eTO2tHVz+rWzt4YWsnb3T1s6N7gDe7+lm3tZNdewcYzfN7jjkViZzwqDgoTLIhU19Vps5/UnQUECKhBbUVvOe0Y3jPaccctG5k1NnVM0B79wA7uoPH7N+O7n7auwdY29bBjq4B+oZGDnp+WdyYX5N7FhIEytgzlPk1OiuRmUMBITIJ8ZixYE4FC+ZUcNoE2+4dGD4oPHZ07w+XrR39rGntYFfPIPl+ZV5XWRYESE05dZVl1FYmqK0oC6fLDrmsPBHTWYpMKQWEyBSrKU9QU55g6fzqcbcbHhllV8/ggUHSNUD73uBx594BNuzcS1ffMJ19Q3nPTHIl47EgOCrLqK3ICZOKxL4g2R8q4bKK7PKEeqbLQRQQIhFJxGMsrK1gYW0FUDfh9oPDo3T1D9HVN0RXfxAaXX1DwWP/0L4gyW7T2TtI6+7efdsN52tEyVGdjB8QJLU5ZypVyTjV5YngMZmgMhmnujxOVTJBdTJBVXmcqmR2Pq6wmSUUECJFIpmIMb8maKc4XO5O7+DIgUFyiHDJrtva0cf67cGy3sGRfT8Lnmyt1WFgVCXjVJUnDpjfHy7ButxwyW5bGYZRVXkYSmVxdXKcZgoIkRJgZlSXJ6guT7Bo4pOVg7g7gyOj9A6M0DM4TN/gCD2DI/QODAePg8P0Do7QMxA+ZrcZCNb1DI7QNzjMto6+nPlgu8MZ7SeZiFFZFg/+knHKEzEqk8F8Rdn+x4qy2L5tKnLWVSZjVCTiVIx5TmVZnIpkbN+87oIYUECIyITMjPJEnPJEfEpv8OTu9A+N0jM4TO/ACL1Dw/tDZWCEvnC+Z2CYvqER+odG6R8KwiWY3//Y3R9s0zcYzGfXHcaJzz6JmO0PljBUsmFTnoiRjMdIJsK/fNO5yw61fJLTUf7wQAEhIpExs+AMIBmHmql//eyZT//gKP3D+YOlb3B0zPxIuO0ofUMjDITrsuHT3T/M0Mgog8OjDGYfw7+BcH4qlcUtb+Bcd/GJXHLW4ind11gKCBGZtXLPfOqYnptOuTvDo74/OMLQGBgezRssA+H80NjAGQnX5T4vZ5tUVeGPRwEhIjKFzIyyuFEWj1Fd5ONAqiVGRETyUkCIiEheCggREclLASEiInkpIEREJC8FhIiI5KWAEBGRvBQQIiKSl/nhjJQ1w5hZO7A56jqO0nxgZ9RFzCB6P/bTe3EgvR8HOpr341h3b5hoo6IOiNnAzFa5+/Ko65gp9H7sp/fiQHo/DjQd74cuMYmISF4KCBERyUsBEb1boy5ghtH7sZ/eiwPp/ThQwd8PtUGIiEheOoMQEZG8FBAiIpKXAiIiZtZkZo+a2Xoze8HMPh11TVEzs7iZrTazn0VdS9TMLGVm95nZS+G/kbdGXVOUzOwz4f8nz5vZD82sIuqapouZfc/MdpjZ8znL5prZr8zs1fCxvhD7VkBEZxj4rLufApwLfNLMTo24pqh9GlgfdREzxLeAh9z9ZOAsSvh9MbNG4FpgubufDsSBj0Zb1bS6E3jvmGWfBx5292XAw+H8lFNARMTdt7v7s+F0N8EHQGO0VUXHzJYA7wdui7qWqJlZLfAu4HYAdx90945oq4pcAqg0swRQBWyLuJ5p4+6/BnaPWXwpcFc4fRdwWSH2rYCYAcysGcgAT0VbSaS+CdwAjEZdyAxwHNAO3BFecrvNzKqjLioq7r4V+BqwBdgOdLr7L6OtKnIL3X07BF82gQWF2IkCImJmVgP8BLjO3buiricKZvYBYIe7PxN1LTNEAjgbuMXdM0APBbqEUAzC6+uXAkuBxUC1mX082qpKgwIiQmZWRhAO97j7T6OuJ0JvBz5oZpuAHwEXmtkPoi0pUm1Am7tnzyjvIwiMUnUxsNHd2919CPgp8LaIa4ram2a2CCB83FGInSggImJmRnCNeb27fyPqeqLk7l9w9yXu3kzQ+PiIu5fsN0R3fwNoNbOTwkUXAS9GWFLUtgDnmllV+P/NRZRwo33oAeDKcPpK4P5C7CRRiBeVSXk78F+AdWa2Jlz2RXd/MMKaZOb4FHCPmSWBDcBVEdcTGXd/yszuA54l+PXfakpo2A0z+yFwPjDfzNqAvwG+AtxrZlcTBOjlBdm3htoQEZF8dIlDO4ukAAADpElEQVRJRETyUkCIiEheCggREclLASEiInkpIEREJC8FhMxoZuZm9vWc+c+Z2Zem6LXvNLMPT8VrTbCfy8MRWR8tZF1m1mxmf3j4FYrkp4CQmW4A+D0zmx91IbnMLH4Ym18N/Jm7X1CoekLNwGEFxGEeh5QYBYTMdMMEnaI+M3bF2G/aZrY3fDzfzB43s3vN7BUz+4qZfczMnjazdWZ2fM7LXGxmT4TbfSB8ftzMvmpmK81srZn915zXfdTM/glYl6eePwhf/3kzuzFc9tfAO4B/MLOv5nnODeFznjOzr+RZvykbjma23MweC6fPM7M14d9qM5tD0HnqneGyz0z2OMys2sx+HtbwvJl9ZDL/YWT2U09qKQb/B1hrZjcdxnPOAk4hGCZ5A3Cbu68Ib8z0KeC6cLtm4DzgeOBRMzsBuIJgxNBzzKwc+I2ZZUcPXQGc7u4bc3dmZouBG4G3AHuAX5rZZe7+ZTO7EPicu68a85z3EQzT/Dvu3mtmcw/j+D4HfNLdfxMO+NhPMKDf59w9G3TXTOY4zOxDwDZ3f3/4vLrDqENmMZ1ByIwXjnJ7N8FNYyZrZXjPjQHgdSD7wbiOIBSy7nX3UXd/lSBITgb+E3BFOATKU8A8YFm4/dNjwyF0DvBYOKDcMHAPwT0dxnMxcIe794bHOXbM//H8BviGmV0LpMJ9jjXZ41hHcCZ1o5m90907D6MOmcUUEFIsvklwLT/3vgjDhP+Gw0HckjnrBnKmR3PmRznwzHnsWDMOGPApd28J/5bm3H+g5xD12WQPZMxzJhrrZt8xAvtus+nuXwH+GKgEnjSzkw/x+hMeh7u/QnDmsw74X+FlMREFhBSH8Nv1vQQhkbWJ4IMNgvsFlB3BS19uZrGwXeI44GXgF8CfhsOxY2YnTuKGPU8B55nZ/LDh9w+Axyd4zi+BT5hZVbiffJeYNrH/GD+UXWhmx7v7One/EVhFcObTDczJee6kjiO8PNbr7j8guDFPKQ8tLjnUBiHF5OvAn+fMfxe438yeJrgv76G+3Y/nZYIP8oXAf3P3fjO7jeAy1LPhmUk7E9zS0d23m9kXgEcJvrk/6O7jDsHs7g+ZWQuwyswGgQeBL47Z7G+B283sixx4x8HrzOwCYIRgKPB/Izg7Gjaz5wjuY/ytSR7HGcBXzWwUGAL+dLy6pXRoNFcREclLl5hERCQvBYSIiOSlgBARkbwUECIikpcCQkRE8lJAiIhIXgoIERHJ6/8DlzgZ81SRpKkAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "from sklearn.cluster import KMeans\n",
    "cs = []\n",
    "for i in range(1, 11):\n",
    "    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)\n",
    "    kmeans.fit(X)\n",
    "    cs.append(kmeans.inertia_)\n",
    "plt.plot(range(1, 11), cs)\n",
    "plt.title('The Elbow Method')\n",
    "plt.xlabel('Number of clusters')\n",
    "plt.ylabel('CS')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- By the above plot, we can see that there is a kink at k=2. \n",
    "\n",
    "- Hence k=2 can be considered a good number of the cluster to cluster this data.\n",
    "\n",
    "- But, we have seen that I have achieved a weak classification accuracy of 1% with k=2.\n",
    "\n",
    "- I will write the required code with k=2 again for convinience."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result: 63 out of 7050 samples were correctly labeled.\n",
      "Accuracy score: 0.01\n"
     ]
    }
   ],
   "source": [
    "from sklearn.cluster import KMeans\n",
    "\n",
    "kmeans = KMeans(n_clusters=2,random_state=0)\n",
    "\n",
    "kmeans.fit(X)\n",
    "\n",
    "labels = kmeans.labels_\n",
    "\n",
    "# check how many of the samples were correctly labeled\n",
    "\n",
    "correct_labels = sum(y == labels)\n",
    "\n",
    "print(\"Result: %d out of %d samples were correctly labeled.\" % (correct_labels, y.size))\n",
    "\n",
    "print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, our weak unsupervised classification model achieved a very weak classification accuracy of 1%."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I will check the model accuracy with different number of clusters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 17. K-Means model with different clusters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### K-Means model with 3 clusters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result: 138 out of 7050 samples were correctly labeled.\n",
      "Accuracy score: 0.02\n"
     ]
    }
   ],
   "source": [
    "kmeans = KMeans(n_clusters=3, random_state=0)\n",
    "\n",
    "kmeans.fit(X)\n",
    "\n",
    "# check how many of the samples were correctly labeled\n",
    "labels = kmeans.labels_\n",
    "\n",
    "correct_labels = sum(y == labels)\n",
    "print(\"Result: %d out of %d samples were correctly labeled.\" % (correct_labels, y.size))\n",
    "print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### K-Means model with 4 clusters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result: 4340 out of 7050 samples were correctly labeled.\n",
      "Accuracy score: 0.62\n"
     ]
    }
   ],
   "source": [
    "kmeans = KMeans(n_clusters=4, random_state=0)\n",
    "\n",
    "kmeans.fit(X)\n",
    "\n",
    "# check how many of the samples were correctly labeled\n",
    "labels = kmeans.labels_\n",
    "\n",
    "correct_labels = sum(y == labels)\n",
    "print(\"Result: %d out of %d samples were correctly labeled.\" % (correct_labels, y.size))\n",
    "print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have achieved a relatively high accuracy of 62% with k=4."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 18. Results and conclusion\n",
    "\n",
    "\n",
    "1.\tIn this project, I have implemented the most popular unsupervised clustering technique called **K-Means Clustering**.\n",
    "\n",
    "2.\tI have applied the elbow method and find that k=2 (k is number of clusters) can be considered a good number of cluster to cluster this data.\n",
    "\n",
    "3.\tI have find that the model has very high inertia of 237.7572. So, this is not a good model fit to the data.\n",
    "\n",
    "4.\tI have achieved a weak classification accuracy of 1% with k=2 by our unsupervised model.\n",
    "\n",
    "5.\tSo, I have changed the value of k and find relatively higher classification accuracy of 62% with k=4.\n",
    "\n",
    "6.\tHence, we can conclude that k=4 being the optimal number of clusters.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}