abhishekomi/data-wrangling.ipynb

## data-wrangling.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Module%202/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\"  />\n",
    "</center>\n",
    "\n",
    "# Data Wrangling\n",
    "\n",
    "Estimated time needed: **30** minutes\n",
    "\n",
    "## Objectives\n",
    "\n",
    "After completing this lab you will be able to:\n",
    "\n",
    "*   Handle missing values\n",
    "*   Correct data format\n",
    "*   Standardize and normalize data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Table of Contents</h2>\n",
    "\n",
    "<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
    "<ul>\n",
    "    <li><a href=\"#identify_handle_missing_values\">Identify and handle missing values</a>\n",
    "        <ul>\n",
    "            <li><a href=\"#identify_missing_values\">Identify missing values</a></li>\n",
    "            <li><a href=\"#deal_missing_values\">Deal with missing values</a></li>\n",
    "            <li><a href=\"#correct_data_format\">Correct data format</a></li>\n",
    "        </ul>\n",
    "    </li>\n",
    "    <li><a href=\"#data_standardization\">Data standardization</a></li>\n",
    "    <li><a href=\"#data_normalization\">Data normalization (centering/scaling)</a></li>\n",
    "    <li><a href=\"#binning\">Binning</a></li>\n",
    "    <li><a href=\"#indicator\">Indicator variable</a></li>\n",
    "</ul>\n",
    "\n",
    "</div>\n",
    "\n",
    "<hr>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>What is the purpose of data wrangling?</h2>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data wrangling is the process of converting data from the initial format to a format that may be better for analysis.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>What is the fuel consumption (L/100k) rate for the diesel car?</h3>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Import data</h3>\n",
    "<p>\n",
    "You can find the \"Automobile Dataset\" from the following link: <a href=\"https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01\">https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data</a>. \n",
    "We will be using this dataset throughout this course.\n",
    "</p>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>Import pandas</h4> \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pylab as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Reading the dataset from the URL and adding the related headers</h2>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we assign the URL of the dataset to \"filename\".\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This dataset was hosted on IBM Cloud object. Click <a href=\"https://cocl.us/corsera_da0101en_notebook_bottom?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01\">HERE</a> for free storage.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/auto.csv\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we create a Python list <b>headers</b> containing name of headers.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "headers = [\"symboling\",\"normalized-losses\",\"make\",\"fuel-type\",\"aspiration\", \"num-of-doors\",\"body-style\",\n",
    "         \"drive-wheels\",\"engine-location\",\"wheel-base\", \"length\",\"width\",\"height\",\"curb-weight\",\"engine-type\",\n",
    "         \"num-of-cylinders\", \"engine-size\",\"fuel-system\",\"bore\",\"stroke\",\"compression-ratio\",\"horsepower\",\n",
    "         \"peak-rpm\",\"city-mpg\",\"highway-mpg\",\"price\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the Pandas method <b>read_csv()</b> to load the data from the web address. Set the parameter  \"names\" equal to the Python list \"headers\".\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(filename, names = headers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the method <b>head()</b> to display the first five rows of the dataframe.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>symboling</th>\n",
       "      <th>normalized-losses</th>\n",
       "      <th>make</th>\n",
       "      <th>fuel-type</th>\n",
       "      <th>aspiration</th>\n",
       "      <th>num-of-doors</th>\n",
       "      <th>body-style</th>\n",
       "      <th>drive-wheels</th>\n",
       "      <th>engine-location</th>\n",
       "      <th>wheel-base</th>\n",
       "      <th>...</th>\n",
       "      <th>engine-size</th>\n",
       "      <th>fuel-system</th>\n",
       "      <th>bore</th>\n",
       "      <th>stroke</th>\n",
       "      <th>compression-ratio</th>\n",
       "      <th>horsepower</th>\n",
       "      <th>peak-rpm</th>\n",
       "      <th>city-mpg</th>\n",
       "      <th>highway-mpg</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>?</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>...</td>\n",
       "      <td>130</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000</td>\n",
       "      <td>21</td>\n",
       "      <td>27</td>\n",
       "      <td>13495</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>?</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>...</td>\n",
       "      <td>130</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000</td>\n",
       "      <td>21</td>\n",
       "      <td>27</td>\n",
       "      <td>16500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>?</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>hatchback</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>94.5</td>\n",
       "      <td>...</td>\n",
       "      <td>152</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>2.68</td>\n",
       "      <td>3.47</td>\n",
       "      <td>9.0</td>\n",
       "      <td>154</td>\n",
       "      <td>5000</td>\n",
       "      <td>19</td>\n",
       "      <td>26</td>\n",
       "      <td>16500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>fwd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.8</td>\n",
       "      <td>...</td>\n",
       "      <td>109</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>10.0</td>\n",
       "      <td>102</td>\n",
       "      <td>5500</td>\n",
       "      <td>24</td>\n",
       "      <td>30</td>\n",
       "      <td>13950</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>4wd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.4</td>\n",
       "      <td>...</td>\n",
       "      <td>136</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>8.0</td>\n",
       "      <td>115</td>\n",
       "      <td>5500</td>\n",
       "      <td>18</td>\n",
       "      <td>22</td>\n",
       "      <td>17450</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 26 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   symboling normalized-losses         make fuel-type aspiration num-of-doors  \\\n",
       "0          3                 ?  alfa-romero       gas        std          two   \n",
       "1          3                 ?  alfa-romero       gas        std          two   \n",
       "2          1                 ?  alfa-romero       gas        std          two   \n",
       "3          2               164         audi       gas        std         four   \n",
       "4          2               164         audi       gas        std         four   \n",
       "\n",
       "    body-style drive-wheels engine-location  wheel-base  ...  engine-size  \\\n",
       "0  convertible          rwd           front        88.6  ...          130   \n",
       "1  convertible          rwd           front        88.6  ...          130   \n",
       "2    hatchback          rwd           front        94.5  ...          152   \n",
       "3        sedan          fwd           front        99.8  ...          109   \n",
       "4        sedan          4wd           front        99.4  ...          136   \n",
       "\n",
       "   fuel-system  bore  stroke compression-ratio horsepower  peak-rpm city-mpg  \\\n",
       "0         mpfi  3.47    2.68               9.0        111      5000       21   \n",
       "1         mpfi  3.47    2.68               9.0        111      5000       21   \n",
       "2         mpfi  2.68    3.47               9.0        154      5000       19   \n",
       "3         mpfi  3.19    3.40              10.0        102      5500       24   \n",
       "4         mpfi  3.19    3.40               8.0        115      5500       18   \n",
       "\n",
       "  highway-mpg  price  \n",
       "0          27  13495  \n",
       "1          27  16500  \n",
       "2          26  16500  \n",
       "3          30  13950  \n",
       "4          22  17450  \n",
       "\n",
       "[5 rows x 26 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# To see what the data set looks like, we'll use the head() method.\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, several question marks appeared in the dataframe; those are missing values which may hinder our further analysis.\n",
    "\n",
    "<div>So, how do we identify all those missing values and deal with them?</div> \n",
    "\n",
    "<b>How to work with missing data?</b>\n",
    "\n",
    "Steps for working with missing data:\n",
    "\n",
    "<ol>\n",
    "    <li>Identify missing data</li>\n",
    "    <li>Deal with missing data</li>\n",
    "    <li>Correct data format</li>\n",
    "</ol>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 id=\"identify_handle_missing_values\">Identify and handle missing values</h2>\n",
    "\n",
    "<h3 id=\"identify_missing_values\">Identify missing values</h3>\n",
    "<h4>Convert \"?\" to NaN</h4>\n",
    "In the car dataset, missing data comes with the question mark \"?\".\n",
    "We replace \"?\" with NaN (Not a Number), Python's default missing value marker for reasons of computational speed and convenience. Here we use the function: \n",
    " <pre>.replace(A, B, inplace = True) </pre>\n",
    "to replace A by B.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>symboling</th>\n",
       "      <th>normalized-losses</th>\n",
       "      <th>make</th>\n",
       "      <th>fuel-type</th>\n",
       "      <th>aspiration</th>\n",
       "      <th>num-of-doors</th>\n",
       "      <th>body-style</th>\n",
       "      <th>drive-wheels</th>\n",
       "      <th>engine-location</th>\n",
       "      <th>wheel-base</th>\n",
       "      <th>...</th>\n",
       "      <th>engine-size</th>\n",
       "      <th>fuel-system</th>\n",
       "      <th>bore</th>\n",
       "      <th>stroke</th>\n",
       "      <th>compression-ratio</th>\n",
       "      <th>horsepower</th>\n",
       "      <th>peak-rpm</th>\n",
       "      <th>city-mpg</th>\n",
       "      <th>highway-mpg</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>...</td>\n",
       "      <td>130</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000</td>\n",
       "      <td>21</td>\n",
       "      <td>27</td>\n",
       "      <td>13495</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>...</td>\n",
       "      <td>130</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000</td>\n",
       "      <td>21</td>\n",
       "      <td>27</td>\n",
       "      <td>16500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>hatchback</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>94.5</td>\n",
       "      <td>...</td>\n",
       "      <td>152</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>2.68</td>\n",
       "      <td>3.47</td>\n",
       "      <td>9.0</td>\n",
       "      <td>154</td>\n",
       "      <td>5000</td>\n",
       "      <td>19</td>\n",
       "      <td>26</td>\n",
       "      <td>16500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>fwd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.8</td>\n",
       "      <td>...</td>\n",
       "      <td>109</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>10.0</td>\n",
       "      <td>102</td>\n",
       "      <td>5500</td>\n",
       "      <td>24</td>\n",
       "      <td>30</td>\n",
       "      <td>13950</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>4wd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.4</td>\n",
       "      <td>...</td>\n",
       "      <td>136</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>8.0</td>\n",
       "      <td>115</td>\n",
       "      <td>5500</td>\n",
       "      <td>18</td>\n",
       "      <td>22</td>\n",
       "      <td>17450</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 26 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   symboling normalized-losses         make fuel-type aspiration num-of-doors  \\\n",
       "0          3               NaN  alfa-romero       gas        std          two   \n",
       "1          3               NaN  alfa-romero       gas        std          two   \n",
       "2          1               NaN  alfa-romero       gas        std          two   \n",
       "3          2               164         audi       gas        std         four   \n",
       "4          2               164         audi       gas        std         four   \n",
       "\n",
       "    body-style drive-wheels engine-location  wheel-base  ...  engine-size  \\\n",
       "0  convertible          rwd           front        88.6  ...          130   \n",
       "1  convertible          rwd           front        88.6  ...          130   \n",
       "2    hatchback          rwd           front        94.5  ...          152   \n",
       "3        sedan          fwd           front        99.8  ...          109   \n",
       "4        sedan          4wd           front        99.4  ...          136   \n",
       "\n",
       "   fuel-system  bore  stroke compression-ratio horsepower  peak-rpm city-mpg  \\\n",
       "0         mpfi  3.47    2.68               9.0        111      5000       21   \n",
       "1         mpfi  3.47    2.68               9.0        111      5000       21   \n",
       "2         mpfi  2.68    3.47               9.0        154      5000       19   \n",
       "3         mpfi  3.19    3.40              10.0        102      5500       24   \n",
       "4         mpfi  3.19    3.40               8.0        115      5500       18   \n",
       "\n",
       "  highway-mpg  price  \n",
       "0          27  13495  \n",
       "1          27  16500  \n",
       "2          26  16500  \n",
       "3          30  13950  \n",
       "4          22  17450  \n",
       "\n",
       "[5 rows x 26 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "# replace \"?\" to NaN\n",
    "df.replace(\"?\", np.nan, inplace = True)\n",
    "df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>Evaluating for Missing Data</h4>\n",
    "\n",
    "The missing values are converted by default. We use the following functions to identify these missing values. There are two methods to detect missing data:\n",
    "\n",
    "<ol>\n",
    "    <li><b>.isnull()</b></li>\n",
    "    <li><b>.notnull()</b></li>\n",
    "</ol>\n",
    "The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>symboling</th>\n",
       "      <th>normalized-losses</th>\n",
       "      <th>make</th>\n",
       "      <th>fuel-type</th>\n",
       "      <th>aspiration</th>\n",
       "      <th>num-of-doors</th>\n",
       "      <th>body-style</th>\n",
       "      <th>drive-wheels</th>\n",
       "      <th>engine-location</th>\n",
       "      <th>wheel-base</th>\n",
       "      <th>...</th>\n",
       "      <th>engine-size</th>\n",
       "      <th>fuel-system</th>\n",
       "      <th>bore</th>\n",
       "      <th>stroke</th>\n",
       "      <th>compression-ratio</th>\n",
       "      <th>horsepower</th>\n",
       "      <th>peak-rpm</th>\n",
       "      <th>city-mpg</th>\n",
       "      <th>highway-mpg</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 26 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   symboling  normalized-losses   make  fuel-type  aspiration  num-of-doors  \\\n",
       "0      False               True  False      False       False         False   \n",
       "1      False               True  False      False       False         False   \n",
       "2      False               True  False      False       False         False   \n",
       "3      False              False  False      False       False         False   \n",
       "4      False              False  False      False       False         False   \n",
       "\n",
       "   body-style  drive-wheels  engine-location  wheel-base  ...  engine-size  \\\n",
       "0       False         False            False       False  ...        False   \n",
       "1       False         False            False       False  ...        False   \n",
       "2       False         False            False       False  ...        False   \n",
       "3       False         False            False       False  ...        False   \n",
       "4       False         False            False       False  ...        False   \n",
       "\n",
       "   fuel-system   bore  stroke  compression-ratio  horsepower  peak-rpm  \\\n",
       "0        False  False   False              False       False     False   \n",
       "1        False  False   False              False       False     False   \n",
       "2        False  False   False              False       False     False   \n",
       "3        False  False   False              False       False     False   \n",
       "4        False  False   False              False       False     False   \n",
       "\n",
       "   city-mpg  highway-mpg  price  \n",
       "0     False        False  False  \n",
       "1     False        False  False  \n",
       "2     False        False  False  \n",
       "3     False        False  False  \n",
       "4     False        False  False  \n",
       "\n",
       "[5 rows x 26 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "missing_data = df.isnull()\n",
    "missing_data.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"True\" means the value is a missing value while \"False\" means the value is not a missing value.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>Count missing values in each column</h4>\n",
    "<p>\n",
    "Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, \"True\" represents a missing value and \"False\" means the value is present in the dataset.  In the body of the for loop the method \".value_counts()\" counts the number of \"True\" values. \n",
    "</p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "symboling\n",
      "False    205\n",
      "Name: symboling, dtype: int64\n",
      "\n",
      "normalized-losses\n",
      "False    164\n",
      "True      41\n",
      "Name: normalized-losses, dtype: int64\n",
      "\n",
      "make\n",
      "False    205\n",
      "Name: make, dtype: int64\n",
      "\n",
      "fuel-type\n",
      "False    205\n",
      "Name: fuel-type, dtype: int64\n",
      "\n",
      "aspiration\n",
      "False    205\n",
      "Name: aspiration, dtype: int64\n",
      "\n",
      "num-of-doors\n",
      "False    203\n",
      "True       2\n",
      "Name: num-of-doors, dtype: int64\n",
      "\n",
      "body-style\n",
      "False    205\n",
      "Name: body-style, dtype: int64\n",
      "\n",
      "drive-wheels\n",
      "False    205\n",
      "Name: drive-wheels, dtype: int64\n",
      "\n",
      "engine-location\n",
      "False    205\n",
      "Name: engine-location, dtype: int64\n",
      "\n",
      "wheel-base\n",
      "False    205\n",
      "Name: wheel-base, dtype: int64\n",
      "\n",
      "length\n",
      "False    205\n",
      "Name: length, dtype: int64\n",
      "\n",
      "width\n",
      "False    205\n",
      "Name: width, dtype: int64\n",
      "\n",
      "height\n",
      "False    205\n",
      "Name: height, dtype: int64\n",
      "\n",
      "curb-weight\n",
      "False    205\n",
      "Name: curb-weight, dtype: int64\n",
      "\n",
      "engine-type\n",
      "False    205\n",
      "Name: engine-type, dtype: int64\n",
      "\n",
      "num-of-cylinders\n",
      "False    205\n",
      "Name: num-of-cylinders, dtype: int64\n",
      "\n",
      "engine-size\n",
      "False    205\n",
      "Name: engine-size, dtype: int64\n",
      "\n",
      "fuel-system\n",
      "False    205\n",
      "Name: fuel-system, dtype: int64\n",
      "\n",
      "bore\n",
      "False    201\n",
      "True       4\n",
      "Name: bore, dtype: int64\n",
      "\n",
      "stroke\n",
      "False    201\n",
      "True       4\n",
      "Name: stroke, dtype: int64\n",
      "\n",
      "compression-ratio\n",
      "False    205\n",
      "Name: compression-ratio, dtype: int64\n",
      "\n",
      "horsepower\n",
      "False    203\n",
      "True       2\n",
      "Name: horsepower, dtype: int64\n",
      "\n",
      "peak-rpm\n",
      "False    203\n",
      "True       2\n",
      "Name: peak-rpm, dtype: int64\n",
      "\n",
      "city-mpg\n",
      "False    205\n",
      "Name: city-mpg, dtype: int64\n",
      "\n",
      "highway-mpg\n",
      "False    205\n",
      "Name: highway-mpg, dtype: int64\n",
      "\n",
      "price\n",
      "False    201\n",
      "True       4\n",
      "Name: price, dtype: int64\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for column in missing_data.columns.values.tolist():\n",
    "    print(column)\n",
    "    print (missing_data[column].value_counts())\n",
    "    print(\"\")    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on the summary above, each column has 205 rows of data and seven of the columns containing missing data:\n",
    "\n",
    "<ol>\n",
    "    <li>\"normalized-losses\": 41 missing data</li>\n",
    "    <li>\"num-of-doors\": 2 missing data</li>\n",
    "    <li>\"bore\": 4 missing data</li>\n",
    "    <li>\"stroke\" : 4 missing data</li>\n",
    "    <li>\"horsepower\": 2 missing data</li>\n",
    "    <li>\"peak-rpm\": 2 missing data</li>\n",
    "    <li>\"price\": 4 missing data</li>\n",
    "</ol>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3 id=\"deal_missing_values\">Deal with missing data</h3>\n",
    "<b>How to deal with missing data?</b>\n",
    "\n",
    "<ol>\n",
    "    <li>Drop data<br>\n",
    "        a. Drop the whole row<br>\n",
    "        b. Drop the whole column\n",
    "    </li>\n",
    "    <li>Replace data<br>\n",
    "        a. Replace it by mean<br>\n",
    "        b. Replace it by frequency<br>\n",
    "        c. Replace it based on other functions\n",
    "    </li>\n",
    "</ol>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Whole columns should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty enough to drop entirely.\n",
    "We have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. We will apply each method to many different columns:\n",
    "\n",
    "<b>Replace by mean:</b>\n",
    "\n",
    "<ul>\n",
    "    <li>\"normalized-losses\": 41 missing data, replace them with mean</li>\n",
    "    <li>\"stroke\": 4 missing data, replace them with mean</li>\n",
    "    <li>\"bore\": 4 missing data, replace them with mean</li>\n",
    "    <li>\"horsepower\": 2 missing data, replace them with mean</li>\n",
    "    <li>\"peak-rpm\": 2 missing data, replace them with mean</li>\n",
    "</ul>\n",
    "\n",
    "<b>Replace by frequency:</b>\n",
    "\n",
    "<ul>\n",
    "    <li>\"num-of-doors\": 2 missing data, replace them with \"four\". \n",
    "        <ul>\n",
    "            <li>Reason: 84% sedans is four doors. Since four doors is most frequent, it is most likely to occur</li>\n",
    "        </ul>\n",
    "    </li>\n",
    "</ul>\n",
    "\n",
    "<b>Drop the whole row:</b>\n",
    "\n",
    "<ul>\n",
    "    <li>\"price\": 4 missing data, simply delete the whole row\n",
    "        <ul>\n",
    "            <li>Reason: price is what we want to predict. Any data entry without price data cannot be used for prediction; therefore any row now without price data is not useful to us</li>\n",
    "        </ul>\n",
    "    </li>\n",
    "</ul>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>Calculate the mean value for the \"normalized-losses\" column </h4>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average of normalized-losses: 122.0\n"
     ]
    }
   ],
   "source": [
    "avg_norm_loss = df[\"normalized-losses\"].astype(\"float\").mean(axis=0)\n",
    "print(\"Average of normalized-losses:\", avg_norm_loss)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>Replace \"NaN\" with mean value in \"normalized-losses\" column</h4>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"normalized-losses\"].replace(np.nan, avg_norm_loss, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>Calculate the mean value for the \"bore\" column</h4>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average of bore: 3.3297512437810943\n"
     ]
    }
   ],
   "source": [
    "avg_bore=df['bore'].astype('float').mean(axis=0)\n",
    "print(\"Average of bore:\", avg_bore)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>Replace \"NaN\" with the mean value in the \"bore\" column</h4>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"bore\"].replace(np.nan, avg_bore, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n",
    "<h1> Question  #1: </h1>\n",
    "\n",
    "<b>Based on the example above, replace NaN in \"stroke\" column with the mean value.</b>\n",
    "\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average of stroke: 3.255422885572139\n"
     ]
    }
   ],
   "source": [
    "# Write your code below and press Shift+Enter to execute \n",
    "avg_stroke=df['stroke'].astype('float').mean(axis=0)\n",
    "print(\"Average of stroke:\", avg_stroke)\n",
    "df[\"stroke\"].replace(np.nan, avg_stroke, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details><summary>Click here for the solution</summary>\n",
    "\n",
    "```python\n",
    "#Calculate the mean vaule for \"stroke\" column\n",
    "avg_stroke = df[\"stroke\"].astype(\"float\").mean(axis = 0)\n",
    "print(\"Average of stroke:\", avg_stroke)\n",
    "\n",
    "# replace NaN by mean value in \"stroke\" column\n",
    "df[\"stroke\"].replace(np.nan, avg_stroke, inplace = True)\n",
    "```\n",
    "\n",
    "</details>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>Calculate the mean value for the \"horsepower\" column</h4>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average horsepower: 104.25615763546799\n"
     ]
    }
   ],
   "source": [
    "avg_horsepower = df['horsepower'].astype('float').mean(axis=0)\n",
    "print(\"Average horsepower:\", avg_horsepower)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>Replace \"NaN\" with the mean value in the \"horsepower\" column</h4>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>Calculate the mean value for \"peak-rpm\" column</h4>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average peak rpm: 5125.369458128079\n"
     ]
    }
   ],
   "source": [
    "avg_peakrpm=df['peak-rpm'].astype('float').mean(axis=0)\n",
    "print(\"Average peak rpm:\", avg_peakrpm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>Replace \"NaN\" with the mean value in the \"peak-rpm\" column</h4>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['peak-rpm'].replace(np.nan, avg_peakrpm, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To see which values are present in a particular column, we can use the \".value_counts()\" method:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "four    114\n",
       "two      89\n",
       "Name: num-of-doors, dtype: int64"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['num-of-doors'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that four doors are the most common type. We can also use the \".idxmax()\" method to calculate the most common type automatically:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'four'"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['num-of-doors'].value_counts().idxmax()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The replacement procedure is very similar to what we have seen previously:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "#replace the missing 'num-of-doors' values by the most frequent \n",
    "df[\"num-of-doors\"].replace(np.nan, \"four\", inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's drop all rows that do not have price data:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "# simply drop whole row with NaN in \"price\" column\n",
    "df.dropna(subset=[\"price\"], axis=0, inplace=True)\n",
    "\n",
    "# reset index, because we droped two rows\n",
    "df.reset_index(drop=True, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>symboling</th>\n",
       "      <th>normalized-losses</th>\n",
       "      <th>make</th>\n",
       "      <th>fuel-type</th>\n",
       "      <th>aspiration</th>\n",
       "      <th>num-of-doors</th>\n",
       "      <th>body-style</th>\n",
       "      <th>drive-wheels</th>\n",
       "      <th>engine-location</th>\n",
       "      <th>wheel-base</th>\n",
       "      <th>...</th>\n",
       "      <th>engine-size</th>\n",
       "      <th>fuel-system</th>\n",
       "      <th>bore</th>\n",
       "      <th>stroke</th>\n",
       "      <th>compression-ratio</th>\n",
       "      <th>horsepower</th>\n",
       "      <th>peak-rpm</th>\n",
       "      <th>city-mpg</th>\n",
       "      <th>highway-mpg</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>...</td>\n",
       "      <td>130</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000</td>\n",
       "      <td>21</td>\n",
       "      <td>27</td>\n",
       "      <td>13495</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>...</td>\n",
       "      <td>130</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000</td>\n",
       "      <td>21</td>\n",
       "      <td>27</td>\n",
       "      <td>16500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>hatchback</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>94.5</td>\n",
       "      <td>...</td>\n",
       "      <td>152</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>2.68</td>\n",
       "      <td>3.47</td>\n",
       "      <td>9.0</td>\n",
       "      <td>154</td>\n",
       "      <td>5000</td>\n",
       "      <td>19</td>\n",
       "      <td>26</td>\n",
       "      <td>16500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>fwd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.8</td>\n",
       "      <td>...</td>\n",
       "      <td>109</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>10.0</td>\n",
       "      <td>102</td>\n",
       "      <td>5500</td>\n",
       "      <td>24</td>\n",
       "      <td>30</td>\n",
       "      <td>13950</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>4wd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.4</td>\n",
       "      <td>...</td>\n",
       "      <td>136</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>8.0</td>\n",
       "      <td>115</td>\n",
       "      <td>5500</td>\n",
       "      <td>18</td>\n",
       "      <td>22</td>\n",
       "      <td>17450</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 26 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   symboling normalized-losses         make fuel-type aspiration num-of-doors  \\\n",
       "0          3               122  alfa-romero       gas        std          two   \n",
       "1          3               122  alfa-romero       gas        std          two   \n",
       "2          1               122  alfa-romero       gas        std          two   \n",
       "3          2               164         audi       gas        std         four   \n",
       "4          2               164         audi       gas        std         four   \n",
       "\n",
       "    body-style drive-wheels engine-location  wheel-base  ...  engine-size  \\\n",
       "0  convertible          rwd           front        88.6  ...          130   \n",
       "1  convertible          rwd           front        88.6  ...          130   \n",
       "2    hatchback          rwd           front        94.5  ...          152   \n",
       "3        sedan          fwd           front        99.8  ...          109   \n",
       "4        sedan          4wd           front        99.4  ...          136   \n",
       "\n",
       "   fuel-system  bore  stroke compression-ratio horsepower  peak-rpm city-mpg  \\\n",
       "0         mpfi  3.47    2.68               9.0        111      5000       21   \n",
       "1         mpfi  3.47    2.68               9.0        111      5000       21   \n",
       "2         mpfi  2.68    3.47               9.0        154      5000       19   \n",
       "3         mpfi  3.19    3.40              10.0        102      5500       24   \n",
       "4         mpfi  3.19    3.40               8.0        115      5500       18   \n",
       "\n",
       "  highway-mpg  price  \n",
       "0          27  13495  \n",
       "1          27  16500  \n",
       "2          26  16500  \n",
       "3          30  13950  \n",
       "4          22  17450  \n",
       "\n",
       "[5 rows x 26 columns]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>Good!</b> Now, we have a dataset with no missing values.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3 id=\"correct_data_format\">Correct data format</h3>\n",
    "<b>We are almost there!</b>\n",
    "<p>The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).</p>\n",
    "\n",
    "In Pandas, we use:\n",
    "\n",
    "<p><b>.dtype()</b> to check the data type</p>\n",
    "<p><b>.astype()</b> to change the data type</p>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>Let's list the data types for each column</h4>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "symboling              int64\n",
       "normalized-losses     object\n",
       "make                  object\n",
       "fuel-type             object\n",
       "aspiration            object\n",
       "num-of-doors          object\n",
       "body-style            object\n",
       "drive-wheels          object\n",
       "engine-location       object\n",
       "wheel-base           float64\n",
       "length               float64\n",
       "width                float64\n",
       "height               float64\n",
       "curb-weight            int64\n",
       "engine-type           object\n",
       "num-of-cylinders      object\n",
       "engine-size            int64\n",
       "fuel-system           object\n",
       "bore                  object\n",
       "stroke                object\n",
       "compression-ratio    float64\n",
       "horsepower            object\n",
       "peak-rpm              object\n",
       "city-mpg               int64\n",
       "highway-mpg            int64\n",
       "price                 object\n",
       "dtype: object"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p>As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. For example, 'bore' and 'stroke' variables are numerical values that describe the engines, so we should expect them to be of the type 'float' or 'int'; however, they are shown as type 'object'. We have to convert data types into a proper format for each column using the \"astype()\" method.</p> \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>Convert data types to proper format</h4>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[[\"bore\", \"stroke\"]] = df[[\"bore\", \"stroke\"]].astype(\"float\")\n",
    "df[[\"normalized-losses\"]] = df[[\"normalized-losses\"]].astype(\"int\")\n",
    "df[[\"price\"]] = df[[\"price\"]].astype(\"float\")\n",
    "df[[\"peak-rpm\"]] = df[[\"peak-rpm\"]].astype(\"float\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>Let us list the columns after the conversion</h4>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "symboling              int64\n",
       "normalized-losses      int64\n",
       "make                  object\n",
       "fuel-type             object\n",
       "aspiration            object\n",
       "num-of-doors          object\n",
       "body-style            object\n",
       "drive-wheels          object\n",
       "engine-location       object\n",
       "wheel-base           float64\n",
       "length               float64\n",
       "width                float64\n",
       "height               float64\n",
       "curb-weight            int64\n",
       "engine-type           object\n",
       "num-of-cylinders      object\n",
       "engine-size            int64\n",
       "fuel-system           object\n",
       "bore                 float64\n",
       "stroke               float64\n",
       "compression-ratio    float64\n",
       "horsepower            object\n",
       "peak-rpm             float64\n",
       "city-mpg               int64\n",
       "highway-mpg            int64\n",
       "price                float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<b>Wonderful!</b>\n",
    "\n",
    "Now we have finally obtained the cleaned dataset with no missing values with all data in its proper format.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 id=\"data_standardization\">Data Standardization</h2>\n",
    "<p>\n",
    "Data is usually collected from different agencies in different formats.\n",
    "(Data standardization is also a term for a particular type of data normalization where we subtract the mean and divide by the standard deviation.)\n",
    "</p>\n",
    "\n",
    "<b>What is standardization?</b>\n",
    "\n",
    "<p>Standardization is the process of transforming data into a common format, allowing the researcher to make the meaningful comparison.\n",
    "</p>\n",
    "\n",
    "<b>Example</b>\n",
    "\n",
    "<p>Transform mpg to L/100km:</p>\n",
    "<p>In our dataset, the fuel consumption columns \"city-mpg\" and \"highway-mpg\" are represented by mpg (miles per gallon) unit. Assume we are developing an application in a country that accepts the fuel consumption with L/100km standard.</p>\n",
    "<p>We will need to apply <b>data transformation</b> to transform mpg into L/100km.</p>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p>The formula for unit conversion is:<p>\n",
    "L/100km = 235 / mpg\n",
    "<p>We can do many mathematical operations directly in Pandas.</p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>symboling</th>\n",
       "      <th>normalized-losses</th>\n",
       "      <th>make</th>\n",
       "      <th>fuel-type</th>\n",
       "      <th>aspiration</th>\n",
       "      <th>num-of-doors</th>\n",
       "      <th>body-style</th>\n",
       "      <th>drive-wheels</th>\n",
       "      <th>engine-location</th>\n",
       "      <th>wheel-base</th>\n",
       "      <th>...</th>\n",
       "      <th>engine-size</th>\n",
       "      <th>fuel-system</th>\n",
       "      <th>bore</th>\n",
       "      <th>stroke</th>\n",
       "      <th>compression-ratio</th>\n",
       "      <th>horsepower</th>\n",
       "      <th>peak-rpm</th>\n",
       "      <th>city-mpg</th>\n",
       "      <th>highway-mpg</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>...</td>\n",
       "      <td>130</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>21</td>\n",
       "      <td>27</td>\n",
       "      <td>13495.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>...</td>\n",
       "      <td>130</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>21</td>\n",
       "      <td>27</td>\n",
       "      <td>16500.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>hatchback</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>94.5</td>\n",
       "      <td>...</td>\n",
       "      <td>152</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>2.68</td>\n",
       "      <td>3.47</td>\n",
       "      <td>9.0</td>\n",
       "      <td>154</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>19</td>\n",
       "      <td>26</td>\n",
       "      <td>16500.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>fwd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.8</td>\n",
       "      <td>...</td>\n",
       "      <td>109</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>10.0</td>\n",
       "      <td>102</td>\n",
       "      <td>5500.0</td>\n",
       "      <td>24</td>\n",
       "      <td>30</td>\n",
       "      <td>13950.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>4wd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.4</td>\n",
       "      <td>...</td>\n",
       "      <td>136</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>8.0</td>\n",
       "      <td>115</td>\n",
       "      <td>5500.0</td>\n",
       "      <td>18</td>\n",
       "      <td>22</td>\n",
       "      <td>17450.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 26 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   symboling  normalized-losses         make fuel-type aspiration  \\\n",
       "0          3                122  alfa-romero       gas        std   \n",
       "1          3                122  alfa-romero       gas        std   \n",
       "2          1                122  alfa-romero       gas        std   \n",
       "3          2                164         audi       gas        std   \n",
       "4          2                164         audi       gas        std   \n",
       "\n",
       "  num-of-doors   body-style drive-wheels engine-location  wheel-base  ...  \\\n",
       "0          two  convertible          rwd           front        88.6  ...   \n",
       "1          two  convertible          rwd           front        88.6  ...   \n",
       "2          two    hatchback          rwd           front        94.5  ...   \n",
       "3         four        sedan          fwd           front        99.8  ...   \n",
       "4         four        sedan          4wd           front        99.4  ...   \n",
       "\n",
       "   engine-size  fuel-system  bore  stroke compression-ratio horsepower  \\\n",
       "0          130         mpfi  3.47    2.68               9.0        111   \n",
       "1          130         mpfi  3.47    2.68               9.0        111   \n",
       "2          152         mpfi  2.68    3.47               9.0        154   \n",
       "3          109         mpfi  3.19    3.40              10.0        102   \n",
       "4          136         mpfi  3.19    3.40               8.0        115   \n",
       "\n",
       "   peak-rpm city-mpg  highway-mpg    price  \n",
       "0    5000.0       21           27  13495.0  \n",
       "1    5000.0       21           27  16500.0  \n",
       "2    5000.0       19           26  16500.0  \n",
       "3    5500.0       24           30  13950.0  \n",
       "4    5500.0       18           22  17450.0  \n",
       "\n",
       "[5 rows x 26 columns]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>symboling</th>\n",
       "      <th>normalized-losses</th>\n",
       "      <th>make</th>\n",
       "      <th>fuel-type</th>\n",
       "      <th>aspiration</th>\n",
       "      <th>num-of-doors</th>\n",
       "      <th>body-style</th>\n",
       "      <th>drive-wheels</th>\n",
       "      <th>engine-location</th>\n",
       "      <th>wheel-base</th>\n",
       "      <th>...</th>\n",
       "      <th>fuel-system</th>\n",
       "      <th>bore</th>\n",
       "      <th>stroke</th>\n",
       "      <th>compression-ratio</th>\n",
       "      <th>horsepower</th>\n",
       "      <th>peak-rpm</th>\n",
       "      <th>city-mpg</th>\n",
       "      <th>highway-mpg</th>\n",
       "      <th>price</th>\n",
       "      <th>city-L/100km</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>...</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>21</td>\n",
       "      <td>27</td>\n",
       "      <td>13495.0</td>\n",
       "      <td>11.190476</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>...</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>21</td>\n",
       "      <td>27</td>\n",
       "      <td>16500.0</td>\n",
       "      <td>11.190476</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>hatchback</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>94.5</td>\n",
       "      <td>...</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>2.68</td>\n",
       "      <td>3.47</td>\n",
       "      <td>9.0</td>\n",
       "      <td>154</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>19</td>\n",
       "      <td>26</td>\n",
       "      <td>16500.0</td>\n",
       "      <td>12.368421</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>fwd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.8</td>\n",
       "      <td>...</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>10.0</td>\n",
       "      <td>102</td>\n",
       "      <td>5500.0</td>\n",
       "      <td>24</td>\n",
       "      <td>30</td>\n",
       "      <td>13950.0</td>\n",
       "      <td>9.791667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>4wd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.4</td>\n",
       "      <td>...</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>8.0</td>\n",
       "      <td>115</td>\n",
       "      <td>5500.0</td>\n",
       "      <td>18</td>\n",
       "      <td>22</td>\n",
       "      <td>17450.0</td>\n",
       "      <td>13.055556</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 27 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   symboling  normalized-losses         make fuel-type aspiration  \\\n",
       "0          3                122  alfa-romero       gas        std   \n",
       "1          3                122  alfa-romero       gas        std   \n",
       "2          1                122  alfa-romero       gas        std   \n",
       "3          2                164         audi       gas        std   \n",
       "4          2                164         audi       gas        std   \n",
       "\n",
       "  num-of-doors   body-style drive-wheels engine-location  wheel-base  ...  \\\n",
       "0          two  convertible          rwd           front        88.6  ...   \n",
       "1          two  convertible          rwd           front        88.6  ...   \n",
       "2          two    hatchback          rwd           front        94.5  ...   \n",
       "3         four        sedan          fwd           front        99.8  ...   \n",
       "4         four        sedan          4wd           front        99.4  ...   \n",
       "\n",
       "   fuel-system  bore  stroke  compression-ratio horsepower peak-rpm  city-mpg  \\\n",
       "0         mpfi  3.47    2.68                9.0        111   5000.0        21   \n",
       "1         mpfi  3.47    2.68                9.0        111   5000.0        21   \n",
       "2         mpfi  2.68    3.47                9.0        154   5000.0        19   \n",
       "3         mpfi  3.19    3.40               10.0        102   5500.0        24   \n",
       "4         mpfi  3.19    3.40                8.0        115   5500.0        18   \n",
       "\n",
       "  highway-mpg    price  city-L/100km  \n",
       "0          27  13495.0     11.190476  \n",
       "1          27  16500.0     11.190476  \n",
       "2          26  16500.0     12.368421  \n",
       "3          30  13950.0      9.791667  \n",
       "4          22  17450.0     13.055556  \n",
       "\n",
       "[5 rows x 27 columns]"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Convert mpg to L/100km by mathematical operation (235 divided by mpg)\n",
    "df['city-L/100km'] = 235/df[\"city-mpg\"]\n",
    "\n",
    "# check your transformed data \n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n",
    "<h1> Question  #2: </h1>\n",
    "\n",
    "<b>According to the example above, transform mpg to L/100km in the column of \"highway-mpg\" and change the name of column to \"highway-L/100km\".</b>\n",
    "\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>symboling</th>\n",
       "      <th>normalized-losses</th>\n",
       "      <th>make</th>\n",
       "      <th>fuel-type</th>\n",
       "      <th>aspiration</th>\n",
       "      <th>num-of-doors</th>\n",
       "      <th>body-style</th>\n",
       "      <th>drive-wheels</th>\n",
       "      <th>engine-location</th>\n",
       "      <th>wheel-base</th>\n",
       "      <th>...</th>\n",
       "      <th>fuel-system</th>\n",
       "      <th>bore</th>\n",
       "      <th>stroke</th>\n",
       "      <th>compression-ratio</th>\n",
       "      <th>horsepower</th>\n",
       "      <th>peak-rpm</th>\n",
       "      <th>city-mpg</th>\n",
       "      <th>highway-mpg</th>\n",
       "      <th>price</th>\n",
       "      <th>city-L/100km</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>...</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>21</td>\n",
       "      <td>8.703704</td>\n",
       "      <td>13495.0</td>\n",
       "      <td>11.190476</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>...</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.47</td>\n",
       "      <td>2.68</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>21</td>\n",
       "      <td>8.703704</td>\n",
       "      <td>16500.0</td>\n",
       "      <td>11.190476</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>hatchback</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>94.5</td>\n",
       "      <td>...</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>2.68</td>\n",
       "      <td>3.47</td>\n",
       "      <td>9.0</td>\n",
       "      <td>154</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>19</td>\n",
       "      <td>9.038462</td>\n",
       "      <td>16500.0</td>\n",
       "      <td>12.368421</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>fwd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.8</td>\n",
       "      <td>...</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>10.0</td>\n",
       "      <td>102</td>\n",
       "      <td>5500.0</td>\n",
       "      <td>24</td>\n",
       "      <td>7.833333</td>\n",
       "      <td>13950.0</td>\n",
       "      <td>9.791667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>4wd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.4</td>\n",
       "      <td>...</td>\n",
       "      <td>mpfi</td>\n",
       "      <td>3.19</td>\n",
       "      <td>3.40</td>\n",
       "      <td>8.0</td>\n",
       "      <td>115</td>\n",
       "      <td>5500.0</td>\n",
       "      <td>18</td>\n",
       "      <td>10.681818</td>\n",
       "      <td>17450.0</td>\n",
       "      <td>13.055556</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 27 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   symboling  normalized-losses         make fuel-type aspiration  \\\n",
       "0          3                122  alfa-romero       gas        std   \n",
       "1          3                122  alfa-romero       gas        std   \n",
       "2          1                122  alfa-romero       gas        std   \n",
       "3          2                164         audi       gas        std   \n",
       "4          2                164         audi       gas        std   \n",
       "\n",
       "  num-of-doors   body-style drive-wheels engine-location  wheel-base  ...  \\\n",
       "0          two  convertible          rwd           front        88.6  ...   \n",
       "1          two  convertible          rwd           front        88.6  ...   \n",
       "2          two    hatchback          rwd           front        94.5  ...   \n",
       "3         four        sedan          fwd           front        99.8  ...   \n",
       "4         four        sedan          4wd           front        99.4  ...   \n",
       "\n",
       "   fuel-system  bore  stroke  compression-ratio horsepower peak-rpm  city-mpg  \\\n",
       "0         mpfi  3.47    2.68                9.0        111   5000.0        21   \n",
       "1         mpfi  3.47    2.68                9.0        111   5000.0        21   \n",
       "2         mpfi  2.68    3.47                9.0        154   5000.0        19   \n",
       "3         mpfi  3.19    3.40               10.0        102   5500.0        24   \n",
       "4         mpfi  3.19    3.40                8.0        115   5500.0        18   \n",
       "\n",
       "  highway-mpg    price  city-L/100km  \n",
       "0    8.703704  13495.0     11.190476  \n",
       "1    8.703704  16500.0     11.190476  \n",
       "2    9.038462  16500.0     12.368421  \n",
       "3    7.833333  13950.0      9.791667  \n",
       "4   10.681818  17450.0     13.055556  \n",
       "\n",
       "[5 rows x 27 columns]"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Write your code below and press Shift+Enter to execute \n",
    "#df[\"highway-mpg\"] = 235/df[\"highway-mpg\"]\n",
    "df.rename(columns={'\"highway-mpg\"':'\"highway-L/100km\"'}, inplace=True)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',\n",
       "       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',\n",
       "       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',\n",
       "       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',\n",
       "       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',\n",
       "       'highway-mpg', 'price', 'city-L/100km'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details><summary>Click here for the solution</summary>\n",
    "\n",
    "```python\n",
    "# transform mpg to L/100km by mathematical operation (235 divided by mpg)\n",
    "df[\"highway-mpg\"] = 235/df[\"highway-mpg\"]\n",
    "\n",
    "# rename column name from \"highway-mpg\" to \"highway-L/100km\"\n",
    "df.rename(columns={'\"highway-mpg\"':'highway-L/100km'}, inplace=True)\n",
    "\n",
    "# check your transformed data \n",
    "df.head()\n",
    "\n",
    "```\n",
    "\n",
    "</details>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 id=\"data_normalization\">Data Normalization</h2>\n",
    "\n",
    "<b>Why normalization?</b>\n",
    "\n",
    "<p>Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.\n",
    "</p>\n",
    "\n",
    "<b>Example</b>\n",
    "\n",
    "<p>To demonstrate normalization, let's say we want to scale the columns \"length\", \"width\" and \"height\".</p>\n",
    "<p><b>Target:</b> would like to normalize those variables so their value ranges from 0 to 1</p>\n",
    "<p><b>Approach:</b> replace original value by (original value)/(maximum value)</p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "# replace (original value) by (original value)/(maximum value)\n",
    "df['length'] = df['length']/df['length'].max()\n",
    "df['width'] = df['width']/df['width'].max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n",
    "<h1> Question #3: </h1>\n",
    "\n",
    "<b>According to the example above, normalize the column \"height\".</b>\n",
    "\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    201.000000\n",
       "mean       0.899108\n",
       "std        0.040933\n",
       "min        0.799331\n",
       "25%        0.869565\n",
       "50%        0.904682\n",
       "75%        0.928094\n",
       "max        1.000000\n",
       "Name: height, dtype: float64"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Write your code below and press Shift+Enter to execute \n",
    "df['height'] = df['height']/df['height'].max()\n",
    "df['height'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details><summary>Click here for the solution</summary>\n",
    "\n",
    "```python\n",
    "df['height'] = df['height']/df['height'].max() \n",
    "\n",
    "# show the scaled columns\n",
    "df[[\"length\",\"width\",\"height\"]].head()\n",
    "\n",
    "\n",
    "```\n",
    "\n",
    "</details>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we can see we've normalized \"length\", \"width\" and \"height\" in the range of \\[0,1].\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 id=\"binning\">Binning</h2>\n",
    "<b>Why binning?</b>\n",
    "<p>\n",
    "    Binning is a process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis.\n",
    "</p>\n",
    "\n",
    "<b>Example: </b>\n",
    "\n",
    "<p>In our dataset, \"horsepower\" is a real valued variable ranging from 48 to 288 and it has 57 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins' to simplify analysis? </p>\n",
    "\n",
    "<p>We will use the pandas method 'cut' to segment the 'horsepower' column into 3 bins.</p>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Example of Binning Data In Pandas</h3>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convert data to correct format:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    201.000000\n",
       "mean     103.402985\n",
       "std       37.365650\n",
       "min       48.000000\n",
       "25%       70.000000\n",
       "50%       95.000000\n",
       "75%      116.000000\n",
       "max      262.000000\n",
       "Name: horsepower, dtype: float64"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"horsepower\"]=df[\"horsepower\"].astype(int, copy=True)\n",
    "df[\"horsepower\"].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's plot the histogram of horsepower to see what the distribution of horsepower looks like.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Text(0.5, 1.0, 'horsepower bins')"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX4AAAEWCAYAAABhffzLAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAVCUlEQVR4nO3de7SldX3f8feHAUXlHgYcwThIiStoDeoEtRSkEBXBZqiJF1awU0OCdqnRVU0yJk1q2piCMSbGusrCahjjldQg09il0lEGSRQdDCKIBoKDF0ZmuChoWxT89o/nN2Uzc87hAPOcfeb83q+19trPdT/f/Zs9n/07z372b6eqkCT1Y49pFyBJWlgGvyR1xuCXpM4Y/JLUGYNfkjpj8EtSZwx+TVWSzUl+Ydp1LHZJTkzy7TnWn5fk9xayJu2+9px2AZIevqp61bRr0O7DHr+WhCRLphOzlJ6LFieDX4vBMUmuTvL9JB9Jsvf2FUl+PckNSW5Psj7J4ybWVZJXJ7keuD6DP02ytT3W1Ume0rZ9ZJK3JflmklvaqZFHtXUnJvl2kt9Jcms7/fQrE8fZP8n7kmxLclOSf59kj7bupiTPaNNntpqObvO/luRjbXqPJGuT/GOS25JcmOSgtm5l2++sJN8EPj1bQ81R4wVJ/nCH5/OG1hZbkrxiYttTk3w1yV1JvpPkjQ/j3067IYNfi8FLgFOAI4CnAv8GIMlJwH9u61cANwEf3mHf04FnAkcDzwNOAH4GOAB4KXBb2+7ctvwY4J8AhwG/P/E4jwUObsvXAOcneVJb905gf+CJwHOAfw1sD9KNwIlt+gTgxrbN9vmNbfo3Wq3PAR4H3AG8a4fn8hzgZ4HnM7O5apxp2/3btmcB70pyYFv3HuCVVbUv8BTmeKPRElVV3rxN7QZsBs6cmH8rcF6bfg/w1ol1+wA/Bla2+QJOmlh/EvAPwLOAPSaWB/ghcOTEsmcD32jTJwL3AI+ZWH8h8HvAMuBu4OiJda8ELm3TZwHr2/R1wK8BH27zNwFPn1h38sRjrGjPZU9gZXsuT5yjnWatsU1fAPzhxLb/B9hzYtutwLPa9Dfbc9hv2v/+3qZzs8evxeC7E9P/myHgYegZ37R9RVX9gKEHf9jE9t+aWP9p4L8w9KRvSXJ+kv2A5cCjgSuTfC/J94BPtOXb3VFVP5yYv6kd/2DgEZN1tOntNWwEjk/yWIY3iY8AxyVZydDjvqpt9wTgoonjXwfcCxw603OZxWw1zuS2qrpnYn6yXX8JOBW4KcnGJM9+gONqiTH4tZjdzBCYACR5DPBTwHcmtrnf8LJV9edV9QzgyQyndn4TuJWhB/zkqjqg3favqn0mdj2wPf52P92OfytDz/wJO6z7TjveDQyh+hvAZVV1F8Mb2dnA5VX1k7bPt4AXTBz/gKrau6pmfS4zmK3GB6WqvlhVq4FDgI8x/OWgjhj8Wsw+CLwiyTFJHgn8EXBFVW2eaeMkP5/kmUn2Yji183+Be1v4vhv40ySHtG0PS7LjufQ/SPKIJMcDLwT+qqruZQjGtyTZN8kTgH8HvH9iv43Aa7jvfP6lO8wDnNce4wnt+MuTrH4IbbJTjQ9m57bvryTZv6p+DNzJ8JeHOmLwa9Gqqg0M59k/CmwBjgReNscu+zEE/B0Mp0FuA97W1v02cAPw+SR3Av8LmPxg9Lttv5uBDwCvqqqvtXWvZXgjuRG4nOEN6b0T+24E9gUum2Ue4B3AeuBTSe4CPs/wofSDMVeND8bLgc2tHV4FnPkQHkO7sVT5QyzqW5ITgfdX1eFTLkVaEPb4JakzBr8kdcZTPZLUmVHHBEmyGbiL4aqBe6pqVfua+kcYvrSyGXhJVd0xZh2SpPuM2uNvwb+qqm6dWPZW4PaqOifJWuDAqvrtuR7n4IMPrpUrV45WpyQtRVdeeeWtVbV8x+XTGAVwNfeNbbKO4ZrnOYN/5cqVbNq0adyqJGmJSXLTTMvH/nC3GK5bvjLJ2W3ZoVW1BaDdHzLTjknOTrIpyaZt27aNXKYk9WPsHv9xVXVz+7bkJUnm/WWTqjofOB9g1apVfgItSbvIqD3+qrq53W8FLgKOZRg8awVAu986Zg2SpPsbLfiTPCbJvtunGcZKv4bha+tr2mZrgIvHqkGStLMxT/UcyjAM7fbjfLCqPpHki8CFSc5iGBf8xSPWIEnawWjBX1U3Aj83w/LbgJPHOq4kaW4O2SBJnTH4JakzBr8kdWYa39zVyFau/fhUjrv5nNOmclxJD449fknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1xkHaRjStwdIkaS72+CWpMwa/JHXG4Jekziz5c/yeZ5ek+7PHL0mdMfglqTMGvyR1xuCXpM4Y/JLUGYNfkjpj8EtSZwx+SeqMwS9JnTH4JakzBr8kdcbgl6TOGPyS1JnRgz/JsiR/n+Rv2vxBSS5Jcn27P3DsGiRJ91mIHv/rgOsm5tcCG6rqKGBDm5ckLZBRgz/J4cBpwH+bWLwaWNem1wGnj1mDJOn+xu7x/xnwW8BPJpYdWlVbANr9ITPtmOTsJJuSbNq2bdvIZUpSP0YL/iQvBLZW1ZUPZf+qOr+qVlXVquXLl+/i6iSpX2P+9OJxwC8mORXYG9gvyfuBW5KsqKotSVYAW0esQZK0g9F6/FX1pqo6vKpWAi8DPl1VZwLrgTVtszXAxWPVIEna2TSu4z8HeG6S64HntnlJ0gIZ81TP/1dVlwKXtunbgJMX4riSpJ35zV1J6ozBL0mdMfglqTMGvyR1xuCXpM4Y/JLUGYNfkjpj8EtSZwx+SeqMwS9JnTH4JakzBr8kdcbgl6TOGPyS1BmDX5I6Y/BLUmcMfknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1xuCXpM4Y/JLUGYNfkjpj8EtSZwx+SeqMwS9JnRkt+JPsneQLSb6c5Nokf9CWH5TkkiTXt/sDx6pBkrSzMXv8dwMnVdXPAccApyR5FrAW2FBVRwEb2rwkaYGMFvw1+EGb3avdClgNrGvL1wGnj1WDJGlno57jT7IsyVXAVuCSqroCOLSqtgC0+0Nm2ffsJJuSbNq2bduYZUpSV0YN/qq6t6qOAQ4Hjk3ylAex7/lVtaqqVi1fvny0GiWpNwtyVU9VfQ+4FDgFuCXJCoB2v3UhapAkDca8qmd5kgPa9KOAXwC+BqwH1rTN1gAXj1WDJGlne4742CuAdUmWMbzBXFhVf5Pkc8CFSc4Cvgm8eMQaJEk7GC34q+pq4GkzLL8NOHms40qS5uY3dyWpMwa/JHXG4Jekzhj8ktSZeQV/kg3zWSZJWvzmvKonyd7Ao4GD2yiaaav2Ax43cm2SpBE80OWcrwRezxDyV3Jf8N8JvGu8siRJY5kz+KvqHcA7kry2qt65QDVJkkY0ry9wVdU7k/wzYOXkPlX1vpHqkiSNZF7Bn+QvgSOBq4B72+ICDH5J2s3Md8iGVcDRVVVjFiNJGt98r+O/BnjsmIVIkhbGfHv8BwNfTfIFht/SBaCqfnGUqiRJo5lv8L95zCIkSQtnvlf1bBy7EEnSwpjvVT13MVzFA/AIYC/gh1W131iFSZLGMd8e/76T80lOB44doyBJ0rge0uicVfUx4KRdW4okaSHM91TPiyZm92C4rt9r+iVpNzTfq3r+5cT0PcBmYPUur0aSNLr5nuN/xdiFaPe3cu3Hp3bszeecNrVjS7ub+f4Qy+FJLkqyNcktST6a5PCxi5Mk7Xrz/XD3L4D1DOPyHwb8j7ZMkrSbmW/wL6+qv6iqe9rtAmD5iHVJkkYy3+C/NcmZSZa125nAbWMWJkkax3yD/1eBlwDfBbYAvwz4ga8k7YbmeznnfwLWVNUdAEkOAt7G8IYgSdqNzLfH/9TtoQ9QVbcDTxunJEnSmOYb/HskOXD7TOvxz/evBUnSIjLf8P4T4O+S/HeGoRpeArxltKokSaOZ7zd335dkE8PAbAFeVFVfHbUySdIo5n26pgW9YS9Ju7mHNCyzJGn3ZfBLUmdGC/4kj0/ymSTXJbk2yeva8oOSXJLk+nZ/4AM9liRp1xmzx38P8Iaq+lngWcCrkxwNrAU2VNVRwIY2L0laIKMFf1Vtqaovtem7gOsYRvZcDaxrm60DTh+rBknSzhbkHH+SlQzf9L0COLSqtsDw5gAcshA1SJIGowd/kn2AjwKvr6o7H8R+ZyfZlGTTtm3bxitQkjozavAn2Ysh9D9QVX/dFt+SZEVbvwLYOtO+VXV+Va2qqlXLlzv0vyTtKmNe1RPgPcB1VfX2iVXrgTVteg1w8Vg1SJJ2NuZAa8cBLwe+kuSqtux3gHOAC5OcBXwTePGINUiSdjBa8FfV5Qzj+szk5LGOK0mam9/claTOGPyS1BmDX5I6Y/BLUmcMfknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnxhydU1ryVq79+NSOvfmc06Z2bO3e7PFLUmcMfknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZr+PXkjDN6+ml3Y09fknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1xuCXpM6MFvxJ3ptka5JrJpYdlOSSJNe3+wPHOr4kaWZj9vgvAE7ZYdlaYENVHQVsaPOSpAU0WvBX1WXA7TssXg2sa9PrgNPHOr4kaWYLfY7/0KraAtDuD1ng40tS9xbth7tJzk6yKcmmbdu2TbscSVoyFjr4b0myAqDdb51tw6o6v6pWVdWq5cuXL1iBkrTULXTwrwfWtOk1wMULfHxJ6t6Yl3N+CPgc8KQk305yFnAO8Nwk1wPPbfOSpAW051gPXFVnzLLq5LGOKUl6YIv2w11J0jgMfknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1xuCXpM4Y/JLUGYNfkjoz2g+xSBrXyrUfn8pxN59z2lSOq13HHr8kdcbgl6TOGPyS1BmDX5I644e7kh6UaX2oDH6wvKvY45ekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1xuCXpM4Y/JLUGYNfkjrjWD2S9ACW2vhE9vglqTNTCf4kpyT5epIbkqydRg2S1KsFD/4ky4B3AS8AjgbOSHL0QtchSb2aRo//WOCGqrqxqn4EfBhYPYU6JKlL0/hw9zDgWxPz3waeueNGSc4Gzm6zP0jy9V1w7IOBW3fB4yxFts3sbJvZLWjb5NyFOtIusUva5mE+5yfMtHAawZ8ZltVOC6rOB87fpQdONlXVql35mEuFbTM722Z2ts3sFnPbTONUz7eBx0/MHw7cPIU6JKlL0wj+LwJHJTkiySOAlwHrp1CHJHVpwU/1VNU9SV4DfBJYBry3qq5doMPv0lNHS4xtMzvbZna2zewWbdukaqfT65KkJcxv7kpSZwx+SerMkg3+JJuTfCXJVUk2tWUHJbkkyfXt/sBp17lQkrw3ydYk10wsm7U9krypDanx9STPn07VC2OWtnlzku+0189VSU6dWNdT2zw+yWeSXJfk2iSva8u7f+3M0TaL/7VTVUvyBmwGDt5h2VuBtW16LXDutOtcwPY4AXg6cM0DtQfDUBpfBh4JHAH8I7Bs2s9hgdvmzcAbZ9i2t7ZZATy9Te8L/ENrg+5fO3O0zaJ/7SzZHv8sVgPr2vQ64PTplbKwquoy4PYdFs/WHquBD1fV3VX1DeAGhqE2lqRZ2mY2vbXNlqr6Upu+C7iO4dv33b925mib2SyatlnKwV/Ap5Jc2YZ/ADi0qrbA8I8GHDK16haH2dpjpmE15npBL1WvSXJ1OxW0/VRGt22TZCXwNOAKfO3czw5tA4v8tbOUg/+4qno6wyigr05ywrQL2o3Ma1iNJe6/AkcCxwBbgD9py7tsmyT7AB8FXl9Vd8616QzLlnT7zNA2i/61s2SDv6pubvdbgYsY/qS6JckKgHa/dXoVLgqztUf3w2pU1S1VdW9V/QR4N/f9Sd5d2yTZiyHYPlBVf90W+9ph5rbZHV47SzL4kzwmyb7bp4HnAdcwDA2xpm22Brh4OhUuGrO1x3rgZUkemeQI4CjgC1Oob2q2h1rzrxheP9BZ2yQJ8B7guqp6+8Sq7l87s7XNbvHamfYn4yN92v5Ehk/PvwxcC/xuW/5TwAbg+nZ/0LRrXcA2+RDDn50/Zuh5nDVXewC/y3DVwdeBF0y7/im0zV8CXwGuZvgPu6LTtvnnDKcjrgauardTfe3M2TaL/rXjkA2S1JkleapHkjQ7g1+SOmPwS1JnDH5J6ozBL0mdMfi1pCRZOTnKpqSdGfxSk2TBf4r0odhd6tTiZfBrKVqW5N1tjPRPJXlUkmOSfL4NnHXR9oGzklya5I+SbARel+TFSa5J8uUkl7VtliX54yRfbPu/si0/Mcll7fG+muS8JHu0dWdk+D2Ia5Kc25a9JMnb2/TrktzYpo9McnmbfkaSjW1wwU9ODItwvzoXtjm11Nhz0FJ0FHBGVf16kguBXwJ+C3htVW1M8h+B/wC8vm1/QFU9ByDJV4DnV9V3khzQ1p8FfL+qfj7JI4G/TfKptu5YhnHWbwI+Abwoyd8B5wLPAO5gGCX2dOAy4DfbfscDtyU5jOEboJ9t4768E1hdVduSvBR4C/CrO9YpPRwGv5aib1TVVW36SoaREg+oqo1t2Trgrya2/8jE9N8CF7Q3jO0Dkj0PeGqSX27z+zO8ufwI+EJVbe+5f4ghxH8MXFpV29ryDwAnVNXHkuzTxpF6PPBBhh+BOb4d60nAU4BLhmFgWMYwlMRMdUoPmcGvpejuiel7gQMeYPsfbp+oqlcleSZwGnBVkmMYhtN9bVV9cnKnJCey87C6xczD7273OeAVDGO1fJahN/9s4A3ATwPXVtWzH6hO6eHwHL968H3gjiTHt/mXAxtn2jDJkVV1RVX9PnArQ8/8k8C/badiSPIzbdRXgGOTHNHO7b8UuJzhxziek+TgJMuAMyaOdxnwxnb/98C/AO6uqu8zvBksT/Lsdpy9kjx51zWDNLDHr16sAc5L8mjgRoZe90z+OMlRDL32DQwjvF4NrAS+1Ibi3cZ9PzX4OeAc4J8yhPlFVfWTJG8CPtMe539W1fZhiz/L8GZyWVXdm+RbwNcAqupH7XTSnyfZn+H/558xjDAr7TKOzik9RO1Uzxur6oVTLkV6UDzVI0mdsccvSZ2xxy9JnTH4JakzBr8kdcbgl6TOGPyS1Jn/B2EXAS5SRAWHAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib as plt\n",
    "from matplotlib import pyplot\n",
    "plt.pyplot.hist(df[\"horsepower\"])\n",
    "\n",
    "# set x/y labels and plot title\n",
    "plt.pyplot.xlabel(\"horsepower\")\n",
    "plt.pyplot.ylabel(\"count\")\n",
    "plt.pyplot.title(\"horsepower bins\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p>We would like 3 bins of equal size bandwidth so we use numpy's <code>linspace(start_value, end_value, numbers_generated</code> function.</p>\n",
    "<p>Since we want to include the minimum value of horsepower, we want to set start_value = min(df[\"horsepower\"]).</p>\n",
    "<p>Since we want to include the maximum value of horsepower, we want to set end_value = max(df[\"horsepower\"]).</p>\n",
    "<p>Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated = 4.</p>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We build a bin array with a minimum value to a maximum value by using the bandwidth calculated above. The values will determine when one bin ends and another begins.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 48.        , 119.33333333, 190.66666667, 262.        ])"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bins = np.linspace(min(df[\"horsepower\"]), max(df[\"horsepower\"]), 4)\n",
    "bins"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We set group  names:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "group_names = ['Low', 'Medium', 'High']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We apply the function \"cut\" to determine what each value of `df['horsepower']` belongs to.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>horsepower</th>\n",
       "      <th>horsepower-binned</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>111</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>111</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>154</td>\n",
       "      <td>Medium</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>102</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>115</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>110</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>110</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>110</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>140</td>\n",
       "      <td>Medium</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>101</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>101</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>121</td>\n",
       "      <td>Medium</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>121</td>\n",
       "      <td>Medium</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>121</td>\n",
       "      <td>Medium</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>182</td>\n",
       "      <td>Medium</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>182</td>\n",
       "      <td>Medium</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>182</td>\n",
       "      <td>Medium</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>48</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>70</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>70</td>\n",
       "      <td>Low</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    horsepower horsepower-binned\n",
       "0          111               Low\n",
       "1          111               Low\n",
       "2          154            Medium\n",
       "3          102               Low\n",
       "4          115               Low\n",
       "5          110               Low\n",
       "6          110               Low\n",
       "7          110               Low\n",
       "8          140            Medium\n",
       "9          101               Low\n",
       "10         101               Low\n",
       "11         121            Medium\n",
       "12         121            Medium\n",
       "13         121            Medium\n",
       "14         182            Medium\n",
       "15         182            Medium\n",
       "16         182            Medium\n",
       "17          48               Low\n",
       "18          70               Low\n",
       "19          70               Low"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )\n",
    "df[['horsepower','horsepower-binned']].head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see the number of vehicles in each bin:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Low       153\n",
       "Medium     43\n",
       "High        5\n",
       "Name: horsepower-binned, dtype: int64"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"horsepower-binned\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's plot the distribution of each bin:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Text(0.5, 1.0, 'horsepower bins')"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEWCAYAAACJ0YulAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAZA0lEQVR4nO3dfbRddX3n8feHoOATD5qLIgGCDDoNDEWNqHVQKlWpOoalVcMSGxUbcVHUGVFDnUrHioMPo1WrZcWKYFUgPiCZ0frQWIiPYFAEASmpBAgguSAKouXxO3/sne3hepNcLjnn3Nzzfq1119n799v77O+9Jzmfs/c++7dTVUiSBLDdsAuQJM0choIkqWMoSJI6hoIkqWMoSJI6hoIkqWMoaEZKsi7Jnwy7jpkuyaFJ1m+m/5Qkfz3ImrRt237YBUjqn6o6Ztg1aNvinoJmtSSz5oPPbPpdNHMZCprJDkpycZJfJTkryY4bO5L8RZK1SX6RZGWSx/b0VZJjk1wJXJnGB5NsaJ/r4iQHtMvukOT9Sa5JcmN7uOUhbd+hSdYn+askN7WHtF7Rs52dk3wqyXiSq5P8zyTbtX1XJ3lyO31UW9OCdv61Sb7UTm+XZFmSf09yc5IVSR7Z9s1v1zs6yTXANzf1h9pMjacledeE3+fN7d/ihiSv7ln2+UkuS3JbkuuSHP8AXjttowwFzWQvAw4H9gEOBF4FkOTZwP9u+3cHrgbOnLDuEcBTgQXAc4FnAo8HdgFeDtzcLveetv0g4D8BewDv6HmexwBz2/YlwPIkT2j7PgLsDDwOeBbw58DGN9nzgEPb6WcCP2uX2Th/Xjv9hrbWZwGPBW4BPjrhd3kW8AfA85jc5mqcbNmd22WPBj6aZNe27xPA66rqEcABbCaENItVlT/+zLgfYB1wVM/8e4FT2ulPAO/t6Xs4cBcwv50v4Nk9/c8G/g14GrBdT3uA24F9e9qeDlzVTh8K3A08rKd/BfDXwBzgDmBBT9/rgHPb6aOBle305cBrgTPb+auBJ/X0HdbzHLu3v8v2wPz2d3ncZv5Om6yxnT4NeFfPsr8Ftu9ZdgPwtHb6mvZ32GnYr78/w/txT0Ez2c97pn9D8+YPzSfqqzd2VNWvaT7579Gz/LU9/d8E/p7mE/iNSZYn2QkYAx4KXJjkl0l+CXy1bd/olqq6vWf+6nb7c4EH99bRTm+s4TzgkCSPoQmQs4BnJJlP80n9ona5vYGze7Z/OXAP8OjJfpdN2FSNk7m5qu7ume/9u74EeD5wdZLzkjx9C9vVLGQoaFt0Pc2bKQBJHgY8CriuZ5n7DP9bVR+uqicD+9McLnoLcBPNJ+f9q2qX9mfnqnp4z6q7ts+/0V7t9m+i+US/94S+69rtraV5w30DsLqqbqMJuaXAt6vq3nada4E/7dn+LlW1Y1Vt8neZxKZqvF+q6gdVtQjYDfgSzR6HRoyhoG3RZ4FXJzkoyQ7Au4Hzq2rdZAsneUqSpyZ5EM3hov8A7mnfmD8OfDDJbu2yeySZeOz+fyV5cJJDgBcCn6uqe2jeNE9K8ogkewP/A/h0z3rnAX/J784fnDthHuCU9jn2brc/lmTRNP4mv1fj/Vm5XfcVSXauqruAW2n2WDRiDAVtc6pqFc1x/S8ANwD7Aos3s8pONG/+t9AcWrkZeH/b9zZgLfD9JLcC/wL0nqT9ebve9cBngGOq6qdt33E0IfMz4Ns0YXVqz7rnAY8AVm9iHuBDwErg60luA75Pc4L8/thcjffHK4F17d/hGOCoaTyHtnGp8iY70mSSHAp8uqrmDbkUaWDcU5AkdQwFSVKnb6GQ5NT2qsmfTGg/LskVSS5N8t6e9hPaK1SvmOREnzRwVXWuh440avo5lsppNN8N/9TGhiR/DCwCDqyqO3q+8bGA5kTh/jTfr/6XJI9vv+EhSRqQvoVCVa1uL9Tp9Xrg5Kq6o11mQ9u+iOZqzzuAq5KsBQ4Gvre5bcydO7fmz5+4CUnS5lx44YU3VdXYZH2DHnXx8TRXeZ5E813x46vqBzRXgX6/Z7n13Pfq1E6SpTQXALHXXnuxZs2a/lYsSbNMkqs31TfoE83bA7vSjEHzFmBFktCMQTPRpN+VrarlVbWwqhaOjU0adJKkaRp0KKwHvliNC4B7acaQWQ/s2bPcPKZxmb4k6YEZdCh8iWbESpI8nmZAsZtoruhc3I5tvw+wH3DBgGuTpJHXt3MKSc6gGap3bpp7yJ5IMwTAqe3XVO8EllRzSfWlSVYAl9EMA3ys3zySpMHbpoe5WLhwYXmiWZLunyQXVtXCyfq8olmS1DEUJEkdQ0GS1DEUJEmdQV/RPKPMX/blYZcwa607+QXDLkHSNLinIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnq9C0UkpyaZEN7P+aJfccnqSRze9pOSLI2yRVJntevuiRJm9bPPYXTgMMnNibZE3gOcE1P2wJgMbB/u87HkszpY22SpEn0LRSqajXwi0m6Pgi8FaietkXAmVV1R1VdBawFDu5XbZKkyQ30nEKSFwHXVdWPJ3TtAVzbM7++bZvsOZYmWZNkzfj4eJ8qlaTRNLBQSPJQ4O3AOybrnqStJmmjqpZX1cKqWjg2NrY1S5SkkTfI23HuC+wD/DgJwDzgh0kOptkz2LNn2XnA9QOsTZLEAPcUquqSqtqtquZX1XyaIHhSVf0cWAksTrJDkn2A/YALBlWbJKnRz6+kngF8D3hCkvVJjt7UslV1KbACuAz4KnBsVd3Tr9okSZPr2+GjqjpyC/3zJ8yfBJzUr3okSVvmFc2SpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnq9PMezacm2ZDkJz1t70vy0yQXJzk7yS49fSckWZvkiiTP61ddkqRN6+eewmnA4RPavgEcUFUHAv8GnACQZAGwGNi/XedjSeb0sTZJ0iT6FgpVtRr4xYS2r1fV3e3s94F57fQi4MyquqOqrgLWAgf3qzZJ0uSGeU7hNcA/t9N7ANf29K1v235PkqVJ1iRZMz4+3ucSJWm0DCUUkrwduBv4zMamSRarydatquVVtbCqFo6NjfWrREkaSdsPeoNJlgAvBA6rqo1v/OuBPXsWmwdcP+jaJGnUDXRPIcnhwNuAF1XVb3q6VgKLk+yQZB9gP+CCQdYmSerjnkKSM4BDgblJ1gMn0nzbaAfgG0kAvl9Vx1TVpUlWAJfRHFY6tqru6VdtkqTJ9S0UqurISZo/sZnlTwJO6lc9kqQt84pmSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdfoWCklOTbIhyU962h6Z5BtJrmwfd+3pOyHJ2iRXJHlev+qSJG1aP/cUTgMOn9C2DFhVVfsBq9p5kiwAFgP7t+t8LMmcPtYmSZpE30KhqlYDv5jQvAg4vZ0+HTiip/3Mqrqjqq4C1gIH96s2SdLkBn1O4dFVdQNA+7hb274HcG3Pcuvbtt+TZGmSNUnWjI+P97VYSRo1M+VEcyZpq8kWrKrlVbWwqhaOjY31uSxJGi2DDoUbk+wO0D5uaNvXA3v2LDcPuH7AtUnSyBt0KKwElrTTS4BzetoXJ9khyT7AfsAFA65Nkkbe9v164iRnAIcCc5OsB04ETgZWJDkauAZ4KUBVXZpkBXAZcDdwbFXd06/aJEmT61soVNWRm+g6bBPLnwSc1K96JElbNlNONEuSZgBDQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUmVIoJFk1lTZJ0rZts2MfJdkReCjNoHa78rv7HuwEPLbPtUmSBmxLA+K9DngTTQBcyO9C4Vbgo/0rS5I0DJsNhar6EPChJMdV1UcGVJMkaUimNHR2VX0kyR8B83vXqapP9akuSdIQTCkUkvwTsC9wEbDx5jcFGAqSNItM9SY7C4EFVVX9LEaSNFxTvU7hJ8Bj+lmIJGn4prqnMBe4LMkFwB0bG6vqRdPZaJL/DryW5hDUJcCrab76ehbNeYt1wMuq6pbpPL8kaXqmGgp/s7U2mGQP4A00h6N+m2QFsBhYAKyqqpOTLAOWAW/bWtuVJG3ZVL99dF4ftvuQJHfR7CFcD5wAHNr2nw6ci6EgSQM11WEubktya/vzH0nuSXLrdDZYVdcB7weuAW4AflVVXwceXVU3tMvcAOw2neeXJE3fVPcUHtE7n+QI4ODpbLAdLmMRsA/wS+BzSY66H+svBZYC7LXXXtMpQZK0CdMaJbWqvgQ8e5rb/BPgqqoar6q7gC8CfwTcmGR3gPZxwya2vbyqFlbVwrGxsWmWIEmazFQvXntxz+x2NNctTPeahWuApyV5KPBb4DBgDXA7sAQ4uX08Z5rPL0mapql+++i/9UzfTfOV0UXT2WBVnZ/k88AP2+f6EbAceDiwIsnRNMHx0uk8vyRp+qZ6TuHVW3OjVXUicOKE5jto9hokSUMy1W8fzUtydpINSW5M8oUk8/pdnCRpsKZ6ovmTwEqa+yrsAfzftk2SNItMNRTGquqTVXV3+3Ma4Fd/JGmWmWoo3JTkqCRz2p+jgJv7WZgkafCmGgqvAV4G/JzmKuQ/oxnETpI0i0z1K6l/CyzZOGppkkfSDFXxmn4VJkkavKnuKRzYO4x1Vf0CeGJ/SpIkDctUQ2G7dswioNtTmOpehiRpGzHVN/b/A3y3vRK5aM4vnNS3qiRJQzHVK5o/lWQNzSB4AV5cVZf1tTJJ0sBN+RBQGwIGgSTNYtMaOluSNDsZCpKkjqEgSeoYCpKkjqEgSeoYCpKkjqEgSeoMJRSS7JLk80l+muTyJE9P8sgk30hyZfu465afSZK0NQ1rT+FDwFer6j8DfwhcDiwDVlXVfsCqdl6SNEADD4UkOwHPBD4BUFV3VtUvgUXA6e1ipwNHDLo2SRp1w9hTeBwwDnwyyY+S/GOShwGPrqobANrH3YZQmySNtGGEwvbAk4B/qKonArdzPw4VJVmaZE2SNePj4/2qUZJG0jBCYT2wvqrOb+c/TxMSNybZHaB93DDZylW1vKoWVtXCsbGxgRQsSaNi4KFQVT8Hrk3yhLbpMJrRV1cCS9q2JcA5g65NkkbdsO6edhzwmSQPBn4GvJomoFYkORq4BnjpkGqTpJE1lFCoqouAhZN0HTbgUiRJPbyiWZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSZ1h3aOZJHOANcB1VfXCJI8EzgLmA+uAl1XVLcOqTzPP/GVfHnYJs9a6k18w7BI0QwxzT+GNwOU988uAVVW1H7CqnZckDdBQQiHJPOAFwD/2NC8CTm+nTweOGHBZkjTyhrWn8HfAW4F7e9oeXVU3ALSPuw2hLkkaaQMPhSQvBDZU1YXTXH9pkjVJ1oyPj2/l6iRptA1jT+EZwIuSrAPOBJ6d5NPAjUl2B2gfN0y2clUtr6qFVbVwbGxsUDVL0kgYeChU1QlVNa+q5gOLgW9W1VHASmBJu9gS4JxB1yZJo24mXadwMvCcJFcCz2nnJUkDNLTrFACq6lzg3Hb6ZuCwYdYjSaNuJu0pSJKGzFCQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSZ+ChkGTPJP+a5PIklyZ5Y9v+yCTfSHJl+7jroGuTpFE3jD2Fu4E3V9UfAE8Djk2yAFgGrKqq/YBV7bwkaYAGHgpVdUNV/bCdvg24HNgDWASc3i52OnDEoGuTpFE31HMKSeYDTwTOBx5dVTdAExzAbptYZ2mSNUnWjI+PD6xWSRoFQwuFJA8HvgC8qapunep6VbW8qhZW1cKxsbH+FShJI2gooZDkQTSB8Jmq+mLbfGOS3dv+3YENw6hNkkbZML59FOATwOVV9YGerpXAknZ6CXDOoGuTpFG3/RC2+QzglcAlSS5q2/4KOBlYkeRo4BrgpUOoTZJG2sBDoaq+DWQT3YcNshZJ0n15RbMkqWMoSJI6hoIkqWMoSJI6hoIkqWMoSJI6hoIkqWMoSJI6hoIkqWMoSJI6hoIkqWMoSJI6hoIkqWMoSJI6w7ifgqQRMX/Zl4ddwqy17uQX9OV53VOQJHUMBUlSx1CQJHVmXCgkOTzJFUnWJlk27HokaZTMqFBIMgf4KPCnwALgyCQLhluVJI2OGRUKwMHA2qr6WVXdCZwJLBpyTZI0MmbaV1L3AK7tmV8PPLV3gSRLgaXt7K+TXDGg2oZtLnDTsIuYqrxn2BXMCNvMa+brBWxDrxc84Nds7011zLRQyCRtdZ+ZquXA8sGUM3MkWVNVC4ddh6bO12zb4uvVmGmHj9YDe/bMzwOuH1ItkjRyZloo/ADYL8k+SR4MLAZWDrkmSRoZM+rwUVXdneQvga8Bc4BTq+rSIZc1U4zcIbNZwNds2+LrBaSqtryUJGkkzLTDR5KkITIUJEkdQ2EGSPLrYdegRpJK8k8989snGU/y/+7n85ybZGE7/ZUku2zlUjVFE/9/JXlVkr9vp49J8udbWL9bfhTMqBPN0gxwO3BAkodU1W+B5wDXPZAnrKrnb5XKtNVV1SnDrmGmcU9hhkpyUJLvJ7k4ydlJdk2yW5IL2/4/bD/V7tXO/3uShw636lnjn4GNdzA5EjhjY0eShyU5NckPkvwoyaK2/SFJzmxfr7OAh/Sssy7J3CTzk/ykp/34JH/TTp+b5INJVie5PMlTknwxyZVJ3jWA33kkJfmbJMe3009pX7/vJXlf72sFPDbJV9vX471DKncgDIWZ61PA26rqQOAS4MSq2gDsmGQn4BBgDXBIkr2BDVX1m+GVO6ucCSxOsiNwIHB+T9/bgW9W1VOAPwbel+RhwOuB37Sv10nAk6ex3Tur6pnAKcA5wLHAAcCrkjxq2r+NHpLkoo0/wDs3sdwngWOq6unAPRP6DgJeDvwX4OVJ9mSW8vDRDJRkZ2CXqjqvbTod+Fw7/V3gGcAzgXcDh9MMD/KtQdc5W1XVxUnm0+wlfGVC93OBF238dAnsCOxF83p8uGf9i6ex6Y0Xal4CXFpVNwAk+RnNlf43T+M5Bb+tqoM2ziR5FXCf4Szacz6PqKrvtk2fBV7Ys8iqqvpVu+xlNGMH9Y7TNmsYCtueb9HsJexN82nybTTjQ92vE6HaopXA+4FDgd5P6QFeUlX3GYgxCUwYp2sSd3PfvfMdJ/Tf0T7e2zO9cd7/q/012bhrvXpfj3uYxa+Hh49moPYTyS1JDmmbXgls3GtYDRwFXFlV9wK/AJ4PfGfghc5upwLvrKpLJrR/DTgubQokeWLbvhp4Rdt2AM1hp4luBHZL8qgkO3DfT6Iaoqq6BbgtydPapsXDrGeYZm3abWMemmR9z/wHgCXAKe3J458BrwaoqnXt+9HqdtlvA/Paf9TaSqpqPfChSbr+Fvg74OI2GNbRvLn/A/DJ9rDRRcAFkzznXUneSXOO4irgp/2oXdN2NPDxJLcD5wK/Gm45w+EwF5IEJHl4Vf26nV4G7F5VbxxyWQPnnoIkNV6Q5ASa98WrgVcNt5zhcE9BktTxRLMkqWMoSJI6hoIkqWMoaCRMHHdI0uQMBWkLkmwT39LbVurUzGYoaJTMSfLxJJcm+Xo7sunvjUYL3ail705yHvDGJC9N8pMkP06yul1mTjua5g/a9V/Xth/ajnZ6dpLLkpySZLu278gkl7TP9Z627WVJPtBOv7Ed64gk+yb5djv95CTnJbkwydeS7D5ZnYP9c2o28pOFRsl+wJFV9RdJVgAvAd4KHFdV57VXG58IvKldfpeqehZAkkuA51XVdfndDXOOBn5VVU9ph634TpKvt30HAwtovu/+VeDFSb4LvIdmBNVbgK8nOYLm6vS3tOsdAtycZA/gvwLfSvIg4CPAoqoaT/JympFYXzOxTumBMhQ0Sq6qqova6QuBfdn0aLQAZ/VMfwc4rQ2TL7ZtzwUOTPJn7fzONMFzJ3BBVW38xH8GzRv8XcC5VTXetn8GeGZVfSnJw5M8gmY01M/SjLp6SLutJ9AMof2NdoiTOcANm6hTekAMBY2SiSNd7rKF5W/fOFFVxyR5Ks3Ndy5KchDNyJrHVdXXeldKcii/P2JqsfmROL9HM77VFTQj4b4GeDrwZpqhuS9tx/nfbJ3SA+U5BY2yzY1Gex9J9q2q86vqHcBNNJ/ovwa8vj28Q5LHtzfcATg4yT7tuYSX0wxceD7wrDR3YZtDc7+G3tFvj28ff0RzA5872hFzrwDGkjy93c6Dkuy/9f4M0u+4p6BRN+lotJN4X5L9aD7trwJ+DFwMzAd+2I6YOg4c0S7/PeBkmjt1rQbOrqp727F1/rV9nq9U1Tnt8t+iCZrVVXVPkmtpR1GtqjvbQ1Qfbm/AtD3NSK2XbpW/gNTDsY+kraw9fHR8VXm/BG1zPHwkSeq4pyBJ6rinIEnqGAqSpI6hIEnqGAqSpI6hIEnq/H92ln9qkR9/MQAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib as plt\n",
    "from matplotlib import pyplot\n",
    "pyplot.bar(group_names, df[\"horsepower-binned\"].value_counts())\n",
    "\n",
    "# set x/y labels and plot title\n",
    "plt.pyplot.xlabel(\"horsepower\")\n",
    "plt.pyplot.ylabel(\"count\")\n",
    "plt.pyplot.title(\"horsepower bins\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p>\n",
    "    Look at the dataframe above carefully. You will find that the last column provides the bins for \"horsepower\" based on 3 categories (\"Low\", \"Medium\" and \"High\"). \n",
    "</p>\n",
    "<p>\n",
    "    We successfully narrowed down the intervals from 57 to 3!\n",
    "</p>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Bins Visualization</h3>\n",
    "Normally, a histogram is used to visualize the distribution of bins we created above. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Text(0.5, 1.0, 'horsepower bins')"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEWCAYAAACJ0YulAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAYdElEQVR4nO3dfbRddX3n8feHoPgA8tBcEAkYZKLT4DiIKUodlUoVfBjD2KphFSeD2GgXos6IGupU7Ezp4EO11tG6YkFipWB8QDLLGYVGSbQVMCjyKJJCgEAk4UGh1kbA7/yxdzaH673JJeScc5Pzfq111tn7t/c++3t/OTmfs/c5+3dSVUiSBLDLsAuQJE0fhoIkqWMoSJI6hoIkqWMoSJI6hoIkqWMoaFpKsjbJ7w67jukuyVFJ1m1h+aeT/Mkga9KObddhFyCpf6rqrcOuQTsWjxS0U0uy07zx2Zn+Fk1fhoKms8OSXJXkZ0m+kOQJmxck+cMka5Lck2R5kqf1LKskJye5EbgxjY8l2dA+1lVJnt2uu1uSjyS5Ncmd7emWJ7bLjkqyLskfJ7mrPaX1Bz372TPJ55JsTHJLkv+eZJd22S1JntdOn9DWNLedf3OSr7bTuyRZnOSfktydZFmSfdpls9vtTkpyK/DNyTpqCzWek+TPxv0972r7Yn2SE3vWfWWS65Lcn+T2JKc+hn877aAMBU1nrweOBQ4GngP8F4AkLwX+V7t8f+AW4Pxx2x4HPB+YC7wceDHwTGAv4A3A3e16H2zbDwP+DXAA8P6ex3kqMLNtXwgsSfKsdtkngD2BZwAvAf4zsPlFdiVwVDv9YuCmdp3N8yvb6be3tb4EeBpwL/DJcX/LS4DfBI5hYluqcaJ192zXPQn4ZJK922VnAW+pqj2AZ7OFENJOrKq8eZt2N2AtcELP/IeAT7fTZwEf6lm2O/AAMLudL+ClPctfCvwYeAGwS097gJ8Dh/S0HQnc3E4fBTwIPLln+TLgT4AZwCZgbs+ytwCXtNMnAcvb6euBNwPnt/O3AIf3LDu65zH2b/+WXYHZ7d/yjC3006Q1ttPnAH/Ws+4vgF171t0AvKCdvrX9G54y7H9/b8O7eaSg6ewnPdP/QvPiD8076ls2L6iqf6Z5539Az/q39Sz/JvC/ad6B35lkSZKnAGPAk4Arkvw0yU+Br7ftm91bVT/vmb+l3f9M4PG9dbTTm2tYCbwoyVNpAuQLwAuTzKZ5p35lu97TgQt69n898BCw30R/yyQmq3Eid1fVgz3zvf36e8ArgVuSrExy5Fb2q52QoaAd0R00L6YAJHky8BvA7T3rPGL436r6q6p6HnAozemidwN30bxzPrSq9mpve1bV7j2b7t0+/mYHtfu/i+Yd/dPHLbu93d8amhfctwOrqup+mpBbBHynqn7VbnMb8Iqe/e9VVU+oqkn/lglMVuOjUlXfq6r5wL7AV2mOODRiDAXtiP4OODHJYUl2A/4cuKyq1k60cpLfSvL8JI+jOV30r8BD7QvzZ4CPJdm3XfeAJOPP3f9pkscneRHwauCLVfUQzYvmGUn2SPJ04L8Bn+/ZbiXwNh7+/OCScfMAn24f4+nt/seSzN+GPvm1Gh/Nxu22f5Bkz6p6ALiP5ohFI8ZQ0A6nqlbQnNf/MrAeOARYsIVNnkLz4n8vzamVu4GPtMveC6wBLk1yH/D3QO+HtD9pt7sDOBd4a1X9qF12Ck3I3AR8hyaszu7ZdiWwB7BqknmAjwPLgYuS3A9cSvMB+aOxpRofjTcCa9t+eCtwwjY8hnZwqfJHdqSJJDkK+HxVzRpyKdLAeKQgSeoYCpKkTt9CIcnZ7VWT14xrPyXJDUmuTfKhnvbT2itUb5jggz5p4KrqEk8dadT0cyyVc2i+G/65zQ1JfgeYDzynqjb1fONjLs0HhYfSfL/675M8s/2GhyRpQPoWClW1qr1Qp9cfAWdW1aZ2nQ1t+3yaqz03ATcnWQMcAXx3S/uYOXNmzZ49fheSpC254oor7qqqsYmWDXrUxWfSXOV5Bs13xU+tqu/RXAV6ac9663jk1amdJItoLgDioIMOYvXq1f2tWJJ2MklumWzZoD9o3hXYm2YMmncDy5KEZgya8Sb8rmxVLamqeVU1b2xswqCTJG2jQYfCOuAr1bgc+BXNGDLrgAN71pvFNlymL0l6bAYdCl+lGbGSJM+kGVDsLporOhe0Y9sfDMwBLh9wbZI08vr2mUKS82iG6p2Z5jdkT6cZAuDs9muqvwQWVnNJ9bVJlgHX0QwDfLLfPJKkwduhh7mYN29e+UGzJD06Sa6oqnkTLfOKZklSx1CQJHUMBUlSx1CQJHUGfUXztDJ78deGXYLGWXvmq4ZdgjTSPFKQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHX6FgpJzk6yof095vHLTk1SSWb2tJ2WZE2SG5Ic06+6JEmT6+eRwjnAseMbkxwIvAy4tadtLrAAOLTd5lNJZvSxNknSBPoWClW1CrhngkUfA94DVE/bfOD8qtpUVTcDa4Aj+lWbJGliA/1MIclrgNur6ofjFh0A3NYzv65tm+gxFiVZnWT1xo0b+1SpJI2mgYVCkicB7wPeP9HiCdpqgjaqaklVzauqeWNjY9uzREkaeYP8Oc5DgIOBHyYBmAV8P8kRNEcGB/asOwu4Y4C1SZIY4JFCVV1dVftW1eyqmk0TBIdX1U+A5cCCJLslORiYA1w+qNokSY1+fiX1POC7wLOSrEty0mTrVtW1wDLgOuDrwMlV9VC/apMkTaxvp4+q6vitLJ89bv4M4Ix+1SNJ2jqvaJYkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVKnn7/RfHaSDUmu6Wn7cJIfJbkqyQVJ9upZdlqSNUluSHJMv+qSJE2un0cK5wDHjmu7GHh2VT0H+DFwGkCSucAC4NB2m08lmdHH2iRJE+hbKFTVKuCecW0XVdWD7eylwKx2ej5wflVtqqqbgTXAEf2qTZI0sWF+pvAm4P+10wcAt/UsW9e2/Zoki5KsTrJ648aNfS5RkkbLUEIhyfuAB4FzNzdNsFpNtG1VLamqeVU1b2xsrF8lStJI2nXQO0yyEHg1cHRVbX7hXwcc2LPaLOCOQdcmSaNuoEcKSY4F3gu8pqr+pWfRcmBBkt2SHAzMAS4fZG2SpD4eKSQ5DzgKmJlkHXA6zbeNdgMuTgJwaVW9taquTbIMuI7mtNLJVfVQv2qTJE2sb6FQVcdP0HzWFtY/AzijX/VIkrbOK5olSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLU6VsoJDk7yYYk1/S07ZPk4iQ3tvd79yw7LcmaJDckOaZfdUmSJtfPI4VzgGPHtS0GVlTVHGBFO0+SucAC4NB2m08lmdHH2iRJE+hbKFTVKuCecc3zgaXt9FLguJ7286tqU1XdDKwBjuhXbZKkiQ36M4X9qmo9QHu/b9t+AHBbz3rr2rZfk2RRktVJVm/cuLGvxUrSqJkuHzRngraaaMWqWlJV86pq3tjYWJ/LkqTRMuhQuDPJ/gDt/Ya2fR1wYM96s4A7BlybJI28QYfCcmBhO70QuLCnfUGS3ZIcDMwBLh9wbZI08nbt1wMnOQ84CpiZZB1wOnAmsCzJScCtwOsAquraJMuA64AHgZOr6qF+1SZJmljfQqGqjp9k0dGTrH8GcEa/6pEkbd10+aBZkjQNGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpM6UQiHJiqm0SZJ2bFsc+yjJE4An0QxqtzcP/+7BU4Cn9bk2SdKAbW1AvLcA76QJgCt4OBTuAz7Zv7IkScOwxVCoqo8DH09ySlV9YkA1SZKGZEpDZ1fVJ5L8NjC7d5uq+lyf6pIkDcGUQiHJ3wKHAFcCm3/8pgBDQZJ2IlP9kZ15wNyqqn4WI0karqlep3AN8NR+FiJJGr6pHinMBK5LcjmwaXNjVb1mW3aa5L8Cb6Y5BXU1cCLNV1+/QPO5xVrg9VV177Y8viRp20w1FD6wvXaY5ADg7TSno36RZBmwAJgLrKiqM5MsBhYD791e+5Ukbd1Uv320sg/7fWKSB2iOEO4ATgOOapcvBS7BUJCkgZrqMBf3J7mvvf1rkoeS3LctO6yq24GPALcC64GfVdVFwH5Vtb5dZz2w77Y8viRp2031SGGP3vkkxwFHbMsO2+Ey5gMHAz8FvpjkhEex/SJgEcBBBx20LSVIkiaxTaOkVtVXgZdu4z5/F7i5qjZW1QPAV4DfBu5Msj9Ae79hkn0vqap5VTVvbGxsG0uQJE1kqhevvbZndhea6xa29ZqFW4EXJHkS8AvgaGA18HNgIXBme3/hNj6+JGkbTfXbR/+xZ/pBmq+Mzt+WHVbVZUm+BHy/fawfAEuA3YFlSU6iCY7XbcvjS5K23VQ/Uzhxe+60qk4HTh/XvInmqEGSNCRT/fbRrCQXJNmQ5M4kX04yq9/FSZIGa6ofNH8WWE7zuwoHAP+nbZMk7USmGgpjVfXZqnqwvZ0D+NUfSdrJTDUU7kpyQpIZ7e0E4O5+FiZJGryphsKbgNcDP6G5Cvn3aQaxkyTtRKb6ldT/CSzcPGppkn1ohqp4U78KkyQN3lSPFJ7TO4x1Vd0DPLc/JUmShmWqobBLO2YR0B0pTPUoQ5K0g5jqC/tfAP/YXolcNJ8vnNG3qiRJQzHVK5o/l2Q1zSB4AV5bVdf1tTJJ0sBN+RRQGwIGgSTtxLZp6GxJ0s7JUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVJnKKGQZK8kX0ryoyTXJzkyyT5JLk5yY3u/99YfSZK0PQ3rSOHjwNer6t8C/x64HlgMrKiqOcCKdl6SNEADD4UkTwFeDJwFUFW/rKqfAvOBpe1qS4HjBl2bJI26YRwpPAPYCHw2yQ+S/E2SJwP7VdV6gPZ+3yHUJkkjbRihsCtwOPDXVfVc4Oc8ilNFSRYlWZ1k9caNG/tVoySNpGGEwjpgXVVd1s5/iSYk7kyyP0B7v2GijatqSVXNq6p5Y2NjAylYkkbFwEOhqn4C3JbkWW3T0TSjry4HFrZtC4ELB12bJI26Yf162inAuUkeD9wEnEgTUMuSnATcCrxuSLVJ0sgaSihU1ZXAvAkWHT3gUiRJPbyiWZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSZ1h/UYzSWYAq4Hbq+rVSfYBvgDMBtYCr6+qe4dVn4Zj9uKvDbsEjbP2zFcNuwQN0DCPFN4BXN8zvxhYUVVzgBXtvCRpgIYSCklmAa8C/qaneT6wtJ1eChw34LIkaeQN60jhL4H3AL/qaduvqtYDtPf7DqEuSRppAw+FJK8GNlTVFdu4/aIkq5Os3rhx43auTpJG2zCOFF4IvCbJWuB84KVJPg/cmWR/gPZ+w0QbV9WSqppXVfPGxsYGVbMkjYSBh0JVnVZVs6pqNrAA+GZVnQAsBxa2qy0ELhx0bZI06qbTdQpnAi9LciPwsnZekjRAQ7tOAaCqLgEuaafvBo4eZj2SNOqm05GCJGnIDAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1Bh4KSQ5M8q0k1ye5Nsk72vZ9klyc5Mb2fu9B1yZJo24YRwoPAu+qqt8EXgCcnGQusBhYUVVzgBXtvCRpgAYeClW1vqq+307fD1wPHADMB5a2qy0Fjht0bZI06ob6mUKS2cBzgcuA/apqPTTBAew7yTaLkqxOsnrjxo0Dq1WSRsHQQiHJ7sCXgXdW1X1T3a6qllTVvKqaNzY21r8CJWkEDSUUkjyOJhDOraqvtM13Jtm/Xb4/sGEYtUnSKBvGt48CnAVcX1Uf7Vm0HFjYTi8ELhx0bZI06nYdwj5fCLwRuDrJlW3bHwNnAsuSnATcCrxuCLVJ0kgbeChU1XeATLL46EHWIkl6JK9oliR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUmcYv6cgaQcye/HXhl2CJrD2zFf15XE9UpAkdQwFSVLHUJAkdaZdKCQ5NskNSdYkWTzseiRplEyrUEgyA/gk8ApgLnB8krnDrUqSRse0CgXgCGBNVd1UVb8EzgfmD7kmSRoZ0+0rqQcAt/XMrwOe37tCkkXAonb2n5PcsB32OxO4azs8zs7K/pmcfTM5+2Zyj7lv8sHHtP+nT7ZguoVCJmirR8xULQGWbNedJqurat72fMydif0zOftmcvbN5KZz30y300frgAN75mcBdwypFkkaOdMtFL4HzElycJLHAwuA5UOuSZJGxrQ6fVRVDyZ5G/ANYAZwdlVdO4Bdb9fTUTsh+2dy9s3k7JvJTdu+SVVtfS1J0kiYbqePJElDZChIkjojGQpJ1ia5OsmVSVa3bfskuTjJje393sOucxCSnJ1kQ5Jretom7Yskp7VDkNyQ5JjhVD0Yk/TNB5Lc3j53rkzyyp5lo9Q3Byb5VpLrk1yb5B1t+8g/d7bQNzvGc6eqRu4GrAVmjmv7ELC4nV4MfHDYdQ6oL14MHA5cs7W+oBl65IfAbsDBwD8BM4b9Nwy4bz4AnDrBuqPWN/sDh7fTewA/bvtg5J87W+ibHeK5M5JHCpOYDyxtp5cCxw2vlMGpqlXAPeOaJ+uL+cD5VbWpqm4G1tAMTbJTmqRvJjNqfbO+qr7fTt8PXE8zIsHIP3e20DeTmVZ9M6qhUMBFSa5oh80A2K+q1kPzjwrsO7Tqhm+yvphoGJItPdl3Vm9LclV7emnz6ZGR7Zsks4HnApfhc+cRxvUN7ADPnVENhRdW1eE0o7GenOTFwy5oB7HVYUhGwF8DhwCHAeuBv2jbR7JvkuwOfBl4Z1Xdt6VVJ2jbqftngr7ZIZ47IxkKVXVHe78BuIDmUO3OJPsDtPcbhlfh0E3WFyM/DElV3VlVD1XVr4DP8PBh/sj1TZLH0bzonVtVX2mbfe4wcd/sKM+dkQuFJE9OssfmaeDlwDU0w2ksbFdbCFw4nAqnhcn6YjmwIMluSQ4G5gCXD6G+odn8gtf6TzTPHRixvkkS4Czg+qr6aM+ikX/uTNY3O8xzZ9if1A/6BjyD5pP+HwLXAu9r238DWAHc2N7vM+xaB9Qf59Ecyj5A847lpC31BfA+mm9H3AC8Ytj1D6Fv/ha4GriK5j/z/iPaN/+B5hTHVcCV7e2VPne22Dc7xHPHYS4kSZ2RO30kSZqcoSBJ6hgKkqSOoSBJ6hgKkqSOoaCRkGR272inkiZmKEhbkWRa/WztZHaUOjW9GQoaJTOSfKYd4/6iJE9McliSS9tByi7YPEhZkkuS/HmSlcA7krwuyTVJfphkVbvOjCQfTvK9dvu3tO1HJVnVPt51ST6dZJd22fFpfsvjmiQfbNten+Sj7fQ7ktzUTh+S5Dvt9POSrGwHcfxGz1ASj6hzsN2pnZHvLDRK5gDHV9UfJlkG/B7wHuCUqlqZ5H8ApwPvbNffq6peApDkauCYqro9yV7t8pOAn1XVbyXZDfiHJBe1y46gGSf/FuDrwGuT/CPwQeB5wL00I/UeB6wC3t1u9yLg7iQH0FwZ++12HJ1PAPOramOSNwBnAG8aX6f0WBkKGiU3V9WV7fQVNCNW7lVVK9u2pcAXe9b/Qs/0PwDntGGyefC3lwPPSfL77fyeNMHzS+Dyqtr8jv88mhf4B4BLqmpj234u8OKq+mqS3dsxuQ4E/o7mB35e1O7rWcCzgYubYXWYQTP8xkR1So+JoaBRsqln+iFgr62s//PNE1X11iTPB14FXJnkMJohj0+pqm/0bpTkKH596ONi4iGSN/sucCLN2DffpjkKOBJ4F3AQcG1VHbm1OqXHys8UNMp+Btyb5EXt/BuBlROtmOSQqrqsqt4P3EXzjv4bwB+1p3dI8sx25F2AI5Ic3H6W8AbgOzQ/tPKSJDOTzACO79nfKuDU9v4HwO8Am6rqZzRBMZbkyHY/j0ty6PbrBulhHilo1C0EPp3kScBNNO/WJ/LhJHNo3u2voBll9ypgNvD9drjkjTz885PfBc4E/h3NC/0FVfWrJKcB32of5/9W1eahpb9NEzSrquqhJLcBPwKoql+2p6j+KsmeNP9v/5JmlF9pu3KUVGk7a08fnVpVrx5yKdKj5ukjSVLHIwVJUscjBUlSx1CQJHUMBUlSx1CQJHUMBUlS5/8D72GK8enPX1MAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib as plt\n",
    "from matplotlib import pyplot\n",
    "\n",
    "\n",
    "# draw historgram of attribute \"horsepower\" with bins = 3\n",
    "plt.pyplot.hist(df[\"horsepower\"], bins = 3)\n",
    "\n",
    "# set x/y labels and plot title\n",
    "plt.pyplot.xlabel(\"horsepower\")\n",
    "plt.pyplot.ylabel(\"count\")\n",
    "plt.pyplot.title(\"horsepower bins\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The plot above shows the binning result for the attribute \"horsepower\".\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 id=\"indicator\">Indicator Variable (or Dummy Variable)</h2>\n",
    "<b>What is an indicator variable?</b>\n",
    "<p>\n",
    "    An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning. \n",
    "</p>\n",
    "\n",
    "<b>Why we use indicator variables?</b>\n",
    "\n",
    "<p>\n",
    "    We use indicator variables so we can use categorical variables for regression analysis in the later modules.\n",
    "</p>\n",
    "<b>Example</b>\n",
    "<p>\n",
    "    We see the column \"fuel-type\" has two unique values: \"gas\" or \"diesel\". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert \"fuel-type\" to indicator variables.\n",
    "</p>\n",
    "\n",
    "<p>\n",
    "    We will use pandas' method 'get_dummies' to assign numerical values to different categories of fuel type. \n",
    "</p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',\n",
       "       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',\n",
       "       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',\n",
       "       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',\n",
       "       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',\n",
       "       'highway-mpg', 'price', 'city-L/100km', 'horsepower-binned'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the indicator variables and assign it to data frame \"dummy_variable\\_1\":\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>diesel</th>\n",
       "      <th>gas</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   diesel  gas\n",
       "0       0    1\n",
       "1       0    1\n",
       "2       0    1\n",
       "3       0    1\n",
       "4       0    1"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dummy_variable_1 = pd.get_dummies(df[\"fuel-type\"])\n",
    "dummy_variable_1.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Change the column names for clarity:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fuel-type-diesel</th>\n",
       "      <th>fuel-type-gas</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   fuel-type-diesel  fuel-type-gas\n",
       "0                 0              1\n",
       "1                 0              1\n",
       "2                 0              1\n",
       "3                 0              1\n",
       "4                 0              1"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dummy_variable_1.rename(columns={'gas':'fuel-type-gas', 'diesel':'fuel-type-diesel'}, inplace=True)\n",
    "dummy_variable_1.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the dataframe, column 'fuel-type' has values for 'gas' and 'diesel' as 0s and 1s now.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "# merge data frame \"df\" and \"dummy_variable_1\" \n",
    "df = pd.concat([df, dummy_variable_1], axis=1)\n",
    "\n",
    "# drop original column \"fuel-type\" from \"df\"\n",
    "df.drop(\"fuel-type\", axis = 1, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>symboling</th>\n",
       "      <th>normalized-losses</th>\n",
       "      <th>make</th>\n",
       "      <th>aspiration</th>\n",
       "      <th>num-of-doors</th>\n",
       "      <th>body-style</th>\n",
       "      <th>drive-wheels</th>\n",
       "      <th>engine-location</th>\n",
       "      <th>wheel-base</th>\n",
       "      <th>length</th>\n",
       "      <th>...</th>\n",
       "      <th>compression-ratio</th>\n",
       "      <th>horsepower</th>\n",
       "      <th>peak-rpm</th>\n",
       "      <th>city-mpg</th>\n",
       "      <th>highway-mpg</th>\n",
       "      <th>price</th>\n",
       "      <th>city-L/100km</th>\n",
       "      <th>horsepower-binned</th>\n",
       "      <th>fuel-type-diesel</th>\n",
       "      <th>fuel-type-gas</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>0.811148</td>\n",
       "      <td>...</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>21</td>\n",
       "      <td>8.703704</td>\n",
       "      <td>13495.0</td>\n",
       "      <td>11.190476</td>\n",
       "      <td>Low</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>0.811148</td>\n",
       "      <td>...</td>\n",
       "      <td>9.0</td>\n",
       "      <td>111</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>21</td>\n",
       "      <td>8.703704</td>\n",
       "      <td>16500.0</td>\n",
       "      <td>11.190476</td>\n",
       "      <td>Low</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>hatchback</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>94.5</td>\n",
       "      <td>0.822681</td>\n",
       "      <td>...</td>\n",
       "      <td>9.0</td>\n",
       "      <td>154</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>19</td>\n",
       "      <td>9.038462</td>\n",
       "      <td>16500.0</td>\n",
       "      <td>12.368421</td>\n",
       "      <td>Medium</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>fwd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.8</td>\n",
       "      <td>0.848630</td>\n",
       "      <td>...</td>\n",
       "      <td>10.0</td>\n",
       "      <td>102</td>\n",
       "      <td>5500.0</td>\n",
       "      <td>24</td>\n",
       "      <td>7.833333</td>\n",
       "      <td>13950.0</td>\n",
       "      <td>9.791667</td>\n",
       "      <td>Low</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>4wd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.4</td>\n",
       "      <td>0.848630</td>\n",
       "      <td>...</td>\n",
       "      <td>8.0</td>\n",
       "      <td>115</td>\n",
       "      <td>5500.0</td>\n",
       "      <td>18</td>\n",
       "      <td>10.681818</td>\n",
       "      <td>17450.0</td>\n",
       "      <td>13.055556</td>\n",
       "      <td>Low</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 29 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   symboling  normalized-losses         make aspiration num-of-doors  \\\n",
       "0          3                122  alfa-romero        std          two   \n",
       "1          3                122  alfa-romero        std          two   \n",
       "2          1                122  alfa-romero        std          two   \n",
       "3          2                164         audi        std         four   \n",
       "4          2                164         audi        std         four   \n",
       "\n",
       "    body-style drive-wheels engine-location  wheel-base    length  ...  \\\n",
       "0  convertible          rwd           front        88.6  0.811148  ...   \n",
       "1  convertible          rwd           front        88.6  0.811148  ...   \n",
       "2    hatchback          rwd           front        94.5  0.822681  ...   \n",
       "3        sedan          fwd           front        99.8  0.848630  ...   \n",
       "4        sedan          4wd           front        99.4  0.848630  ...   \n",
       "\n",
       "   compression-ratio  horsepower  peak-rpm city-mpg highway-mpg    price  \\\n",
       "0                9.0         111    5000.0       21    8.703704  13495.0   \n",
       "1                9.0         111    5000.0       21    8.703704  16500.0   \n",
       "2                9.0         154    5000.0       19    9.038462  16500.0   \n",
       "3               10.0         102    5500.0       24    7.833333  13950.0   \n",
       "4                8.0         115    5500.0       18   10.681818  17450.0   \n",
       "\n",
       "  city-L/100km  horsepower-binned  fuel-type-diesel  fuel-type-gas  \n",
       "0    11.190476                Low                 0              1  \n",
       "1    11.190476                Low                 0              1  \n",
       "2    12.368421             Medium                 0              1  \n",
       "3     9.791667                Low                 0              1  \n",
       "4    13.055556                Low                 0              1  \n",
       "\n",
       "[5 rows x 29 columns]"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last two columns are now the indicator variable representation of the fuel-type variable. They're all 0s and 1s now.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n",
    "<h1> Question  #4: </h1>\n",
    "\n",
    "<b>Similar to before, create an indicator variable for the column \"aspiration\"</b>\n",
    "\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>std</th>\n",
       "      <th>turbo</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   std  turbo\n",
       "0    1      0\n",
       "1    1      0\n",
       "2    1      0\n",
       "3    1      0\n",
       "4    1      0"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Write your code below and press Shift+Enter to execute \n",
    "#df['aspiration'].describe()\n",
    "dummy_variable_2 = pd.get_dummies(df[\"aspiration\"])\n",
    "dummy_variable_2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>std-aspiration</th>\n",
       "      <th>turbo-aspiration</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   std-aspiration  turbo-aspiration\n",
       "0               1                 0\n",
       "1               1                 0\n",
       "2               1                 0\n",
       "3               1                 0\n",
       "4               1                 0"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dummy_variable_2.rename(columns={'std':'std-aspiration', 'turbo':'turbo-aspiration'}, inplace=True)\n",
    "dummy_variable_2.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details><summary>Click here for the solution</summary>\n",
    "\n",
    "```python\n",
    "# get indicator variables of aspiration and assign it to data frame \"dummy_variable_2\"\n",
    "dummy_variable_2 = pd.get_dummies(df['aspiration'])\n",
    "\n",
    "# change column names for clarity\n",
    "dummy_variable_2.rename(columns={'std':'aspiration-std', 'turbo': 'aspiration-turbo'}, inplace=True)\n",
    "\n",
    "# show first 5 instances of data frame \"dummy_variable_1\"\n",
    "dummy_variable_2.head()\n",
    "\n",
    "\n",
    "```\n",
    "\n",
    "</details>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " <div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n",
    "<h1> Question  #5: </h1>\n",
    "\n",
    "<b>Merge the new dataframe to the original dataframe, then drop the column 'aspiration'.</b>\n",
    "\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code below and press Shift+Enter to execute \n",
    "df = pd.concat([df, dummy_variable_2], axis=1)\n",
    "\n",
    "# drop original column \"fuel-type\" from \"df\"\n",
    "df.drop(\"aspiration\", axis = 1, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>symboling</th>\n",
       "      <th>normalized-losses</th>\n",
       "      <th>make</th>\n",
       "      <th>num-of-doors</th>\n",
       "      <th>body-style</th>\n",
       "      <th>drive-wheels</th>\n",
       "      <th>engine-location</th>\n",
       "      <th>wheel-base</th>\n",
       "      <th>length</th>\n",
       "      <th>width</th>\n",
       "      <th>...</th>\n",
       "      <th>peak-rpm</th>\n",
       "      <th>city-mpg</th>\n",
       "      <th>highway-mpg</th>\n",
       "      <th>price</th>\n",
       "      <th>city-L/100km</th>\n",
       "      <th>horsepower-binned</th>\n",
       "      <th>fuel-type-diesel</th>\n",
       "      <th>fuel-type-gas</th>\n",
       "      <th>std-aspiration</th>\n",
       "      <th>turbo-aspiration</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>0.811148</td>\n",
       "      <td>0.890278</td>\n",
       "      <td>...</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>21</td>\n",
       "      <td>8.703704</td>\n",
       "      <td>13495.0</td>\n",
       "      <td>11.190476</td>\n",
       "      <td>Low</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "      <td>0.811148</td>\n",
       "      <td>0.890278</td>\n",
       "      <td>...</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>21</td>\n",
       "      <td>8.703704</td>\n",
       "      <td>16500.0</td>\n",
       "      <td>11.190476</td>\n",
       "      <td>Low</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>122</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>two</td>\n",
       "      <td>hatchback</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>94.5</td>\n",
       "      <td>0.822681</td>\n",
       "      <td>0.909722</td>\n",
       "      <td>...</td>\n",
       "      <td>5000.0</td>\n",
       "      <td>19</td>\n",
       "      <td>9.038462</td>\n",
       "      <td>16500.0</td>\n",
       "      <td>12.368421</td>\n",
       "      <td>Medium</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>fwd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.8</td>\n",
       "      <td>0.848630</td>\n",
       "      <td>0.919444</td>\n",
       "      <td>...</td>\n",
       "      <td>5500.0</td>\n",
       "      <td>24</td>\n",
       "      <td>7.833333</td>\n",
       "      <td>13950.0</td>\n",
       "      <td>9.791667</td>\n",
       "      <td>Low</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>164</td>\n",
       "      <td>audi</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>4wd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.4</td>\n",
       "      <td>0.848630</td>\n",
       "      <td>0.922222</td>\n",
       "      <td>...</td>\n",
       "      <td>5500.0</td>\n",
       "      <td>18</td>\n",
       "      <td>10.681818</td>\n",
       "      <td>17450.0</td>\n",
       "      <td>13.055556</td>\n",
       "      <td>Low</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 30 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   symboling  normalized-losses         make num-of-doors   body-style  \\\n",
       "0          3                122  alfa-romero          two  convertible   \n",
       "1          3                122  alfa-romero          two  convertible   \n",
       "2          1                122  alfa-romero          two    hatchback   \n",
       "3          2                164         audi         four        sedan   \n",
       "4          2                164         audi         four        sedan   \n",
       "\n",
       "  drive-wheels engine-location  wheel-base    length     width  ...  peak-rpm  \\\n",
       "0          rwd           front        88.6  0.811148  0.890278  ...    5000.0   \n",
       "1          rwd           front        88.6  0.811148  0.890278  ...    5000.0   \n",
       "2          rwd           front        94.5  0.822681  0.909722  ...    5000.0   \n",
       "3          fwd           front        99.8  0.848630  0.919444  ...    5500.0   \n",
       "4          4wd           front        99.4  0.848630  0.922222  ...    5500.0   \n",
       "\n",
       "   city-mpg highway-mpg    price  city-L/100km horsepower-binned  \\\n",
       "0        21    8.703704  13495.0     11.190476               Low   \n",
       "1        21    8.703704  16500.0     11.190476               Low   \n",
       "2        19    9.038462  16500.0     12.368421            Medium   \n",
       "3        24    7.833333  13950.0      9.791667               Low   \n",
       "4        18   10.681818  17450.0     13.055556               Low   \n",
       "\n",
       "   fuel-type-diesel  fuel-type-gas  std-aspiration  turbo-aspiration  \n",
       "0                 0              1               1                 0  \n",
       "1                 0              1               1                 0  \n",
       "2                 0              1               1                 0  \n",
       "3                 0              1               1                 0  \n",
       "4                 0              1               1                 0  \n",
       "\n",
       "[5 rows x 30 columns]"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details><summary>Click here for the solution</summary>\n",
    "\n",
    "```python\n",
    "# merge the new dataframe to the original datafram\n",
    "df = pd.concat([df, dummy_variable_2], axis=1)\n",
    "\n",
    "# drop original column \"aspiration\" from \"df\"\n",
    "df.drop('aspiration', axis = 1, inplace=True)\n",
    "\n",
    "\n",
    "```\n",
    "\n",
    "</details>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save the new csv:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_csv('clean_df.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Thank you for completing this lab!\n",
    "\n",
    "## Author\n",
    "\n",
    "<a href=\"https://www.linkedin.com/in/joseph-s-50398b136/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01\" target=\"_blank\">Joseph Santarcangelo</a>\n",
    "\n",
    "### Other Contributors\n",
    "\n",
    "<a href=\"https://www.linkedin.com/in/mahdi-noorian-58219234/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01\" target=\"_blank\">Mahdi Noorian PhD</a>\n",
    "\n",
    "Bahare Talayian\n",
    "\n",
    "Eric Xiao\n",
    "\n",
    "Steven Dong\n",
    "\n",
    "Parizad\n",
    "\n",
    "Hima Vasudevan\n",
    "\n",
    "<a href=\"https://www.linkedin.com/in/fiorellawever/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01\" target=\"_blank\">Fiorella Wenver</a>\n",
    "\n",
    "<a href=\"https://www.linkedin.com/in/yi-leng-yao-84451275/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01\" target=\"_blank\" >Yi Yao</a>.\n",
    "\n",
    "## Change Log\n",
    "\n",
    "|  Date (YYYY-MM-DD) |  Version | Changed By  |  Change Description |\n",
    "|---|---|---|---|\n",
    "| 2020-10-30  | 2.2  | Lakshmi  |  Changed URL of csv  |\n",
    "| 2020-09-09  | 2.1  | Lakshmi  |  Updated Indicator Variables section  |\n",
    "| 2020-08-27  | 2.0  | Lavanya  |  Moved lab to course repo in GitLab  |\n",
    "\n",
    "<hr>\n",
    "\n",
    "## <h3 align=\"center\"> © IBM Corporation 2020. All rights reserved. <h3/>\n"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python",
   "language": "python",
   "name": "conda-env-python-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}