Skip to content

Instantly share code, notes, and snippets.

@573
Forked from Alexon-Abreu/data-wrangling.ipynb
Created March 30, 2021 15:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save 573/a91d48c8de716a51eb6ceda5d8a0ce84 to your computer and use it in GitHub Desktop.
Save 573/a91d48c8de716a51eb6ceda5d8a0ce84 to your computer and use it in GitHub Desktop.
Created on Skills Network Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/Logos/organization_logo/organization_logo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n",
"</center>\n",
"\n",
"# Data Wrangling\n",
"\n",
"Estimated time needed: **30** minutes\n",
"\n",
"## Objectives\n",
"\n",
"After completing this lab you will be able to:\n",
"\n",
"- Handle missing values\n",
"- Correct data format\n",
"- Standardize and Normalize Data\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Table of content</h2>\n",
"\n",
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
"<ul>\n",
" <li><a href=\"#identify_handle_missing_values\">Identify and handle missing values</a>\n",
" <ul>\n",
" <li><a href=\"#identify_missing_values\">Identify missing values</a></li>\n",
" <li><a href=\"#deal_missing_values\">Deal with missing values</a></li>\n",
" <li><a href=\"#correct_data_format\">Correct data format</a></li>\n",
" </ul>\n",
" </li>\n",
" <li><a href=\"#data_standardization\">Data standardization</a></li>\n",
" <li><a href=\"#data_normalization\">Data Normalization (centering/scaling)</a></li>\n",
" <li><a href=\"#binning\">Binning</a></li>\n",
" <li><a href=\"#indicator\">Indicator variable</a></li>\n",
"</ul>\n",
" \n",
"</div>\n",
" \n",
"<hr>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>What is the purpose of Data Wrangling?</h2>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data Wrangling is the process of converting data from the initial format to a format that may be better for analysis.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>What is the fuel consumption (L/100k) rate for the diesel car?</h3>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Import data</h3>\n",
"<p>\n",
"You can find the \"Automobile Data Set\" from the following link: <a href=\"https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data\">https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data</a>. \n",
"We will be using this data set throughout this course.\n",
"</p>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Import pandas</h4> \n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pylab as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Reading the data set from the URL and adding the related headers.</h2>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"URL of the dataset\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This dataset was hosted on IBM Cloud object click <a href=\"https://cocl.us/corsera_da0101en_notebook_bottom\">HERE</a> for free storage \n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"filename = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/auto.csv\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Python list <b>headers</b> containing name of headers \n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"headers = [\"symboling\",\"normalized-losses\",\"make\",\"fuel-type\",\"aspiration\", \"num-of-doors\",\"body-style\",\n",
" \"drive-wheels\",\"engine-location\",\"wheel-base\", \"length\",\"width\",\"height\",\"curb-weight\",\"engine-type\",\n",
" \"num-of-cylinders\", \"engine-size\",\"fuel-system\",\"bore\",\"stroke\",\"compression-ratio\",\"horsepower\",\n",
" \"peak-rpm\",\"city-mpg\",\"highway-mpg\",\"price\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the Pandas method <b>read_csv()</b> to load the data from the web address. Set the parameter \"names\" equal to the Python list \"headers\".\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(filename, names = headers)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Use the method <b>head()</b> to display the first five rows of the dataframe. \n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>symboling</th>\n",
" <th>normalized-losses</th>\n",
" <th>make</th>\n",
" <th>fuel-type</th>\n",
" <th>aspiration</th>\n",
" <th>num-of-doors</th>\n",
" <th>body-style</th>\n",
" <th>drive-wheels</th>\n",
" <th>engine-location</th>\n",
" <th>wheel-base</th>\n",
" <th>...</th>\n",
" <th>engine-size</th>\n",
" <th>fuel-system</th>\n",
" <th>bore</th>\n",
" <th>stroke</th>\n",
" <th>compression-ratio</th>\n",
" <th>horsepower</th>\n",
" <th>peak-rpm</th>\n",
" <th>city-mpg</th>\n",
" <th>highway-mpg</th>\n",
" <th>price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3</td>\n",
" <td>?</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>130</td>\n",
" <td>mpfi</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000</td>\n",
" <td>21</td>\n",
" <td>27</td>\n",
" <td>13495</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>?</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>130</td>\n",
" <td>mpfi</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000</td>\n",
" <td>21</td>\n",
" <td>27</td>\n",
" <td>16500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>?</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>hatchback</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>94.5</td>\n",
" <td>...</td>\n",
" <td>152</td>\n",
" <td>mpfi</td>\n",
" <td>2.68</td>\n",
" <td>3.47</td>\n",
" <td>9.0</td>\n",
" <td>154</td>\n",
" <td>5000</td>\n",
" <td>19</td>\n",
" <td>26</td>\n",
" <td>16500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>fwd</td>\n",
" <td>front</td>\n",
" <td>99.8</td>\n",
" <td>...</td>\n",
" <td>109</td>\n",
" <td>mpfi</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>10.0</td>\n",
" <td>102</td>\n",
" <td>5500</td>\n",
" <td>24</td>\n",
" <td>30</td>\n",
" <td>13950</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>4wd</td>\n",
" <td>front</td>\n",
" <td>99.4</td>\n",
" <td>...</td>\n",
" <td>136</td>\n",
" <td>mpfi</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>8.0</td>\n",
" <td>115</td>\n",
" <td>5500</td>\n",
" <td>18</td>\n",
" <td>22</td>\n",
" <td>17450</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 26 columns</p>\n",
"</div>"
],
"text/plain": [
" symboling normalized-losses make fuel-type aspiration num-of-doors \\\n",
"0 3 ? alfa-romero gas std two \n",
"1 3 ? alfa-romero gas std two \n",
"2 1 ? alfa-romero gas std two \n",
"3 2 164 audi gas std four \n",
"4 2 164 audi gas std four \n",
"\n",
" body-style drive-wheels engine-location wheel-base ... engine-size \\\n",
"0 convertible rwd front 88.6 ... 130 \n",
"1 convertible rwd front 88.6 ... 130 \n",
"2 hatchback rwd front 94.5 ... 152 \n",
"3 sedan fwd front 99.8 ... 109 \n",
"4 sedan 4wd front 99.4 ... 136 \n",
"\n",
" fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg \\\n",
"0 mpfi 3.47 2.68 9.0 111 5000 21 \n",
"1 mpfi 3.47 2.68 9.0 111 5000 21 \n",
"2 mpfi 2.68 3.47 9.0 154 5000 19 \n",
"3 mpfi 3.19 3.40 10.0 102 5500 24 \n",
"4 mpfi 3.19 3.40 8.0 115 5500 18 \n",
"\n",
" highway-mpg price \n",
"0 27 13495 \n",
"1 27 16500 \n",
"2 26 16500 \n",
"3 30 13950 \n",
"4 22 17450 \n",
"\n",
"[5 rows x 26 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# To see what the data set looks like, we'll use the head() method.\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, several question marks appeared in the dataframe; those are missing values which may hinder our further analysis. \n",
"\n",
"<div>So, how do we identify all those missing values and deal with them?</div> \n",
"\n",
"<b>How to work with missing data?</b>\n",
"\n",
"Steps for working with missing data:\n",
"\n",
"<ol>\n",
" <li>dentify missing data</li>\n",
" <li>deal with missing data</li>\n",
" <li>correct data format</li>\n",
"</ol>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"identify_handle_missing_values\">Identify and handle missing values</h2>\n",
"\n",
"<h3 id=\"identify_missing_values\">Identify missing values</h3>\n",
"<h4>Convert \"?\" to NaN</h4>\n",
"In the car dataset, missing data comes with the question mark \"?\".\n",
"We replace \"?\" with NaN (Not a Number), which is Python's default missing value marker, for reasons of computational speed and convenience. Here we use the function: \n",
" <pre>.replace(A, B, inplace = True) </pre>\n",
"to replace A by B\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>symboling</th>\n",
" <th>normalized-losses</th>\n",
" <th>make</th>\n",
" <th>fuel-type</th>\n",
" <th>aspiration</th>\n",
" <th>num-of-doors</th>\n",
" <th>body-style</th>\n",
" <th>drive-wheels</th>\n",
" <th>engine-location</th>\n",
" <th>wheel-base</th>\n",
" <th>...</th>\n",
" <th>engine-size</th>\n",
" <th>fuel-system</th>\n",
" <th>bore</th>\n",
" <th>stroke</th>\n",
" <th>compression-ratio</th>\n",
" <th>horsepower</th>\n",
" <th>peak-rpm</th>\n",
" <th>city-mpg</th>\n",
" <th>highway-mpg</th>\n",
" <th>price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>130</td>\n",
" <td>mpfi</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000</td>\n",
" <td>21</td>\n",
" <td>27</td>\n",
" <td>13495</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>130</td>\n",
" <td>mpfi</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000</td>\n",
" <td>21</td>\n",
" <td>27</td>\n",
" <td>16500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>hatchback</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>94.5</td>\n",
" <td>...</td>\n",
" <td>152</td>\n",
" <td>mpfi</td>\n",
" <td>2.68</td>\n",
" <td>3.47</td>\n",
" <td>9.0</td>\n",
" <td>154</td>\n",
" <td>5000</td>\n",
" <td>19</td>\n",
" <td>26</td>\n",
" <td>16500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>fwd</td>\n",
" <td>front</td>\n",
" <td>99.8</td>\n",
" <td>...</td>\n",
" <td>109</td>\n",
" <td>mpfi</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>10.0</td>\n",
" <td>102</td>\n",
" <td>5500</td>\n",
" <td>24</td>\n",
" <td>30</td>\n",
" <td>13950</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>4wd</td>\n",
" <td>front</td>\n",
" <td>99.4</td>\n",
" <td>...</td>\n",
" <td>136</td>\n",
" <td>mpfi</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>8.0</td>\n",
" <td>115</td>\n",
" <td>5500</td>\n",
" <td>18</td>\n",
" <td>22</td>\n",
" <td>17450</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 26 columns</p>\n",
"</div>"
],
"text/plain": [
" symboling normalized-losses make fuel-type aspiration num-of-doors \\\n",
"0 3 NaN alfa-romero gas std two \n",
"1 3 NaN alfa-romero gas std two \n",
"2 1 NaN alfa-romero gas std two \n",
"3 2 164 audi gas std four \n",
"4 2 164 audi gas std four \n",
"\n",
" body-style drive-wheels engine-location wheel-base ... engine-size \\\n",
"0 convertible rwd front 88.6 ... 130 \n",
"1 convertible rwd front 88.6 ... 130 \n",
"2 hatchback rwd front 94.5 ... 152 \n",
"3 sedan fwd front 99.8 ... 109 \n",
"4 sedan 4wd front 99.4 ... 136 \n",
"\n",
" fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg \\\n",
"0 mpfi 3.47 2.68 9.0 111 5000 21 \n",
"1 mpfi 3.47 2.68 9.0 111 5000 21 \n",
"2 mpfi 2.68 3.47 9.0 154 5000 19 \n",
"3 mpfi 3.19 3.40 10.0 102 5500 24 \n",
"4 mpfi 3.19 3.40 8.0 115 5500 18 \n",
"\n",
" highway-mpg price \n",
"0 27 13495 \n",
"1 27 16500 \n",
"2 26 16500 \n",
"3 30 13950 \n",
"4 22 17450 \n",
"\n",
"[5 rows x 26 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"\n",
"# replace \"?\" to NaN\n",
"df.replace(\"?\", np.nan, inplace = True)\n",
"df.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Identify_missing_values\n",
"\n",
"<h4>Evaluating for Missing Data</h4>\n",
"\n",
"The missing values are converted to default. We use the following functions to identify these missing values. There are two methods to detect missing data:\n",
"\n",
"<ol>\n",
" <li><b>.isnull()</b></li>\n",
" <li><b>.notnull()</b></li>\n",
"</ol>\n",
"The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>symboling</th>\n",
" <th>normalized-losses</th>\n",
" <th>make</th>\n",
" <th>fuel-type</th>\n",
" <th>aspiration</th>\n",
" <th>num-of-doors</th>\n",
" <th>body-style</th>\n",
" <th>drive-wheels</th>\n",
" <th>engine-location</th>\n",
" <th>wheel-base</th>\n",
" <th>...</th>\n",
" <th>engine-size</th>\n",
" <th>fuel-system</th>\n",
" <th>bore</th>\n",
" <th>stroke</th>\n",
" <th>compression-ratio</th>\n",
" <th>horsepower</th>\n",
" <th>peak-rpm</th>\n",
" <th>city-mpg</th>\n",
" <th>highway-mpg</th>\n",
" <th>price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>...</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 26 columns</p>\n",
"</div>"
],
"text/plain": [
" symboling normalized-losses make fuel-type aspiration num-of-doors \\\n",
"0 False True False False False False \n",
"1 False True False False False False \n",
"2 False True False False False False \n",
"3 False False False False False False \n",
"4 False False False False False False \n",
"\n",
" body-style drive-wheels engine-location wheel-base ... engine-size \\\n",
"0 False False False False ... False \n",
"1 False False False False ... False \n",
"2 False False False False ... False \n",
"3 False False False False ... False \n",
"4 False False False False ... False \n",
"\n",
" fuel-system bore stroke compression-ratio horsepower peak-rpm \\\n",
"0 False False False False False False \n",
"1 False False False False False False \n",
"2 False False False False False False \n",
"3 False False False False False False \n",
"4 False False False False False False \n",
"\n",
" city-mpg highway-mpg price \n",
"0 False False False \n",
"1 False False False \n",
"2 False False False \n",
"3 False False False \n",
"4 False False False \n",
"\n",
"[5 rows x 26 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"missing_data = df.isnull()\n",
"missing_data.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\"True\" stands for missing value, while \"False\" stands for not missing value.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Count missing values in each column</h4>\n",
"<p>\n",
"Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, \"True\" represents a missing value, \"False\" means the value is present in the dataset. In the body of the for loop the method \".value_counts()\" counts the number of \"True\" values. \n",
"</p>\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"symboling\n",
"False 205\n",
"Name: symboling, dtype: int64\n",
"\n",
"normalized-losses\n",
"False 164\n",
"True 41\n",
"Name: normalized-losses, dtype: int64\n",
"\n",
"make\n",
"False 205\n",
"Name: make, dtype: int64\n",
"\n",
"fuel-type\n",
"False 205\n",
"Name: fuel-type, dtype: int64\n",
"\n",
"aspiration\n",
"False 205\n",
"Name: aspiration, dtype: int64\n",
"\n",
"num-of-doors\n",
"False 203\n",
"True 2\n",
"Name: num-of-doors, dtype: int64\n",
"\n",
"body-style\n",
"False 205\n",
"Name: body-style, dtype: int64\n",
"\n",
"drive-wheels\n",
"False 205\n",
"Name: drive-wheels, dtype: int64\n",
"\n",
"engine-location\n",
"False 205\n",
"Name: engine-location, dtype: int64\n",
"\n",
"wheel-base\n",
"False 205\n",
"Name: wheel-base, dtype: int64\n",
"\n",
"length\n",
"False 205\n",
"Name: length, dtype: int64\n",
"\n",
"width\n",
"False 205\n",
"Name: width, dtype: int64\n",
"\n",
"height\n",
"False 205\n",
"Name: height, dtype: int64\n",
"\n",
"curb-weight\n",
"False 205\n",
"Name: curb-weight, dtype: int64\n",
"\n",
"engine-type\n",
"False 205\n",
"Name: engine-type, dtype: int64\n",
"\n",
"num-of-cylinders\n",
"False 205\n",
"Name: num-of-cylinders, dtype: int64\n",
"\n",
"engine-size\n",
"False 205\n",
"Name: engine-size, dtype: int64\n",
"\n",
"fuel-system\n",
"False 205\n",
"Name: fuel-system, dtype: int64\n",
"\n",
"bore\n",
"False 201\n",
"True 4\n",
"Name: bore, dtype: int64\n",
"\n",
"stroke\n",
"False 201\n",
"True 4\n",
"Name: stroke, dtype: int64\n",
"\n",
"compression-ratio\n",
"False 205\n",
"Name: compression-ratio, dtype: int64\n",
"\n",
"horsepower\n",
"False 203\n",
"True 2\n",
"Name: horsepower, dtype: int64\n",
"\n",
"peak-rpm\n",
"False 203\n",
"True 2\n",
"Name: peak-rpm, dtype: int64\n",
"\n",
"city-mpg\n",
"False 205\n",
"Name: city-mpg, dtype: int64\n",
"\n",
"highway-mpg\n",
"False 205\n",
"Name: highway-mpg, dtype: int64\n",
"\n",
"price\n",
"False 201\n",
"True 4\n",
"Name: price, dtype: int64\n",
"\n"
]
}
],
"source": [
"for column in missing_data.columns.values.tolist():\n",
" print(column)\n",
" print (missing_data[column].value_counts())\n",
" print(\"\") "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Based on the summary above, each column has 205 rows of data, seven columns containing missing data:\n",
"\n",
"<ol>\n",
" <li>\"normalized-losses\": 41 missing data</li>\n",
" <li>\"num-of-doors\": 2 missing data</li>\n",
" <li>\"bore\": 4 missing data</li>\n",
" <li>\"stroke\" : 4 missing data</li>\n",
" <li>\"horsepower\": 2 missing data</li>\n",
" <li>\"peak-rpm\": 2 missing data</li>\n",
" <li>\"price\": 4 missing data</li>\n",
"</ol>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3 id=\"deal_missing_values\">Deal with missing data</h3>\n",
"<b>How to deal with missing data?</b>\n",
"\n",
"<ol>\n",
" <li>drop data<br>\n",
" a. drop the whole row<br>\n",
" b. drop the whole column\n",
" </li>\n",
" <li>replace data<br>\n",
" a. replace it by mean<br>\n",
" b. replace it by frequency<br>\n",
" c. replace it based on other functions\n",
" </li>\n",
"</ol>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whole columns should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty enough to drop entirely.\n",
"We have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. We will apply each method to many different columns:\n",
"\n",
"<b>Replace by mean:</b>\n",
"\n",
"<ul>\n",
" <li>\"normalized-losses\": 41 missing data, replace them with mean</li>\n",
" <li>\"stroke\": 4 missing data, replace them with mean</li>\n",
" <li>\"bore\": 4 missing data, replace them with mean</li>\n",
" <li>\"horsepower\": 2 missing data, replace them with mean</li>\n",
" <li>\"peak-rpm\": 2 missing data, replace them with mean</li>\n",
"</ul>\n",
"\n",
"<b>Replace by frequency:</b>\n",
"\n",
"<ul>\n",
" <li>\"num-of-doors\": 2 missing data, replace them with \"four\". \n",
" <ul>\n",
" <li>Reason: 84% sedans is four doors. Since four doors is most frequent, it is most likely to occur</li>\n",
" </ul>\n",
" </li>\n",
"</ul>\n",
"\n",
"<b>Drop the whole row:</b>\n",
"\n",
"<ul>\n",
" <li>\"price\": 4 missing data, simply delete the whole row\n",
" <ul>\n",
" <li>Reason: price is what we want to predict. Any data entry without price data cannot be used for prediction; therefore any row now without price data is not useful to us</li>\n",
" </ul>\n",
" </li>\n",
"</ul>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Calculate the average of the column </h4>\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average of normalized-losses: 122.0\n"
]
}
],
"source": [
"avg_norm_loss = df[\"normalized-losses\"].astype(\"float\").mean(axis=0)\n",
"print(\"Average of normalized-losses:\", avg_norm_loss)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Replace \"NaN\" by mean value in \"normalized-losses\" column</h4>\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"df[\"normalized-losses\"].replace(np.nan, avg_norm_loss, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Calculate the mean value for 'bore' column</h4>\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average of bore: 3.3297512437810943\n"
]
}
],
"source": [
"avg_bore=df['bore'].astype('float').mean(axis=0)\n",
"print(\"Average of bore:\", avg_bore)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Replace NaN by mean value</h4>\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"df[\"bore\"].replace(np.nan, avg_bore, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n",
"<h1> Question #1: </h1>\n",
"\n",
"<b>According to the example above, replace NaN in \"stroke\" column by mean.</b>\n",
"\n",
"</div>\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average of stroke: 3.255422885572139\n"
]
}
],
"source": [
"avg_stroke = df[\"stroke\"].astype(\"float\").mean(axis = 0)\n",
"print(\"Average of stroke:\", avg_stroke)\n",
"\n",
"df[\"stroke\"].replace(np.nan,avg_stroke, inplace = True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```python\n",
"#Calculate the mean vaule for \"stroke\" column\n",
"avg_stroke = df[\"stroke\"].astype(\"float\").mean(axis = 0)\n",
"print(\"Average of stroke:\", avg_stroke)\n",
"\n",
"# replace NaN by mean value in \"stroke\" column\n",
"df[\"stroke\"].replace(np.nan, avg_stroke, inplace = True)\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Calculate the mean value for the 'horsepower' column:</h4>\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average horsepower: 104.25615763546799\n"
]
}
],
"source": [
"avg_horsepower = df['horsepower'].astype('float').mean(axis=0)\n",
"print(\"Average horsepower:\", avg_horsepower)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Replace \"NaN\" by mean value:</h4>\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Calculate the mean value for 'peak-rpm' column:</h4>\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average peak rpm: 5125.369458128079\n"
]
}
],
"source": [
"avg_peakrpm=df['peak-rpm'].astype('float').mean(axis=0)\n",
"print(\"Average peak rpm:\", avg_peakrpm)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Replace NaN by mean value:</h4>\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"df['peak-rpm'].replace(np.nan, avg_peakrpm, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To see which values are present in a particular column, we can use the \".value_counts()\" method:\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"four 114\n",
"two 89\n",
"Name: num-of-doors, dtype: int64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['num-of-doors'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that four doors are the most common type. We can also use the \".idxmax()\" method to calculate for us the most common type automatically:\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'four'"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['num-of-doors'].value_counts().idxmax()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The replacement procedure is very similar to what we have seen previously\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"#replace the missing 'num-of-doors' values by the most frequent \n",
"df[\"num-of-doors\"].replace(np.nan, \"four\", inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, let's drop all rows that do not have price data:\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# simply drop whole row with NaN in \"price\" column\n",
"df.dropna(subset=[\"price\"], axis=0, inplace=True)\n",
"\n",
"# reset index, because we droped two rows\n",
"df.reset_index(drop=True, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>symboling</th>\n",
" <th>normalized-losses</th>\n",
" <th>make</th>\n",
" <th>fuel-type</th>\n",
" <th>aspiration</th>\n",
" <th>num-of-doors</th>\n",
" <th>body-style</th>\n",
" <th>drive-wheels</th>\n",
" <th>engine-location</th>\n",
" <th>wheel-base</th>\n",
" <th>...</th>\n",
" <th>engine-size</th>\n",
" <th>fuel-system</th>\n",
" <th>bore</th>\n",
" <th>stroke</th>\n",
" <th>compression-ratio</th>\n",
" <th>horsepower</th>\n",
" <th>peak-rpm</th>\n",
" <th>city-mpg</th>\n",
" <th>highway-mpg</th>\n",
" <th>price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>130</td>\n",
" <td>mpfi</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000</td>\n",
" <td>21</td>\n",
" <td>27</td>\n",
" <td>13495</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>130</td>\n",
" <td>mpfi</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000</td>\n",
" <td>21</td>\n",
" <td>27</td>\n",
" <td>16500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>hatchback</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>94.5</td>\n",
" <td>...</td>\n",
" <td>152</td>\n",
" <td>mpfi</td>\n",
" <td>2.68</td>\n",
" <td>3.47</td>\n",
" <td>9.0</td>\n",
" <td>154</td>\n",
" <td>5000</td>\n",
" <td>19</td>\n",
" <td>26</td>\n",
" <td>16500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>fwd</td>\n",
" <td>front</td>\n",
" <td>99.8</td>\n",
" <td>...</td>\n",
" <td>109</td>\n",
" <td>mpfi</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>10.0</td>\n",
" <td>102</td>\n",
" <td>5500</td>\n",
" <td>24</td>\n",
" <td>30</td>\n",
" <td>13950</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>4wd</td>\n",
" <td>front</td>\n",
" <td>99.4</td>\n",
" <td>...</td>\n",
" <td>136</td>\n",
" <td>mpfi</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>8.0</td>\n",
" <td>115</td>\n",
" <td>5500</td>\n",
" <td>18</td>\n",
" <td>22</td>\n",
" <td>17450</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 26 columns</p>\n",
"</div>"
],
"text/plain": [
" symboling normalized-losses make fuel-type aspiration num-of-doors \\\n",
"0 3 122 alfa-romero gas std two \n",
"1 3 122 alfa-romero gas std two \n",
"2 1 122 alfa-romero gas std two \n",
"3 2 164 audi gas std four \n",
"4 2 164 audi gas std four \n",
"\n",
" body-style drive-wheels engine-location wheel-base ... engine-size \\\n",
"0 convertible rwd front 88.6 ... 130 \n",
"1 convertible rwd front 88.6 ... 130 \n",
"2 hatchback rwd front 94.5 ... 152 \n",
"3 sedan fwd front 99.8 ... 109 \n",
"4 sedan 4wd front 99.4 ... 136 \n",
"\n",
" fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg \\\n",
"0 mpfi 3.47 2.68 9.0 111 5000 21 \n",
"1 mpfi 3.47 2.68 9.0 111 5000 21 \n",
"2 mpfi 2.68 3.47 9.0 154 5000 19 \n",
"3 mpfi 3.19 3.40 10.0 102 5500 24 \n",
"4 mpfi 3.19 3.40 8.0 115 5500 18 \n",
"\n",
" highway-mpg price \n",
"0 27 13495 \n",
"1 27 16500 \n",
"2 26 16500 \n",
"3 30 13950 \n",
"4 22 17450 \n",
"\n",
"[5 rows x 26 columns]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>Good!</b> Now, we obtain the dataset with no missing values.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3 id=\"correct_data_format\">Correct data format</h3>\n",
"<b>We are almost there!</b>\n",
"<p>The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).</p>\n",
"\n",
"In Pandas, we use \n",
"\n",
"<p><b>.dtype()</b> to check the data type</p>\n",
"<p><b>.astype()</b> to change the data type</p>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Lets list the data types for each column</h4>\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"symboling int64\n",
"normalized-losses object\n",
"make object\n",
"fuel-type object\n",
"aspiration object\n",
"num-of-doors object\n",
"body-style object\n",
"drive-wheels object\n",
"engine-location object\n",
"wheel-base float64\n",
"length float64\n",
"width float64\n",
"height float64\n",
"curb-weight int64\n",
"engine-type object\n",
"num-of-cylinders object\n",
"engine-size int64\n",
"fuel-system object\n",
"bore object\n",
"stroke object\n",
"compression-ratio float64\n",
"horsepower object\n",
"peak-rpm object\n",
"city-mpg int64\n",
"highway-mpg int64\n",
"price object\n",
"dtype: object"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. For example, 'bore' and 'stroke' variables are numerical values that describe the engines, so we should expect them to be of the type 'float' or 'int'; however, they are shown as type 'object'. We have to convert data types into a proper format for each column using the \"astype()\" method.</p> \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Convert data types to proper format</h4>\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"df[[\"bore\", \"stroke\"]] = df[[\"bore\", \"stroke\"]].astype(\"float\")\n",
"df[[\"normalized-losses\"]] = df[[\"normalized-losses\"]].astype(\"int\")\n",
"df[[\"price\"]] = df[[\"price\"]].astype(\"float\")\n",
"df[[\"peak-rpm\"]] = df[[\"peak-rpm\"]].astype(\"float\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Let us list the columns after the conversion</h4>\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"symboling int64\n",
"normalized-losses int64\n",
"make object\n",
"fuel-type object\n",
"aspiration object\n",
"num-of-doors object\n",
"body-style object\n",
"drive-wheels object\n",
"engine-location object\n",
"wheel-base float64\n",
"length float64\n",
"width float64\n",
"height float64\n",
"curb-weight int64\n",
"engine-type object\n",
"num-of-cylinders object\n",
"engine-size int64\n",
"fuel-system object\n",
"bore float64\n",
"stroke float64\n",
"compression-ratio float64\n",
"horsepower object\n",
"peak-rpm float64\n",
"city-mpg int64\n",
"highway-mpg int64\n",
"price float64\n",
"dtype: object"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>Wonderful!</b>\n",
"\n",
"Now, we finally obtain the cleaned dataset with no missing values and all data in its proper format.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"data_standardization\">Data Standardization</h2>\n",
"<p>\n",
"Data is usually collected from different agencies with different formats.\n",
"(Data Standardization is also a term for a particular type of data normalization, where we subtract the mean and divide by the standard deviation)\n",
"</p>\n",
" \n",
"<b>What is Standardization?</b>\n",
"<p>Standardization is the process of transforming data into a common format which allows the researcher to make the meaningful comparison.\n",
"</p>\n",
"\n",
"<b>Example</b>\n",
"\n",
"<p>Transform mpg to L/100km:</p>\n",
"<p>In our dataset, the fuel consumption columns \"city-mpg\" and \"highway-mpg\" are represented by mpg (miles per gallon) unit. Assume we are developing an application in a country that accept the fuel consumption with L/100km standard</p>\n",
"<p>We will need to apply <b>data transformation</b> to transform mpg into L/100km?</p>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>The formula for unit conversion is<p>\n",
"L/100km = 235 / mpg\n",
"<p>We can do many mathematical operations directly in Pandas.</p>\n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>symboling</th>\n",
" <th>normalized-losses</th>\n",
" <th>make</th>\n",
" <th>fuel-type</th>\n",
" <th>aspiration</th>\n",
" <th>num-of-doors</th>\n",
" <th>body-style</th>\n",
" <th>drive-wheels</th>\n",
" <th>engine-location</th>\n",
" <th>wheel-base</th>\n",
" <th>...</th>\n",
" <th>engine-size</th>\n",
" <th>fuel-system</th>\n",
" <th>bore</th>\n",
" <th>stroke</th>\n",
" <th>compression-ratio</th>\n",
" <th>horsepower</th>\n",
" <th>peak-rpm</th>\n",
" <th>city-mpg</th>\n",
" <th>highway-mpg</th>\n",
" <th>price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>130</td>\n",
" <td>mpfi</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000.0</td>\n",
" <td>21</td>\n",
" <td>27</td>\n",
" <td>13495.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>130</td>\n",
" <td>mpfi</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000.0</td>\n",
" <td>21</td>\n",
" <td>27</td>\n",
" <td>16500.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>hatchback</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>94.5</td>\n",
" <td>...</td>\n",
" <td>152</td>\n",
" <td>mpfi</td>\n",
" <td>2.68</td>\n",
" <td>3.47</td>\n",
" <td>9.0</td>\n",
" <td>154</td>\n",
" <td>5000.0</td>\n",
" <td>19</td>\n",
" <td>26</td>\n",
" <td>16500.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>fwd</td>\n",
" <td>front</td>\n",
" <td>99.8</td>\n",
" <td>...</td>\n",
" <td>109</td>\n",
" <td>mpfi</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>10.0</td>\n",
" <td>102</td>\n",
" <td>5500.0</td>\n",
" <td>24</td>\n",
" <td>30</td>\n",
" <td>13950.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>4wd</td>\n",
" <td>front</td>\n",
" <td>99.4</td>\n",
" <td>...</td>\n",
" <td>136</td>\n",
" <td>mpfi</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>8.0</td>\n",
" <td>115</td>\n",
" <td>5500.0</td>\n",
" <td>18</td>\n",
" <td>22</td>\n",
" <td>17450.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 26 columns</p>\n",
"</div>"
],
"text/plain": [
" symboling normalized-losses make fuel-type aspiration \\\n",
"0 3 122 alfa-romero gas std \n",
"1 3 122 alfa-romero gas std \n",
"2 1 122 alfa-romero gas std \n",
"3 2 164 audi gas std \n",
"4 2 164 audi gas std \n",
"\n",
" num-of-doors body-style drive-wheels engine-location wheel-base ... \\\n",
"0 two convertible rwd front 88.6 ... \n",
"1 two convertible rwd front 88.6 ... \n",
"2 two hatchback rwd front 94.5 ... \n",
"3 four sedan fwd front 99.8 ... \n",
"4 four sedan 4wd front 99.4 ... \n",
"\n",
" engine-size fuel-system bore stroke compression-ratio horsepower \\\n",
"0 130 mpfi 3.47 2.68 9.0 111 \n",
"1 130 mpfi 3.47 2.68 9.0 111 \n",
"2 152 mpfi 2.68 3.47 9.0 154 \n",
"3 109 mpfi 3.19 3.40 10.0 102 \n",
"4 136 mpfi 3.19 3.40 8.0 115 \n",
"\n",
" peak-rpm city-mpg highway-mpg price \n",
"0 5000.0 21 27 13495.0 \n",
"1 5000.0 21 27 16500.0 \n",
"2 5000.0 19 26 16500.0 \n",
"3 5500.0 24 30 13950.0 \n",
"4 5500.0 18 22 17450.0 \n",
"\n",
"[5 rows x 26 columns]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>symboling</th>\n",
" <th>normalized-losses</th>\n",
" <th>make</th>\n",
" <th>fuel-type</th>\n",
" <th>aspiration</th>\n",
" <th>num-of-doors</th>\n",
" <th>body-style</th>\n",
" <th>drive-wheels</th>\n",
" <th>engine-location</th>\n",
" <th>wheel-base</th>\n",
" <th>...</th>\n",
" <th>bore</th>\n",
" <th>stroke</th>\n",
" <th>compression-ratio</th>\n",
" <th>horsepower</th>\n",
" <th>peak-rpm</th>\n",
" <th>city-mpg</th>\n",
" <th>highway-mpg</th>\n",
" <th>price</th>\n",
" <th>city-L/100km</th>\n",
" <th>highway-L/100km</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000.0</td>\n",
" <td>21</td>\n",
" <td>27.0</td>\n",
" <td>13495.0</td>\n",
" <td>11.190476</td>\n",
" <td>8.703704</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000.0</td>\n",
" <td>21</td>\n",
" <td>27.0</td>\n",
" <td>16500.0</td>\n",
" <td>11.190476</td>\n",
" <td>8.703704</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>hatchback</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>94.5</td>\n",
" <td>...</td>\n",
" <td>2.68</td>\n",
" <td>3.47</td>\n",
" <td>9.0</td>\n",
" <td>154</td>\n",
" <td>5000.0</td>\n",
" <td>19</td>\n",
" <td>26.0</td>\n",
" <td>16500.0</td>\n",
" <td>12.368421</td>\n",
" <td>9.038462</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>fwd</td>\n",
" <td>front</td>\n",
" <td>99.8</td>\n",
" <td>...</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>10.0</td>\n",
" <td>102</td>\n",
" <td>5500.0</td>\n",
" <td>24</td>\n",
" <td>30.0</td>\n",
" <td>13950.0</td>\n",
" <td>9.791667</td>\n",
" <td>7.833333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>4wd</td>\n",
" <td>front</td>\n",
" <td>99.4</td>\n",
" <td>...</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>8.0</td>\n",
" <td>115</td>\n",
" <td>5500.0</td>\n",
" <td>18</td>\n",
" <td>22.0</td>\n",
" <td>17450.0</td>\n",
" <td>13.055556</td>\n",
" <td>10.681818</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 28 columns</p>\n",
"</div>"
],
"text/plain": [
" symboling normalized-losses make fuel-type aspiration \\\n",
"0 3 122 alfa-romero gas std \n",
"1 3 122 alfa-romero gas std \n",
"2 1 122 alfa-romero gas std \n",
"3 2 164 audi gas std \n",
"4 2 164 audi gas std \n",
"\n",
" num-of-doors body-style drive-wheels engine-location wheel-base ... \\\n",
"0 two convertible rwd front 88.6 ... \n",
"1 two convertible rwd front 88.6 ... \n",
"2 two hatchback rwd front 94.5 ... \n",
"3 four sedan fwd front 99.8 ... \n",
"4 four sedan 4wd front 99.4 ... \n",
"\n",
" bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg \\\n",
"0 3.47 2.68 9.0 111 5000.0 21 27.0 \n",
"1 3.47 2.68 9.0 111 5000.0 21 27.0 \n",
"2 2.68 3.47 9.0 154 5000.0 19 26.0 \n",
"3 3.19 3.40 10.0 102 5500.0 24 30.0 \n",
"4 3.19 3.40 8.0 115 5500.0 18 22.0 \n",
"\n",
" price city-L/100km highway-L/100km \n",
"0 13495.0 11.190476 8.703704 \n",
"1 16500.0 11.190476 8.703704 \n",
"2 16500.0 12.368421 9.038462 \n",
"3 13950.0 9.791667 7.833333 \n",
"4 17450.0 13.055556 10.681818 \n",
"\n",
"[5 rows x 28 columns]"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Convert mpg to L/100km by mathematical operation (235 divided by mpg)\n",
"df['city-L/100km'] = 235/df[\"city-mpg\"]\n",
"\n",
"# check your transformed data \n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n",
"<h1> Question #2: </h1>\n",
"\n",
"<b>According to the example above, transform mpg to L/100km in the column of \"highway-mpg\", and change the name of column to \"highway-L/100km\".</b>\n",
"\n",
"</div>\n"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>symboling</th>\n",
" <th>normalized-losses</th>\n",
" <th>make</th>\n",
" <th>fuel-type</th>\n",
" <th>aspiration</th>\n",
" <th>num-of-doors</th>\n",
" <th>body-style</th>\n",
" <th>drive-wheels</th>\n",
" <th>engine-location</th>\n",
" <th>wheel-base</th>\n",
" <th>...</th>\n",
" <th>bore</th>\n",
" <th>stroke</th>\n",
" <th>compression-ratio</th>\n",
" <th>horsepower</th>\n",
" <th>peak-rpm</th>\n",
" <th>city-mpg</th>\n",
" <th>highway-mpg</th>\n",
" <th>price</th>\n",
" <th>city-L/100km</th>\n",
" <th>highway-L/100km</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000.0</td>\n",
" <td>21</td>\n",
" <td>27.0</td>\n",
" <td>13495.0</td>\n",
" <td>11.190476</td>\n",
" <td>8.703704</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000.0</td>\n",
" <td>21</td>\n",
" <td>27.0</td>\n",
" <td>16500.0</td>\n",
" <td>11.190476</td>\n",
" <td>8.703704</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>hatchback</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>94.5</td>\n",
" <td>...</td>\n",
" <td>2.68</td>\n",
" <td>3.47</td>\n",
" <td>9.0</td>\n",
" <td>154</td>\n",
" <td>5000.0</td>\n",
" <td>19</td>\n",
" <td>26.0</td>\n",
" <td>16500.0</td>\n",
" <td>12.368421</td>\n",
" <td>9.038462</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>fwd</td>\n",
" <td>front</td>\n",
" <td>99.8</td>\n",
" <td>...</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>10.0</td>\n",
" <td>102</td>\n",
" <td>5500.0</td>\n",
" <td>24</td>\n",
" <td>30.0</td>\n",
" <td>13950.0</td>\n",
" <td>9.791667</td>\n",
" <td>7.833333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>4wd</td>\n",
" <td>front</td>\n",
" <td>99.4</td>\n",
" <td>...</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>8.0</td>\n",
" <td>115</td>\n",
" <td>5500.0</td>\n",
" <td>18</td>\n",
" <td>22.0</td>\n",
" <td>17450.0</td>\n",
" <td>13.055556</td>\n",
" <td>10.681818</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 28 columns</p>\n",
"</div>"
],
"text/plain": [
" symboling normalized-losses make fuel-type aspiration \\\n",
"0 3 122 alfa-romero gas std \n",
"1 3 122 alfa-romero gas std \n",
"2 1 122 alfa-romero gas std \n",
"3 2 164 audi gas std \n",
"4 2 164 audi gas std \n",
"\n",
" num-of-doors body-style drive-wheels engine-location wheel-base ... \\\n",
"0 two convertible rwd front 88.6 ... \n",
"1 two convertible rwd front 88.6 ... \n",
"2 two hatchback rwd front 94.5 ... \n",
"3 four sedan fwd front 99.8 ... \n",
"4 four sedan 4wd front 99.4 ... \n",
"\n",
" bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg \\\n",
"0 3.47 2.68 9.0 111 5000.0 21 27.0 \n",
"1 3.47 2.68 9.0 111 5000.0 21 27.0 \n",
"2 2.68 3.47 9.0 154 5000.0 19 26.0 \n",
"3 3.19 3.40 10.0 102 5500.0 24 30.0 \n",
"4 3.19 3.40 8.0 115 5500.0 18 22.0 \n",
"\n",
" price city-L/100km highway-L/100km \n",
"0 13495.0 11.190476 8.703704 \n",
"1 16500.0 11.190476 8.703704 \n",
"2 16500.0 12.368421 9.038462 \n",
"3 13950.0 9.791667 7.833333 \n",
"4 17450.0 13.055556 10.681818 \n",
"\n",
"[5 rows x 28 columns]"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"highway-L/100km\"] = 235/df[\"highway-mpg\"]\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```python\n",
"# transform mpg to L/100km by mathematical operation (235 divided by mpg)\n",
"df[\"highway-mpg\"] = 235/df[\"highway-mpg\"]\n",
"\n",
"# rename column name from \"highway-mpg\" to \"highway-L/100km\"\n",
"df.rename(columns={'\"highway-mpg\"':'highway-L/100km'}, inplace=True)\n",
"\n",
"# check your transformed data \n",
"df.head()\n",
"\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"data_normalization\">Data Normalization</h2>\n",
"\n",
"<b>Why normalization?</b>\n",
"\n",
"<p>Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling variable so the variable values range from 0 to 1\n",
"</p>\n",
"\n",
"<b>Example</b>\n",
"\n",
"<p>To demonstrate normalization, let's say we want to scale the columns \"length\", \"width\" and \"height\" </p>\n",
"<p><b>Target:</b>would like to Normalize those variables so their value ranges from 0 to 1.</p>\n",
"<p><b>Approach:</b> replace original value by (original value)/(maximum value)</p>\n"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>symboling</th>\n",
" <th>normalized-losses</th>\n",
" <th>make</th>\n",
" <th>fuel-type</th>\n",
" <th>aspiration</th>\n",
" <th>num-of-doors</th>\n",
" <th>body-style</th>\n",
" <th>drive-wheels</th>\n",
" <th>engine-location</th>\n",
" <th>wheel-base</th>\n",
" <th>...</th>\n",
" <th>bore</th>\n",
" <th>stroke</th>\n",
" <th>compression-ratio</th>\n",
" <th>horsepower</th>\n",
" <th>peak-rpm</th>\n",
" <th>city-mpg</th>\n",
" <th>highway-mpg</th>\n",
" <th>price</th>\n",
" <th>city-L/100km</th>\n",
" <th>highway-L/100km</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000.0</td>\n",
" <td>21</td>\n",
" <td>27.0</td>\n",
" <td>13495.0</td>\n",
" <td>11.190476</td>\n",
" <td>8.703704</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000.0</td>\n",
" <td>21</td>\n",
" <td>27.0</td>\n",
" <td>16500.0</td>\n",
" <td>11.190476</td>\n",
" <td>8.703704</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>hatchback</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>94.5</td>\n",
" <td>...</td>\n",
" <td>2.68</td>\n",
" <td>3.47</td>\n",
" <td>9.0</td>\n",
" <td>154</td>\n",
" <td>5000.0</td>\n",
" <td>19</td>\n",
" <td>26.0</td>\n",
" <td>16500.0</td>\n",
" <td>12.368421</td>\n",
" <td>9.038462</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>fwd</td>\n",
" <td>front</td>\n",
" <td>99.8</td>\n",
" <td>...</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>10.0</td>\n",
" <td>102</td>\n",
" <td>5500.0</td>\n",
" <td>24</td>\n",
" <td>30.0</td>\n",
" <td>13950.0</td>\n",
" <td>9.791667</td>\n",
" <td>7.833333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>4wd</td>\n",
" <td>front</td>\n",
" <td>99.4</td>\n",
" <td>...</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>8.0</td>\n",
" <td>115</td>\n",
" <td>5500.0</td>\n",
" <td>18</td>\n",
" <td>22.0</td>\n",
" <td>17450.0</td>\n",
" <td>13.055556</td>\n",
" <td>10.681818</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 28 columns</p>\n",
"</div>"
],
"text/plain": [
" symboling normalized-losses make fuel-type aspiration \\\n",
"0 3 122 alfa-romero gas std \n",
"1 3 122 alfa-romero gas std \n",
"2 1 122 alfa-romero gas std \n",
"3 2 164 audi gas std \n",
"4 2 164 audi gas std \n",
"\n",
" num-of-doors body-style drive-wheels engine-location wheel-base ... \\\n",
"0 two convertible rwd front 88.6 ... \n",
"1 two convertible rwd front 88.6 ... \n",
"2 two hatchback rwd front 94.5 ... \n",
"3 four sedan fwd front 99.8 ... \n",
"4 four sedan 4wd front 99.4 ... \n",
"\n",
" bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg \\\n",
"0 3.47 2.68 9.0 111 5000.0 21 27.0 \n",
"1 3.47 2.68 9.0 111 5000.0 21 27.0 \n",
"2 2.68 3.47 9.0 154 5000.0 19 26.0 \n",
"3 3.19 3.40 10.0 102 5500.0 24 30.0 \n",
"4 3.19 3.40 8.0 115 5500.0 18 22.0 \n",
"\n",
" price city-L/100km highway-L/100km \n",
"0 13495.0 11.190476 8.703704 \n",
"1 16500.0 11.190476 8.703704 \n",
"2 16500.0 12.368421 9.038462 \n",
"3 13950.0 9.791667 7.833333 \n",
"4 17450.0 13.055556 10.681818 \n",
"\n",
"[5 rows x 28 columns]"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# replace (original value) by (original value)/(maximum value)\n",
"df['length'] = df['length']/df['length'].max()\n",
"df['width'] = df['width']/df['width'].max()\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n",
"<h1> Questiont #3: </h1>\n",
"\n",
"<b>According to the example above, normalize the column \"height\".</b>\n",
"\n",
"</div>\n"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>length</th>\n",
" <th>width</th>\n",
" <th>height</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.811148</td>\n",
" <td>0.890278</td>\n",
" <td>0.816054</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.811148</td>\n",
" <td>0.890278</td>\n",
" <td>0.816054</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.822681</td>\n",
" <td>0.909722</td>\n",
" <td>0.876254</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.848630</td>\n",
" <td>0.919444</td>\n",
" <td>0.908027</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.848630</td>\n",
" <td>0.922222</td>\n",
" <td>0.908027</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" length width height\n",
"0 0.811148 0.890278 0.816054\n",
"1 0.811148 0.890278 0.816054\n",
"2 0.822681 0.909722 0.876254\n",
"3 0.848630 0.919444 0.908027\n",
"4 0.848630 0.922222 0.908027"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Write your code below and press Shift+Enter to execute \n",
"df['height'] = df['height'] / df['height'].max()\n",
"df[[\"length\", \"width\", \"height\"]].head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```python\n",
"df['height'] = df['height']/df['height'].max() \n",
"\n",
"# show the scaled columns\n",
"df[[\"length\",\"width\",\"height\"]].head()\n",
"\n",
"\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we can see, we've normalized \"length\", \"width\" and \"height\" in the range of [0,1].\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"binning\">Binning</h2>\n",
"<b>Why binning?</b>\n",
"<p>\n",
" Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis.\n",
"</p>\n",
"\n",
"<b>Example: </b>\n",
"\n",
"<p>In our dataset, \"horsepower\" is a real valued variable ranging from 48 to 288, it has 57 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins' to simplify analysis? </p>\n",
"\n",
"<p>We will use the Pandas method 'cut' to segment the 'horsepower' column into 3 bins </p>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Example of Binning Data In Pandas</h3>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Convert data to correct format \n"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>symboling</th>\n",
" <th>normalized-losses</th>\n",
" <th>make</th>\n",
" <th>fuel-type</th>\n",
" <th>aspiration</th>\n",
" <th>num-of-doors</th>\n",
" <th>body-style</th>\n",
" <th>drive-wheels</th>\n",
" <th>engine-location</th>\n",
" <th>wheel-base</th>\n",
" <th>...</th>\n",
" <th>bore</th>\n",
" <th>stroke</th>\n",
" <th>compression-ratio</th>\n",
" <th>horsepower</th>\n",
" <th>peak-rpm</th>\n",
" <th>city-mpg</th>\n",
" <th>highway-mpg</th>\n",
" <th>price</th>\n",
" <th>city-L/100km</th>\n",
" <th>highway-L/100km</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000.0</td>\n",
" <td>21</td>\n",
" <td>27.0</td>\n",
" <td>13495.0</td>\n",
" <td>11.190476</td>\n",
" <td>8.703704</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>...</td>\n",
" <td>3.47</td>\n",
" <td>2.68</td>\n",
" <td>9.0</td>\n",
" <td>111</td>\n",
" <td>5000.0</td>\n",
" <td>21</td>\n",
" <td>27.0</td>\n",
" <td>16500.0</td>\n",
" <td>11.190476</td>\n",
" <td>8.703704</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>hatchback</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>94.5</td>\n",
" <td>...</td>\n",
" <td>2.68</td>\n",
" <td>3.47</td>\n",
" <td>9.0</td>\n",
" <td>154</td>\n",
" <td>5000.0</td>\n",
" <td>19</td>\n",
" <td>26.0</td>\n",
" <td>16500.0</td>\n",
" <td>12.368421</td>\n",
" <td>9.038462</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>fwd</td>\n",
" <td>front</td>\n",
" <td>99.8</td>\n",
" <td>...</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>10.0</td>\n",
" <td>102</td>\n",
" <td>5500.0</td>\n",
" <td>24</td>\n",
" <td>30.0</td>\n",
" <td>13950.0</td>\n",
" <td>9.791667</td>\n",
" <td>7.833333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>gas</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>4wd</td>\n",
" <td>front</td>\n",
" <td>99.4</td>\n",
" <td>...</td>\n",
" <td>3.19</td>\n",
" <td>3.40</td>\n",
" <td>8.0</td>\n",
" <td>115</td>\n",
" <td>5500.0</td>\n",
" <td>18</td>\n",
" <td>22.0</td>\n",
" <td>17450.0</td>\n",
" <td>13.055556</td>\n",
" <td>10.681818</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 28 columns</p>\n",
"</div>"
],
"text/plain": [
" symboling normalized-losses make fuel-type aspiration \\\n",
"0 3 122 alfa-romero gas std \n",
"1 3 122 alfa-romero gas std \n",
"2 1 122 alfa-romero gas std \n",
"3 2 164 audi gas std \n",
"4 2 164 audi gas std \n",
"\n",
" num-of-doors body-style drive-wheels engine-location wheel-base ... \\\n",
"0 two convertible rwd front 88.6 ... \n",
"1 two convertible rwd front 88.6 ... \n",
"2 two hatchback rwd front 94.5 ... \n",
"3 four sedan fwd front 99.8 ... \n",
"4 four sedan 4wd front 99.4 ... \n",
"\n",
" bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg \\\n",
"0 3.47 2.68 9.0 111 5000.0 21 27.0 \n",
"1 3.47 2.68 9.0 111 5000.0 21 27.0 \n",
"2 2.68 3.47 9.0 154 5000.0 19 26.0 \n",
"3 3.19 3.40 10.0 102 5500.0 24 30.0 \n",
"4 3.19 3.40 8.0 115 5500.0 18 22.0 \n",
"\n",
" price city-L/100km highway-L/100km \n",
"0 13495.0 11.190476 8.703704 \n",
"1 16500.0 11.190476 8.703704 \n",
"2 16500.0 12.368421 9.038462 \n",
"3 13950.0 9.791667 7.833333 \n",
"4 17450.0 13.055556 10.681818 \n",
"\n",
"[5 rows x 28 columns]"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"horsepower\"]=df[\"horsepower\"].astype(int, copy=True)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets plot the histogram of horspower, to see what the distribution of horsepower looks like.\n"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.0, 'horsepower bins')"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAX4AAAEWCAYAAABhffzLAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAVCUlEQVR4nO3de7SldX3f8feHAUXlHgYcwThIiStoDeoEtRSkEBXBZqiJF1awU0OCdqnRVU0yJk1q2piCMSbGusrCahjjldQg09il0lEGSRQdDCKIBoKDF0ZmuChoWxT89o/nN2Uzc87hAPOcfeb83q+19trPdT/f/Zs9n/07z372b6eqkCT1Y49pFyBJWlgGvyR1xuCXpM4Y/JLUGYNfkjpj8EtSZwx+TVWSzUl+Ydp1LHZJTkzy7TnWn5fk9xayJu2+9px2AZIevqp61bRr0O7DHr+WhCRLphOzlJ6LFieDX4vBMUmuTvL9JB9Jsvf2FUl+PckNSW5Psj7J4ybWVZJXJ7keuD6DP02ytT3W1Ume0rZ9ZJK3JflmklvaqZFHtXUnJvl2kt9Jcms7/fQrE8fZP8n7kmxLclOSf59kj7bupiTPaNNntpqObvO/luRjbXqPJGuT/GOS25JcmOSgtm5l2++sJN8EPj1bQ81R4wVJ/nCH5/OG1hZbkrxiYttTk3w1yV1JvpPkjQ/j3067IYNfi8FLgFOAI4CnAv8GIMlJwH9u61cANwEf3mHf04FnAkcDzwNOAH4GOAB4KXBb2+7ctvwY4J8AhwG/P/E4jwUObsvXAOcneVJb905gf+CJwHOAfw1sD9KNwIlt+gTgxrbN9vmNbfo3Wq3PAR4H3AG8a4fn8hzgZ4HnM7O5apxp2/3btmcB70pyYFv3HuCVVbUv8BTmeKPRElVV3rxN7QZsBs6cmH8rcF6bfg/w1ol1+wA/Bla2+QJOmlh/EvAPwLOAPSaWB/ghcOTEsmcD32jTJwL3AI+ZWH8h8HvAMuBu4OiJda8ELm3TZwHr2/R1wK8BH27zNwFPn1h38sRjrGjPZU9gZXsuT5yjnWatsU1fAPzhxLb/B9hzYtutwLPa9Dfbc9hv2v/+3qZzs8evxeC7E9P/myHgYegZ37R9RVX9gKEHf9jE9t+aWP9p4L8w9KRvSXJ+kv2A5cCjgSuTfC/J94BPtOXb3VFVP5yYv6kd/2DgEZN1tOntNWwEjk/yWIY3iY8AxyVZydDjvqpt9wTgoonjXwfcCxw603OZxWw1zuS2qrpnYn6yXX8JOBW4KcnGJM9+gONqiTH4tZjdzBCYACR5DPBTwHcmtrnf8LJV9edV9QzgyQyndn4TuJWhB/zkqjqg3favqn0mdj2wPf52P92OfytDz/wJO6z7TjveDQyh+hvAZVV1F8Mb2dnA5VX1k7bPt4AXTBz/gKrau6pmfS4zmK3GB6WqvlhVq4FDgI8x/OWgjhj8Wsw+CLwiyTFJHgn8EXBFVW2eaeMkP5/kmUn2Yji183+Be1v4vhv40ySHtG0PS7LjufQ/SPKIJMcDLwT+qqruZQjGtyTZN8kTgH8HvH9iv43Aa7jvfP6lO8wDnNce4wnt+MuTrH4IbbJTjQ9m57bvryTZv6p+DNzJ8JeHOmLwa9Gqqg0M59k/CmwBjgReNscu+zEE/B0Mp0FuA97W1v02cAPw+SR3Av8LmPxg9Lttv5uBDwCvqqqvtXWvZXgjuRG4nOEN6b0T+24E9gUum2Ue4B3AeuBTSe4CPs/wofSDMVeND8bLgc2tHV4FnPkQHkO7sVT5QyzqW5ITgfdX1eFTLkVaEPb4JakzBr8kdcZTPZLUmVHHBEmyGbiL4aqBe6pqVfua+kcYvrSyGXhJVd0xZh2SpPuM2uNvwb+qqm6dWPZW4PaqOifJWuDAqvrtuR7n4IMPrpUrV45WpyQtRVdeeeWtVbV8x+XTGAVwNfeNbbKO4ZrnOYN/5cqVbNq0adyqJGmJSXLTTMvH/nC3GK5bvjLJ2W3ZoVW1BaDdHzLTjknOTrIpyaZt27aNXKYk9WPsHv9xVXVz+7bkJUnm/WWTqjofOB9g1apVfgItSbvIqD3+qrq53W8FLgKOZRg8awVAu986Zg2SpPsbLfiTPCbJvtunGcZKv4bha+tr2mZrgIvHqkGStLMxT/UcyjAM7fbjfLCqPpHki8CFSc5iGBf8xSPWIEnawWjBX1U3Aj83w/LbgJPHOq4kaW4O2SBJnTH4JakzBr8kdWYa39zVyFau/fhUjrv5nNOmclxJD449fknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1xkHaRjStwdIkaS72+CWpMwa/JHXG4Jekziz5c/yeZ5ek+7PHL0mdMfglqTMGvyR1xuCXpM4Y/JLUGYNfkjpj8EtSZwx+SeqMwS9JnTH4JakzBr8kdcbgl6TOGPyS1JnRgz/JsiR/n+Rv2vxBSS5Jcn27P3DsGiRJ91mIHv/rgOsm5tcCG6rqKGBDm5ckLZBRgz/J4cBpwH+bWLwaWNem1wGnj1mDJOn+xu7x/xnwW8BPJpYdWlVbANr9ITPtmOTsJJuSbNq2bdvIZUpSP0YL/iQvBLZW1ZUPZf+qOr+qVlXVquXLl+/i6iSpX2P+9OJxwC8mORXYG9gvyfuBW5KsqKotSVYAW0esQZK0g9F6/FX1pqo6vKpWAi8DPl1VZwLrgTVtszXAxWPVIEna2TSu4z8HeG6S64HntnlJ0gIZ81TP/1dVlwKXtunbgJMX4riSpJ35zV1J6ozBL0mdMfglqTMGvyR1xuCXpM4Y/JLUGYNfkjpj8EtSZwx+SeqMwS9JnTH4JakzBr8kdcbgl6TOGPyS1BmDX5I6Y/BLUmcMfknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1xuCXpM4Y/JLUGYNfkjpj8EtSZwx+SeqMwS9JnRkt+JPsneQLSb6c5Nokf9CWH5TkkiTXt/sDx6pBkrSzMXv8dwMnVdXPAccApyR5FrAW2FBVRwEb2rwkaYGMFvw1+EGb3avdClgNrGvL1wGnj1WDJGlno57jT7IsyVXAVuCSqroCOLSqtgC0+0Nm2ffsJJuSbNq2bduYZUpSV0YN/qq6t6qOAQ4Hjk3ylAex7/lVtaqqVi1fvny0GiWpNwtyVU9VfQ+4FDgFuCXJCoB2v3UhapAkDca8qmd5kgPa9KOAXwC+BqwH1rTN1gAXj1WDJGlne4742CuAdUmWMbzBXFhVf5Pkc8CFSc4Cvgm8eMQaJEk7GC34q+pq4GkzLL8NOHms40qS5uY3dyWpMwa/JHXG4Jekzhj8ktSZeQV/kg3zWSZJWvzmvKonyd7Ao4GD2yiaaav2Ax43cm2SpBE80OWcrwRezxDyV3Jf8N8JvGu8siRJY5kz+KvqHcA7kry2qt65QDVJkkY0ry9wVdU7k/wzYOXkPlX1vpHqkiSNZF7Bn+QvgSOBq4B72+ICDH5J2s3Md8iGVcDRVVVjFiNJGt98r+O/BnjsmIVIkhbGfHv8BwNfTfIFht/SBaCqfnGUqiRJo5lv8L95zCIkSQtnvlf1bBy7EEnSwpjvVT13MVzFA/AIYC/gh1W131iFSZLGMd8e/76T80lOB44doyBJ0rge0uicVfUx4KRdW4okaSHM91TPiyZm92C4rt9r+iVpNzTfq3r+5cT0PcBmYPUur0aSNLr5nuN/xdiFaPe3cu3Hp3bszeecNrVjS7ub+f4Qy+FJLkqyNcktST6a5PCxi5Mk7Xrz/XD3L4D1DOPyHwb8j7ZMkrSbmW/wL6+qv6iqe9rtAmD5iHVJkkYy3+C/NcmZSZa125nAbWMWJkkax3yD/1eBlwDfBbYAvwz4ga8k7YbmeznnfwLWVNUdAEkOAt7G8IYgSdqNzLfH/9TtoQ9QVbcDTxunJEnSmOYb/HskOXD7TOvxz/evBUnSIjLf8P4T4O+S/HeGoRpeArxltKokSaOZ7zd335dkE8PAbAFeVFVfHbUySdIo5n26pgW9YS9Ju7mHNCyzJGn3ZfBLUmdGC/4kj0/ymSTXJbk2yeva8oOSXJLk+nZ/4AM9liRp1xmzx38P8Iaq+lngWcCrkxwNrAU2VNVRwIY2L0laIKMFf1Vtqaovtem7gOsYRvZcDaxrm60DTh+rBknSzhbkHH+SlQzf9L0COLSqtsDw5gAcshA1SJIGowd/kn2AjwKvr6o7H8R+ZyfZlGTTtm3bxitQkjozavAn2Ysh9D9QVX/dFt+SZEVbvwLYOtO+VXV+Va2qqlXLlzv0vyTtKmNe1RPgPcB1VfX2iVXrgTVteg1w8Vg1SJJ2NuZAa8cBLwe+kuSqtux3gHOAC5OcBXwTePGINUiSdjBa8FfV5Qzj+szk5LGOK0mam9/claTOGPyS1BmDX5I6Y/BLUmcMfknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnxhydU1ryVq79+NSOvfmc06Z2bO3e7PFLUmcMfknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZr+PXkjDN6+ml3Y09fknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1xuCXpM6MFvxJ3ptka5JrJpYdlOSSJNe3+wPHOr4kaWZj9vgvAE7ZYdlaYENVHQVsaPOSpAU0WvBX1WXA7TssXg2sa9PrgNPHOr4kaWYLfY7/0KraAtDuD1ng40tS9xbth7tJzk6yKcmmbdu2TbscSVoyFjr4b0myAqDdb51tw6o6v6pWVdWq5cuXL1iBkrTULXTwrwfWtOk1wMULfHxJ6t6Yl3N+CPgc8KQk305yFnAO8Nwk1wPPbfOSpAW051gPXFVnzLLq5LGOKUl6YIv2w11J0jgMfknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1xuCXpM4Y/JLUGYNfkjoz2g+xSBrXyrUfn8pxN59z2lSOq13HHr8kdcbgl6TOGPyS1BmDX5I644e7kh6UaX2oDH6wvKvY45ekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1xuCXpM4Y/JLUGYNfkjrjWD2S9ACW2vhE9vglqTNTCf4kpyT5epIbkqydRg2S1KsFD/4ky4B3AS8AjgbOSHL0QtchSb2aRo//WOCGqrqxqn4EfBhYPYU6JKlL0/hw9zDgWxPz3waeueNGSc4Gzm6zP0jy9V1w7IOBW3fB4yxFts3sbJvZLWjb5NyFOtIusUva5mE+5yfMtHAawZ8ZltVOC6rOB87fpQdONlXVql35mEuFbTM722Z2ts3sFnPbTONUz7eBx0/MHw7cPIU6JKlL0wj+LwJHJTkiySOAlwHrp1CHJHVpwU/1VNU9SV4DfBJYBry3qq5doMPv0lNHS4xtMzvbZna2zewWbdukaqfT65KkJcxv7kpSZwx+SerMkg3+JJuTfCXJVUk2tWUHJbkkyfXt/sBp17lQkrw3ydYk10wsm7U9krypDanx9STPn07VC2OWtnlzku+0189VSU6dWNdT2zw+yWeSXJfk2iSva8u7f+3M0TaL/7VTVUvyBmwGDt5h2VuBtW16LXDutOtcwPY4AXg6cM0DtQfDUBpfBh4JHAH8I7Bs2s9hgdvmzcAbZ9i2t7ZZATy9Te8L/ENrg+5fO3O0zaJ/7SzZHv8sVgPr2vQ64PTplbKwquoy4PYdFs/WHquBD1fV3VX1DeAGhqE2lqRZ2mY2vbXNlqr6Upu+C7iO4dv33b925mib2SyatlnKwV/Ap5Jc2YZ/ADi0qrbA8I8GHDK16haH2dpjpmE15npBL1WvSXJ1OxW0/VRGt22TZCXwNOAKfO3czw5tA4v8tbOUg/+4qno6wyigr05ywrQL2o3Ma1iNJe6/AkcCxwBbgD9py7tsmyT7AB8FXl9Vd8616QzLlnT7zNA2i/61s2SDv6pubvdbgYsY/qS6JckKgHa/dXoVLgqztUf3w2pU1S1VdW9V/QR4N/f9Sd5d2yTZiyHYPlBVf90W+9ph5rbZHV47SzL4kzwmyb7bp4HnAdcwDA2xpm22Brh4OhUuGrO1x3rgZUkemeQI4CjgC1Oob2q2h1rzrxheP9BZ2yQJ8B7guqp6+8Sq7l87s7XNbvHamfYn4yN92v5Ehk/PvwxcC/xuW/5TwAbg+nZ/0LRrXcA2+RDDn50/Zuh5nDVXewC/y3DVwdeBF0y7/im0zV8CXwGuZvgPu6LTtvnnDKcjrgauardTfe3M2TaL/rXjkA2S1JkleapHkjQ7g1+SOmPwS1JnDH5J6ozBL0mdMfi1pCRZOTnKpqSdGfxSk2TBf4r0odhd6tTiZfBrKVqW5N1tjPRPJXlUkmOSfL4NnHXR9oGzklya5I+SbARel+TFSa5J8uUkl7VtliX54yRfbPu/si0/Mcll7fG+muS8JHu0dWdk+D2Ia5Kc25a9JMnb2/TrktzYpo9McnmbfkaSjW1wwU9ODItwvzoXtjm11Nhz0FJ0FHBGVf16kguBXwJ+C3htVW1M8h+B/wC8vm1/QFU9ByDJV4DnV9V3khzQ1p8FfL+qfj7JI4G/TfKptu5YhnHWbwI+Abwoyd8B5wLPAO5gGCX2dOAy4DfbfscDtyU5jOEboJ9t4768E1hdVduSvBR4C/CrO9YpPRwGv5aib1TVVW36SoaREg+oqo1t2Trgrya2/8jE9N8CF7Q3jO0Dkj0PeGqSX27z+zO8ufwI+EJVbe+5f4ghxH8MXFpV29ryDwAnVNXHkuzTxpF6PPBBhh+BOb4d60nAU4BLhmFgWMYwlMRMdUoPmcGvpejuiel7gQMeYPsfbp+oqlcleSZwGnBVkmMYhtN9bVV9cnKnJCey87C6xczD7273OeAVDGO1fJahN/9s4A3ATwPXVtWzH6hO6eHwHL968H3gjiTHt/mXAxtn2jDJkVV1RVX9PnArQ8/8k8C/badiSPIzbdRXgGOTHNHO7b8UuJzhxziek+TgJMuAMyaOdxnwxnb/98C/AO6uqu8zvBksT/Lsdpy9kjx51zWDNLDHr16sAc5L8mjgRoZe90z+OMlRDL32DQwjvF4NrAS+1Ibi3cZ9PzX4OeAc4J8yhPlFVfWTJG8CPtMe539W1fZhiz/L8GZyWVXdm+RbwNcAqupH7XTSnyfZn+H/558xjDAr7TKOzik9RO1Uzxur6oVTLkV6UDzVI0mdsccvSZ2xxy9JnTH4JakzBr8kdcbgl6TOGPyS1Jn/B2EXAS5SRAWHAAAAAElFTkSuQmCC\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"%matplotlib inline\n",
"import matplotlib as plt\n",
"from matplotlib import pyplot\n",
"plt.pyplot.hist(df[\"horsepower\"])\n",
"\n",
"# set x/y labels and plot title\n",
"plt.pyplot.xlabel(\"horsepower\")\n",
"plt.pyplot.ylabel(\"count\")\n",
"plt.pyplot.title(\"horsepower bins\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>We would like 3 bins of equal size bandwidth so we use numpy's <code>linspace(start_value, end_value, numbers_generated</code> function.</p>\n",
"<p>Since we want to include the minimum value of horsepower we want to set start_value=min(df[\"horsepower\"]).</p>\n",
"<p>Since we want to include the maximum value of horsepower we want to set end_value=max(df[\"horsepower\"]).</p>\n",
"<p>Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated=4.</p>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We build a bin array, with a minimum value to a maximum value, with bandwidth calculated above. The bins will be values used to determine when one bin ends and another begins.\n"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 48. , 119.33333333, 190.66666667, 262. ])"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bins = np.linspace(min(df[\"horsepower\"]), max(df[\"horsepower\"]), 4)\n",
"bins"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" We set group names:\n"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"group_names = ['Low', 'Medium', 'High']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" We apply the function \"cut\" the determine what each value of \"df['horsepower']\" belongs to. \n"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>horsepower</th>\n",
" <th>horsepower-binned</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>111</td>\n",
" <td>Low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>111</td>\n",
" <td>Low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>154</td>\n",
" <td>Medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>102</td>\n",
" <td>Low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>115</td>\n",
" <td>Low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>110</td>\n",
" <td>Low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>110</td>\n",
" <td>Low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>110</td>\n",
" <td>Low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>140</td>\n",
" <td>Medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>101</td>\n",
" <td>Low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>101</td>\n",
" <td>Low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>121</td>\n",
" <td>Medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>121</td>\n",
" <td>Medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>121</td>\n",
" <td>Medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>182</td>\n",
" <td>Medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>182</td>\n",
" <td>Medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>182</td>\n",
" <td>Medium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>48</td>\n",
" <td>Low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>70</td>\n",
" <td>Low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>70</td>\n",
" <td>Low</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" horsepower horsepower-binned\n",
"0 111 Low\n",
"1 111 Low\n",
"2 154 Medium\n",
"3 102 Low\n",
"4 115 Low\n",
"5 110 Low\n",
"6 110 Low\n",
"7 110 Low\n",
"8 140 Medium\n",
"9 101 Low\n",
"10 101 Low\n",
"11 121 Medium\n",
"12 121 Medium\n",
"13 121 Medium\n",
"14 182 Medium\n",
"15 182 Medium\n",
"16 182 Medium\n",
"17 48 Low\n",
"18 70 Low\n",
"19 70 Low"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )\n",
"df[['horsepower','horsepower-binned']].head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets see the number of vehicles in each bin.\n"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Low 153\n",
"Medium 43\n",
"High 5\n",
"Name: horsepower-binned, dtype: int64"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"horsepower-binned\"].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets plot the distribution of each bin.\n"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.0, 'horsepower bins')"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEWCAYAAACJ0YulAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAZA0lEQVR4nO3dfbRddX3n8feHoOATD5qLIgGCDDoNDEWNqHVQKlWpOoalVcMSGxUbcVHUGVFDnUrHioMPo1WrZcWKYFUgPiCZ0frQWIiPYFAEASmpBAgguSAKouXxO3/sne3hepNcLjnn3Nzzfq1119n799v77O+9Jzmfs/c++7dTVUiSBLDdsAuQJM0choIkqWMoSJI6hoIkqWMoSJI6hoIkqWMoaEZKsi7Jnwy7jpkuyaFJ1m+m/5Qkfz3ImrRt237YBUjqn6o6Ztg1aNvinoJmtSSz5oPPbPpdNHMZCprJDkpycZJfJTkryY4bO5L8RZK1SX6RZGWSx/b0VZJjk1wJXJnGB5NsaJ/r4iQHtMvukOT9Sa5JcmN7uOUhbd+hSdYn+askN7WHtF7Rs52dk3wqyXiSq5P8zyTbtX1XJ3lyO31UW9OCdv61Sb7UTm+XZFmSf09yc5IVSR7Z9s1v1zs6yTXANzf1h9pMjacledeE3+fN7d/ihiSv7ln2+UkuS3JbkuuSHP8AXjttowwFzWQvAw4H9gEOBF4FkOTZwP9u+3cHrgbOnLDuEcBTgQXAc4FnAo8HdgFeDtzcLveetv0g4D8BewDv6HmexwBz2/YlwPIkT2j7PgLsDDwOeBbw58DGN9nzgEPb6WcCP2uX2Th/Xjv9hrbWZwGPBW4BPjrhd3kW8AfA85jc5mqcbNmd22WPBj6aZNe27xPA66rqEcABbCaENItVlT/+zLgfYB1wVM/8e4FT2ulPAO/t6Xs4cBcwv50v4Nk9/c8G/g14GrBdT3uA24F9e9qeDlzVTh8K3A08rKd/BfDXwBzgDmBBT9/rgHPb6aOBle305cBrgTPb+auBJ/X0HdbzHLu3v8v2wPz2d3ncZv5Om6yxnT4NeFfPsr8Ftu9ZdgPwtHb6mvZ32GnYr78/w/txT0Ez2c97pn9D8+YPzSfqqzd2VNWvaT7579Gz/LU9/d8E/p7mE/iNSZYn2QkYAx4KXJjkl0l+CXy1bd/olqq6vWf+6nb7c4EH99bRTm+s4TzgkCSPoQmQs4BnJJlP80n9ona5vYGze7Z/OXAP8OjJfpdN2FSNk7m5qu7ume/9u74EeD5wdZLzkjx9C9vVLGQoaFt0Pc2bKQBJHgY8CriuZ5n7DP9bVR+uqicD+9McLnoLcBPNJ+f9q2qX9mfnqnp4z6q7ts+/0V7t9m+i+US/94S+69rtraV5w30DsLqqbqMJuaXAt6vq3nada4E/7dn+LlW1Y1Vt8neZxKZqvF+q6gdVtQjYDfgSzR6HRoyhoG3RZ4FXJzkoyQ7Au4Hzq2rdZAsneUqSpyZ5EM3hov8A7mnfmD8OfDDJbu2yeySZeOz+fyV5cJJDgBcCn6uqe2jeNE9K8ogkewP/A/h0z3rnAX/J784fnDthHuCU9jn2brc/lmTRNP4mv1fj/Vm5XfcVSXauqruAW2n2WDRiDAVtc6pqFc1x/S8ANwD7Aos3s8pONG/+t9AcWrkZeH/b9zZgLfD9JLcC/wL0nqT9ebve9cBngGOq6qdt33E0IfMz4Ns0YXVqz7rnAY8AVm9iHuBDwErg60luA75Pc4L8/thcjffHK4F17d/hGOCoaTyHtnGp8iY70mSSHAp8uqrmDbkUaWDcU5AkdQwFSVKnb6GQ5NT2qsmfTGg/LskVSS5N8t6e9hPaK1SvmOREnzRwVXWuh440avo5lsppNN8N/9TGhiR/DCwCDqyqO3q+8bGA5kTh/jTfr/6XJI9vv+EhSRqQvoVCVa1uL9Tp9Xrg5Kq6o11mQ9u+iOZqzzuAq5KsBQ4Gvre5bcydO7fmz5+4CUnS5lx44YU3VdXYZH2DHnXx8TRXeZ5E813x46vqBzRXgX6/Z7n13Pfq1E6SpTQXALHXXnuxZs2a/lYsSbNMkqs31TfoE83bA7vSjEHzFmBFktCMQTPRpN+VrarlVbWwqhaOjU0adJKkaRp0KKwHvliNC4B7acaQWQ/s2bPcPKZxmb4k6YEZdCh8iWbESpI8nmZAsZtoruhc3I5tvw+wH3DBgGuTpJHXt3MKSc6gGap3bpp7yJ5IMwTAqe3XVO8EllRzSfWlSVYAl9EMA3ys3zySpMHbpoe5WLhwYXmiWZLunyQXVtXCyfq8olmS1DEUJEkdQ0GS1DEUJEmdQV/RPKPMX/blYZcwa607+QXDLkHSNLinIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnq9C0UkpyaZEN7P+aJfccnqSRze9pOSLI2yRVJntevuiRJm9bPPYXTgMMnNibZE3gOcE1P2wJgMbB/u87HkszpY22SpEn0LRSqajXwi0m6Pgi8FaietkXAmVV1R1VdBawFDu5XbZKkyQ30nEKSFwHXVdWPJ3TtAVzbM7++bZvsOZYmWZNkzfj4eJ8qlaTRNLBQSPJQ4O3AOybrnqStJmmjqpZX1cKqWjg2NrY1S5SkkTfI23HuC+wD/DgJwDzgh0kOptkz2LNn2XnA9QOsTZLEAPcUquqSqtqtquZX1XyaIHhSVf0cWAksTrJDkn2A/YALBlWbJKnRz6+kngF8D3hCkvVJjt7UslV1KbACuAz4KnBsVd3Tr9okSZPr2+GjqjpyC/3zJ8yfBJzUr3okSVvmFc2SpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnq9PMezacm2ZDkJz1t70vy0yQXJzk7yS49fSckWZvkiiTP61ddkqRN6+eewmnA4RPavgEcUFUHAv8GnACQZAGwGNi/XedjSeb0sTZJ0iT6FgpVtRr4xYS2r1fV3e3s94F57fQi4MyquqOqrgLWAgf3qzZJ0uSGeU7hNcA/t9N7ANf29K1v235PkqVJ1iRZMz4+3ucSJWm0DCUUkrwduBv4zMamSRarydatquVVtbCqFo6NjfWrREkaSdsPeoNJlgAvBA6rqo1v/OuBPXsWmwdcP+jaJGnUDXRPIcnhwNuAF1XVb3q6VgKLk+yQZB9gP+CCQdYmSerjnkKSM4BDgblJ1gMn0nzbaAfgG0kAvl9Vx1TVpUlWAJfRHFY6tqru6VdtkqTJ9S0UqurISZo/sZnlTwJO6lc9kqQt84pmSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdfoWCklOTbIhyU962h6Z5BtJrmwfd+3pOyHJ2iRXJHlev+qSJG1aP/cUTgMOn9C2DFhVVfsBq9p5kiwAFgP7t+t8LMmcPtYmSZpE30KhqlYDv5jQvAg4vZ0+HTiip/3Mqrqjqq4C1gIH96s2SdLkBn1O4dFVdQNA+7hb274HcG3Pcuvbtt+TZGmSNUnWjI+P97VYSRo1M+VEcyZpq8kWrKrlVbWwqhaOjY31uSxJGi2DDoUbk+wO0D5uaNvXA3v2LDcPuH7AtUnSyBt0KKwElrTTS4BzetoXJ9khyT7AfsAFA65Nkkbe9v164iRnAIcCc5OsB04ETgZWJDkauAZ4KUBVXZpkBXAZcDdwbFXd06/aJEmT61soVNWRm+g6bBPLnwSc1K96JElbNlNONEuSZgBDQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUmVIoJFk1lTZJ0rZts2MfJdkReCjNoHa78rv7HuwEPLbPtUmSBmxLA+K9DngTTQBcyO9C4Vbgo/0rS5I0DJsNhar6EPChJMdV1UcGVJMkaUimNHR2VX0kyR8B83vXqapP9akuSdIQTCkUkvwTsC9wEbDx5jcFGAqSNItM9SY7C4EFVVX9LEaSNFxTvU7hJ8Bj+lmIJGn4prqnMBe4LMkFwB0bG6vqRdPZaJL/DryW5hDUJcCrab76ehbNeYt1wMuq6pbpPL8kaXqmGgp/s7U2mGQP4A00h6N+m2QFsBhYAKyqqpOTLAOWAW/bWtuVJG3ZVL99dF4ftvuQJHfR7CFcD5wAHNr2nw6ci6EgSQM11WEubktya/vzH0nuSXLrdDZYVdcB7weuAW4AflVVXwceXVU3tMvcAOw2neeXJE3fVPcUHtE7n+QI4ODpbLAdLmMRsA/wS+BzSY66H+svBZYC7LXXXtMpQZK0CdMaJbWqvgQ8e5rb/BPgqqoar6q7gC8CfwTcmGR3gPZxwya2vbyqFlbVwrGxsWmWIEmazFQvXntxz+x2NNctTPeahWuApyV5KPBb4DBgDXA7sAQ4uX08Z5rPL0mapql+++i/9UzfTfOV0UXT2WBVnZ/k88AP2+f6EbAceDiwIsnRNMHx0uk8vyRp+qZ6TuHVW3OjVXUicOKE5jto9hokSUMy1W8fzUtydpINSW5M8oUk8/pdnCRpsKZ6ovmTwEqa+yrsAfzftk2SNItMNRTGquqTVXV3+3Ma4Fd/JGmWmWoo3JTkqCRz2p+jgJv7WZgkafCmGgqvAV4G/JzmKuQ/oxnETpI0i0z1K6l/CyzZOGppkkfSDFXxmn4VJkkavKnuKRzYO4x1Vf0CeGJ/SpIkDctUQ2G7dswioNtTmOpehiRpGzHVN/b/A3y3vRK5aM4vnNS3qiRJQzHVK5o/lWQNzSB4AV5cVZf1tTJJ0sBN+RBQGwIGgSTNYtMaOluSNDsZCpKkjqEgSeoYCpKkjqEgSeoYCpKkjqEgSeoMJRSS7JLk80l+muTyJE9P8sgk30hyZfu465afSZK0NQ1rT+FDwFer6j8DfwhcDiwDVlXVfsCqdl6SNEADD4UkOwHPBD4BUFV3VtUvgUXA6e1ipwNHDLo2SRp1w9hTeBwwDnwyyY+S/GOShwGPrqobANrH3YZQmySNtGGEwvbAk4B/qKonArdzPw4VJVmaZE2SNePj4/2qUZJG0jBCYT2wvqrOb+c/TxMSNybZHaB93DDZylW1vKoWVtXCsbGxgRQsSaNi4KFQVT8Hrk3yhLbpMJrRV1cCS9q2JcA5g65NkkbdsO6edhzwmSQPBn4GvJomoFYkORq4BnjpkGqTpJE1lFCoqouAhZN0HTbgUiRJPbyiWZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSZ1h3aOZJHOANcB1VfXCJI8EzgLmA+uAl1XVLcOqTzPP/GVfHnYJs9a6k18w7BI0QwxzT+GNwOU988uAVVW1H7CqnZckDdBQQiHJPOAFwD/2NC8CTm+nTweOGHBZkjTyhrWn8HfAW4F7e9oeXVU3ALSPuw2hLkkaaQMPhSQvBDZU1YXTXH9pkjVJ1oyPj2/l6iRptA1jT+EZwIuSrAPOBJ6d5NPAjUl2B2gfN0y2clUtr6qFVbVwbGxsUDVL0kgYeChU1QlVNa+q5gOLgW9W1VHASmBJu9gS4JxB1yZJo24mXadwMvCcJFcCz2nnJUkDNLTrFACq6lzg3Hb6ZuCwYdYjSaNuJu0pSJKGzFCQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSZ+ChkGTPJP+a5PIklyZ5Y9v+yCTfSHJl+7jroGuTpFE3jD2Fu4E3V9UfAE8Djk2yAFgGrKqq/YBV7bwkaYAGHgpVdUNV/bCdvg24HNgDWASc3i52OnDEoGuTpFE31HMKSeYDTwTOBx5dVTdAExzAbptYZ2mSNUnWjI+PD6xWSRoFQwuFJA8HvgC8qapunep6VbW8qhZW1cKxsbH+FShJI2gooZDkQTSB8Jmq+mLbfGOS3dv+3YENw6hNkkbZML59FOATwOVV9YGerpXAknZ6CXDOoGuTpFG3/RC2+QzglcAlSS5q2/4KOBlYkeRo4BrgpUOoTZJG2sBDoaq+DWQT3YcNshZJ0n15RbMkqWMoSJI6hoIkqWMoSJI6hoIkqWMoSJI6hoIkqWMoSJI6hoIkqWMoSJI6hoIkqWMoSJI6hoIkqWMoSJI6w7ifgqQRMX/Zl4ddwqy17uQX9OV53VOQJHUMBUlSx1CQJHVmXCgkOTzJFUnWJlk27HokaZTMqFBIMgf4KPCnwALgyCQLhluVJI2OGRUKwMHA2qr6WVXdCZwJLBpyTZI0MmbaV1L3AK7tmV8PPLV3gSRLgaXt7K+TXDGg2oZtLnDTsIuYqrxn2BXMCNvMa+brBWxDrxc84Nds7011zLRQyCRtdZ+ZquXA8sGUM3MkWVNVC4ddh6bO12zb4uvVmGmHj9YDe/bMzwOuH1ItkjRyZloo/ADYL8k+SR4MLAZWDrkmSRoZM+rwUVXdneQvga8Bc4BTq+rSIZc1U4zcIbNZwNds2+LrBaSqtryUJGkkzLTDR5KkITIUJEkdQ2EGSPLrYdegRpJK8k8989snGU/y/+7n85ybZGE7/ZUku2zlUjVFE/9/JXlVkr9vp49J8udbWL9bfhTMqBPN0gxwO3BAkodU1W+B5wDXPZAnrKrnb5XKtNVV1SnDrmGmcU9hhkpyUJLvJ7k4ydlJdk2yW5IL2/4/bD/V7tXO/3uShw636lnjn4GNdzA5EjhjY0eShyU5NckPkvwoyaK2/SFJzmxfr7OAh/Sssy7J3CTzk/ykp/34JH/TTp+b5INJVie5PMlTknwxyZVJ3jWA33kkJfmbJMe3009pX7/vJXlf72sFPDbJV9vX471DKncgDIWZ61PA26rqQOAS4MSq2gDsmGQn4BBgDXBIkr2BDVX1m+GVO6ucCSxOsiNwIHB+T9/bgW9W1VOAPwbel+RhwOuB37Sv10nAk6ex3Tur6pnAKcA5wLHAAcCrkjxq2r+NHpLkoo0/wDs3sdwngWOq6unAPRP6DgJeDvwX4OVJ9mSW8vDRDJRkZ2CXqjqvbTod+Fw7/V3gGcAzgXcDh9MMD/KtQdc5W1XVxUnm0+wlfGVC93OBF238dAnsCOxF83p8uGf9i6ex6Y0Xal4CXFpVNwAk+RnNlf43T+M5Bb+tqoM2ziR5FXCf4Szacz6PqKrvtk2fBV7Ys8iqqvpVu+xlNGMH9Y7TNmsYCtueb9HsJexN82nybTTjQ92vE6HaopXA+4FDgd5P6QFeUlX3GYgxCUwYp2sSd3PfvfMdJ/Tf0T7e2zO9cd7/q/012bhrvXpfj3uYxa+Hh49moPYTyS1JDmmbXgls3GtYDRwFXFlV9wK/AJ4PfGfghc5upwLvrKpLJrR/DTgubQokeWLbvhp4Rdt2AM1hp4luBHZL8qgkO3DfT6Iaoqq6BbgtydPapsXDrGeYZm3abWMemmR9z/wHgCXAKe3J458BrwaoqnXt+9HqdtlvA/Paf9TaSqpqPfChSbr+Fvg74OI2GNbRvLn/A/DJ9rDRRcAFkzznXUneSXOO4irgp/2oXdN2NPDxJLcD5wK/Gm45w+EwF5IEJHl4Vf26nV4G7F5VbxxyWQPnnoIkNV6Q5ASa98WrgVcNt5zhcE9BktTxRLMkqWMoSJI6hoIkqWMoaCRMHHdI0uQMBWkLkmwT39LbVurUzGYoaJTMSfLxJJcm+Xo7sunvjUYL3ail705yHvDGJC9N8pMkP06yul1mTjua5g/a9V/Xth/ajnZ6dpLLkpySZLu278gkl7TP9Z627WVJPtBOv7Ed64gk+yb5djv95CTnJbkwydeS7D5ZnYP9c2o28pOFRsl+wJFV9RdJVgAvAd4KHFdV57VXG58IvKldfpeqehZAkkuA51XVdfndDXOOBn5VVU9ph634TpKvt30HAwtovu/+VeDFSb4LvIdmBNVbgK8nOYLm6vS3tOsdAtycZA/gvwLfSvIg4CPAoqoaT/JympFYXzOxTumBMhQ0Sq6qqova6QuBfdn0aLQAZ/VMfwc4rQ2TL7ZtzwUOTPJn7fzONMFzJ3BBVW38xH8GzRv8XcC5VTXetn8GeGZVfSnJw5M8gmY01M/SjLp6SLutJ9AMof2NdoiTOcANm6hTekAMBY2SiSNd7rKF5W/fOFFVxyR5Ks3Ndy5KchDNyJrHVdXXeldKcii/P2JqsfmROL9HM77VFTQj4b4GeDrwZpqhuS9tx/nfbJ3SA+U5BY2yzY1Gex9J9q2q86vqHcBNNJ/ovwa8vj28Q5LHtzfcATg4yT7tuYSX0wxceD7wrDR3YZtDc7+G3tFvj28ff0RzA5872hFzrwDGkjy93c6Dkuy/9f4M0u+4p6BRN+lotJN4X5L9aD7trwJ+DFwMzAd+2I6YOg4c0S7/PeBkmjt1rQbOrqp727F1/rV9nq9U1Tnt8t+iCZrVVXVPkmtpR1GtqjvbQ1Qfbm/AtD3NSK2XbpW/gNTDsY+kraw9fHR8VXm/BG1zPHwkSeq4pyBJ6rinIEnqGAqSpI6hIEnqGAqSpI6hIEnq/H92ln9qkR9/MQAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"%matplotlib inline\n",
"import matplotlib as plt\n",
"from matplotlib import pyplot\n",
"pyplot.bar(group_names, df[\"horsepower-binned\"].value_counts())\n",
"\n",
"# set x/y labels and plot title\n",
"plt.pyplot.xlabel(\"horsepower\")\n",
"plt.pyplot.ylabel(\"count\")\n",
"plt.pyplot.title(\"horsepower bins\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>\n",
" Check the dataframe above carefully, you will find the last column provides the bins for \"horsepower\" with 3 categories (\"Low\",\"Medium\" and \"High\"). \n",
"</p>\n",
"<p>\n",
" We successfully narrow the intervals from 57 to 3!\n",
"</p>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Bins visualization</h3>\n",
"Normally, a histogram is used to visualize the distribution of bins we created above. \n"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.0, 'horsepower bins')"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEWCAYAAACJ0YulAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAYdElEQVR4nO3dfbRddX3n8feHoPgA8tBcEAkYZKLT4DiIKUodlUoVfBjD2KphFSeD2GgXos6IGupU7Ezp4EO11tG6YkFipWB8QDLLGYVGSbQVMCjyKJJCgEAk4UGh1kbA7/yxdzaH673JJeScc5Pzfq111tn7t/c++3t/OTmfs/c5+3dSVUiSBLDLsAuQJE0fhoIkqWMoSJI6hoIkqWMoSJI6hoIkqWMoaFpKsjbJ7w67jukuyVFJ1m1h+aeT/Mkga9KObddhFyCpf6rqrcOuQTsWjxS0U0uy07zx2Zn+Fk1fhoKms8OSXJXkZ0m+kOQJmxck+cMka5Lck2R5kqf1LKskJye5EbgxjY8l2dA+1lVJnt2uu1uSjyS5Ncmd7emWJ7bLjkqyLskfJ7mrPaX1Bz372TPJ55JsTHJLkv+eZJd22S1JntdOn9DWNLedf3OSr7bTuyRZnOSfktydZFmSfdpls9vtTkpyK/DNyTpqCzWek+TPxv0972r7Yn2SE3vWfWWS65Lcn+T2JKc+hn877aAMBU1nrweOBQ4GngP8F4AkLwX+V7t8f+AW4Pxx2x4HPB+YC7wceDHwTGAv4A3A3e16H2zbDwP+DXAA8P6ex3kqMLNtXwgsSfKsdtkngD2BZwAvAf4zsPlFdiVwVDv9YuCmdp3N8yvb6be3tb4EeBpwL/DJcX/LS4DfBI5hYluqcaJ192zXPQn4ZJK922VnAW+pqj2AZ7OFENJOrKq8eZt2N2AtcELP/IeAT7fTZwEf6lm2O/AAMLudL+ClPctfCvwYeAGwS097gJ8Dh/S0HQnc3E4fBTwIPLln+TLgT4AZwCZgbs+ytwCXtNMnAcvb6euBNwPnt/O3AIf3LDu65zH2b/+WXYHZ7d/yjC3006Q1ttPnAH/Ws+4vgF171t0AvKCdvrX9G54y7H9/b8O7eaSg6ewnPdP/QvPiD8076ls2L6iqf6Z5539Az/q39Sz/JvC/ad6B35lkSZKnAGPAk4Arkvw0yU+Br7ftm91bVT/vmb+l3f9M4PG9dbTTm2tYCbwoyVNpAuQLwAuTzKZ5p35lu97TgQt69n898BCw30R/yyQmq3Eid1fVgz3zvf36e8ArgVuSrExy5Fb2q52QoaAd0R00L6YAJHky8BvA7T3rPGL436r6q6p6HnAozemidwN30bxzPrSq9mpve1bV7j2b7t0+/mYHtfu/i+Yd/dPHLbu93d8amhfctwOrqup+mpBbBHynqn7VbnMb8Iqe/e9VVU+oqkn/lglMVuOjUlXfq6r5wL7AV2mOODRiDAXtiP4OODHJYUl2A/4cuKyq1k60cpLfSvL8JI+jOV30r8BD7QvzZ4CPJdm3XfeAJOPP3f9pkscneRHwauCLVfUQzYvmGUn2SPJ04L8Bn+/ZbiXwNh7+/OCScfMAn24f4+nt/seSzN+GPvm1Gh/Nxu22f5Bkz6p6ALiP5ohFI8ZQ0A6nqlbQnNf/MrAeOARYsIVNnkLz4n8vzamVu4GPtMveC6wBLk1yH/D3QO+HtD9pt7sDOBd4a1X9qF12Ck3I3AR8hyaszu7ZdiWwB7BqknmAjwPLgYuS3A9cSvMB+aOxpRofjTcCa9t+eCtwwjY8hnZwqfJHdqSJJDkK+HxVzRpyKdLAeKQgSeoYCpKkTt9CIcnZ7VWT14xrPyXJDUmuTfKhnvbT2itUb5jggz5p4KrqEk8dadT0cyyVc2i+G/65zQ1JfgeYDzynqjb1fONjLs0HhYfSfL/675M8s/2GhyRpQPoWClW1qr1Qp9cfAWdW1aZ2nQ1t+3yaqz03ATcnWQMcAXx3S/uYOXNmzZ49fheSpC254oor7qqqsYmWDXrUxWfSXOV5Bs13xU+tqu/RXAV6ac9663jk1amdJItoLgDioIMOYvXq1f2tWJJ2MklumWzZoD9o3hXYm2YMmncDy5KEZgya8Sb8rmxVLamqeVU1b2xswqCTJG2jQYfCOuAr1bgc+BXNGDLrgAN71pvFNlymL0l6bAYdCl+lGbGSJM+kGVDsLporOhe0Y9sfDMwBLh9wbZI08vr2mUKS82iG6p2Z5jdkT6cZAuDs9muqvwQWVnNJ9bVJlgHX0QwDfLLfPJKkwduhh7mYN29e+UGzJD06Sa6oqnkTLfOKZklSx1CQJHUMBUlSx1CQJHUGfUXztDJ78deGXYLGWXvmq4ZdgjTSPFKQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHX6FgpJzk6yof095vHLTk1SSWb2tJ2WZE2SG5Ic06+6JEmT6+eRwjnAseMbkxwIvAy4tadtLrAAOLTd5lNJZvSxNknSBPoWClW1CrhngkUfA94DVE/bfOD8qtpUVTcDa4Aj+lWbJGliA/1MIclrgNur6ofjFh0A3NYzv65tm+gxFiVZnWT1xo0b+1SpJI2mgYVCkicB7wPeP9HiCdpqgjaqaklVzauqeWNjY9uzREkaeYP8Oc5DgIOBHyYBmAV8P8kRNEcGB/asOwu4Y4C1SZIY4JFCVV1dVftW1eyqmk0TBIdX1U+A5cCCJLslORiYA1w+qNokSY1+fiX1POC7wLOSrEty0mTrVtW1wDLgOuDrwMlV9VC/apMkTaxvp4+q6vitLJ89bv4M4Ix+1SNJ2jqvaJYkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVKnn7/RfHaSDUmu6Wn7cJIfJbkqyQVJ9upZdlqSNUluSHJMv+qSJE2un0cK5wDHjmu7GHh2VT0H+DFwGkCSucAC4NB2m08lmdHH2iRJE+hbKFTVKuCecW0XVdWD7eylwKx2ej5wflVtqqqbgTXAEf2qTZI0sWF+pvAm4P+10wcAt/UsW9e2/Zoki5KsTrJ648aNfS5RkkbLUEIhyfuAB4FzNzdNsFpNtG1VLamqeVU1b2xsrF8lStJI2nXQO0yyEHg1cHRVbX7hXwcc2LPaLOCOQdcmSaNuoEcKSY4F3gu8pqr+pWfRcmBBkt2SHAzMAS4fZG2SpD4eKSQ5DzgKmJlkHXA6zbeNdgMuTgJwaVW9taquTbIMuI7mtNLJVfVQv2qTJE2sb6FQVcdP0HzWFtY/AzijX/VIkrbOK5olSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLU6VsoJDk7yYYk1/S07ZPk4iQ3tvd79yw7LcmaJDckOaZfdUmSJtfPI4VzgGPHtS0GVlTVHGBFO0+SucAC4NB2m08lmdHH2iRJE+hbKFTVKuCecc3zgaXt9FLguJ7286tqU1XdDKwBjuhXbZKkiQ36M4X9qmo9QHu/b9t+AHBbz3rr2rZfk2RRktVJVm/cuLGvxUrSqJkuHzRngraaaMWqWlJV86pq3tjYWJ/LkqTRMuhQuDPJ/gDt/Ya2fR1wYM96s4A7BlybJI28QYfCcmBhO70QuLCnfUGS3ZIcDMwBLh9wbZI08nbt1wMnOQ84CpiZZB1wOnAmsCzJScCtwOsAquraJMuA64AHgZOr6qF+1SZJmljfQqGqjp9k0dGTrH8GcEa/6pEkbd10+aBZkjQNGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpI6hIEnqGAqSpM6UQiHJiqm0SZJ2bFsc+yjJE4An0QxqtzcP/+7BU4Cn9bk2SdKAbW1AvLcA76QJgCt4OBTuAz7Zv7IkScOwxVCoqo8DH09ySlV9YkA1SZKGZEpDZ1fVJ5L8NjC7d5uq+lyf6pIkDcGUQiHJ3wKHAFcCm3/8pgBDQZJ2IlP9kZ15wNyqqn4WI0karqlep3AN8NR+FiJJGr6pHinMBK5LcjmwaXNjVb1mW3aa5L8Cb6Y5BXU1cCLNV1+/QPO5xVrg9VV177Y8viRp20w1FD6wvXaY5ADg7TSno36RZBmwAJgLrKiqM5MsBhYD791e+5Ukbd1Uv320sg/7fWKSB2iOEO4ATgOOapcvBS7BUJCkgZrqMBf3J7mvvf1rkoeS3LctO6yq24GPALcC64GfVdVFwH5Vtb5dZz2w77Y8viRp2031SGGP3vkkxwFHbMsO2+Ey5gMHAz8FvpjkhEex/SJgEcBBBx20LSVIkiaxTaOkVtVXgZdu4z5/F7i5qjZW1QPAV4DfBu5Msj9Ae79hkn0vqap5VTVvbGxsG0uQJE1kqhevvbZndhea6xa29ZqFW4EXJHkS8AvgaGA18HNgIXBme3/hNj6+JGkbTfXbR/+xZ/pBmq+Mzt+WHVbVZUm+BHy/fawfAEuA3YFlSU6iCY7XbcvjS5K23VQ/Uzhxe+60qk4HTh/XvInmqEGSNCRT/fbRrCQXJNmQ5M4kX04yq9/FSZIGa6ofNH8WWE7zuwoHAP+nbZMk7USmGgpjVfXZqnqwvZ0D+NUfSdrJTDUU7kpyQpIZ7e0E4O5+FiZJGryphsKbgNcDP6G5Cvn3aQaxkyTtRKb6ldT/CSzcPGppkn1ohqp4U78KkyQN3lSPFJ7TO4x1Vd0DPLc/JUmShmWqobBLO2YR0B0pTPUoQ5K0g5jqC/tfAP/YXolcNJ8vnNG3qiRJQzHVK5o/l2Q1zSB4AV5bVdf1tTJJ0sBN+RRQGwIGgSTtxLZp6GxJ0s7JUJAkdQwFSVLHUJAkdQwFSVLHUJAkdQwFSVJnKKGQZK8kX0ryoyTXJzkyyT5JLk5yY3u/99YfSZK0PQ3rSOHjwNer6t8C/x64HlgMrKiqOcCKdl6SNEADD4UkTwFeDJwFUFW/rKqfAvOBpe1qS4HjBl2bJI26YRwpPAPYCHw2yQ+S/E2SJwP7VdV6gPZ+3yHUJkkjbRihsCtwOPDXVfVc4Oc8ilNFSRYlWZ1k9caNG/tVoySNpGGEwjpgXVVd1s5/iSYk7kyyP0B7v2GijatqSVXNq6p5Y2NjAylYkkbFwEOhqn4C3JbkWW3T0TSjry4HFrZtC4ELB12bJI26Yf162inAuUkeD9wEnEgTUMuSnATcCrxuSLVJ0sgaSihU1ZXAvAkWHT3gUiRJPbyiWZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSR1DQZLUMRQkSZ1h/UYzSWYAq4Hbq+rVSfYBvgDMBtYCr6+qe4dVn4Zj9uKvDbsEjbP2zFcNuwQN0DCPFN4BXN8zvxhYUVVzgBXtvCRpgIYSCklmAa8C/qaneT6wtJ1eChw34LIkaeQN60jhL4H3AL/qaduvqtYDtPf7DqEuSRppAw+FJK8GNlTVFdu4/aIkq5Os3rhx43auTpJG2zCOFF4IvCbJWuB84KVJPg/cmWR/gPZ+w0QbV9WSqppXVfPGxsYGVbMkjYSBh0JVnVZVs6pqNrAA+GZVnQAsBxa2qy0ELhx0bZI06qbTdQpnAi9LciPwsnZekjRAQ7tOAaCqLgEuaafvBo4eZj2SNOqm05GCJGnIDAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1Bh4KSQ5M8q0k1ye5Nsk72vZ9klyc5Mb2fu9B1yZJo24YRwoPAu+qqt8EXgCcnGQusBhYUVVzgBXtvCRpgAYeClW1vqq+307fD1wPHADMB5a2qy0Fjht0bZI06ob6mUKS2cBzgcuA/apqPTTBAew7yTaLkqxOsnrjxo0Dq1WSRsHQQiHJ7sCXgXdW1X1T3a6qllTVvKqaNzY21r8CJWkEDSUUkjyOJhDOraqvtM13Jtm/Xb4/sGEYtUnSKBvGt48CnAVcX1Uf7Vm0HFjYTi8ELhx0bZI06nYdwj5fCLwRuDrJlW3bHwNnAsuSnATcCrxuCLVJ0kgbeChU1XeATLL46EHWIkl6JK9oliR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUmcYv6cgaQcye/HXhl2CJrD2zFf15XE9UpAkdQwFSVLHUJAkdaZdKCQ5NskNSdYkWTzseiRplEyrUEgyA/gk8ApgLnB8krnDrUqSRse0CgXgCGBNVd1UVb8EzgfmD7kmSRoZ0+0rqQcAt/XMrwOe37tCkkXAonb2n5PcsB32OxO4azs8zs7K/pmcfTM5+2Zyj7lv8sHHtP+nT7ZguoVCJmirR8xULQGWbNedJqurat72fMydif0zOftmcvbN5KZz30y300frgAN75mcBdwypFkkaOdMtFL4HzElycJLHAwuA5UOuSZJGxrQ6fVRVDyZ5G/ANYAZwdlVdO4Bdb9fTUTsh+2dy9s3k7JvJTdu+SVVtfS1J0kiYbqePJElDZChIkjojGQpJ1ia5OsmVSVa3bfskuTjJje393sOucxCSnJ1kQ5Jretom7Yskp7VDkNyQ5JjhVD0Yk/TNB5Lc3j53rkzyyp5lo9Q3Byb5VpLrk1yb5B1t+8g/d7bQNzvGc6eqRu4GrAVmjmv7ELC4nV4MfHDYdQ6oL14MHA5cs7W+oBl65IfAbsDBwD8BM4b9Nwy4bz4AnDrBuqPWN/sDh7fTewA/bvtg5J87W+ibHeK5M5JHCpOYDyxtp5cCxw2vlMGpqlXAPeOaJ+uL+cD5VbWpqm4G1tAMTbJTmqRvJjNqfbO+qr7fTt8PXE8zIsHIP3e20DeTmVZ9M6qhUMBFSa5oh80A2K+q1kPzjwrsO7Tqhm+yvphoGJItPdl3Vm9LclV7emnz6ZGR7Zsks4HnApfhc+cRxvUN7ADPnVENhRdW1eE0o7GenOTFwy5oB7HVYUhGwF8DhwCHAeuBv2jbR7JvkuwOfBl4Z1Xdt6VVJ2jbqftngr7ZIZ47IxkKVXVHe78BuIDmUO3OJPsDtPcbhlfh0E3WFyM/DElV3VlVD1XVr4DP8PBh/sj1TZLH0bzonVtVX2mbfe4wcd/sKM+dkQuFJE9OssfmaeDlwDU0w2ksbFdbCFw4nAqnhcn6YjmwIMluSQ4G5gCXD6G+odn8gtf6TzTPHRixvkkS4Czg+qr6aM+ikX/uTNY3O8xzZ9if1A/6BjyD5pP+HwLXAu9r238DWAHc2N7vM+xaB9Qf59Ecyj5A847lpC31BfA+mm9H3AC8Ytj1D6Fv/ha4GriK5j/z/iPaN/+B5hTHVcCV7e2VPne22Dc7xHPHYS4kSZ2RO30kSZqcoSBJ6hgKkqSOoSBJ6hgKkqSOoaCRkGR272inkiZmKEhbkWRa/WztZHaUOjW9GQoaJTOSfKYd4/6iJE9McliSS9tByi7YPEhZkkuS/HmSlcA7krwuyTVJfphkVbvOjCQfTvK9dvu3tO1HJVnVPt51ST6dZJd22fFpfsvjmiQfbNten+Sj7fQ7ktzUTh+S5Dvt9POSrGwHcfxGz1ASj6hzsN2pnZHvLDRK5gDHV9UfJlkG/B7wHuCUqlqZ5H8ApwPvbNffq6peApDkauCYqro9yV7t8pOAn1XVbyXZDfiHJBe1y46gGSf/FuDrwGuT/CPwQeB5wL00I/UeB6wC3t1u9yLg7iQH0FwZ++12HJ1PAPOramOSNwBnAG8aX6f0WBkKGiU3V9WV7fQVNCNW7lVVK9u2pcAXe9b/Qs/0PwDntGGyefC3lwPPSfL77fyeNMHzS+Dyqtr8jv88mhf4B4BLqmpj234u8OKq+mqS3dsxuQ4E/o7mB35e1O7rWcCzgYubYXWYQTP8xkR1So+JoaBRsqln+iFgr62s//PNE1X11iTPB14FXJnkMJohj0+pqm/0bpTkKH596ONi4iGSN/sucCLN2DffpjkKOBJ4F3AQcG1VHbm1OqXHys8UNMp+Btyb5EXt/BuBlROtmOSQqrqsqt4P3EXzjv4bwB+1p3dI8sx25F2AI5Ic3H6W8AbgOzQ/tPKSJDOTzACO79nfKuDU9v4HwO8Am6rqZzRBMZbkyHY/j0ty6PbrBulhHilo1C0EPp3kScBNNO/WJ/LhJHNo3u2voBll9ypgNvD9drjkjTz885PfBc4E/h3NC/0FVfWrJKcB32of5/9W1eahpb9NEzSrquqhJLcBPwKoql+2p6j+KsmeNP9v/5JmlF9pu3KUVGk7a08fnVpVrx5yKdKj5ukjSVLHIwVJUscjBUlSx1CQJHUMBUlSx1CQJHUMBUlS5/8D72GK8enPX1MAAAAASUVORK5CYII=\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"%matplotlib inline\n",
"import matplotlib as plt\n",
"from matplotlib import pyplot\n",
"\n",
"\n",
"# draw historgram of attribute \"horsepower\" with bins = 3\n",
"plt.pyplot.hist(df[\"horsepower\"], bins = 3)\n",
"\n",
"# set x/y labels and plot title\n",
"plt.pyplot.xlabel(\"horsepower\")\n",
"plt.pyplot.ylabel(\"count\")\n",
"plt.pyplot.title(\"horsepower bins\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The plot above shows the binning result for attribute \"horsepower\". \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"indicator\">Indicator variable (or dummy variable)</h2>\n",
"<b>What is an indicator variable?</b>\n",
"<p>\n",
" An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning. \n",
"</p>\n",
"\n",
"<b>Why we use indicator variables?</b>\n",
"\n",
"<p>\n",
" So we can use categorical variables for regression analysis in the later modules.\n",
"</p>\n",
"<b>Example</b>\n",
"<p>\n",
" We see the column \"fuel-type\" has two unique values, \"gas\" or \"diesel\". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert \"fuel-type\" into indicator variables.\n",
"</p>\n",
"\n",
"<p>\n",
" We will use the panda's method 'get_dummies' to assign numerical values to different categories of fuel type. \n",
"</p>\n"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['symboling', 'normalized-losses', 'make', 'aspiration', 'num-of-doors',\n",
" 'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 'length',\n",
" 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders',\n",
" 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio',\n",
" 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price',\n",
" 'city-L/100km', 'highway-L/100km', 'horsepower-binned',\n",
" 'fuel-type-diesel', 'fuel-type-gas', 'fuel-type-diesel',\n",
" 'fuel-type-gas', 'fuel-type-diesel', 'fuel-type-gas',\n",
" 'fuel-type-diesel', 'fuel-type-gas'],\n",
" dtype='object')"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"get indicator variables and assign it to data frame \"dummy_variable_1\" \n"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"ename": "KeyError",
"evalue": "'fuel-type'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m~/conda/envs/python/lib/python3.6/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2897\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2898\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2899\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine._get_loc_duplicates\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine._maybe_get_bool_indexer\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine._unpack_bool_indexer\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: 'fuel-type'",
"\nThe above exception was the direct cause of the following exception:\n",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-73-5c0c9c9ea224>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdummy_variable_1\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_dummies\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"fuel-type\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mdummy_variable_1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhead\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/conda/envs/python/lib/python3.6/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 2904\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2905\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2906\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2907\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2908\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/conda/envs/python/lib/python3.6/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 2898\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2899\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2900\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2901\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2902\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mtolerance\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: 'fuel-type'"
]
}
],
"source": [
"dummy_variable_1 = pd.get_dummies(df[\"fuel-type\"])\n",
"dummy_variable_1.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"change column names for clarity \n"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>fuel-type-diesel</th>\n",
" <th>fuel-type-gas</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" fuel-type-diesel fuel-type-gas\n",
"0 0 1\n",
"1 0 1\n",
"2 0 1\n",
"3 0 1\n",
"4 0 1"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dummy_variable_1.rename(columns={'gas':'fuel-type-gas', 'diesel':'fuel-type-diesel'}, inplace=True)\n",
"dummy_variable_1.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the dataframe, column fuel-type has a value for 'gas' and 'diesel'as 0s and 1s now.\n"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"ename": "KeyError",
"evalue": "\"['fuel-type'] not found in axis\"",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-75-a92dbd6eade8>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;31m# drop original column \"fuel-type\" from \"df\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdrop\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"fuel-type\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minplace\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m~/conda/envs/python/lib/python3.6/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36mdrop\u001b[0;34m(self, labels, axis, index, columns, level, inplace, errors)\u001b[0m\n\u001b[1;32m 4172\u001b[0m \u001b[0mlevel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mlevel\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4173\u001b[0m \u001b[0minplace\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0minplace\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 4174\u001b[0;31m \u001b[0merrors\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merrors\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4175\u001b[0m )\n\u001b[1;32m 4176\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/conda/envs/python/lib/python3.6/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36mdrop\u001b[0;34m(self, labels, axis, index, columns, level, inplace, errors)\u001b[0m\n\u001b[1;32m 3887\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlabels\u001b[0m \u001b[0;32min\u001b[0m \u001b[0maxes\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3888\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlabels\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3889\u001b[0;31m \u001b[0mobj\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_drop_axis\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlabels\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mlevel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merrors\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3890\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3891\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0minplace\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/conda/envs/python/lib/python3.6/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m_drop_axis\u001b[0;34m(self, labels, axis, level, errors)\u001b[0m\n\u001b[1;32m 3940\u001b[0m \u001b[0mlabels_missing\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_indexer_for\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlabels\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0many\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3941\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0merrors\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m\"raise\"\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mlabels_missing\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3942\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf\"{labels} not found in axis\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3943\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3944\u001b[0m \u001b[0mslicer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mslice\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mndim\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: \"['fuel-type'] not found in axis\""
]
}
],
"source": [
"# merge data frame \"df\" and \"dummy_variable_1\" \n",
"df = pd.concat([df, dummy_variable_1], axis=1)\n",
"\n",
"# drop original column \"fuel-type\" from \"df\"\n",
"df.drop(\"fuel-type\", axis = 1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>symboling</th>\n",
" <th>normalized-losses</th>\n",
" <th>make</th>\n",
" <th>aspiration</th>\n",
" <th>num-of-doors</th>\n",
" <th>body-style</th>\n",
" <th>drive-wheels</th>\n",
" <th>engine-location</th>\n",
" <th>wheel-base</th>\n",
" <th>length</th>\n",
" <th>...</th>\n",
" <th>fuel-type-diesel</th>\n",
" <th>fuel-type-gas</th>\n",
" <th>fuel-type-diesel</th>\n",
" <th>fuel-type-gas</th>\n",
" <th>fuel-type-diesel</th>\n",
" <th>fuel-type-gas</th>\n",
" <th>fuel-type-diesel</th>\n",
" <th>fuel-type-gas</th>\n",
" <th>fuel-type-diesel</th>\n",
" <th>fuel-type-gas</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>0.811148</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>convertible</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>88.6</td>\n",
" <td>0.811148</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>122</td>\n",
" <td>alfa-romero</td>\n",
" <td>std</td>\n",
" <td>two</td>\n",
" <td>hatchback</td>\n",
" <td>rwd</td>\n",
" <td>front</td>\n",
" <td>94.5</td>\n",
" <td>0.822681</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>fwd</td>\n",
" <td>front</td>\n",
" <td>99.8</td>\n",
" <td>0.848630</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>164</td>\n",
" <td>audi</td>\n",
" <td>std</td>\n",
" <td>four</td>\n",
" <td>sedan</td>\n",
" <td>4wd</td>\n",
" <td>front</td>\n",
" <td>99.4</td>\n",
" <td>0.848630</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 38 columns</p>\n",
"</div>"
],
"text/plain": [
" symboling normalized-losses make aspiration num-of-doors \\\n",
"0 3 122 alfa-romero std two \n",
"1 3 122 alfa-romero std two \n",
"2 1 122 alfa-romero std two \n",
"3 2 164 audi std four \n",
"4 2 164 audi std four \n",
"\n",
" body-style drive-wheels engine-location wheel-base length ... \\\n",
"0 convertible rwd front 88.6 0.811148 ... \n",
"1 convertible rwd front 88.6 0.811148 ... \n",
"2 hatchback rwd front 94.5 0.822681 ... \n",
"3 sedan fwd front 99.8 0.848630 ... \n",
"4 sedan 4wd front 99.4 0.848630 ... \n",
"\n",
" fuel-type-diesel fuel-type-gas fuel-type-diesel fuel-type-gas \\\n",
"0 0 1 0 1 \n",
"1 0 1 0 1 \n",
"2 0 1 0 1 \n",
"3 0 1 0 1 \n",
"4 0 1 0 1 \n",
"\n",
" fuel-type-diesel fuel-type-gas fuel-type-diesel fuel-type-gas \\\n",
"0 0 1 0 1 \n",
"1 0 1 0 1 \n",
"2 0 1 0 1 \n",
"3 0 1 0 1 \n",
"4 0 1 0 1 \n",
"\n",
" fuel-type-diesel fuel-type-gas \n",
"0 0 1 \n",
"1 0 1 \n",
"2 0 1 \n",
"3 0 1 \n",
"4 0 1 \n",
"\n",
"[5 rows x 38 columns]"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The last two columns are now the indicator variable representation of the fuel-type variable. It's all 0s and 1s now.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n",
"<h1> Question #4: </h1>\n",
"\n",
"<b>As above, create indicator variable to the column of \"aspiration\"</b>\n",
"\n",
"</div>\n"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>aspiration-std</th>\n",
" <th>aspiration-turbo</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" aspiration-std aspiration-turbo\n",
"0 1 0\n",
"1 1 0\n",
"2 1 0\n",
"3 1 0\n",
"4 1 0"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Write your code below and press Shift+Enter to execute \n",
"\n",
"dummy_variable_2 = pd.get_dummies(df['aspiration'])\n",
"\n",
"dummy_variable_2.rename(columns={'std':'aspiration-std', 'turbo': 'aspiration-turbo'}, inplace=True)\n",
"\n",
"dummy_variable_2.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```python\n",
"# get indicator variables of aspiration and assign it to data frame \"dummy_variable_2\"\n",
"dummy_variable_2 = pd.get_dummies(df['aspiration'])\n",
"\n",
"# change column names for clarity\n",
"dummy_variable_2.rename(columns={'std':'aspiration-std', 'turbo': 'aspiration-turbo'}, inplace=True)\n",
"\n",
"# show first 5 instances of data frame \"dummy_variable_1\"\n",
"dummy_variable_2.head()\n",
"\n",
"\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" <div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n",
"<h1> Question #5: </h1>\n",
"\n",
"<b>Merge the new dataframe to the original dataframe then drop the column 'aspiration'</b>\n",
"\n",
"</div>\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code below and press Shift+Enter to execute \n",
"# merge the new dataframe to the original datafram\n",
"df = pd.concat([df, dummy_variable_2], axis=1)\n",
"\n",
"# drop original column \"aspiration\" from \"df\"\n",
"df.drop('aspiration', axis = 1, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```python\n",
"# merge the new dataframe to the original datafram\n",
"df = pd.concat([df, dummy_variable_2], axis=1)\n",
"\n",
"# drop original column \"aspiration\" from \"df\"\n",
"df.drop('aspiration', axis = 1, inplace=True)\n",
"\n",
"\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Save the new csv \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.to_csv('clean_df.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Thank you for completing this lab!\n",
"\n",
"## Author\n",
"\n",
"<a href=\"https://www.linkedin.com/in/joseph-s-50398b136/\" target=\"_blank\">Joseph Santarcangelo</a>\n",
"\n",
"### Other Contributors\n",
"\n",
"<a href=\"https://www.linkedin.com/in/mahdi-noorian-58219234/\" target=\"_blank\">Mahdi Noorian PhD</a>\n",
"\n",
"Bahare Talayian\n",
"\n",
"Eric Xiao\n",
"\n",
"Steven Dong\n",
"\n",
"Parizad\n",
"\n",
"Hima Vasudevan\n",
"\n",
"<a href=\"https://www.linkedin.com/in/fiorellawever/\" target=\"_blank\">Fiorella Wenver</a>\n",
"\n",
"<a href=\" https://www.linkedin.com/in/yi-leng-yao-84451275/ \" target=\"_blank\" >Yi Yao</a>.\n",
"\n",
"## Change Log\n",
"\n",
"| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n",
"| ----------------- | ------- | ---------- | ----------------------------------- |\n",
"| 2020-10-30 | 2.2 | Lakshmi | Changed URL of csv |\n",
"| 2020-09-09 | 2.1 | Lakshmi | Updated Indicator Variables section |\n",
"| 2020-08-27 | 2.0 | Lavanya | Moved lab to course repo in GitLab |\n",
"\n",
"<hr>\n",
"\n",
"## <h3 align=\"center\"> © IBM Corporation 2020. All rights reserved. <h3/>\n"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment