Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save pb111/af439e4affb1dd94879579cfd6793770 to your computer and use it in GitHub Desktop.
Save pb111/af439e4affb1dd94879579cfd6793770 to your computer and use it in GitHub Desktop.
Decision-Tree Classification with Python and Scikit-Learn
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Decision Tree Classification with Python and Scikit-Learn\n",
"\n",
"\n",
"In this project, I build a Decision Tree Classifier to predict the safety of the car. I build two models, one with criterion `gini index` and another one with criterion `entropy`. I implement Decision Tree Classification with Python and Scikit-Learn. I have used the **Car Evaluation Data Set** for this project, downloaded from the UCI Machine Learning Repository website."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"\n",
"1.\tIntroduction to Decision Tree algorithm\n",
"2.\tClassification and Regression Trees\n",
"3.\tDecision Tree algorithm intuition\n",
"4.\tAttribute selection measures\n",
" - Information gain\n",
" - Gini index\n",
"5.\tThe problem statement\n",
"6.\tDataset description\n",
"7.\tImport libraries\n",
"8.\tImport dataset\n",
"9.\tExploratory data analysis\n",
"10.\tDeclare feature vector and target variable\n",
"11.\tSplit data into separate training and test set\n",
"12.\tFeature engineering\n",
"13.\tDecision Tree classifier with criterion gini-index\n",
"14.\tDecision Tree classifier with criterion entropy\n",
"15.\tConfusion matrix\n",
"16.\tClassification report\n",
"17.\tResults and conclusion\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Introduction to Decision Tree algorithm\n",
"\n",
"\n",
"A Decision Tree algorithm is one of the most popular machine learning algorithms. It uses a tree like structure and their possible combinations to solve a particular problem. It belongs to the class of supervised learning algorithms where it can be used for both classification and regression purposes. \n",
"\n",
"\n",
"A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Classification and Regression Trees (CART)\n",
"\n",
"\n",
"Nowadays, Decision Tree algorithm is known by its modern name **CART** which stands for **Classification and Regression Trees**.\n",
"Classification and Regression Trees or **CART** is a term introduced by Leo Breiman to refer to Decision Tree algorithms that can be used for classification and regression modeling problems.The CART algorithm provides a foundation for other important algorithms like bagged decision trees, random forest and boosted decision trees.\n",
"\n",
"\n",
"In this project, I will solve a classification problem. So, I will refer the algorithm also as Decision Tree Classification problem. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Decision Tree algorithm intuition\n",
"\n",
"The Decision-Tree algorithm is one of the most frequently and widely used supervised machine learning algorithms that can be used for both classification and regression tasks. The intuition behind the Decision-Tree algorithm is very simple to understand.\n",
"\n",
"\n",
"The Decision Tree algorithm intuition is as follows:-\n",
"\n",
"\n",
"1.\tFor each attribute in the dataset, the Decision-Tree algorithm forms a node. The most important attribute is placed at the root node. \n",
"\n",
"2.\tFor evaluating the task in hand, we start at the root node and we work our way down the tree by following the corresponding node that meets our condition or decision.\n",
"\n",
"3.\tThis process continues until a leaf node is reached. It contains the prediction or the outcome of the Decision Tree.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Attribute selection measures\n",
"\n",
"\n",
"The primary challenge in the Decision Tree implementation is to identify the attributes which we consider as the root node and each level. This process is known as the **attributes selection**. There are different attributes selection measure to identify the attribute which can be considered as the root node at each level.\n",
"\n",
"\n",
"There are 2 popular attribute selection measures. They are as follows:-\n",
"\n",
"\n",
"- **Information gain**\n",
"\n",
"- **Gini index**\n",
"\n",
"\n",
"While using **Information gain** as a criterion, we assume attributes to be categorical and for **Gini index** attributes are assumed to be continuous. These attribute selection measures are described below.\n",
"\n",
"\n",
"### Information gain\n",
"\n",
"\n",
"By using information gain as a criterion, we try to estimate the information contained by each attribute. To understand the concept of Information Gain, we need to know another concept called **Entropy**. \n",
"\n",
"\n",
"Entropy measures the impurity in the given dataset. In Physics and Mathematics, entropy is referred to as the randomness or uncertainty of a random variable X. In information theory, it refers to the impurity in a group of examples. **Information gain** is the decrease in entropy. Information gain computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values. \n",
"\n",
"\n",
"The ID3 (Iterative Dichotomiser) Decision Tree algorithm uses entropy to calculate information gain. So, by calculating decrease in **entropy measure** of each attribute we can calculate their information gain. The attribute with the highest information gain is chosen as the splitting attribute at the node.\n",
"\n",
"\n",
"### Gini index\n",
"\n",
"\n",
"Another attribute selection measure that **CART (Categorical and Regression Trees)** uses is the **Gini index**. It uses the Gini method to create split points. \n",
"\n",
"Gini index says, if we randomly select two items from a population, they must be of the same class and probability for this is 1 if the population is pure.\n",
"\n",
"It works with the categorical target variable “Success” or “Failure”. It performs only binary splits. The higher the value of Gini, higher the homogeneity. CART (Classification and Regression Tree) uses the Gini method to create binary splits.\n",
"\n",
"Steps to Calculate Gini for a split\n",
"\n",
"1.\tCalculate Gini for sub-nodes, using formula sum of the square of probability for success and failure (p^2+q^2).\n",
"\n",
"2.\tCalculate Gini for split using weighted Gini score of each node of that split.\n",
"\n",
"\n",
"In case of a discrete-valued attribute, the subset that gives the minimum gini index for that chosen is selected as a splitting attribute. In the case of continuous-valued attributes, the strategy is to select each pair of adjacent values as a possible split-point and point with smaller gini index chosen as the splitting point. The attribute with minimum Gini index is chosen as the splitting attribute.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. The problem statement\n",
"\n",
"\n",
"The problem is to predict the safety of the car. In this project, I build a Decision Tree Classifier to predict the safety of the car. I implement Decision Tree Classification with Python and Scikit-Learn. I have used the **Car Evaluation Data Set** for this project, downloaded from the UCI Machine Learning Repository website.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Dataset description\n",
"\n",
"\n",
"I have used the **Car Evaluation Data Set** downloaded from the Kaggle website. I have downloaded this data set from the Kaggle website. The data set can be found at the following url:-\n",
"\n",
"\n",
"http://archive.ics.uci.edu/ml/datasets/Car+Evaluation\n",
"\n",
"\n",
"Car Evaluation Database was derived from a simple hierarchical decision model originally developed for expert system for decision making. The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety. \n",
"\n",
"It was donated by Marko Bohanec."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Import libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Import dataset"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data = 'C:/datasets/car.data'\n",
"\n",
"df = pd.read_csv(data, header=None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Exploratory data analysis\n",
"\n",
"\n",
"Now, I will explore the data to gain insights about the data. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1728, 7)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view dimensions of dataset\n",
"\n",
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 1728 instances and 7 variables in the data set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View top 5 rows of dataset"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>small</td>\n",
" <td>low</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>small</td>\n",
" <td>med</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>small</td>\n",
" <td>high</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>med</td>\n",
" <td>low</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>med</td>\n",
" <td>med</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 5 6\n",
"0 vhigh vhigh 2 2 small low unacc\n",
"1 vhigh vhigh 2 2 small med unacc\n",
"2 vhigh vhigh 2 2 small high unacc\n",
"3 vhigh vhigh 2 2 med low unacc\n",
"4 vhigh vhigh 2 2 med med unacc"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# preview the dataset\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rename column names\n",
"\n",
"We can see that the dataset does not have proper column names. The columns are merely labelled as 0,1,2.... and so on. We should give proper names to the columns. I will do it as follows:-"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']\n",
"\n",
"\n",
"df.columns = col_names\n",
"\n",
"col_names"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>buying</th>\n",
" <th>maint</th>\n",
" <th>doors</th>\n",
" <th>persons</th>\n",
" <th>lug_boot</th>\n",
" <th>safety</th>\n",
" <th>class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>small</td>\n",
" <td>low</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>small</td>\n",
" <td>med</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>small</td>\n",
" <td>high</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>med</td>\n",
" <td>low</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>med</td>\n",
" <td>med</td>\n",
" <td>unacc</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" buying maint doors persons lug_boot safety class\n",
"0 vhigh vhigh 2 2 small low unacc\n",
"1 vhigh vhigh 2 2 small med unacc\n",
"2 vhigh vhigh 2 2 small high unacc\n",
"3 vhigh vhigh 2 2 med low unacc\n",
"4 vhigh vhigh 2 2 med med unacc"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's again preview the dataset\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the column names are renamed. Now, the columns have meaningful names."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View summary of dataset"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 1728 entries, 0 to 1727\n",
"Data columns (total 7 columns):\n",
"buying 1728 non-null object\n",
"maint 1728 non-null object\n",
"doors 1728 non-null object\n",
"persons 1728 non-null object\n",
"lug_boot 1728 non-null object\n",
"safety 1728 non-null object\n",
"class 1728 non-null object\n",
"dtypes: object(7)\n",
"memory usage: 94.6+ KB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Frequency distribution of values in variables\n",
"\n",
"Now, I will check the frequency counts of categorical variables."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"med 432\n",
"low 432\n",
"vhigh 432\n",
"high 432\n",
"Name: buying, dtype: int64\n",
"med 432\n",
"low 432\n",
"vhigh 432\n",
"high 432\n",
"Name: maint, dtype: int64\n",
"5more 432\n",
"4 432\n",
"2 432\n",
"3 432\n",
"Name: doors, dtype: int64\n",
"4 576\n",
"2 576\n",
"more 576\n",
"Name: persons, dtype: int64\n",
"med 576\n",
"big 576\n",
"small 576\n",
"Name: lug_boot, dtype: int64\n",
"med 576\n",
"low 576\n",
"high 576\n",
"Name: safety, dtype: int64\n",
"unacc 1210\n",
"acc 384\n",
"good 69\n",
"vgood 65\n",
"Name: class, dtype: int64\n"
]
}
],
"source": [
"col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']\n",
"\n",
"\n",
"for col in col_names:\n",
" \n",
" print(df[col].value_counts()) \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the `doors` and `persons` are categorical in nature. So, I will treat them as categorical variables."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary of variables\n",
"\n",
"\n",
"- There are 7 variables in the dataset. All the variables are of categorical data type.\n",
"\n",
"\n",
"- These are given by `buying`, `maint`, `doors`, `persons`, `lug_boot`, `safety` and `class`.\n",
"\n",
"\n",
"- `class` is the target variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore `class` variable"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"unacc 1210\n",
"acc 384\n",
"good 69\n",
"vgood 65\n",
"Name: class, dtype: int64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['class'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `class` target variable is ordinal in nature."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Missing values in variables"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"buying 0\n",
"maint 0\n",
"doors 0\n",
"persons 0\n",
"lug_boot 0\n",
"safety 0\n",
"class 0\n",
"dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in variables\n",
"\n",
"df.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are no missing values in the dataset. I have checked the frequency distribution of values previously. It also confirms that there are no missing values in the dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Declare feature vector and target variable"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"X = df.drop(['class'], axis=1)\n",
"\n",
"y = df['class']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Split data into separate training and test set"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# split X and y into training and testing sets\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((1157, 6), (571, 6))"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check the shape of X_train and X_test\n",
"\n",
"X_train.shape, X_test.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Feature Engineering\n",
"\n",
"\n",
"**Feature Engineering** is the process of transforming raw data into useful features that help us to understand our model better and increase its predictive power. I will carry out feature engineering on different types of variables.\n",
"\n",
"\n",
"First, I will check the data types of variables again."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"buying object\n",
"maint object\n",
"doors object\n",
"persons object\n",
"lug_boot object\n",
"safety object\n",
"dtype: object"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check data types in X_train\n",
"\n",
"X_train.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Encode categorical variables\n",
"\n",
"\n",
"Now, I will encode the categorical variables."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>buying</th>\n",
" <th>maint</th>\n",
" <th>doors</th>\n",
" <th>persons</th>\n",
" <th>lug_boot</th>\n",
" <th>safety</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>48</th>\n",
" <td>vhigh</td>\n",
" <td>vhigh</td>\n",
" <td>3</td>\n",
" <td>more</td>\n",
" <td>med</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>468</th>\n",
" <td>high</td>\n",
" <td>vhigh</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>small</td>\n",
" <td>low</td>\n",
" </tr>\n",
" <tr>\n",
" <th>155</th>\n",
" <td>vhigh</td>\n",
" <td>high</td>\n",
" <td>3</td>\n",
" <td>more</td>\n",
" <td>small</td>\n",
" <td>high</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1721</th>\n",
" <td>low</td>\n",
" <td>low</td>\n",
" <td>5more</td>\n",
" <td>more</td>\n",
" <td>small</td>\n",
" <td>high</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1208</th>\n",
" <td>med</td>\n",
" <td>low</td>\n",
" <td>2</td>\n",
" <td>more</td>\n",
" <td>small</td>\n",
" <td>high</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" buying maint doors persons lug_boot safety\n",
"48 vhigh vhigh 3 more med low\n",
"468 high vhigh 3 4 small low\n",
"155 vhigh high 3 more small high\n",
"1721 low low 5more more small high\n",
"1208 med low 2 more small high"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that all the variables are ordinal categorical data type."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# import category encoders\n",
"\n",
"import category_encoders as ce"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# encode variables with ordinal encoding\n",
"\n",
"encoder = ce.OrdinalEncoder(cols=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])\n",
"\n",
"\n",
"X_train = encoder.fit_transform(X_train)\n",
"\n",
"X_test = encoder.transform(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>buying</th>\n",
" <th>maint</th>\n",
" <th>doors</th>\n",
" <th>persons</th>\n",
" <th>lug_boot</th>\n",
" <th>safety</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>48</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>468</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>155</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1721</th>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1208</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" buying maint doors persons lug_boot safety\n",
"48 1 1 1 1 1 1\n",
"468 2 1 1 2 2 1\n",
"155 1 2 1 1 2 2\n",
"1721 3 3 2 1 2 2\n",
"1208 4 3 3 1 2 2"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.head()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>buying</th>\n",
" <th>maint</th>\n",
" <th>doors</th>\n",
" <th>persons</th>\n",
" <th>lug_boot</th>\n",
" <th>safety</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>599</th>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1201</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>628</th>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1498</th>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1263</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" buying maint doors persons lug_boot safety\n",
"599 2 2 4 3 1 2\n",
"1201 4 3 3 2 1 3\n",
"628 2 2 2 3 3 3\n",
"1498 3 2 2 2 1 3\n",
"1263 4 3 4 1 1 1"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_test.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have training and test set ready for model building. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. Decision Tree Classifier with criterion gini index"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# import DecisionTreeClassifier\n",
"\n",
"from sklearn.tree import DecisionTreeClassifier\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,\n",
" max_features=None, max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, presort=False, random_state=0,\n",
" splitter='best')"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# instantiate the DecisionTreeClassifier model with criterion gini index\n",
"\n",
"clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)\n",
"\n",
"\n",
"# fit the model\n",
"clf_gini.fit(X_train, y_train)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Predict the Test set results with criterion gini index"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"y_pred_gini = clf_gini.predict(X_test)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check accuracy score with criterion gini index"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with criterion gini index: 0.8021\n"
]
}
],
"source": [
"from sklearn.metrics import accuracy_score\n",
"\n",
"print('Model accuracy score with criterion gini index: {0:0.4f}'. format(accuracy_score(y_test, y_pred_gini)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, **y_test** are the true class labels and **y_pred_gini** are the predicted class labels in the test-set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compare the train-set and test-set accuracy\n",
"\n",
"\n",
"Now, I will compare the train-set and test-set accuracy to check for overfitting."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['unacc', 'unacc', 'unacc', ..., 'unacc', 'unacc', 'acc'],\n",
" dtype=object)"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pred_train_gini = clf_gini.predict(X_train)\n",
"\n",
"y_pred_train_gini"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training-set accuracy score: 0.7865\n"
]
}
],
"source": [
"print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_gini)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check for overfitting and underfitting"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training set score: 0.7865\n",
"Test set score: 0.8021\n"
]
}
],
"source": [
"# print the scores on training and test set\n",
"\n",
"print('Training set score: {:.4f}'.format(clf_gini.score(X_train, y_train)))\n",
"\n",
"print('Test set score: {:.4f}'.format(clf_gini.score(X_test, y_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, the training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021. These two values are quite comparable. So, there is no sign of overfitting. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Decision Tree Classifier with criterion entropy"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,\n",
" max_features=None, max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, presort=False, random_state=0,\n",
" splitter='best')"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# instantiate the DecisionTreeClassifier model with criterion entropy\n",
"\n",
"clf_en = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)\n",
"\n",
"\n",
"# fit the model\n",
"clf_en.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Predict the Test set results with criterion entropy"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"y_pred_en = clf_en.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check accuracy score with criterion entropy"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with criterion entropy: 0.8021\n"
]
}
],
"source": [
"from sklearn.metrics import accuracy_score\n",
"\n",
"print('Model accuracy score with criterion entropy: {0:0.4f}'. format(accuracy_score(y_test, y_pred_en)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compare the train-set and test-set accuracy\n",
"\n",
"\n",
"Now, I will compare the train-set and test-set accuracy to check for overfitting."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['unacc', 'unacc', 'unacc', ..., 'unacc', 'unacc', 'acc'],\n",
" dtype=object)"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pred_train_en = clf_en.predict(X_train)\n",
"\n",
"y_pred_train_en"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training-set accuracy score: 0.7865\n"
]
}
],
"source": [
"print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_en)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check for overfitting and underfitting"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training set score: 0.7865\n",
"Test set score: 0.8021\n"
]
}
],
"source": [
"# print the scores on training and test set\n",
"\n",
"print('Training set score: {:.4f}'.format(clf_en.score(X_train, y_train)))\n",
"\n",
"print('Test set score: {:.4f}'.format(clf_en.score(X_test, y_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the training-set score and test-set score is same as above. The training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021. These two values are quite comparable. So, there is no sign of overfitting. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, based on the above analysis we can conclude that our classification model accuracy is very good. Our model is doing a very good job in terms of predicting the class labels.\n",
"\n",
"\n",
"But, it does not give the underlying distribution of values. Also, it does not tell anything about the type of errors our classifer is making. \n",
"\n",
"\n",
"We have another tool called `Confusion matrix` that comes to our rescue."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 15. Confusion matrix\n",
"\n",
"\n",
"A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.\n",
"\n",
"\n",
"Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-\n",
"\n",
"\n",
"**True Positives (TP)** – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to that class.\n",
"\n",
"\n",
"**True Negatives (TN)** – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually does not belong to that class.\n",
"\n",
"\n",
"**False Positives (FP)** – False Positives occur when we predict an observation belongs to a certain class but the observation actually does not belong to that class. This type of error is called **Type I error.**\n",
"\n",
"\n",
"\n",
"**False Negatives (FN)** – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually belongs to that class. This is a very serious error and it is called **Type II error.**\n",
"\n",
"\n",
"\n",
"These four outcomes are summarized in a confusion matrix given below.\n"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Confusion matrix\n",
"\n",
" [[ 73 0 56 0]\n",
" [ 20 0 0 0]\n",
" [ 12 0 385 0]\n",
" [ 25 0 0 0]]\n"
]
}
],
"source": [
"# Print the Confusion Matrix and slice it into four pieces\n",
"\n",
"from sklearn.metrics import confusion_matrix\n",
"\n",
"cm = confusion_matrix(y_test, y_pred_en)\n",
"\n",
"print('Confusion matrix\\n\\n', cm)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 16. Classification Report\n",
"\n",
"\n",
"**Classification report** is another way to evaluate the classification model performance. It displays the **precision**, **recall**, **f1** and **support** scores for the model. I have described these terms in later.\n",
"\n",
"We can print a classification report as follows:-"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" acc 0.56 0.57 0.56 129\n",
" good 0.00 0.00 0.00 20\n",
" unacc 0.87 0.97 0.92 397\n",
" vgood 0.00 0.00 0.00 25\n",
"\n",
" micro avg 0.80 0.80 0.80 571\n",
" macro avg 0.36 0.38 0.37 571\n",
"weighted avg 0.73 0.80 0.77 571\n",
"\n"
]
}
],
"source": [
"from sklearn.metrics import classification_report\n",
"\n",
"print(classification_report(y_test, y_pred_en))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 17. Results and conclusion\n",
"\n",
"\n",
"1.\tIn this project, I build a Decision-Tree Classifier model to predict the safety of the car. I build two models, one with criterion `gini index` and another one with criterion `entropy`. The model yields a very good performance as indicated by the model accuracy in both the cases which was found to be 0.8021.\n",
"2.\tIn the model with criterion `gini index`, the training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021. These two values are quite comparable. So, there is no sign of overfitting.\n",
"3.\tSimilarly, in the model with criterion `entropy`, the training-set accuracy score is 0.7865 while the test-set accuracy to be 0.8021.We get the same values as in the case with criterion `gini`. So, there is no sign of overfitting.\n",
"4.\tIn both the cases, the training-set and test-set accuracy score is the same. It may happen because of small dataset.\n",
"5.\tThe confusion matrix and classification report yields very good model performance."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@Mulugetaabdeta
Copy link

i am happy by using this code so its very nice and clear code.

@peterHernandez8451
Copy link

Great technical communication. Keep up the great work!

@geethuunni
Copy link

Thank you, this one really helped me to understand quickly. Resolved many confusions too.

@RobelDawit
Copy link

Thank you, this really helped me to understand what my professor is talking about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment