Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pb111/289b4d5bc64b408e5b3462e4e2edb2b3 to your computer and use it in GitHub Desktop.
Save pb111/289b4d5bc64b408e5b3462e4e2edb2b3 to your computer and use it in GitHub Desktop.
k Nearest Neighbours with Python and Scikit-Learn
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# k Nearest Neighbours with Python and Scikit-Learn\n",
"\n",
"\n",
"k Nearest Neighbours is a very simple and one of the topmost machine learning algorithms. In this project, I build a k Nearest Neighbours classifier to classify the patients suffering from Breast Cancer. I have used the `Breast Cancer Wisconsin (Original) Data Set` downloaded from the UCI Machine Learning Repository."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"\n",
"1.\tIntroduction to k Nearest Neighbours Algorithm\n",
"2.\tk Nearest Neighbours intuition\n",
"3.\tThe problem statement\n",
"4.\tDataset description\n",
"5.\tImport libraries\n",
"6.\tImport dataset\n",
"7.\tExploratory data analysis\n",
"8.\tData visualization\n",
"9.\tDeclare feature vector and target variable\n",
"10.\tSplit data into separate training and test set\n",
"11.\tFeature engineering\n",
"12.\tFeature scaling\n",
"13.\tFit Neighbours classifier to the training set\n",
"14.\tPredict the test-set results\n",
"15.\tCheck the accuracy score\n",
"16.\tRebuild kNN classification model using different values of k\n",
"17.\tConfusion matrix\n",
"18.\tClassification metrices\n",
"19.\tROC - AUC\n",
"20.\tk-Fold Cross Validation\n",
"21.\tResults and conclusion\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Introduction to k Nearest Neighbours algorithm\n",
"\n",
"\n",
"\n",
"In machine learning, k Nearest Neighbours or kNN is the simplest of all machine learning algorithms. It is a non-parametric algorithm used for classification and regression tasks. Non-parametric means there is no assumption required for data distribution. So, kNN does not require any underlying assumption to be made. In both classification and regression tasks, the input consists of the k closest training examples in the feature space. The output depends upon whether kNN is used for classification or regression purposes.\n",
"\n",
"-\tIn kNN classification, the output is a class membership. The given data point is classified based on the majority of type of its neighbours. The data point is assigned to the most frequent class among its k nearest neighbours. Usually k is a small positive integer. If k=1, then the data point is simply assigned to the class of that single nearest neighbour.\n",
"\n",
"-\tIn kNN regression, the output is simply some property value for the object. This value is the average of the values of k nearest neighbours.\n",
"\n",
"\n",
"kNN is a type of instance-based learning or lazy learning. Lazy learning means it does not require any training data points for model generation. All training data will be used in the testing phase. This makes training faster and testing slower and costlier. So, the testing phase requires more time and memory resources.\n",
"\n",
"In kNN, the neighbours are taken from a set of objects for which the class or the object property value is known. This can be thought of as the training set for the kNN algorithm, though no explicit training step is required. In both classification and regression kNN algorithm, we can assign weight to the contributions of the neighbours. So, nearest neighbours contribute more to the average than the more distant ones.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. k Nearest Neighbours intuition\n",
"\n",
"\n",
"The kNN algorithm intuition is very simple to understand. It simply calculates the distance between a sample data point and all the other training data points. The distance can be Euclidean distance or Manhattan distance. Then, it selects the k nearest data points where k can be any integer. Finally, it assigns the sample data point to the class to which the majority of the k data points belong.\n",
"\n",
"\n",
"Now, we will see kNN algorithm in action. Suppose, we have a dataset with two variables which are classified as `Red` and `Blue`.\n",
"\n",
"\n",
"In kNN algorithm, k is the number of nearest neighbours. Generally, k is an odd number because it helps to decide the majority of the class. When k=1, then the algorithm is known as the nearest neighbour algorithm.\n",
"\n",
"Now, we want to classify a new data point `X` into `Blue` class or `Red` class. Suppose the value of k is 3. The kNN algorithm starts by calculating the distance between `X` and all the other data points. It then finds the 3 nearest points with least distance to point `X`. \n",
"\n",
"\n",
"In the final step of the kNN algorithm, we assign the new data point `X` to the majority of the class of the 3 nearest points. If 2 of the 3 nearest points belong to the class `Red` while 1 belong to the class `Blue`, then we classify the new data point as `Red`.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. The problem statement\n",
"\n",
"\n",
"In this project, I try to classify the patients suffering from breast cancer. I implement kNN algorithm with Python and Scikit-Learn. \n",
"\n",
"\n",
"To answer the question, I build a kNN classifier to predict whether or not a patient is suffering from breast cancer. I have used the **Breast Cancer Wisconsin (Original) Data Set** downloaded from the UCI Machine Learning Repository for this project."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Dataset description\n",
"\n",
"\n",
"I have used the **Breast Cancer Wisconsin (Original) Data Set** downloaded from the UCI Machine Learning Repository for this project.\n",
"\n",
"\n",
"The data set can be found at the following url:-\n",
"\n",
"\n",
"https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)\n",
"\n",
"\n",
"This dataset contains the patient samples that arrive periodically as Dr. Wolberg reports his clinical cases. \n",
"\n",
"\n",
"The attribute information of this dataset is as follows:-\n",
"\n",
"1. Sample code number: id number \n",
"2. Clump Thickness: 1 - 10 \n",
"3. Uniformity of Cell Size: 1 - 10 \n",
"4. Uniformity of Cell Shape: 1 - 10 \n",
"5. Marginal Adhesion: 1 - 10 \n",
"6. Single Epithelial Cell Size: 1 - 10 \n",
"7. Bare Nuclei: 1 - 10 \n",
"8. Bland Chromatin: 1 - 10 \n",
"9. Normal Nucleoli: 1 - 10 \n",
"10. Mitoses: 1 - 10 \n",
"11. Class: (2 for benign, 4 for malignant)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Import libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Import dataset"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data = 'C:/datasets/breast-cancer-wisconsin.data'\n",
"\n",
"df = pd.read_csv(data, header=None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Exploratory data analysis\n",
"\n",
"\n",
"Now, I will explore the data to gain insights about the data. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(699, 11)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view dimensions of dataset\n",
"\n",
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 699 instances and 11 attributes in the data set. \n",
"\n",
"\n",
"In the dataset description, it is given that there are 10 attributes and 1 `Class` which is the target variable. So, we have 10 attributes and 1 target variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View top 5 rows of dataset"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>10</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1000025</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1002945</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>7</td>\n",
" <td>10</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1015425</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1016277</td>\n",
" <td>6</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1017023</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 5 6 7 8 9 10\n",
"0 1000025 5 1 1 1 2 1 3 1 1 2\n",
"1 1002945 5 4 4 5 7 10 3 2 1 2\n",
"2 1015425 3 1 1 1 2 2 3 1 1 2\n",
"3 1016277 6 8 8 1 3 4 3 7 1 2\n",
"4 1017023 4 1 1 3 2 1 3 1 1 2"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# preview the dataset\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rename column names\n",
"\n",
"We can see that the dataset does not have proper column names. The columns are merely labelled as 0,1,2.... and so on. We should give proper names to the columns. I will do it as follows:-"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Id', 'Clump_thickness', 'Uniformity_Cell_Size',\n",
" 'Uniformity_Cell_Shape', 'Marginal_Adhesion',\n",
" 'Single_Epithelial_Cell_Size', 'Bare_Nuclei', 'Bland_Chromatin',\n",
" 'Normal_Nucleoli', 'Mitoses', 'Class'],\n",
" dtype='object')"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"col_names = ['Id', 'Clump_thickness', 'Uniformity_Cell_Size', 'Uniformity_Cell_Shape', 'Marginal_Adhesion', \n",
" 'Single_Epithelial_Cell_Size', 'Bare_Nuclei', 'Bland_Chromatin', 'Normal_Nucleoli', 'Mitoses', 'Class']\n",
"\n",
"df.columns = col_names\n",
"\n",
"df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the column names are renamed. Now, the columns have meaningful names."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>Clump_thickness</th>\n",
" <th>Uniformity_Cell_Size</th>\n",
" <th>Uniformity_Cell_Shape</th>\n",
" <th>Marginal_Adhesion</th>\n",
" <th>Single_Epithelial_Cell_Size</th>\n",
" <th>Bare_Nuclei</th>\n",
" <th>Bland_Chromatin</th>\n",
" <th>Normal_Nucleoli</th>\n",
" <th>Mitoses</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1000025</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1002945</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>7</td>\n",
" <td>10</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1015425</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1016277</td>\n",
" <td>6</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1017023</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id Clump_thickness Uniformity_Cell_Size Uniformity_Cell_Shape \\\n",
"0 1000025 5 1 1 \n",
"1 1002945 5 4 4 \n",
"2 1015425 3 1 1 \n",
"3 1016277 6 8 8 \n",
"4 1017023 4 1 1 \n",
"\n",
" Marginal_Adhesion Single_Epithelial_Cell_Size Bare_Nuclei \\\n",
"0 1 2 1 \n",
"1 5 7 10 \n",
"2 1 2 2 \n",
"3 1 3 4 \n",
"4 3 2 1 \n",
"\n",
" Bland_Chromatin Normal_Nucleoli Mitoses Class \n",
"0 3 1 1 2 \n",
"1 3 2 1 2 \n",
"2 3 1 1 2 \n",
"3 3 7 1 2 \n",
"4 3 1 1 2 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's agian preview the dataset\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Drop redundant columns\n",
"\n",
"\n",
"We should drop any redundant columns from the dataset which does not have any predictive power. Here, `Id` is the redundant column. So, I will drop it first."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# drop Id column from dataset\n",
"\n",
"df.drop('Id', axis=1, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View summary of dataset\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 699 entries, 0 to 698\n",
"Data columns (total 10 columns):\n",
"Clump_thickness 699 non-null int64\n",
"Uniformity_Cell_Size 699 non-null int64\n",
"Uniformity_Cell_Shape 699 non-null int64\n",
"Marginal_Adhesion 699 non-null int64\n",
"Single_Epithelial_Cell_Size 699 non-null int64\n",
"Bare_Nuclei 699 non-null object\n",
"Bland_Chromatin 699 non-null int64\n",
"Normal_Nucleoli 699 non-null int64\n",
"Mitoses 699 non-null int64\n",
"Class 699 non-null int64\n",
"dtypes: int64(9), object(1)\n",
"memory usage: 54.7+ KB\n"
]
}
],
"source": [
"# view summary of dataset\n",
"\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the `Id` column has been removed from the dataset. \n",
"\n",
"We can see that there are 9 numerical variables and 1 categorical variable in the dataset. I will check the frequency distribution of values in the variables to confirm the same."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Frequency distribution of values in variables"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 145\n",
"5 130\n",
"3 108\n",
"4 80\n",
"10 69\n",
"2 50\n",
"8 46\n",
"6 34\n",
"7 23\n",
"9 14\n",
"Name: Clump_thickness, dtype: int64\n",
"1 384\n",
"10 67\n",
"3 52\n",
"2 45\n",
"4 40\n",
"5 30\n",
"8 29\n",
"6 27\n",
"7 19\n",
"9 6\n",
"Name: Uniformity_Cell_Size, dtype: int64\n",
"1 353\n",
"2 59\n",
"10 58\n",
"3 56\n",
"4 44\n",
"5 34\n",
"7 30\n",
"6 30\n",
"8 28\n",
"9 7\n",
"Name: Uniformity_Cell_Shape, dtype: int64\n",
"1 407\n",
"3 58\n",
"2 58\n",
"10 55\n",
"4 33\n",
"8 25\n",
"5 23\n",
"6 22\n",
"7 13\n",
"9 5\n",
"Name: Marginal_Adhesion, dtype: int64\n",
"2 386\n",
"3 72\n",
"4 48\n",
"1 47\n",
"6 41\n",
"5 39\n",
"10 31\n",
"8 21\n",
"7 12\n",
"9 2\n",
"Name: Single_Epithelial_Cell_Size, dtype: int64\n",
"1 402\n",
"10 132\n",
"5 30\n",
"2 30\n",
"3 28\n",
"8 21\n",
"4 19\n",
"? 16\n",
"9 9\n",
"7 8\n",
"6 4\n",
"Name: Bare_Nuclei, dtype: int64\n",
"2 166\n",
"3 165\n",
"1 152\n",
"7 73\n",
"4 40\n",
"5 34\n",
"8 28\n",
"10 20\n",
"9 11\n",
"6 10\n",
"Name: Bland_Chromatin, dtype: int64\n",
"1 443\n",
"10 61\n",
"3 44\n",
"2 36\n",
"8 24\n",
"6 22\n",
"5 19\n",
"4 18\n",
"9 16\n",
"7 16\n",
"Name: Normal_Nucleoli, dtype: int64\n",
"1 579\n",
"2 35\n",
"3 33\n",
"10 14\n",
"4 12\n",
"7 9\n",
"8 8\n",
"5 6\n",
"6 3\n",
"Name: Mitoses, dtype: int64\n",
"2 458\n",
"4 241\n",
"Name: Class, dtype: int64\n"
]
}
],
"source": [
"for var in df.columns:\n",
" \n",
" print(df[var].value_counts())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The distribution of values shows that data type of `Bare_Nuclei` is of type integer. But the summary of the dataframe shows that it is type object. So, I will explicitly convert its data type to integer."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Convert data type of Bare_Nuclei to integer"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"df['Bare_Nuclei'] = pd.to_numeric(df['Bare_Nuclei'], errors='coerce')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check data types of columns of dataframe"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Clump_thickness int64\n",
"Uniformity_Cell_Size int64\n",
"Uniformity_Cell_Shape int64\n",
"Marginal_Adhesion int64\n",
"Single_Epithelial_Cell_Size int64\n",
"Bare_Nuclei float64\n",
"Bland_Chromatin int64\n",
"Normal_Nucleoli int64\n",
"Mitoses int64\n",
"Class int64\n",
"dtype: object"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can see that all the columns of the dataframe are of type numeric."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary of variables\n",
"\n",
"\n",
"- There are 10 numerical variables in the dataset.\n",
"\n",
"\n",
"- All of the variables are of discrete type.\n",
"\n",
"\n",
"- Out of all the 10 variables, the first 9 variables are feature variables and last variable `Class` is the target variable.\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore problems within variables\n",
"\n",
"\n",
"Now, I will explore problems within variables.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Missing values in variables"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Clump_thickness 0\n",
"Uniformity_Cell_Size 0\n",
"Uniformity_Cell_Shape 0\n",
"Marginal_Adhesion 0\n",
"Single_Epithelial_Cell_Size 0\n",
"Bare_Nuclei 16\n",
"Bland_Chromatin 0\n",
"Normal_Nucleoli 0\n",
"Mitoses 0\n",
"Class 0\n",
"dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in variables\n",
"\n",
"df.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the `Bare_Nuclei` column contains missing values. We need to dig deeper to find the frequency distribution of \n",
"values of `Bare_Nuclei`."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Clump_thickness 0\n",
"Uniformity_Cell_Size 0\n",
"Uniformity_Cell_Shape 0\n",
"Marginal_Adhesion 0\n",
"Single_Epithelial_Cell_Size 0\n",
"Bare_Nuclei 16\n",
"Bland_Chromatin 0\n",
"Normal_Nucleoli 0\n",
"Mitoses 0\n",
"Class 0\n",
"dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check `na` values in the dataframe\n",
"\n",
"df.isna().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the `Bare_Nuclei` column contains 16 `nan` values."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.0 402\n",
"10.0 132\n",
"5.0 30\n",
"2.0 30\n",
"3.0 28\n",
"8.0 21\n",
"4.0 19\n",
"9.0 9\n",
"7.0 8\n",
"6.0 4\n",
"Name: Bare_Nuclei, dtype: int64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check frequency distribution of `Bare_Nuclei` column\n",
"\n",
"df['Bare_Nuclei'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 1., 10., 2., 4., 3., 9., 7., nan, 5., 8., 6.])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check unique values in `Bare_Nuclei` column\n",
"\n",
"df['Bare_Nuclei'].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are `nan` values in the `Bare_Nuclei` column."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"16"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check for nan values in `Bare_Nuclei` column\n",
"\n",
"df['Bare_Nuclei'].isna().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 16 `nan` values in the dataset. I will impute missing values after dividing the dataset into training and test set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### check frequency distribution of target variable `Class`"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2 458\n",
"4 241\n",
"Name: Class, dtype: int64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view frequency distribution of values in `Class` variable\n",
"\n",
"df['Class'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### check percentage of frequency distribution of `Class`"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2 0.655222\n",
"4 0.344778\n",
"Name: Class, dtype: float64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view percentage of frequency distribution of values in `Class` variable\n",
"\n",
"df['Class'].value_counts()/np.float(len(df))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the `Class` variable contains 2 class labels - `2` and `4`. `2` stands for benign and `4` stands for malignant cancer."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Outliers in numerical variables"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Clump_thickness Uniformity_Cell_Size Uniformity_Cell_Shape \\\n",
"count 699.00 699.00 699.00 \n",
"mean 4.42 3.13 3.21 \n",
"std 2.82 3.05 2.97 \n",
"min 1.00 1.00 1.00 \n",
"25% 2.00 1.00 1.00 \n",
"50% 4.00 1.00 1.00 \n",
"75% 6.00 5.00 5.00 \n",
"max 10.00 10.00 10.00 \n",
"\n",
" Marginal_Adhesion Single_Epithelial_Cell_Size Bare_Nuclei \\\n",
"count 699.00 699.00 683.00 \n",
"mean 2.81 3.22 3.54 \n",
"std 2.86 2.21 3.64 \n",
"min 1.00 1.00 1.00 \n",
"25% 1.00 2.00 1.00 \n",
"50% 1.00 2.00 1.00 \n",
"75% 4.00 4.00 6.00 \n",
"max 10.00 10.00 10.00 \n",
"\n",
" Bland_Chromatin Normal_Nucleoli Mitoses Class \n",
"count 699.00 699.00 699.00 699.00 \n",
"mean 3.44 2.87 1.59 2.69 \n",
"std 2.44 3.05 1.72 0.95 \n",
"min 1.00 1.00 1.00 2.00 \n",
"25% 2.00 1.00 1.00 2.00 \n",
"50% 3.00 1.00 1.00 2.00 \n",
"75% 5.00 4.00 1.00 4.00 \n",
"max 10.00 10.00 10.00 4.00 \n"
]
}
],
"source": [
"# view summary statistics in numerical variables\n",
"\n",
"print(round(df.describe(),2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"kNN algorithm is robust to outliers."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Data Visualization\n",
"\n",
"\n",
"Now, we have a basic understanding of our data. I will supplement it with some data visualization to get better understanding\n",
"of our data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Univariate plots"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check the distribution of variables\n",
"\n",
"\n",
"Now, I will plot the histograms to check variable distributions to find out if they are normal or skewed. "
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 2160x1800 with 10 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# plot histograms of the variables\n",
"\n",
"\n",
"plt.rcParams['figure.figsize']=(30,25)\n",
"\n",
"df.plot(kind='hist', bins=10, subplots=True, layout=(5,2), sharex=False, sharey=False)\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that all the variables in the dataset are positively skewed. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multivariate plots"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Estimating correlation coefficients\n",
"\n",
"Our dataset is very small. So, we can compute the standard correlation coefficient (also called Pearson's r) between every pair of attributes. We can compute it using the `df.corr()` method as follows:-"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"correlation = df.corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our target variable is `Class`. So, we should check how each attribute correlates with the `Class` variable. We can do it as follows:-"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Class 1.000000\n",
"Bare_Nuclei 0.822696\n",
"Uniformity_Cell_Shape 0.818934\n",
"Uniformity_Cell_Size 0.817904\n",
"Bland_Chromatin 0.756616\n",
"Clump_thickness 0.716001\n",
"Normal_Nucleoli 0.712244\n",
"Marginal_Adhesion 0.696800\n",
"Single_Epithelial_Cell_Size 0.682785\n",
"Mitoses 0.423170\n",
"Name: Class, dtype: float64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"correlation['Class'].sort_values(ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpretation \n",
"\n",
"- The correlation coefficient ranges from -1 to +1. \n",
"\n",
"- When it is close to +1, this signifies that there is a strong positive correlation. So, we can see that there is a strong positive correlation between `Class` and `Bare_Nuclei`, `Class` and `Uniformity_Cell_Shape`, `Class` and `Uniformity_Cell_Size`.\n",
"\n",
"- When it is clsoe to -1, it means that there is a strong negative correlation. When it is close to 0, it means that there is no correlation. \n",
"\n",
"- We can see that all the variables are positively correlated with `Class` variable. Some variables are strongly positive correlated while some variables are negatively correlated."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Discover patterns and relationships \n",
"\n",
"\n",
"An important step in EDA is to discover patterns and relationships between variables in the dataset. I will use the seaborn heatmap to explore the patterns and relationships in the dataset.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Correlation Heat Map"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x576 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(10,8))\n",
"plt.title('Correlation of Attributes with Class variable')\n",
"a = sns.heatmap(correlation, square=True, annot=True, fmt='.2f', linecolor='white')\n",
"a.set_xticklabels(a.get_xticklabels(), rotation=90)\n",
"a.set_yticklabels(a.get_yticklabels(), rotation=30) \n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpretation\n",
"\n",
"\n",
"From the above correlation heat map, we can conclude that :-\n",
"\n",
"1. `Class` is highly positive correlated with `Uniformity_Cell_Size`, `Uniformity_Cell_Shape` and `Bare_Nuclei`. (correlation coefficient = 0.82).\n",
"\n",
"2. `Class` is positively correlated with `Clump_thickness`(correlation coefficient=0.72), `Marginal_Adhesion`(correlation coefficient=0.70), `Single_Epithelial_Cell_Size)`(correlation coefficient = 0.68) and `Normal_Nucleoli`(correlation coefficient=0.71).\n",
"\n",
"3. `Class` is weekly positive correlated with `Mitoses`(correlation coefficient=0.42).\n",
"\n",
"4. The `Mitoses` variable is weekly positive correlated with all the other variables(correlation coefficient < 0.50)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Declare feature vector and target variable"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"X = df.drop(['Class'], axis=1)\n",
"\n",
"y = df['Class']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Split data into separate training and test set"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# split X and y into training and testing sets\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)\n"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((559, 9), (140, 9))"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check the shape of X_train and X_test\n",
"\n",
"X_train.shape, X_test.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Feature Engineering\n",
"\n",
"\n",
"**Feature Engineering** is the process of transforming raw data into useful features that help us to understand our model better and increase its predictive power. I will carry out feature engineering on different types of variables.\n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Clump_thickness int64\n",
"Uniformity_Cell_Size int64\n",
"Uniformity_Cell_Shape int64\n",
"Marginal_Adhesion int64\n",
"Single_Epithelial_Cell_Size int64\n",
"Bare_Nuclei float64\n",
"Bland_Chromatin int64\n",
"Normal_Nucleoli int64\n",
"Mitoses int64\n",
"dtype: object"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check data types in X_train\n",
"\n",
"X_train.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Engineering missing values in variables\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Clump_thickness 0\n",
"Uniformity_Cell_Size 0\n",
"Uniformity_Cell_Shape 0\n",
"Marginal_Adhesion 0\n",
"Single_Epithelial_Cell_Size 0\n",
"Bare_Nuclei 13\n",
"Bland_Chromatin 0\n",
"Normal_Nucleoli 0\n",
"Mitoses 0\n",
"dtype: int64"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in numerical variables in X_train\n",
"\n",
"X_train.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Clump_thickness 0\n",
"Uniformity_Cell_Size 0\n",
"Uniformity_Cell_Shape 0\n",
"Marginal_Adhesion 0\n",
"Single_Epithelial_Cell_Size 0\n",
"Bare_Nuclei 3\n",
"Bland_Chromatin 0\n",
"Normal_Nucleoli 0\n",
"Mitoses 0\n",
"dtype: int64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in numerical variables in X_test\n",
"\n",
"X_test.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Bare_Nuclei 0.0233\n"
]
}
],
"source": [
"# print percentage of missing values in the numerical variables in training set\n",
"\n",
"for col in X_train.columns:\n",
" if X_train[col].isnull().mean()>0:\n",
" print(col, round(X_train[col].isnull().mean(),4))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Assumption\n",
"\n",
"\n",
"I assume that the data are missing completely at random (MCAR). There are two methods which can be used to impute missing values. One is mean or median imputation and other one is random sample imputation. When there are outliers in the dataset, we should use median imputation. So, I will use median imputation because median imputation is robust to outliers.\n",
"\n",
"\n",
"I will impute missing values with the appropriate statistical measures of the data, in this case median. Imputation should be done over the training set, and then propagated to the test set. It means that the statistical measures to be used to fill missing values both in train and test set, should be extracted from the train set only. This is to avoid overfitting."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# impute missing values in X_train and X_test with respective column median in X_train\n",
"\n",
"for df1 in [X_train, X_test]:\n",
" for col in X_train.columns:\n",
" col_median=X_train[col].median()\n",
" df1[col].fillna(col_median, inplace=True) \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Clump_thickness 0\n",
"Uniformity_Cell_Size 0\n",
"Uniformity_Cell_Shape 0\n",
"Marginal_Adhesion 0\n",
"Single_Epithelial_Cell_Size 0\n",
"Bare_Nuclei 0\n",
"Bland_Chromatin 0\n",
"Normal_Nucleoli 0\n",
"Mitoses 0\n",
"dtype: int64"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check again missing values in numerical variables in X_train\n",
"\n",
"X_train.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Clump_thickness 0\n",
"Uniformity_Cell_Size 0\n",
"Uniformity_Cell_Shape 0\n",
"Marginal_Adhesion 0\n",
"Single_Epithelial_Cell_Size 0\n",
"Bare_Nuclei 0\n",
"Bland_Chromatin 0\n",
"Normal_Nucleoli 0\n",
"Mitoses 0\n",
"dtype: int64"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in numerical variables in X_test\n",
"\n",
"X_test.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are no missing values in X_train and X_test."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Clump_thickness</th>\n",
" <th>Uniformity_Cell_Size</th>\n",
" <th>Uniformity_Cell_Shape</th>\n",
" <th>Marginal_Adhesion</th>\n",
" <th>Single_Epithelial_Cell_Size</th>\n",
" <th>Bare_Nuclei</th>\n",
" <th>Bland_Chromatin</th>\n",
" <th>Normal_Nucleoli</th>\n",
" <th>Mitoses</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>293</th>\n",
" <td>10</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>2</td>\n",
" <td>10.0</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62</th>\n",
" <td>9</td>\n",
" <td>10</td>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>10</td>\n",
" <td>8.0</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>485</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>422</th>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>332</th>\n",
" <td>5</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Clump_thickness Uniformity_Cell_Size Uniformity_Cell_Shape \\\n",
"293 10 4 4 \n",
"62 9 10 10 \n",
"485 1 1 1 \n",
"422 4 3 3 \n",
"332 5 2 2 \n",
"\n",
" Marginal_Adhesion Single_Epithelial_Cell_Size Bare_Nuclei \\\n",
"293 6 2 10.0 \n",
"62 1 10 8.0 \n",
"485 3 1 3.0 \n",
"422 1 2 1.0 \n",
"332 2 2 1.0 \n",
"\n",
" Bland_Chromatin Normal_Nucleoli Mitoses \n",
"293 2 3 1 \n",
"62 3 3 1 \n",
"485 1 1 1 \n",
"422 3 3 1 \n",
"332 2 2 1 "
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.head()"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Clump_thickness</th>\n",
" <th>Uniformity_Cell_Size</th>\n",
" <th>Uniformity_Cell_Shape</th>\n",
" <th>Marginal_Adhesion</th>\n",
" <th>Single_Epithelial_Cell_Size</th>\n",
" <th>Bare_Nuclei</th>\n",
" <th>Bland_Chromatin</th>\n",
" <th>Normal_Nucleoli</th>\n",
" <th>Mitoses</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>476</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>531</th>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40</th>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td>9</td>\n",
" <td>6</td>\n",
" <td>1.0</td>\n",
" <td>7</td>\n",
" <td>8</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>432</th>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>8</td>\n",
" <td>7</td>\n",
" <td>5</td>\n",
" <td>10</td>\n",
" <td>7</td>\n",
" <td>9.0</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Clump_thickness Uniformity_Cell_Size Uniformity_Cell_Shape \\\n",
"476 4 1 2 \n",
"531 4 2 2 \n",
"40 6 6 6 \n",
"432 5 1 1 \n",
"14 8 7 5 \n",
"\n",
" Marginal_Adhesion Single_Epithelial_Cell_Size Bare_Nuclei \\\n",
"476 1 2 1.0 \n",
"531 1 2 1.0 \n",
"40 9 6 1.0 \n",
"432 1 2 1.0 \n",
"14 10 7 9.0 \n",
"\n",
" Bland_Chromatin Normal_Nucleoli Mitoses \n",
"476 1 1 1 \n",
"531 2 1 1 \n",
"40 7 8 1 \n",
"432 2 2 1 \n",
"14 5 5 4 "
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_test.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have training and testing set ready for model building. Before that, we should map all the feature variables onto the same scale. It is called `feature scaling`. I will do it as follows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Feature Scaling"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"cols = X_train.columns"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"scaler = StandardScaler()\n",
"\n",
"X_train = scaler.fit_transform(X_train)\n",
"\n",
"X_test = scaler.transform(X_test)\n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"X_train = pd.DataFrame(X_train, columns=[cols])"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"X_test = pd.DataFrame(X_test, columns=[cols])"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th>Clump_thickness</th>\n",
" <th>Uniformity_Cell_Size</th>\n",
" <th>Uniformity_Cell_Shape</th>\n",
" <th>Marginal_Adhesion</th>\n",
" <th>Single_Epithelial_Cell_Size</th>\n",
" <th>Bare_Nuclei</th>\n",
" <th>Bland_Chromatin</th>\n",
" <th>Normal_Nucleoli</th>\n",
" <th>Mitoses</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2.028383</td>\n",
" <td>0.299506</td>\n",
" <td>0.289573</td>\n",
" <td>1.119077</td>\n",
" <td>-0.546543</td>\n",
" <td>1.858357</td>\n",
" <td>-0.577774</td>\n",
" <td>0.041241</td>\n",
" <td>-0.324258</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.669451</td>\n",
" <td>2.257680</td>\n",
" <td>2.304569</td>\n",
" <td>-0.622471</td>\n",
" <td>3.106879</td>\n",
" <td>1.297589</td>\n",
" <td>-0.159953</td>\n",
" <td>0.041241</td>\n",
" <td>-0.324258</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-1.202005</td>\n",
" <td>-0.679581</td>\n",
" <td>-0.717925</td>\n",
" <td>0.074148</td>\n",
" <td>-1.003220</td>\n",
" <td>-0.104329</td>\n",
" <td>-0.995595</td>\n",
" <td>-0.608165</td>\n",
" <td>-0.324258</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-0.125209</td>\n",
" <td>-0.026856</td>\n",
" <td>-0.046260</td>\n",
" <td>-0.622471</td>\n",
" <td>-0.546543</td>\n",
" <td>-0.665096</td>\n",
" <td>-0.159953</td>\n",
" <td>0.041241</td>\n",
" <td>-0.324258</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.233723</td>\n",
" <td>-0.353219</td>\n",
" <td>-0.382092</td>\n",
" <td>-0.274161</td>\n",
" <td>-0.546543</td>\n",
" <td>-0.665096</td>\n",
" <td>-0.577774</td>\n",
" <td>-0.283462</td>\n",
" <td>-0.324258</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Clump_thickness Uniformity_Cell_Size Uniformity_Cell_Shape \\\n",
"0 2.028383 0.299506 0.289573 \n",
"1 1.669451 2.257680 2.304569 \n",
"2 -1.202005 -0.679581 -0.717925 \n",
"3 -0.125209 -0.026856 -0.046260 \n",
"4 0.233723 -0.353219 -0.382092 \n",
"\n",
" Marginal_Adhesion Single_Epithelial_Cell_Size Bare_Nuclei Bland_Chromatin \\\n",
"0 1.119077 -0.546543 1.858357 -0.577774 \n",
"1 -0.622471 3.106879 1.297589 -0.159953 \n",
"2 0.074148 -1.003220 -0.104329 -0.995595 \n",
"3 -0.622471 -0.546543 -0.665096 -0.159953 \n",
"4 -0.274161 -0.546543 -0.665096 -0.577774 \n",
"\n",
" Normal_Nucleoli Mitoses \n",
"0 0.041241 -0.324258 \n",
"1 0.041241 -0.324258 \n",
"2 -0.608165 -0.324258 \n",
"3 0.041241 -0.324258 \n",
"4 -0.283462 -0.324258 "
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have `X_train` dataset ready to be fed into the Logistic Regression classifier. I will do it as follows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. Fit K Neighbours Classifier to the Training Set"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
" metric_params=None, n_jobs=None, n_neighbors=3, p=2,\n",
" weights='uniform')"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# import KNeighbors ClaSSifier from sklearn\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"\n",
"\n",
"# instantiate the model\n",
"knn = KNeighborsClassifier(n_neighbors=3)\n",
"\n",
"\n",
"# fit the model to the training set\n",
"knn.fit(X_train, y_train)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Predict the test-set results"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([2, 2, 4, 2, 4, 2, 4, 2, 4, 2, 2, 2, 4, 4, 4, 2, 2, 4, 4, 2, 4, 4,\n",
" 2, 2, 2, 4, 2, 2, 4, 4, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2,\n",
" 4, 4, 2, 4, 2, 4, 4, 2, 2, 4, 2, 2, 2, 2, 2, 2, 4, 2, 2, 4, 4, 4,\n",
" 4, 2, 2, 4, 2, 2, 4, 4, 2, 2, 2, 2, 4, 2, 2, 2, 4, 2, 2, 2, 4, 2,\n",
" 4, 4, 2, 2, 2, 4, 2, 2, 2, 4, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2, 2, 2,\n",
" 4, 4, 4, 2, 2, 2, 2, 2, 4, 4, 4, 4, 2, 4, 2, 2, 4, 4, 4, 4, 4, 2,\n",
" 2, 4, 4, 2, 2, 4, 2, 2], dtype=int64)"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pred = knn.predict(X_test)\n",
"\n",
"y_pred"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### predict_proba method\n",
"\n",
"\n",
"**predict_proba** method gives the probabilities for the target variable(2 and 4) in this case, in array form.\n",
"\n",
"`2 is for probability of benign cancer` and `4 is for probability of malignant cancer.`"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1. , 1. , 0.33333333, 1. , 0. ,\n",
" 1. , 0. , 1. , 0. , 0.66666667,\n",
" 1. , 1. , 0. , 0.33333333, 0. ,\n",
" 1. , 1. , 0. , 0. , 1. ,\n",
" 0. , 0. , 1. , 1. , 1. ,\n",
" 0. , 1. , 1. , 0. , 0. ,\n",
" 1. , 1. , 1. , 1. , 1. ,\n",
" 0.66666667, 1. , 0. , 1. , 1. ,\n",
" 1. , 1. , 1. , 1. , 0. ,\n",
" 0. , 1. , 0. , 1. , 0. ,\n",
" 0. , 1. , 1. , 0. , 1. ,\n",
" 1. , 1. , 1. , 0.66666667, 1. ,\n",
" 0. , 1. , 1. , 0. , 0. ,\n",
" 0.33333333, 0. , 1. , 1. , 0. ,\n",
" 1. , 1. , 0. , 0. , 1. ,\n",
" 1. , 1. , 1. , 0. , 1. ,\n",
" 1. , 1. , 0. , 1. , 1. ,\n",
" 1. , 0. , 1. , 0. , 0. ,\n",
" 1. , 1. , 0.66666667, 0. , 1. ,\n",
" 1. , 1. , 0. , 1. , 0. ,\n",
" 0. , 1. , 1. , 1. , 0. ,\n",
" 1. , 1. , 1. , 1. , 1. ,\n",
" 0. , 0.33333333, 0. , 1. , 1. ,\n",
" 1. , 1. , 1. , 0. , 0. ,\n",
" 0. , 0.33333333, 1. , 0. , 1. ,\n",
" 1. , 0.33333333, 0.33333333, 0. , 0. ,\n",
" 0. , 1. , 1. , 0.33333333, 0. ,\n",
" 1. , 1. , 0. , 1. , 1. ])"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# probability of getting output as 2 - benign cancer\n",
"\n",
"knn.predict_proba(X_test)[:,0]"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0. , 0. , 0.66666667, 0. , 1. ,\n",
" 0. , 1. , 0. , 1. , 0.33333333,\n",
" 0. , 0. , 1. , 0.66666667, 1. ,\n",
" 0. , 0. , 1. , 1. , 0. ,\n",
" 1. , 1. , 0. , 0. , 0. ,\n",
" 1. , 0. , 0. , 1. , 1. ,\n",
" 0. , 0. , 0. , 0. , 0. ,\n",
" 0.33333333, 0. , 1. , 0. , 0. ,\n",
" 0. , 0. , 0. , 0. , 1. ,\n",
" 1. , 0. , 1. , 0. , 1. ,\n",
" 1. , 0. , 0. , 1. , 0. ,\n",
" 0. , 0. , 0. , 0.33333333, 0. ,\n",
" 1. , 0. , 0. , 1. , 1. ,\n",
" 0.66666667, 1. , 0. , 0. , 1. ,\n",
" 0. , 0. , 1. , 1. , 0. ,\n",
" 0. , 0. , 0. , 1. , 0. ,\n",
" 0. , 0. , 1. , 0. , 0. ,\n",
" 0. , 1. , 0. , 1. , 1. ,\n",
" 0. , 0. , 0.33333333, 1. , 0. ,\n",
" 0. , 0. , 1. , 0. , 1. ,\n",
" 1. , 0. , 0. , 0. , 1. ,\n",
" 0. , 0. , 0. , 0. , 0. ,\n",
" 1. , 0.66666667, 1. , 0. , 0. ,\n",
" 0. , 0. , 0. , 1. , 1. ,\n",
" 1. , 0.66666667, 0. , 1. , 0. ,\n",
" 0. , 0.66666667, 0.66666667, 1. , 1. ,\n",
" 1. , 0. , 0. , 0.66666667, 1. ,\n",
" 0. , 0. , 1. , 0. , 0. ])"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# probability of getting output as 4 - malignant cancer\n",
"\n",
"knn.predict_proba(X_test)[:,1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 15. Check the accuracy score"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score: 0.9714\n"
]
}
],
"source": [
"from sklearn.metrics import accuracy_score\n",
"\n",
"print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, **y_test** are the true class labels and **y_pred** are the predicted class labels in the test-set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compare the train-set and test-set accuracy\n",
"\n",
"\n",
"Now, I will compare the train-set and test-set accuracy to check for overfitting."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"y_pred_train = knn.predict(X_train)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training-set accuracy score: 0.9821\n"
]
}
],
"source": [
"print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check for overfitting and underfitting"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training set score: 0.9821\n",
"Test set score: 0.9714\n"
]
}
],
"source": [
"# print the scores on training and test set\n",
"\n",
"print('Training set score: {:.4f}'.format(knn.score(X_train, y_train)))\n",
"\n",
"print('Test set score: {:.4f}'.format(knn.score(X_test, y_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The training-set accuracy score is 0.9821 while the test-set accuracy to be 0.9714. These two values are quite comparable. So, there is no question of overfitting. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compare model accuracy with null accuracy\n",
"\n",
"\n",
"So, the model accuracy is 0.9714. But, we cannot say that our model is very good based on the above accuracy. We must compare it with the **null accuracy**. Null accuracy is the accuracy that could be achieved by always predicting the most frequent class.\n",
"\n",
"So, we should first check the class distribution in the test set. "
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2 85\n",
"4 55\n",
"Name: Class, dtype: int64"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check class distribution in test set\n",
"\n",
"y_test.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the occurences of most frequent class is 85. So, we can calculate null accuracy by dividing 85 by total number of occurences."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Null accuracy score: 0.6071\n"
]
}
],
"source": [
"# check null accuracy score\n",
"\n",
"null_accuracy = (85/(85+55))\n",
"\n",
"print('Null accuracy score: {0:0.4f}'. format(null_accuracy))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that our model accuracy score is 0.9714 but null accuracy score is 0.6071. So, we can conclude that our K Nearest Neighbors model is doing a very good job in predicting the class labels."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 16. Rebuild kNN Classification model using different values of k\n",
"\n",
"\n",
"I have build the kNN classification model using k=3. Now, I will increase the value of k and see its effect on accuracy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rebuild kNN Classification model using k=5"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with k=5 : 0.9714\n"
]
}
],
"source": [
"# instantiate the model with k=5\n",
"knn_5 = KNeighborsClassifier(n_neighbors=5)\n",
"\n",
"\n",
"# fit the model to the training set\n",
"knn_5.fit(X_train, y_train)\n",
"\n",
"\n",
"# predict on the test-set\n",
"y_pred_5 = knn_5.predict(X_test)\n",
"\n",
"\n",
"print('Model accuracy score with k=5 : {0:0.4f}'. format(accuracy_score(y_test, y_pred_5)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rebuild kNN Classification model using k=6"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with k=6 : 0.9786\n"
]
}
],
"source": [
"# instantiate the model with k=6\n",
"knn_6 = KNeighborsClassifier(n_neighbors=6)\n",
"\n",
"\n",
"# fit the model to the training set\n",
"knn_6.fit(X_train, y_train)\n",
"\n",
"\n",
"# predict on the test-set\n",
"y_pred_6 = knn_6.predict(X_test)\n",
"\n",
"\n",
"print('Model accuracy score with k=6 : {0:0.4f}'. format(accuracy_score(y_test, y_pred_6)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rebuild kNN Classification model using k=7"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with k=7 : 0.9786\n"
]
}
],
"source": [
"# instantiate the model with k=7\n",
"knn_7 = KNeighborsClassifier(n_neighbors=7)\n",
"\n",
"\n",
"# fit the model to the training set\n",
"knn_7.fit(X_train, y_train)\n",
"\n",
"\n",
"# predict on the test-set\n",
"y_pred_7 = knn_7.predict(X_test)\n",
"\n",
"\n",
"print('Model accuracy score with k=7 : {0:0.4f}'. format(accuracy_score(y_test, y_pred_7)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rebuild kNN Classification model using k=8"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with k=8 : 0.9786\n"
]
}
],
"source": [
"# instantiate the model with k=8\n",
"knn_8 = KNeighborsClassifier(n_neighbors=8)\n",
"\n",
"\n",
"# fit the model to the training set\n",
"knn_8.fit(X_train, y_train)\n",
"\n",
"\n",
"# predict on the test-set\n",
"y_pred_8 = knn_8.predict(X_test)\n",
"\n",
"\n",
"print('Model accuracy score with k=8 : {0:0.4f}'. format(accuracy_score(y_test, y_pred_8)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rebuild kNN Classification model using k=9"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with k=9 : 0.9714\n"
]
}
],
"source": [
"# instantiate the model with k=9\n",
"knn_9 = KNeighborsClassifier(n_neighbors=9)\n",
"\n",
"\n",
"# fit the model to the training set\n",
"knn_9.fit(X_train, y_train)\n",
"\n",
"\n",
"# predict on the test-set\n",
"y_pred_9 = knn_9.predict(X_test)\n",
"\n",
"\n",
"print('Model accuracy score with k=9 : {0:0.4f}'. format(accuracy_score(y_test, y_pred_9)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpretation\n",
"\n",
"\n",
"Our original model accuracy score with k=3 is 0.9714. Now, we can see that we get same accuracy score of 0.9714 with k=5. But, if we increase the value of k further, this would result in enhanced accuracy.\n",
"\n",
"\n",
"With k=6,7,8 we get accuracy score of 0.9786. So, it results in performance improvement.\n",
"\n",
"\n",
"If we increase k to 9, then accuracy decreases again to 0.9714."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, based on the above analysis we can conclude that our classification model accuracy is very good. Our model is doing a very good job in terms of predicting the class labels.\n",
"\n",
"\n",
"But, it does not give the underlying distribution of values. Also, it does not tell anything about the type of errors our classifer is making. \n",
"\n",
"\n",
"We have another tool called `Confusion matrix` that comes to our rescue."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 17. Confusion matrix\n",
"\n",
"\n",
"A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.\n",
"\n",
"\n",
"Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-\n",
"\n",
"\n",
"**True Positives (TP)** – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to that class.\n",
"\n",
"\n",
"**True Negatives (TN)** – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually does not belong to that class.\n",
"\n",
"\n",
"**False Positives (FP)** – False Positives occur when we predict an observation belongs to a certain class but the observation actually does not belong to that class. This type of error is called **Type I error.**\n",
"\n",
"\n",
"\n",
"**False Negatives (FN)** – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually belongs to that class. This is a very serious error and it is called **Type II error.**\n",
"\n",
"\n",
"\n",
"These four outcomes are summarized in a confusion matrix given below.\n"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Confusion matrix\n",
"\n",
" [[83 2]\n",
" [ 2 53]]\n",
"\n",
"True Positives(TP) = 83\n",
"\n",
"True Negatives(TN) = 53\n",
"\n",
"False Positives(FP) = 2\n",
"\n",
"False Negatives(FN) = 2\n"
]
}
],
"source": [
"# Print the Confusion Matrix with k =3 and slice it into four pieces\n",
"\n",
"from sklearn.metrics import confusion_matrix\n",
"\n",
"cm = confusion_matrix(y_test, y_pred)\n",
"\n",
"print('Confusion matrix\\n\\n', cm)\n",
"\n",
"print('\\nTrue Positives(TP) = ', cm[0,0])\n",
"\n",
"print('\\nTrue Negatives(TN) = ', cm[1,1])\n",
"\n",
"print('\\nFalse Positives(FP) = ', cm[0,1])\n",
"\n",
"print('\\nFalse Negatives(FN) = ', cm[1,0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The confusion matrix shows `83 + 53 = 136 correct predictions` and `2 + 2 = 4 incorrect predictions`.\n",
"\n",
"\n",
"In this case, we have\n",
"\n",
"\n",
"- `True Positives` (Actual Positive:1 and Predict Positive:1) - 83\n",
"\n",
"\n",
"- `True Negatives` (Actual Negative:0 and Predict Negative:0) - 53\n",
"\n",
"\n",
"- `False Positives` (Actual Negative:0 but Predict Positive:1) - 2 `(Type I error)`\n",
"\n",
"\n",
"- `False Negatives` (Actual Positive:1 but Predict Negative:0) - 2 `(Type II error)`"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Confusion matrix\n",
"\n",
" [[83 2]\n",
" [ 1 54]]\n",
"\n",
"True Positives(TP) = 83\n",
"\n",
"True Negatives(TN) = 54\n",
"\n",
"False Positives(FP) = 2\n",
"\n",
"False Negatives(FN) = 1\n"
]
}
],
"source": [
"# Print the Confusion Matrix with k =7 and slice it into four pieces\n",
"\n",
"cm_7 = confusion_matrix(y_test, y_pred_7)\n",
"\n",
"print('Confusion matrix\\n\\n', cm_7)\n",
"\n",
"print('\\nTrue Positives(TP) = ', cm_7[0,0])\n",
"\n",
"print('\\nTrue Negatives(TN) = ', cm_7[1,1])\n",
"\n",
"print('\\nFalse Positives(FP) = ', cm_7[0,1])\n",
"\n",
"print('\\nFalse Negatives(FN) = ', cm_7[1,0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above confusion matrix shows `83 + 54 = 137 correct predictions` and `2 + 1 = 4 incorrect predictions`.\n",
"\n",
"\n",
"In this case, we have\n",
"\n",
"\n",
"- `True Positives` (Actual Positive:1 and Predict Positive:1) - 83\n",
"\n",
"\n",
"- `True Negatives` (Actual Negative:0 and Predict Negative:0) - 54\n",
"\n",
"\n",
"- `False Positives` (Actual Negative:0 but Predict Positive:1) - 2 `(Type I error)`\n",
"\n",
"\n",
"- `False Negatives` (Actual Positive:1 but Predict Negative:0) - 1 `(Type II error)`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comment\n",
"\n",
"\n",
"So, kNN Classification model with k=7 shows more accurate predictions and less number of errors than k=3 model. Hence, we got performance improvement with k=7."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x800e752f60>"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# visualize confusion matrix with seaborn heatmap\n",
"\n",
"plt.figure(figsize=(6,4))\n",
"\n",
"cm_matrix = pd.DataFrame(data=cm_7, columns=['Actual Positive:1', 'Actual Negative:0'], \n",
" index=['Predict Positive:1', 'Predict Negative:0'])\n",
"\n",
"sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 18. Classification metrices"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Classification Report\n",
"\n",
"\n",
"**Classification report** is another way to evaluate the classification model performance. It displays the **precision**, **recall**, **f1** and **support** scores for the model. I have described these terms in later.\n",
"\n",
"We can print a classification report as follows:-"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 2 0.99 0.98 0.98 85\n",
" 4 0.96 0.98 0.97 55\n",
"\n",
" micro avg 0.98 0.98 0.98 140\n",
" macro avg 0.98 0.98 0.98 140\n",
"weighted avg 0.98 0.98 0.98 140\n",
"\n"
]
}
],
"source": [
"from sklearn.metrics import classification_report\n",
"\n",
"print(classification_report(y_test, y_pred_7))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Classification accuracy"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"TP = cm_7[0,0]\n",
"TN = cm_7[1,1]\n",
"FP = cm_7[0,1]\n",
"FN = cm_7[1,0]"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Classification accuracy : 0.9786\n"
]
}
],
"source": [
"# print classification accuracy\n",
"\n",
"classification_accuracy = (TP + TN) / float(TP + TN + FP + FN)\n",
"\n",
"print('Classification accuracy : {0:0.4f}'.format(classification_accuracy))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Classification error"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Classification error : 0.0214\n"
]
}
],
"source": [
"# print classification error\n",
"\n",
"classification_error = (FP + FN) / float(TP + TN + FP + FN)\n",
"\n",
"print('Classification error : {0:0.4f}'.format(classification_error))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Precision\n",
"\n",
"\n",
"**Precision** can be defined as the percentage of correctly predicted positive outcomes out of all the predicted positive outcomes. It can be given as the ratio of true positives (TP) to the sum of true and false positives (TP + FP). \n",
"\n",
"\n",
"So, **Precision** identifies the proportion of correctly predicted positive outcome. It is more concerned with the positive class than the negative class.\n",
"\n",
"\n",
"\n",
"Mathematically, `precision` can be defined as the ratio of `TP to (TP + FP)`.\n"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Precision : 0.9765\n"
]
}
],
"source": [
"# print precision score\n",
"\n",
"precision = TP / float(TP + FP)\n",
"\n",
"\n",
"print('Precision : {0:0.4f}'.format(precision))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Recall\n",
"\n",
"\n",
"Recall can be defined as the percentage of correctly predicted positive outcomes out of all the actual positive outcomes.\n",
"It can be given as the ratio of true positives (TP) to the sum of true positives and false negatives (TP + FN). **Recall** is also called **Sensitivity**.\n",
"\n",
"\n",
"**Recall** identifies the proportion of correctly predicted actual positives.\n",
"\n",
"\n",
"Mathematically, `recall` can be given as the ratio of `TP to (TP + FN)`.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Recall or Sensitivity : 0.9881\n"
]
}
],
"source": [
"recall = TP / float(TP + FN)\n",
"\n",
"print('Recall or Sensitivity : {0:0.4f}'.format(recall))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### True Positive Rate\n",
"\n",
"\n",
"**True Positive Rate** is synonymous with **Recall**.\n"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"True Positive Rate : 0.9881\n"
]
}
],
"source": [
"true_positive_rate = TP / float(TP + FN)\n",
"\n",
"\n",
"print('True Positive Rate : {0:0.4f}'.format(true_positive_rate))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### False Positive Rate"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"False Positive Rate : 0.0357\n"
]
}
],
"source": [
"false_positive_rate = FP / float(FP + TN)\n",
"\n",
"\n",
"print('False Positive Rate : {0:0.4f}'.format(false_positive_rate))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Specificity"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Specificity : 0.9643\n"
]
}
],
"source": [
"specificity = TN / (TN + FP)\n",
"\n",
"print('Specificity : {0:0.4f}'.format(specificity))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### f1-score\n",
"\n",
"\n",
"**f1-score** is the weighted harmonic mean of precision and recall. The best possible **f1-score** would be 1.0 and the worst \n",
"would be 0.0. **f1-score** is the harmonic mean of precision and recall. So, **f1-score** is always lower than accuracy measures as they embed precision and recall into their computation. The weighted average of `f1-score` should be used to \n",
"compare classifier models, not global accuracy.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Support\n",
"\n",
"\n",
"**Support** is the actual number of occurrences of the class in our dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Adjusting the classification threshold level"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1. , 0. ],\n",
" [1. , 0. ],\n",
" [0.33333333, 0.66666667],\n",
" [1. , 0. ],\n",
" [0. , 1. ],\n",
" [1. , 0. ],\n",
" [0. , 1. ],\n",
" [1. , 0. ],\n",
" [0. , 1. ],\n",
" [0.66666667, 0.33333333]])"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# print the first 10 predicted probabilities of two classes- 2 and 4\n",
"\n",
"y_pred_prob = knn.predict_proba(X_test)[0:10]\n",
"\n",
"y_pred_prob"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Observations\n",
"\n",
"\n",
"- In each row, the numbers sum to 1.\n",
"\n",
"\n",
"- There are 2 columns which correspond to 2 classes - 2 and 4. \n",
"\n",
"\n",
" - Class 2 - predicted probability that there is benign cancer. \n",
" \n",
" - Class 4 - predicted probability that there is malignant cancer.\n",
" \n",
" \n",
"- Importance of predicted probabilities\n",
"\n",
" - We can rank the observations by probability of benign or malignant cancer.\n",
"\n",
"\n",
"- predict_proba process\n",
"\n",
" - Predicts the probabilities \n",
" \n",
" - Choose the class with the highest probability \n",
" \n",
" \n",
"- Classification threshold level\n",
"\n",
" - There is a classification threshold level of 0.5. \n",
" \n",
" - Class 4 - probability of malignant cancer is predicted if probability > 0.5. \n",
" \n",
" - Class 2 - probability of benign cancer is predicted if probability < 0.5. \n",
" \n"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Prob of - benign cancer (2)</th>\n",
" <th>Prob of - malignant cancer (4)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.333333</td>\n",
" <td>0.666667</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>0.666667</td>\n",
" <td>0.333333</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Prob of - benign cancer (2) Prob of - malignant cancer (4)\n",
"0 1.000000 0.000000\n",
"1 1.000000 0.000000\n",
"2 0.333333 0.666667\n",
"3 1.000000 0.000000\n",
"4 0.000000 1.000000\n",
"5 1.000000 0.000000\n",
"6 0.000000 1.000000\n",
"7 1.000000 0.000000\n",
"8 0.000000 1.000000\n",
"9 0.666667 0.333333"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# store the probabilities in dataframe\n",
"\n",
"y_pred_prob_df = pd.DataFrame(data=y_pred_prob, columns=['Prob of - benign cancer (2)', 'Prob of - malignant cancer (4)'])\n",
"\n",
"y_pred_prob_df"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0. , 0. , 0.66666667, 0. , 1. ,\n",
" 0. , 1. , 0. , 1. , 0.33333333])"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# print the first 10 predicted probabilities for class 4 - Probability of malignant cancer\n",
"\n",
"knn.predict_proba(X_test)[0:10, 1]"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
"# store the predicted probabilities for class 4 - Probability of malignant cancer\n",
"\n",
"y_pred_1 = knn.predict_proba(X_test)[:, 1]"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0,0.5,'Frequency')"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# plot histogram of predicted probabilities\n",
"\n",
"\n",
"# adjust figure size\n",
"plt.figure(figsize=(6,4))\n",
"\n",
"\n",
"# adjust the font size \n",
"plt.rcParams['font.size'] = 12\n",
"\n",
"\n",
"# plot histogram with 10 bins\n",
"plt.hist(y_pred_1, bins = 10)\n",
"\n",
"\n",
"# set the title of predicted probabilities\n",
"plt.title('Histogram of predicted probabilities of malignant cancer')\n",
"\n",
"\n",
"# set the x-axis limit\n",
"plt.xlim(0,1)\n",
"\n",
"\n",
"# set the title\n",
"plt.xlabel('Predicted probabilities of malignant cancer')\n",
"plt.ylabel('Frequency')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Observations\n",
"\n",
"\n",
"- We can see that the above histogram is positively skewed.\n",
"\n",
"\n",
"- The first column tell us that there are approximately 80 observations with 0 probability of malignant cancer.\n",
"\n",
"\n",
"- There are few observations with probability > 0.5.\n",
"\n",
"\n",
"- So, these few observations predict that there will be malignant cancer.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comments\n",
"\n",
"\n",
"- In binary problems, the threshold of 0.5 is used by default to convert predicted probabilities into class predictions.\n",
"\n",
"\n",
"- Threshold can be adjusted to increase sensitivity or specificity. \n",
"\n",
"\n",
"- Sensitivity and specificity have an inverse relationship. Increasing one would always decrease the other and vice versa.\n",
"\n",
"\n",
"- Adjusting the threshold level should be one of the last step you do in the model-building process."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 19. ROC - AUC\n",
"\n",
"\n",
"\n",
"### ROC Curve\n",
"\n",
"\n",
"Another tool to measure the classification model performance visually is **ROC Curve**. ROC Curve stands for **Receiver Operating Characteristic Curve**. An **ROC Curve** is a plot which shows the performance of a classification model at various \n",
"classification threshold levels. \n",
"\n",
"\n",
"\n",
"The **ROC Curve** plots the **True Positive Rate (TPR)** against the **False Positive Rate (FPR)** at various threshold levels.\n",
"\n",
"\n",
"\n",
"\n",
"**True Positive Rate (TPR)** is also called **Recall**. It is defined as the ratio of **TP to (TP + FN)**.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"**False Positive Rate (FPR)** is defined as the ratio of **FP to (FP + TN)**.\n",
"\n",
"\n",
"\n",
"\n",
"In the ROC Curve, we will focus on the TPR (True Positive Rate) and FPR (False Positive Rate) of a single point. This will give us the general performance of the ROC curve which consists of the TPR and FPR at various threshold levels. So, an ROC Curve plots TPR vs FPR at different classification threshold levels. If we lower the threshold levels, it may result in more items being classified as positve. It will increase both True Positives (TP) and False Positives (FP).\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# plot ROC Curve\n",
"\n",
"from sklearn.metrics import roc_curve\n",
"\n",
"fpr, tpr, thresholds = roc_curve(y_test, y_pred_1, pos_label=4)\n",
"\n",
"plt.figure(figsize=(6,4))\n",
"\n",
"plt.plot(fpr, tpr, linewidth=2)\n",
"\n",
"plt.plot([0,1], [0,1], 'k--' )\n",
"\n",
"plt.rcParams['font.size'] = 12\n",
"\n",
"plt.title('ROC curve for Breast Cancer kNN classifier')\n",
"\n",
"plt.xlabel('False Positive Rate (1 - Specificity)')\n",
"\n",
"plt.ylabel('True Positive Rate (Sensitivity)')\n",
"\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ROC curve help us to choose a threshold level that balances sensitivity and specificity for a particular context."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ROC AUC\n",
"\n",
"\n",
"**ROC AUC** stands for **Receiver Operating Characteristic - Area Under Curve**. It is a technique to compare classifier performance. In this technique, we measure the `area under the curve (AUC)`. A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5. \n",
"\n",
"\n",
"So, **ROC AUC** is the percentage of the ROC plot that is underneath the curve."
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ROC AUC : 0.9825\n"
]
}
],
"source": [
"# compute ROC AUC\n",
"\n",
"from sklearn.metrics import roc_auc_score\n",
"\n",
"ROC_AUC = roc_auc_score(y_test, y_pred_1)\n",
"\n",
"print('ROC AUC : {:.4f}'.format(ROC_AUC))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpretation\n",
"\n",
"\n",
"- ROC AUC is a single number summary of classifier performance. The higher the value, the better the classifier.\n",
"\n",
"- ROC AUC of our model approaches towards 1. So, we can conclude that our classifier does a good job in predicting whether it is benign or malignant cancer."
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cross validated ROC AUC : 0.9910\n"
]
}
],
"source": [
"# calculate cross-validated ROC AUC \n",
"\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"Cross_validated_ROC_AUC = cross_val_score(knn_7, X_train, y_train, cv=5, scoring='roc_auc').mean()\n",
"\n",
"print('Cross validated ROC AUC : {:.4f}'.format(Cross_validated_ROC_AUC))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpretation\n",
"\n",
"Our Cross Validated ROC AUC is very close to 1. So, we can conclude that, the KNN classifier is indeed a very good model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 20. k-fold Cross Validation\n",
"\n",
"\n",
"In this section, I will apply k-fold Cross Validation technique to improve the model performance. Cross-validation is a statistical method of evaluating generalization performance It is more stable and thorough than using a train-test split to evaluate model performance. "
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cross-validation scores:[0.87719298 0.96491228 0.94736842 0.98214286 0.96428571 0.96428571\n",
" 0.98181818 0.98181818 1. 0.98181818]\n"
]
}
],
"source": [
"# Applying 10-Fold Cross Validation\n",
"\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"scores = cross_val_score(knn_7, X_train, y_train, cv = 10, scoring='accuracy')\n",
"\n",
"print('Cross-validation scores:{}'.format(scores))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can summarize the cross-validation accuracy by calculating its mean."
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average cross-validation score: 0.9646\n"
]
}
],
"source": [
"# compute Average cross-validation score\n",
"\n",
"print('Average cross-validation score: {:.4f}'.format(scores.mean()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpretation\n",
"\n",
"\n",
"- Using the mean cross-validation, we can conclude that we expect the model to be around 96.46 % accurate on average.\n",
"\n",
"- If we look at all the 10 scores produced by the 10-fold cross-validation, we can also conclude that there is a relatively high variance in the accuracy between folds, ranging from 100% accuracy to 87.72% accuracy. So, we can conclude that the model is very dependent on the particular folds used for training, but it also be the consequence of the small size of the dataset.\n",
"\n",
"- We can see that 10-fold cross-validation accuracy does not result in performance improvement for this model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 21. Results and conclusion\n",
"\n",
"\n",
"\n",
"1. In this project, I build a kNN classifier model to classify the patients suffering from breast cancer. The model yields very good performance as indicated by the model accuracy which was found to be 0.9786 with k=7.\n",
"\n",
"2. With k=3, the training-set accuracy score is 0.9821 while the test-set accuracy to be 0.9714. These two values are quite comparable. So, there is no question of overfitting. \n",
"\n",
"3. I have compared the model accuracy score which is 0.9714 with null accuracy score which is 0.6071. So, we can conclude that our K Nearest Neighbors model is doing a very good job in predicting the class labels.\n",
"\n",
"4. Our original model accuracy score with k=3 is 0.9714. Now, we can see that we get same accuracy score of 0.9714 with k=5. But, if we increase the value of k further, this would result in enhanced accuracy. With k=6,7,8 we get accuracy score of 0.9786. So, it results in performance improvement. If we increase k to 9, then accuracy decreases again to 0.9714. So, we can conclude that our optimal value of k is 7.\n",
"\n",
"5. kNN Classification model with k=7 shows more accurate predictions and less number of errors than k=3 model. Hence, we got performance improvement with k=7.\n",
"\n",
"6. ROC AUC of our model approaches towards 1. So, we can conclude that our classifier does a good job in predicting whether it is benign or malignant cancer.\n",
"\n",
"7. Using the mean cross-validation, we can conclude that we expect the model to be around 96.46 % accurate on average.\n",
"\n",
"8. If we look at all the 10 scores produced by the 10-fold cross-validation, we can also conclude that there is a relatively high variance in the accuracy between folds, ranging from 100% accuracy to 87.72% accuracy. So, we can conclude that the model is very dependent on the particular folds used for training, but it also be the consequence of the small size of the dataset.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment