Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 22 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save pb111/ca4680d8960c46aeb1b824a93a079fa7 to your computer and use it in GitHub Desktop.
Save pb111/ca4680d8960c46aeb1b824a93a079fa7 to your computer and use it in GitHub Desktop.
Support Vector Machines with Python and Scikit-Learn
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Support Vector Machines with Python and Scikit-Learn\n",
"\n",
"\n",
"\n",
"In this project, I build a Support Vector Machines classifier to classify a Pulsar star. I have used the **Predicting a Pulsar Star** dataset for this project. I have downloaded this dataset from the Kaggle website."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"\n",
"1.\tIntroduction to Support Vector Machines\n",
"2.\tSupport Vector Machines intuition\n",
"3.\tKernel trick\n",
"4.\tThe problem statement\n",
"5.\tDataset description\n",
"6.\tImport libraries\n",
"7.\tImport dataset\n",
"8.\tExploratory data analysis\n",
"9.\tDeclare feature vector and target variable\n",
"10.\tSplit data into separate training and test set\n",
"11.\tFeature scaling\n",
"12.\tRun SVM with default hyperparameters\n",
"13.\tRun SVM with linear kernel\n",
"14.\tRun SVM with polynomial kernel\n",
"15.\tRun SVM with sigmoid kernel\n",
"16.\tConfusion matrix\n",
"17.\tClassification metrices\n",
"18.\tROC - AUC\n",
"19.\tStratified k-fold Cross Validation with shuffle split\n",
"20.\tHyperparameter optimization using GridSearch CV\n",
"21.\tResults and conclusion\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Introduction to Support Vector Machines\n",
"\n",
"\n",
"**Support Vector Machines** (SVMs in short) are machine learning algorithms that are used for classification and regression purposes. SVMs are one of the powerful machine learning algorithms for classification, regression and outlier detection purposes. An SVM classifier builds a model that assigns new data points to one of the given categories. Thus, it can be viewed as a non-probabilistic binary linear classifier.\n",
"\n",
"The original SVM algorithm was developed by Vladimir N Vapnik and Alexey Ya. Chervonenkis in 1963. At that time, the algorithm was in early stages. The only possibility is to draw hyperplanes for linear classifier. In 1992, Bernhard E. Boser, Isabelle M Guyon and Vladimir N Vapnik suggested a way to create non-linear classifiers by applying the kernel trick to maximum-margin hyperplanes. The current standard was proposed by Corinna Cortes and Vapnik in 1993 and published in 1995.\n",
"\n",
"SVMs can be used for linear classification purposes. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using the **kernel trick**. It enable us to implicitly map the inputs into high dimensional feature spaces.\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Support Vector Machines intuition\n",
"\n",
"\n",
"Now, we should be familiar with some SVM terminology. \n",
"\n",
"\n",
"### Hyperplane\n",
"\n",
"A hyperplane is a decision boundary which separates between given set of data points having different class labels. The SVM classifier separates data points using a hyperplane with the maximum amount of margin. This hyperplane is known as the `maximum margin hyperplane` and the linear classifier it defines is known as the `maximum margin classifier`.\n",
"\n",
"\n",
"### Support Vectors\n",
"\n",
"Support vectors are the sample data points, which are closest to the hyperplane. These data points will define the separating line or hyperplane better by calculating margins.\n",
"\n",
"\n",
"### Margin\n",
"\n",
"A margin is a separation gap between the two lines on the closest data points. It is calculated as the perpendicular distance from the line to support vectors or closest data points. In SVMs, we try to maximize this separation gap so that we get maximum margin.\n",
"\n",
"\n",
"### SVM Under the hood\n",
"\n",
"In SVMs, our main objective is to select a hyperplane with the maximum possible margin between support vectors in the given dataset. SVM searches for the maximum margin hyperplane in the following 2 step process –\n",
"\n",
"\n",
"1.\tGenerate hyperplanes which segregates the classes in the best possible way. There are many hyperplanes that might classify the data. We should look for the best hyperplane that represents the largest separation, or margin, between the two classes.\n",
"\n",
"2.\tSo, we choose the hyperplane so that distance from it to the support vectors on each side is maximized. If such a hyperplane exists, it is known as the **maximum margin hyperplane** and the linear classifier it defines is known as a **maximum margin classifier**. \n",
"\n",
"\n",
"### Problem with dispersed datasets\n",
"\n",
"\n",
"Sometimes, the sample data points are so dispersed that it is not possible to separate them using a linear hyperplane. \n",
"In such a situation, SVMs uses a `kernel trick` to transform the input space to a higher dimensional space as shown in the diagram below. It uses a mapping function to transform the 2-D input space into the 3-D input space. Now, we can easily segregate the data points using linear separation.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Kernel trick\n",
"\n",
"\n",
"In practice, SVM algorithm is implemented using a `kernel`. It uses a technique called the `kernel trick`. In simple words, a `kernel` is just a function that maps the data to a higher dimension where data is separable. A kernel transforms a low-dimensional input data space into a higher dimensional space. So, it converts non-linear separable problems to linear separable problems by adding more dimensions to it. Thus, the kernel trick helps us to build a more accurate classifier. Hence, it is useful in non-linear separation problems.\n",
"\n",
"In the context of SVMs, there are 4 popular kernels – `Linear kernel`, `Polynomial kernel` and `Radial Basis Function (RBF) kernel` (also called Gaussian kernel) and `Sigmoid kernel`. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. The problem statement\n",
"\n",
"\n",
"In this project, I try to classify a pulsar star as `legitimate` or `spurious` pulsar star. The legitimate pulsar stars form a minority positive class and spurious pulsar stars form the majority negative class. I implement Support Vector Machines (SVMs) classification algorithm with Python and Scikit-Learn to solve this problem. \n",
"\n",
"\n",
"To answer the question, I build a SVM classifier to classify the pulsar star as legitimate or spurious. I have used the **Predicting a Pulsar Star** dataset downloaded from the Kaggle website for this project."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Dataset description\n",
"\n",
"\n",
"I have used the **Predicting a Pulsar Star** dataset downloaded from the Kaggle website for this project. I have downloaded this data set from the Kaggle website. The data set can be found at the following url:-\n",
"\n",
"\n",
"https://www.kaggle.com/pavanraj159/predicting-a-pulsar-star\n",
"\n",
"\n",
"Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter. Classification algorithms in particular are being adopted, which treat the data sets as binary classification problems. Here the legitimate pulsar examples form minority positive class and spurious examples form the majority negative class.\n",
"\n",
"The data set shared here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. Each row lists the variables first, and the class label is the final entry. The class labels used are 0 (negative) and 1 (positive).\n",
"\n",
"\n",
"### Attribute Information:\n",
"\n",
"\n",
"Each candidate is described by 8 continuous variables, and a single class variable. The first four are simple statistics obtained from the integrated pulse profile. The remaining four variables are similarly obtained from the DM-SNR curve . These are summarised below:\n",
"\n",
"1. Mean of the integrated profile.\n",
"\n",
"2. Standard deviation of the integrated profile.\n",
"\n",
"3. Excess kurtosis of the integrated profile.\n",
"\n",
"4. Skewness of the integrated profile.\n",
"\n",
"5. Mean of the DM-SNR curve.\n",
"\n",
"6. Standard deviation of the DM-SNR curve.\n",
"\n",
"7. Excess kurtosis of the DM-SNR curve.\n",
"\n",
"8. Skewness of the DM-SNR curve.\n",
"\n",
"9. Class"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Import libraries\n",
"\n",
"\n",
"I will start off by importing the required Python libraries."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.Import dataset"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data = 'C:/datasets/pulsar_stars.csv'\n",
"\n",
"df = pd.read_csv(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Exploratory data analysis\n",
"\n",
"\n",
"Now, I will explore the data to gain insights about the data. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(17898, 9)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view dimensions of dataset\n",
"\n",
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 17898 instances and 9 feature variables in the data set."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Mean of the integrated profile</th>\n",
" <th>Standard deviation of the integrated profile</th>\n",
" <th>Excess kurtosis of the integrated profile</th>\n",
" <th>Skewness of the integrated profile</th>\n",
" <th>Mean of the DM-SNR curve</th>\n",
" <th>Standard deviation of the DM-SNR curve</th>\n",
" <th>Excess kurtosis of the DM-SNR curve</th>\n",
" <th>Skewness of the DM-SNR curve</th>\n",
" <th>target_class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>140.562500</td>\n",
" <td>55.683782</td>\n",
" <td>-0.234571</td>\n",
" <td>-0.699648</td>\n",
" <td>3.199833</td>\n",
" <td>19.110426</td>\n",
" <td>7.975532</td>\n",
" <td>74.242225</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>102.507812</td>\n",
" <td>58.882430</td>\n",
" <td>0.465318</td>\n",
" <td>-0.515088</td>\n",
" <td>1.677258</td>\n",
" <td>14.860146</td>\n",
" <td>10.576487</td>\n",
" <td>127.393580</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>103.015625</td>\n",
" <td>39.341649</td>\n",
" <td>0.323328</td>\n",
" <td>1.051164</td>\n",
" <td>3.121237</td>\n",
" <td>21.744669</td>\n",
" <td>7.735822</td>\n",
" <td>63.171909</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>136.750000</td>\n",
" <td>57.178449</td>\n",
" <td>-0.068415</td>\n",
" <td>-0.636238</td>\n",
" <td>3.642977</td>\n",
" <td>20.959280</td>\n",
" <td>6.896499</td>\n",
" <td>53.593661</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>88.726562</td>\n",
" <td>40.672225</td>\n",
" <td>0.600866</td>\n",
" <td>1.123492</td>\n",
" <td>1.178930</td>\n",
" <td>11.468720</td>\n",
" <td>14.269573</td>\n",
" <td>252.567306</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Mean of the integrated profile \\\n",
"0 140.562500 \n",
"1 102.507812 \n",
"2 103.015625 \n",
"3 136.750000 \n",
"4 88.726562 \n",
"\n",
" Standard deviation of the integrated profile \\\n",
"0 55.683782 \n",
"1 58.882430 \n",
"2 39.341649 \n",
"3 57.178449 \n",
"4 40.672225 \n",
"\n",
" Excess kurtosis of the integrated profile \\\n",
"0 -0.234571 \n",
"1 0.465318 \n",
"2 0.323328 \n",
"3 -0.068415 \n",
"4 0.600866 \n",
"\n",
" Skewness of the integrated profile Mean of the DM-SNR curve \\\n",
"0 -0.699648 3.199833 \n",
"1 -0.515088 1.677258 \n",
"2 1.051164 3.121237 \n",
"3 -0.636238 3.642977 \n",
"4 1.123492 1.178930 \n",
"\n",
" Standard deviation of the DM-SNR curve \\\n",
"0 19.110426 \n",
"1 14.860146 \n",
"2 21.744669 \n",
"3 20.959280 \n",
"4 11.468720 \n",
"\n",
" Excess kurtosis of the DM-SNR curve Skewness of the DM-SNR curve \\\n",
"0 7.975532 74.242225 \n",
"1 10.576487 127.393580 \n",
"2 7.735822 63.171909 \n",
"3 6.896499 53.593661 \n",
"4 14.269573 252.567306 \n",
"\n",
" target_class \n",
"0 0 \n",
"1 0 \n",
"2 0 \n",
"3 0 \n",
"4 0 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's preview the dataset\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 9 variables in the dataset. 8 are continuous variables and 1 is discrete variable. The discrete variable is `target_class` variable. It is also the target variable.\n",
"\n",
"\n",
"Now, I will view the column names to check for leading and trailing spaces."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index([' Mean of the integrated profile',\n",
" ' Standard deviation of the integrated profile',\n",
" ' Excess kurtosis of the integrated profile',\n",
" ' Skewness of the integrated profile', ' Mean of the DM-SNR curve',\n",
" ' Standard deviation of the DM-SNR curve',\n",
" ' Excess kurtosis of the DM-SNR curve', ' Skewness of the DM-SNR curve',\n",
" 'target_class'],\n",
" dtype='object')"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view the column names of the dataframe\n",
"\n",
"col_names = df.columns\n",
"\n",
"col_names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are leading spaces (spaces at the start of the string name) in the dataframe. So, I will remove these leading spaces."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# remove leading spaces from column names\n",
"\n",
"df.columns = df.columns.str.strip()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I have removed the leading spaces from the column names. Let's again view the column names to confirm the same."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Mean of the integrated profile',\n",
" 'Standard deviation of the integrated profile',\n",
" 'Excess kurtosis of the integrated profile',\n",
" 'Skewness of the integrated profile', 'Mean of the DM-SNR curve',\n",
" 'Standard deviation of the DM-SNR curve',\n",
" 'Excess kurtosis of the DM-SNR curve', 'Skewness of the DM-SNR curve',\n",
" 'target_class'],\n",
" dtype='object')"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view column names again\n",
"\n",
"df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the leading spaces are removed from the column name. But the column names are very long. So, I will make them short by renaming them."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# rename column names\n",
"\n",
"df.columns = ['IP Mean', 'IP Sd', 'IP Kurtosis', 'IP Skewness', \n",
" 'DM-SNR Mean', 'DM-SNR Sd', 'DM-SNR Kurtosis', 'DM-SNR Skewness', 'target_class']"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['IP Mean', 'IP Sd', 'IP Kurtosis', 'IP Skewness', 'DM-SNR Mean',\n",
" 'DM-SNR Sd', 'DM-SNR Kurtosis', 'DM-SNR Skewness', 'target_class'],\n",
" dtype='object')"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view the renamed column names\n",
"\n",
"df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the column names are shortened. IP stands for `integrated profile` and DM-SNR stands for `delta modulation and signal to noise ratio`. Now, it is much more easy to work with the columns."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our target variable is the `target_class` column. So, I will check its distribution."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 16259\n",
"1 1639\n",
"Name: target_class, dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check distribution of target_class column\n",
"\n",
"df['target_class'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.908426\n",
"1 0.091574\n",
"Name: target_class, dtype: float64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view the percentage distribution of target_class column\n",
"\n",
"df['target_class'].value_counts()/np.float(len(df))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that percentage of observations of the class label `0` and `1` is 90.84% and 9.16%. So, this is a class imbalanced problem. I will deal with that in later section."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 17898 entries, 0 to 17897\n",
"Data columns (total 9 columns):\n",
"IP Mean 17898 non-null float64\n",
"IP Sd 17898 non-null float64\n",
"IP Kurtosis 17898 non-null float64\n",
"IP Skewness 17898 non-null float64\n",
"DM-SNR Mean 17898 non-null float64\n",
"DM-SNR Sd 17898 non-null float64\n",
"DM-SNR Kurtosis 17898 non-null float64\n",
"DM-SNR Skewness 17898 non-null float64\n",
"target_class 17898 non-null int64\n",
"dtypes: float64(8), int64(1)\n",
"memory usage: 1.2 MB\n"
]
}
],
"source": [
"# view summary of dataset\n",
"\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are no missing values in the dataset and all the variables are numerical variables."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore missing values in variables"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"IP Mean 0\n",
"IP Sd 0\n",
"IP Kurtosis 0\n",
"IP Skewness 0\n",
"DM-SNR Mean 0\n",
"DM-SNR Sd 0\n",
"DM-SNR Kurtosis 0\n",
"DM-SNR Skewness 0\n",
"target_class 0\n",
"dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check for missing values in variables\n",
"\n",
"df.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are no missing values in the dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary of numerical variables\n",
"\n",
"\n",
"- There are 9 numerical variables in the dataset.\n",
"\n",
"\n",
"- 8 are continuous variables and 1 is discrete variable. \n",
"\n",
"\n",
"- The discrete variable is `target_class` variable. It is also the target variable.\n",
"\n",
"\n",
"- There are no missing values in the dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Outliers in numerical variables"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>IP Mean</th>\n",
" <th>IP Sd</th>\n",
" <th>IP Kurtosis</th>\n",
" <th>IP Skewness</th>\n",
" <th>DM-SNR Mean</th>\n",
" <th>DM-SNR Sd</th>\n",
" <th>DM-SNR Kurtosis</th>\n",
" <th>DM-SNR Skewness</th>\n",
" <th>target_class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>17898.00</td>\n",
" <td>17898.00</td>\n",
" <td>17898.00</td>\n",
" <td>17898.00</td>\n",
" <td>17898.00</td>\n",
" <td>17898.00</td>\n",
" <td>17898.00</td>\n",
" <td>17898.00</td>\n",
" <td>17898.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>111.08</td>\n",
" <td>46.55</td>\n",
" <td>0.48</td>\n",
" <td>1.77</td>\n",
" <td>12.61</td>\n",
" <td>26.33</td>\n",
" <td>8.30</td>\n",
" <td>104.86</td>\n",
" <td>0.09</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>25.65</td>\n",
" <td>6.84</td>\n",
" <td>1.06</td>\n",
" <td>6.17</td>\n",
" <td>29.47</td>\n",
" <td>19.47</td>\n",
" <td>4.51</td>\n",
" <td>106.51</td>\n",
" <td>0.29</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>5.81</td>\n",
" <td>24.77</td>\n",
" <td>-1.88</td>\n",
" <td>-1.79</td>\n",
" <td>0.21</td>\n",
" <td>7.37</td>\n",
" <td>-3.14</td>\n",
" <td>-1.98</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>100.93</td>\n",
" <td>42.38</td>\n",
" <td>0.03</td>\n",
" <td>-0.19</td>\n",
" <td>1.92</td>\n",
" <td>14.44</td>\n",
" <td>5.78</td>\n",
" <td>34.96</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>115.08</td>\n",
" <td>46.95</td>\n",
" <td>0.22</td>\n",
" <td>0.20</td>\n",
" <td>2.80</td>\n",
" <td>18.46</td>\n",
" <td>8.43</td>\n",
" <td>83.06</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>127.09</td>\n",
" <td>51.02</td>\n",
" <td>0.47</td>\n",
" <td>0.93</td>\n",
" <td>5.46</td>\n",
" <td>28.43</td>\n",
" <td>10.70</td>\n",
" <td>139.31</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>192.62</td>\n",
" <td>98.78</td>\n",
" <td>8.07</td>\n",
" <td>68.10</td>\n",
" <td>223.39</td>\n",
" <td>110.64</td>\n",
" <td>34.54</td>\n",
" <td>1191.00</td>\n",
" <td>1.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" IP Mean IP Sd IP Kurtosis IP Skewness DM-SNR Mean DM-SNR Sd \\\n",
"count 17898.00 17898.00 17898.00 17898.00 17898.00 17898.00 \n",
"mean 111.08 46.55 0.48 1.77 12.61 26.33 \n",
"std 25.65 6.84 1.06 6.17 29.47 19.47 \n",
"min 5.81 24.77 -1.88 -1.79 0.21 7.37 \n",
"25% 100.93 42.38 0.03 -0.19 1.92 14.44 \n",
"50% 115.08 46.95 0.22 0.20 2.80 18.46 \n",
"75% 127.09 51.02 0.47 0.93 5.46 28.43 \n",
"max 192.62 98.78 8.07 68.10 223.39 110.64 \n",
"\n",
" DM-SNR Kurtosis DM-SNR Skewness target_class \n",
"count 17898.00 17898.00 17898.00 \n",
"mean 8.30 104.86 0.09 \n",
"std 4.51 106.51 0.29 \n",
"min -3.14 -1.98 0.00 \n",
"25% 5.78 34.96 0.00 \n",
"50% 8.43 83.06 0.00 \n",
"75% 10.70 139.31 0.00 \n",
"max 34.54 1191.00 1.00 "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view summary statistics in numerical variables\n",
"\n",
"round(df.describe(),2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On closer inspection, we can suspect that all the continuous variables may contain outliers.\n",
"\n",
"\n",
"I will draw boxplots to visualise outliers in the above variables. "
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0,0.5,'DM-SNR Skewness')"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1728x1440 with 8 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# draw boxplots to visualize outliers\n",
"\n",
"plt.figure(figsize=(24,20))\n",
"\n",
"\n",
"plt.subplot(4, 2, 1)\n",
"fig = df.boxplot(column='IP Mean')\n",
"fig.set_title('')\n",
"fig.set_ylabel('IP Mean')\n",
"\n",
"\n",
"plt.subplot(4, 2, 2)\n",
"fig = df.boxplot(column='IP Sd')\n",
"fig.set_title('')\n",
"fig.set_ylabel('IP Sd')\n",
"\n",
"\n",
"plt.subplot(4, 2, 3)\n",
"fig = df.boxplot(column='IP Kurtosis')\n",
"fig.set_title('')\n",
"fig.set_ylabel('IP Kurtosis')\n",
"\n",
"\n",
"plt.subplot(4, 2, 4)\n",
"fig = df.boxplot(column='IP Skewness')\n",
"fig.set_title('')\n",
"fig.set_ylabel('IP Skewness')\n",
"\n",
"\n",
"plt.subplot(4, 2, 5)\n",
"fig = df.boxplot(column='DM-SNR Mean')\n",
"fig.set_title('')\n",
"fig.set_ylabel('DM-SNR Mean')\n",
"\n",
"\n",
"plt.subplot(4, 2, 6)\n",
"fig = df.boxplot(column='DM-SNR Sd')\n",
"fig.set_title('')\n",
"fig.set_ylabel('DM-SNR Sd')\n",
"\n",
"\n",
"plt.subplot(4, 2, 7)\n",
"fig = df.boxplot(column='DM-SNR Kurtosis')\n",
"fig.set_title('')\n",
"fig.set_ylabel('DM-SNR Kurtosis')\n",
"\n",
"\n",
"plt.subplot(4, 2, 8)\n",
"fig = df.boxplot(column='DM-SNR Skewness')\n",
"fig.set_title('')\n",
"fig.set_ylabel('DM-SNR Skewness')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above boxplots confirm that there are lot of outliers in these variables."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Handle outliers with SVMs\n",
"\n",
"\n",
"There are 2 variants of SVMs. They are `hard-margin variant of SVM` and `soft-margin variant of SVM`.\n",
"\n",
"\n",
"The `hard-margin variant of SVM` does not deal with outliers. In this case, we want to find the hyperplane with maximum margin such that every training point is correctly classified with margin at least 1. This technique does not handle outliers well.\n",
"\n",
"\n",
"Another version of SVM is called `soft-margin variant of SVM`. In this case, we can have a few points incorrectly classified or \n",
"classified with a margin less than 1. But for every such point, we have to pay a penalty in the form of `C` parameter, which controls the outliers. `Low C` implies we are allowing more outliers and `high C` implies less outliers.\n",
"\n",
"\n",
"The message is that since the dataset contains outliers, so the value of C should be high while training the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check the distribution of variables\n",
"\n",
"\n",
"Now, I will plot the histograms to check distributions to find out if they are normal or skewed. "
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0,0.5,'Number of pulsar stars')"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1728x1440 with 8 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# plot histogram to check distribution\n",
"\n",
"\n",
"plt.figure(figsize=(24,20))\n",
"\n",
"\n",
"plt.subplot(4, 2, 1)\n",
"fig = df['IP Mean'].hist(bins=20)\n",
"fig.set_xlabel('IP Mean')\n",
"fig.set_ylabel('Number of pulsar stars')\n",
"\n",
"\n",
"plt.subplot(4, 2, 2)\n",
"fig = df['IP Sd'].hist(bins=20)\n",
"fig.set_xlabel('IP Sd')\n",
"fig.set_ylabel('Number of pulsar stars')\n",
"\n",
"\n",
"plt.subplot(4, 2, 3)\n",
"fig = df['IP Kurtosis'].hist(bins=20)\n",
"fig.set_xlabel('IP Kurtosis')\n",
"fig.set_ylabel('Number of pulsar stars')\n",
"\n",
"\n",
"\n",
"plt.subplot(4, 2, 4)\n",
"fig = df['IP Skewness'].hist(bins=20)\n",
"fig.set_xlabel('IP Skewness')\n",
"fig.set_ylabel('Number of pulsar stars')\n",
"\n",
"\n",
"\n",
"plt.subplot(4, 2, 5)\n",
"fig = df['DM-SNR Mean'].hist(bins=20)\n",
"fig.set_xlabel('DM-SNR Mean')\n",
"fig.set_ylabel('Number of pulsar stars')\n",
"\n",
"\n",
"\n",
"plt.subplot(4, 2, 6)\n",
"fig = df['DM-SNR Sd'].hist(bins=20)\n",
"fig.set_xlabel('DM-SNR Sd')\n",
"fig.set_ylabel('Number of pulsar stars')\n",
"\n",
"\n",
"\n",
"plt.subplot(4, 2, 7)\n",
"fig = df['DM-SNR Kurtosis'].hist(bins=20)\n",
"fig.set_xlabel('DM-SNR Kurtosis')\n",
"fig.set_ylabel('Number of pulsar stars')\n",
"\n",
"\n",
"plt.subplot(4, 2, 8)\n",
"fig = df['DM-SNR Skewness'].hist(bins=20)\n",
"fig.set_xlabel('DM-SNR Skewness')\n",
"fig.set_ylabel('Number of pulsar stars')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that all the 8 continuous variables are skewed. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Declare feature vector and target variable"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"X = df.drop(['target_class'], axis=1)\n",
"\n",
"y = df['target_class']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Split data into separate training and test set"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# split X and y into training and testing sets\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((14318, 8), (3580, 8))"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check the shape of X_train and X_test\n",
"\n",
"X_train.shape, X_test.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Feature Scaling"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"cols = X_train.columns"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"scaler = StandardScaler()\n",
"\n",
"X_train = scaler.fit_transform(X_train)\n",
"\n",
"X_test = scaler.transform(X_test)\n"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"X_train = pd.DataFrame(X_train, columns=[cols])"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"X_test = pd.DataFrame(X_test, columns=[cols])"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th>IP Mean</th>\n",
" <th>IP Sd</th>\n",
" <th>IP Kurtosis</th>\n",
" <th>IP Skewness</th>\n",
" <th>DM-SNR Mean</th>\n",
" <th>DM-SNR Sd</th>\n",
" <th>DM-SNR Kurtosis</th>\n",
" <th>DM-SNR Skewness</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>1.431800e+04</td>\n",
" <td>1.431800e+04</td>\n",
" <td>1.431800e+04</td>\n",
" <td>1.431800e+04</td>\n",
" <td>1.431800e+04</td>\n",
" <td>1.431800e+04</td>\n",
" <td>1.431800e+04</td>\n",
" <td>1.431800e+04</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>1.986604e-16</td>\n",
" <td>-6.757488e-16</td>\n",
" <td>2.125527e-17</td>\n",
" <td>3.581784e-17</td>\n",
" <td>-2.205248e-17</td>\n",
" <td>-1.583840e-16</td>\n",
" <td>-9.700300e-18</td>\n",
" <td>1.214786e-16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>1.000035e+00</td>\n",
" <td>1.000035e+00</td>\n",
" <td>1.000035e+00</td>\n",
" <td>1.000035e+00</td>\n",
" <td>1.000035e+00</td>\n",
" <td>1.000035e+00</td>\n",
" <td>1.000035e+00</td>\n",
" <td>1.000035e+00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>-4.035499e+00</td>\n",
" <td>-3.181033e+00</td>\n",
" <td>-2.185946e+00</td>\n",
" <td>-5.744051e-01</td>\n",
" <td>-4.239001e-01</td>\n",
" <td>-9.733707e-01</td>\n",
" <td>-2.455649e+00</td>\n",
" <td>-1.003411e+00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>-3.896291e-01</td>\n",
" <td>-6.069473e-01</td>\n",
" <td>-4.256221e-01</td>\n",
" <td>-3.188054e-01</td>\n",
" <td>-3.664918e-01</td>\n",
" <td>-6.125457e-01</td>\n",
" <td>-5.641035e-01</td>\n",
" <td>-6.627590e-01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>1.587461e-01</td>\n",
" <td>5.846646e-02</td>\n",
" <td>-2.453172e-01</td>\n",
" <td>-2.578142e-01</td>\n",
" <td>-3.372294e-01</td>\n",
" <td>-4.067482e-01</td>\n",
" <td>3.170446e-02</td>\n",
" <td>-2.059136e-01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>6.267059e-01</td>\n",
" <td>6.501017e-01</td>\n",
" <td>-1.001238e-02</td>\n",
" <td>-1.419621e-01</td>\n",
" <td>-2.463724e-01</td>\n",
" <td>1.078934e-01</td>\n",
" <td>5.362759e-01</td>\n",
" <td>3.256217e-01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>3.151882e+00</td>\n",
" <td>7.621116e+00</td>\n",
" <td>7.008906e+00</td>\n",
" <td>1.054430e+01</td>\n",
" <td>7.025568e+00</td>\n",
" <td>4.292181e+00</td>\n",
" <td>5.818557e+00</td>\n",
" <td>1.024613e+01</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" IP Mean IP Sd IP Kurtosis IP Skewness DM-SNR Mean \\\n",
"count 1.431800e+04 1.431800e+04 1.431800e+04 1.431800e+04 1.431800e+04 \n",
"mean 1.986604e-16 -6.757488e-16 2.125527e-17 3.581784e-17 -2.205248e-17 \n",
"std 1.000035e+00 1.000035e+00 1.000035e+00 1.000035e+00 1.000035e+00 \n",
"min -4.035499e+00 -3.181033e+00 -2.185946e+00 -5.744051e-01 -4.239001e-01 \n",
"25% -3.896291e-01 -6.069473e-01 -4.256221e-01 -3.188054e-01 -3.664918e-01 \n",
"50% 1.587461e-01 5.846646e-02 -2.453172e-01 -2.578142e-01 -3.372294e-01 \n",
"75% 6.267059e-01 6.501017e-01 -1.001238e-02 -1.419621e-01 -2.463724e-01 \n",
"max 3.151882e+00 7.621116e+00 7.008906e+00 1.054430e+01 7.025568e+00 \n",
"\n",
" DM-SNR Sd DM-SNR Kurtosis DM-SNR Skewness \n",
"count 1.431800e+04 1.431800e+04 1.431800e+04 \n",
"mean -1.583840e-16 -9.700300e-18 1.214786e-16 \n",
"std 1.000035e+00 1.000035e+00 1.000035e+00 \n",
"min -9.733707e-01 -2.455649e+00 -1.003411e+00 \n",
"25% -6.125457e-01 -5.641035e-01 -6.627590e-01 \n",
"50% -4.067482e-01 3.170446e-02 -2.059136e-01 \n",
"75% 1.078934e-01 5.362759e-01 3.256217e-01 \n",
"max 4.292181e+00 5.818557e+00 1.024613e+01 "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have `X_train` dataset ready to be fed into the Logistic Regression classifier. I will do it as follows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Run SVM with default hyperparameters\n",
"\n",
"\n",
"Default hyperparameter means C=1.0, kernel=`rbf` and gamma=`auto` among other parameters."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with default hyperparameters: 0.9827\n"
]
}
],
"source": [
"# import SVC classifier\n",
"from sklearn.svm import SVC\n",
"\n",
"\n",
"# import metrics to compute accuracy\n",
"from sklearn.metrics import accuracy_score\n",
"\n",
"\n",
"# instantiate classifier with default hyperparameters\n",
"svc=SVC() \n",
"\n",
"\n",
"# fit classifier to training set\n",
"svc.fit(X_train,y_train)\n",
"\n",
"\n",
"# make predictions on test set\n",
"y_pred=svc.predict(X_test)\n",
"\n",
"\n",
"# compute and print accuracy score\n",
"print('Model accuracy score with default hyperparameters: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run SVM with rbf kernel and C=100.0\n",
"\n",
"\n",
"We have seen that there are outliers in our dataset. So, we should increase the value of C as higher C means fewer outliers. \n",
"So, I will run SVM with kernel=`rbf` and C=100.0."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with rbf kernel and C=100.0 : 0.9832\n"
]
}
],
"source": [
"# instantiate classifier with rbf kernel and C=100\n",
"svc=SVC(C=100.0) \n",
"\n",
"\n",
"# fit classifier to training set\n",
"svc.fit(X_train,y_train)\n",
"\n",
"\n",
"# make predictions on test set\n",
"y_pred=svc.predict(X_test)\n",
"\n",
"\n",
"# compute and print accuracy score\n",
"print('Model accuracy score with rbf kernel and C=100.0 : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that we obtain a higher accuracy with C=100.0 as higher C means less outliers.\n",
"\n",
"Now, I will further increase the value of C=1000.0 and check accuracy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run SVM with rbf kernel and C=1000.0\n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with rbf kernel and C=1000.0 : 0.9816\n"
]
}
],
"source": [
"# instantiate classifier with rbf kernel and C=1000\n",
"svc=SVC(C=1000.0) \n",
"\n",
"\n",
"# fit classifier to training set\n",
"svc.fit(X_train,y_train)\n",
"\n",
"\n",
"# make predictions on test set\n",
"y_pred=svc.predict(X_test)\n",
"\n",
"\n",
"# compute and print accuracy score\n",
"print('Model accuracy score with rbf kernel and C=1000.0 : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, we can see that the accuracy had decreased with C=1000.0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. Run SVM with linear kernel\n",
"\n",
"\n",
"### Run SVM with linear kernel and C=1.0"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with linear kernel and C=1.0 : 0.9830\n"
]
}
],
"source": [
"# instantiate classifier with linear kernel and C=1.0\n",
"linear_svc=SVC(kernel='linear', C=1.0) \n",
"\n",
"\n",
"# fit classifier to training set\n",
"linear_svc.fit(X_train,y_train)\n",
"\n",
"\n",
"# make predictions on test set\n",
"y_pred_test=linear_svc.predict(X_test)\n",
"\n",
"\n",
"# compute and print accuracy score\n",
"print('Model accuracy score with linear kernel and C=1.0 : {0:0.4f}'. format(accuracy_score(y_test, y_pred_test)))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run SVM with linear kernel and C=100.0"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with linear kernel and C=100.0 : 0.9832\n"
]
}
],
"source": [
"# instantiate classifier with linear kernel and C=100.0\n",
"linear_svc100=SVC(kernel='linear', C=100.0) \n",
"\n",
"\n",
"# fit classifier to training set\n",
"linear_svc100.fit(X_train, y_train)\n",
"\n",
"\n",
"# make predictions on test set\n",
"y_pred=linear_svc100.predict(X_test)\n",
"\n",
"\n",
"# compute and print accuracy score\n",
"print('Model accuracy score with linear kernel and C=100.0 : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run SVM with linear kernel and C=1000.0"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with linear kernel and C=1000.0 : 0.9832\n"
]
}
],
"source": [
"# instantiate classifier with linear kernel and C=1000.0\n",
"linear_svc1000=SVC(kernel='linear', C=1000.0) \n",
"\n",
"\n",
"# fit classifier to training set\n",
"linear_svc1000.fit(X_train, y_train)\n",
"\n",
"\n",
"# make predictions on test set\n",
"y_pred=linear_svc1000.predict(X_test)\n",
"\n",
"\n",
"# compute and print accuracy score\n",
"print('Model accuracy score with linear kernel and C=1000.0 : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that we can obtain higher accuracy with C=100.0 and C=1000.0 as compared to C=1.0."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, **y_test** are the true class labels and **y_pred** are the predicted class labels in the test-set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compare the train-set and test-set accuracy\n",
"\n",
"\n",
"Now, I will compare the train-set and test-set accuracy to check for overfitting."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 0, 1, ..., 0, 0, 0], dtype=int64)"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pred_train = linear_svc.predict(X_train)\n",
"\n",
"y_pred_train"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training-set accuracy score: 0.9783\n"
]
}
],
"source": [
"print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the training set and test-set accuracy are very much comparable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check for overfitting and underfitting"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training set score: 0.9783\n",
"Test set score: 0.9830\n"
]
}
],
"source": [
"# print the scores on training and test set\n",
"\n",
"print('Training set score: {:.4f}'.format(linear_svc.score(X_train, y_train)))\n",
"\n",
"print('Test set score: {:.4f}'.format(linear_svc.score(X_test, y_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The training-set accuracy score is 0.9783 while the test-set accuracy to be 0.9830. These two values are quite comparable. So, there is no question of overfitting. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compare model accuracy with null accuracy\n",
"\n",
"\n",
"So, the model accuracy is 0.9832. But, we cannot say that our model is very good based on the above accuracy. We must compare it with the **null accuracy**. Null accuracy is the accuracy that could be achieved by always predicting the most frequent class.\n",
"\n",
"So, we should first check the class distribution in the test set. "
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 3306\n",
"1 274\n",
"Name: target_class, dtype: int64"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check class distribution in test set\n",
"\n",
"y_test.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the occurences of most frequent class `0` is 3306. So, we can calculate null accuracy by dividing 3306 by total number of occurences."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Null accuracy score: 0.9235\n"
]
}
],
"source": [
"# check null accuracy score\n",
"\n",
"null_accuracy = (3306/(3306+274))\n",
"\n",
"print('Null accuracy score: {0:0.4f}'. format(null_accuracy))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that our model accuracy score is 0.9830 but null accuracy score is 0.9235. So, we can conclude that our SVM classifier is doing a very good job in predicting the class labels."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Run SVM with polynomial kernel\n",
"\n",
"\n",
"### Run SVM with polynomial kernel and C=1.0"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with polynomial kernel and C=1.0 : 0.9807\n"
]
}
],
"source": [
"# instantiate classifier with polynomial kernel and C=1.0\n",
"poly_svc=SVC(kernel='poly', C=1.0) \n",
"\n",
"\n",
"# fit classifier to training set\n",
"poly_svc.fit(X_train,y_train)\n",
"\n",
"\n",
"# make predictions on test set\n",
"y_pred=poly_svc.predict(X_test)\n",
"\n",
"\n",
"# compute and print accuracy score\n",
"print('Model accuracy score with polynomial kernel and C=1.0 : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" ### Run SVM with polynomial kernel and C=100.0"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with polynomial kernel and C=1.0 : 0.9824\n"
]
}
],
"source": [
"# instantiate classifier with polynomial kernel and C=100.0\n",
"poly_svc100=SVC(kernel='poly', C=100.0) \n",
"\n",
"\n",
"# fit classifier to training set\n",
"poly_svc100.fit(X_train, y_train)\n",
"\n",
"\n",
"# make predictions on test set\n",
"y_pred=poly_svc100.predict(X_test)\n",
"\n",
"\n",
"# compute and print accuracy score\n",
"print('Model accuracy score with polynomial kernel and C=1.0 : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Polynomial kernel gives poor performance. It may be overfitting the training set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 15. Run SVM with sigmoid kernel\n",
"\n",
"\n",
"### Run SVM with sigmoid kernel and C=1.0"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with sigmoid kernel and C=1.0 : 0.8858\n"
]
}
],
"source": [
"# instantiate classifier with sigmoid kernel and C=1.0\n",
"sigmoid_svc=SVC(kernel='sigmoid', C=1.0) \n",
"\n",
"\n",
"# fit classifier to training set\n",
"sigmoid_svc.fit(X_train,y_train)\n",
"\n",
"\n",
"# make predictions on test set\n",
"y_pred=sigmoid_svc.predict(X_test)\n",
"\n",
"\n",
"# compute and print accuracy score\n",
"print('Model accuracy score with sigmoid kernel and C=1.0 : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run SVM with sigmoid kernel and C=100.0"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score with sigmoid kernel and C=100.0 : 0.8855\n"
]
}
],
"source": [
"# instantiate classifier with sigmoid kernel and C=100.0\n",
"sigmoid_svc100=SVC(kernel='sigmoid', C=100.0) \n",
"\n",
"\n",
"# fit classifier to training set\n",
"sigmoid_svc100.fit(X_train,y_train)\n",
"\n",
"\n",
"# make predictions on test set\n",
"y_pred=sigmoid_svc100.predict(X_test)\n",
"\n",
"\n",
"# compute and print accuracy score\n",
"print('Model accuracy score with sigmoid kernel and C=100.0 : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that sigmoid kernel is also performing poorly just like with polynomial kernel."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comments\n",
"\n",
"\n",
"We get maximum accuracy with `rbf` and `linear` kernel with C=100.0. and the accuracy is 0.9832. Based on the above analysis we can conclude that our classification model accuracy is very good. Our model is doing a very good job in terms of predicting the class labels.\n",
"\n",
"\n",
"But, this is not true. Here, we have an imbalanced dataset. The problem is that accuracy is an inadequate measure for quantifying predictive performance in the imbalanced dataset problem.\n",
"\n",
"\n",
"So, we must explore alternative metrices that provide better guidance in selecting models. In particular, we would like to know the underlying distribution of values and the type of errors our classifer is making. \n",
"\n",
"\n",
"One such metric to analyze the model performance in imbalanced classes problem is `Confusion matrix`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 16. Confusion matrix\n",
"\n",
"\n",
"A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.\n",
"\n",
"\n",
"Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-\n",
"\n",
"\n",
"**True Positives (TP)** – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to that class.\n",
"\n",
"\n",
"**True Negatives (TN)** – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually does not belong to that class.\n",
"\n",
"\n",
"**False Positives (FP)** – False Positives occur when we predict an observation belongs to a certain class but the observation actually does not belong to that class. This type of error is called **Type I error.**\n",
"\n",
"\n",
"\n",
"**False Negatives (FN)** – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually belongs to that class. This is a very serious error and it is called **Type II error.**\n",
"\n",
"\n",
"\n",
"These four outcomes are summarized in a confusion matrix given below.\n"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Confusion matrix\n",
"\n",
" [[3289 17]\n",
" [ 44 230]]\n",
"\n",
"True Positives(TP) = 3289\n",
"\n",
"True Negatives(TN) = 230\n",
"\n",
"False Positives(FP) = 17\n",
"\n",
"False Negatives(FN) = 44\n"
]
}
],
"source": [
"# Print the Confusion Matrix and slice it into four pieces\n",
"\n",
"from sklearn.metrics import confusion_matrix\n",
"\n",
"cm = confusion_matrix(y_test, y_pred_test)\n",
"\n",
"print('Confusion matrix\\n\\n', cm)\n",
"\n",
"print('\\nTrue Positives(TP) = ', cm[0,0])\n",
"\n",
"print('\\nTrue Negatives(TN) = ', cm[1,1])\n",
"\n",
"print('\\nFalse Positives(FP) = ', cm[0,1])\n",
"\n",
"print('\\nFalse Negatives(FN) = ', cm[1,0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The confusion matrix shows `3289 + 230 = 3519 correct predictions` and `17 + 44 = 61 incorrect predictions`.\n",
"\n",
"\n",
"In this case, we have\n",
"\n",
"\n",
"- `True Positives` (Actual Positive:1 and Predict Positive:1) - 3289\n",
"\n",
"\n",
"- `True Negatives` (Actual Negative:0 and Predict Negative:0) - 230\n",
"\n",
"\n",
"- `False Positives` (Actual Negative:0 but Predict Positive:1) - 17 `(Type I error)`\n",
"\n",
"\n",
"- `False Negatives` (Actual Positive:1 but Predict Negative:0) - 44 `(Type II error)`"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x20f9f09588>"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# visualize confusion matrix with seaborn heatmap\n",
"\n",
"cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], \n",
" index=['Predict Positive:1', 'Predict Negative:0'])\n",
"\n",
"sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 17. Classification metrices"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Classification Report\n",
"\n",
"\n",
"**Classification report** is another way to evaluate the classification model performance. It displays the **precision**, **recall**, **f1** and **support** scores for the model. I have described these terms in later.\n",
"\n",
"We can print a classification report as follows:-"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 0.99 0.99 0.99 3306\n",
" 1 0.93 0.84 0.88 274\n",
"\n",
" micro avg 0.98 0.98 0.98 3580\n",
" macro avg 0.96 0.92 0.94 3580\n",
"weighted avg 0.98 0.98 0.98 3580\n",
"\n"
]
}
],
"source": [
"from sklearn.metrics import classification_report\n",
"\n",
"print(classification_report(y_test, y_pred_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Classification accuracy"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"TP = cm[0,0]\n",
"TN = cm[1,1]\n",
"FP = cm[0,1]\n",
"FN = cm[1,0]"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Classification accuracy : 0.9830\n"
]
}
],
"source": [
"# print classification accuracy\n",
"\n",
"classification_accuracy = (TP + TN) / float(TP + TN + FP + FN)\n",
"\n",
"print('Classification accuracy : {0:0.4f}'.format(classification_accuracy))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Classification error"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Classification error : 0.0170\n"
]
}
],
"source": [
"# print classification error\n",
"\n",
"classification_error = (FP + FN) / float(TP + TN + FP + FN)\n",
"\n",
"print('Classification error : {0:0.4f}'.format(classification_error))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Precision\n",
"\n",
"\n",
"**Precision** can be defined as the percentage of correctly predicted positive outcomes out of all the predicted positive outcomes. It can be given as the ratio of true positives (TP) to the sum of true and false positives (TP + FP). \n",
"\n",
"\n",
"So, **Precision** identifies the proportion of correctly predicted positive outcome. It is more concerned with the positive class than the negative class.\n",
"\n",
"\n",
"\n",
"Mathematically, precision can be defined as the ratio of `TP to (TP + FP)`.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Precision : 0.9949\n"
]
}
],
"source": [
"# print precision score\n",
"\n",
"precision = TP / float(TP + FP)\n",
"\n",
"\n",
"print('Precision : {0:0.4f}'.format(precision))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Recall\n",
"\n",
"\n",
"Recall can be defined as the percentage of correctly predicted positive outcomes out of all the actual positive outcomes.\n",
"It can be given as the ratio of true positives (TP) to the sum of true positives and false negatives (TP + FN). **Recall** is also called **Sensitivity**.\n",
"\n",
"\n",
"**Recall** identifies the proportion of correctly predicted actual positives.\n",
"\n",
"\n",
"Mathematically, **recall** can be defined as the ratio of `TP to (TP + FN)`.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Recall or Sensitivity : 0.9868\n"
]
}
],
"source": [
"recall = TP / float(TP + FN)\n",
"\n",
"print('Recall or Sensitivity : {0:0.4f}'.format(recall))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### True Positive Rate\n",
"\n",
"\n",
"**True Positive Rate** is synonymous with **Recall**.\n"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"True Positive Rate : 0.9868\n"
]
}
],
"source": [
"true_positive_rate = TP / float(TP + FN)\n",
"\n",
"\n",
"print('True Positive Rate : {0:0.4f}'.format(true_positive_rate))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### False Positive Rate"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"False Positive Rate : 0.0688\n"
]
}
],
"source": [
"false_positive_rate = FP / float(FP + TN)\n",
"\n",
"\n",
"print('False Positive Rate : {0:0.4f}'.format(false_positive_rate))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Specificity"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Specificity : 0.9312\n"
]
}
],
"source": [
"specificity = TN / (TN + FP)\n",
"\n",
"print('Specificity : {0:0.4f}'.format(specificity))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### f1-score\n",
"\n",
"\n",
"**f1-score** is the weighted harmonic mean of precision and recall. The best possible **f1-score** would be 1.0 and the worst \n",
"would be 0.0. **f1-score** is the harmonic mean of precision and recall. So, **f1-score** is always lower than accuracy measures as they embed precision and recall into their computation. The weighted average of `f1-score` should be used to \n",
"compare classifier models, not global accuracy.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Support\n",
"\n",
"\n",
"**Support** is the actual number of occurrences of the class in our dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 18. ROC - AUC\n",
"\n",
"\n",
"\n",
"### ROC Curve\n",
"\n",
"\n",
"Another tool to measure the classification model performance visually is **ROC Curve**. ROC Curve stands for **Receiver Operating Characteristic Curve**. An **ROC Curve** is a plot which shows the performance of a classification model at various \n",
"classification threshold levels. \n",
"\n",
"\n",
"\n",
"The **ROC Curve** plots the **True Positive Rate (TPR)** against the **False Positive Rate (FPR)** at various threshold levels.\n",
"\n",
"\n",
"\n",
"**True Positive Rate (TPR)** is also called **Recall**. It is defined as the ratio of `TP to (TP + FN)`.\n",
"\n",
"\n",
"\n",
"**False Positive Rate (FPR)** is defined as the ratio of `FP to (FP + TN)`.\n",
"\n",
"\n",
"\n",
"In the ROC Curve, we will focus on the TPR (True Positive Rate) and FPR (False Positive Rate) of a single point. This will give us the general performance of the ROC curve which consists of the TPR and FPR at various threshold levels. So, an ROC Curve plots TPR vs FPR at different classification threshold levels. If we lower the threshold levels, it may result in more items being classified as positve. It will increase both True Positives (TP) and False Positives (FP).\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# plot ROC Curve\n",
"\n",
"from sklearn.metrics import roc_curve\n",
"\n",
"fpr, tpr, thresholds = roc_curve(y_test, y_pred_test)\n",
"\n",
"plt.figure(figsize=(6,4))\n",
"\n",
"plt.plot(fpr, tpr, linewidth=2)\n",
"\n",
"plt.plot([0,1], [0,1], 'k--' )\n",
"\n",
"plt.rcParams['font.size'] = 12\n",
"\n",
"plt.title('ROC curve for Predicting a Pulsar Star classifier')\n",
"\n",
"plt.xlabel('False Positive Rate (1 - Specificity)')\n",
"\n",
"plt.ylabel('True Positive Rate (Sensitivity)')\n",
"\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ROC curve help us to choose a threshold level that balances sensitivity and specificity for a particular context."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ROC AUC\n",
"\n",
"\n",
"**ROC AUC** stands for **Receiver Operating Characteristic - Area Under Curve**. It is a technique to compare classifier performance. In this technique, we measure the `area under the curve (AUC)`. A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5. \n",
"\n",
"\n",
"So, **ROC AUC** is the percentage of the ROC plot that is underneath the curve."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ROC AUC : 0.9171\n"
]
}
],
"source": [
"# compute ROC AUC\n",
"\n",
"from sklearn.metrics import roc_auc_score\n",
"\n",
"ROC_AUC = roc_auc_score(y_test, y_pred_test)\n",
"\n",
"print('ROC AUC : {:.4f}'.format(ROC_AUC))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comments\n",
"\n",
"\n",
"- ROC AUC is a single number summary of classifier performance. The higher the value, the better the classifier.\n",
"\n",
"- ROC AUC of our model approaches towards 1. So, we can conclude that our classifier does a good job in classifying the pulsar star."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cross validated ROC AUC : 0.9756\n"
]
}
],
"source": [
"# calculate cross-validated ROC AUC \n",
"\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"Cross_validated_ROC_AUC = cross_val_score(linear_svc, X_train, y_train, cv=10, scoring='roc_auc').mean()\n",
"\n",
"print('Cross validated ROC AUC : {:.4f}'.format(Cross_validated_ROC_AUC))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 19. Stratified k-fold Cross Validation with shuffle split\n",
"\n",
"\n",
"k-fold cross-validation is a very useful technique to evaluate model performance. But, it fails here because we have a imbalnced dataset. So, in the case of imbalanced dataset, I will use another technique to evaluate model performance. It is called `stratified k-fold cross-validation`.\n",
"\n",
"\n",
"In `stratified k-fold cross-validation`, we split the data such that the proportions between classes are the same in each fold as they are in the whole dataset.\n",
"\n",
"\n",
"Moreover, I will shuffle the data before splitting because shuffling yields much better result."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stratified k-Fold Cross Validation with shuffle split with linear kernel"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import KFold\n",
"\n",
"\n",
"kfold=KFold(n_splits=5, shuffle=True, random_state=0)\n",
"\n",
"\n",
"linear_svc=SVC(kernel='linear')\n",
"\n",
"\n",
"linear_scores = cross_val_score(linear_svc, X, y, cv=kfold)\n"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Stratified cross-validation scores with linear kernel:\n",
"\n",
"[0.98296089 0.97458101 0.97988827 0.97876502 0.97848561]\n"
]
}
],
"source": [
"# print cross-validation scores with linear kernel\n",
"\n",
"print('Stratified cross-validation scores with linear kernel:\\n\\n{}'.format(linear_scores))"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average stratified cross-validation score with linear kernel:0.9789\n"
]
}
],
"source": [
"# print average cross-validation score with linear kernel\n",
"\n",
"print('Average stratified cross-validation score with linear kernel:{:.4f}'.format(linear_scores.mean()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stratified k-Fold Cross Validation with shuffle split with rbf kernel"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"rbf_svc=SVC(kernel='rbf')\n",
"\n",
"\n",
"rbf_scores = cross_val_score(rbf_svc, X, y, cv=kfold)"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Stratified Cross-validation scores with rbf kernel:\n",
"\n",
"[0.92541899 0.91201117 0.90167598 0.90835429 0.90472199]\n"
]
}
],
"source": [
"# print cross-validation scores with rbf kernel\n",
"\n",
"print('Stratified Cross-validation scores with rbf kernel:\\n\\n{}'.format(rbf_scores))"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average stratified cross-validation score with rbf kernel:0.9104\n"
]
}
],
"source": [
"# print average cross-validation score with rbf kernel\n",
"\n",
"print('Average stratified cross-validation score with rbf kernel:{:.4f}'.format(rbf_scores.mean()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comments\n",
"\n",
"\n",
"I obtain higher average stratified k-fold cross-validation score of 0.9789 with linear kernel but the model accuracy is 0.9832.\n",
"So, stratified cross-validation technique does not help to improve the model performance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 20. Hyperparameter Optimization using GridSearch CV"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"GridSearchCV(cv=5, error_score='raise-deprecating',\n",
" estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n",
" decision_function_shape='ovr', degree=3, gamma='auto_deprecated',\n",
" kernel='rbf', max_iter=-1, probability=False, random_state=None,\n",
" shrinking=True, tol=0.001, verbose=False),\n",
" fit_params=None, iid='warn', n_jobs=None,\n",
" param_grid=[{'C': [1, 10, 100, 1000], 'kernel': ['linear']}, {'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}, {'C': [1, 10, 100, 1000], 'kernel': ['poly'], 'degree': [2, 3, 4], 'gamma': [0.01, 0.02, 0.03, 0.04, 0.05]}],\n",
" pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n",
" scoring='accuracy', verbose=0)"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# import GridSearchCV\n",
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"\n",
"# import SVC classifier\n",
"from sklearn.svm import SVC\n",
"\n",
"\n",
"# instantiate classifier with default hyperparameters with kernel=rbf, C=1.0 and gamma=auto\n",
"svc=SVC() \n",
"\n",
"\n",
"\n",
"# declare parameters for hyperparameter tuning\n",
"parameters = [ {'C':[1, 10, 100, 1000], 'kernel':['linear']},\n",
" {'C':[1, 10, 100, 1000], 'kernel':['rbf'], 'gamma':[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]},\n",
" {'C':[1, 10, 100, 1000], 'kernel':['poly'], 'degree': [2,3,4] ,'gamma':[0.01,0.02,0.03,0.04,0.05]} \n",
" ]\n",
"\n",
"\n",
"\n",
"\n",
"grid_search = GridSearchCV(estimator = svc, \n",
" param_grid = parameters,\n",
" scoring = 'accuracy',\n",
" cv = 5,\n",
" verbose=0)\n",
"\n",
"\n",
"grid_search.fit(X_train, y_train)\n"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GridSearch CV best score : 0.9793\n",
"\n",
"\n",
"Parameters that give the best results : \n",
"\n",
" {'C': 10, 'gamma': 0.3, 'kernel': 'rbf'}\n",
"\n",
"\n",
"Estimator that was chosen by the search : \n",
"\n",
" SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,\n",
" decision_function_shape='ovr', degree=3, gamma=0.3, kernel='rbf',\n",
" max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
" tol=0.001, verbose=False)\n"
]
}
],
"source": [
"# examine the best model\n",
"\n",
"\n",
"# best score achieved during the GridSearchCV\n",
"print('GridSearch CV best score : {:.4f}\\n\\n'.format(grid_search.best_score_))\n",
"\n",
"\n",
"# print parameters that give the best results\n",
"print('Parameters that give the best results :','\\n\\n', (grid_search.best_params_))\n",
"\n",
"\n",
"# print estimator that was chosen by the GridSearch\n",
"print('\\n\\nEstimator that was chosen by the search :','\\n\\n', (grid_search.best_estimator_))"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GridSearch CV score on test set: 0.9835\n"
]
}
],
"source": [
"# calculate GridSearch CV score on test set\n",
"\n",
"print('GridSearch CV score on test set: {0:0.4f}'.format(grid_search.score(X_test, y_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comments\n",
"\n",
"\n",
"- Our original model test accuracy is 0.9832 while GridSearch CV score on test-set is 0.9835.\n",
"\n",
"\n",
"- So, GridSearch CV helps to identify the parameters that will improve the performance for this particular model.\n",
"\n",
"\n",
"- Here, we should not confuse `best_score_` attribute of `grid_search` with the `score` method on the test-set. \n",
"\n",
"\n",
"- The `score` method on the test-set gives the generalization performance of the model. Using the `score` method, we employ a model trained on the whole training set.\n",
"\n",
"\n",
"- The `best_score_` attribute gives the mean cross-validation accuracy, with cross-validation performed on the training set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 21. Results and conclusion\n",
"\n",
"\n",
"\n",
"1. There are outliers in our dataset. So, as I increase the value of C to limit fewer outliers, the accuracy increased. This is true with different kinds of kernels.\n",
"\n",
"2.\tWe get maximum accuracy with `rbf` and `linear` kernel with C=100.0 and the accuracy is 0.9832. So, we can conclude that our model is doing a very good job in terms of predicting the class labels. But, this is not true. Here, we have an imbalanced dataset. Accuracy is an inadequate measure for quantifying predictive performance in the imbalanced dataset problem. So, we must explore `confusion matrix` that provide better guidance in selecting models. \n",
"\n",
"3.\tROC AUC of our model is very close to 1. So, we can conclude that our classifier does a good job in classifying the pulsar star.\n",
"\n",
"4.\tI obtain higher average stratified k-fold cross-validation score of 0.9789 with linear kernel but the model accuracy is 0.9832. So, stratified cross-validation technique does not help to improve the model performance.\n",
"\n",
"5.\tOur original model test accuracy is 0.9832 while GridSearch CV score on test-set is 0.9835. So, GridSearch CV helps to identify the parameters that will improve the performance for this particular model.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@Unicsoft01
Copy link

hello idiot

@RKiranKumarReddy010
Copy link

Hello bro, can you solve my problem.
i got an error while fit the train data inside svm( )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment