Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save pb111/62067d128be7f7f86e81916ff946af7b to your computer and use it in GitHub Desktop.
Save pb111/62067d128be7f7f86e81916ff946af7b to your computer and use it in GitHub Desktop.
Logistic Regression with Python and Scikit-Learn
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Logistic Regression with Python and Scikit-Learn\n",
"\n",
"\n",
"In this project, I implement Logistic Regression with Python and Scikit-Learn. I build a classifier to predict whether or not it will rain tomorrow in Australia by training a binary classification model using Logistic Regression. I have used the **Rain in Australia** dataset downloaded from the Kaggle website for this project."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"\n",
"The table of contents for this project is as follows:-\n",
"\n",
"\n",
"1.\tIntroduction to Logistic Regression\n",
"2.\tLogistic Regression intuition\n",
"3.\tThe problem statement\n",
"4.\tDataset description\n",
"5.\tImport libraries\n",
"6.\tImport dataset\n",
"7.\tExploratory data analysis\n",
"8.\tDeclare feature vector and target variable\n",
"9.\tSplit data into separate training and test set\n",
"10.\tFeature engineering\n",
"11.\tFeature scaling\n",
"12.\tModel training\n",
"13.\tPredict results\n",
"14.\tCheck accuracy score\n",
"15.\tConfusion matrix\n",
"16.\tClassification metrices\n",
"17.\tAdjusting the threshold level\n",
"18.\tROC - AUC\n",
"19.\tRecursive feature elimination\n",
"20.\tk-Fold Cross Validation\n",
"21.\tHyperparameter optimization using GridSearch CV\n",
"22.\tResults and conclusion\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Introduction to Logistic Regression\n",
"\n",
"\n",
"When data scientists may come across a new classification problem, the first algorithm that may come across their mind is **Logistic Regression**. It is a supervised learning classification algorithm which is used to predict observations to a discrete set of classes. Practically, it is used to classify observations into different categories. Hence, its output is discrete in nature. **Logistic Regression** is also called **Logit Regression**. It is one of the most simple, straightforward and versatile classification algorithms which is used to solve classification problems."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Logistic Regression intuition\n",
"\n",
"\n",
"In statistics, the **Logistic Regression model** is a widely used statistical model which is primarily used for classification purposes. It means that given a set of observations, Logistic Regression algorithm helps us to classify these observations into two or more discrete classes. So, the target variable is discrete in nature.\n",
"\n",
"\n",
"Logistic Regression algorithm works by implementing a linear equation with independent or explanatory variables to predict a response value. This predicted response value, denoted by z is then converted into a probability value that lie between 0 and 1. We use the **sigmoid function** in order to map predicted values to probability values. This sigmoid function then maps any real value into a probability value between 0 and 1. \n",
"\n",
"\n",
"\n",
"The sigmoid function returns a probability value between 0 and 1. This probability value is then mapped to a discrete class which is either “0” or “1”. In order to map this probability value to a discrete class (pass/fail, yes/no, true/false), we select a threshold value. This threshold value is called **Decision boundary**. Above this threshold value, we will map the probability values into class 1 and below which we will map values into class 0.\n",
"\n",
"\n",
"Mathematically, it can be expressed as follows:-\n",
"\n",
"\n",
" p ≥ 0.5 => class = 1\n",
" \n",
" p < 0.5 => class = 0 \n",
"\n",
"\n",
"Generally, the decision boundary is set to 0.5. So, if the probability value is 0.8 (> 0.5), we will map this observation to class 1. Similarly, if the probability value is 0.2 (< 0.5), we will map this observation to class 0.\n",
"\n",
"\n",
"We can use our knowledge of `sigmoid function` and `decision boundary` to write a prediction function. A prediction function in logistic regression returns the probability of the observation being positive, `Yes` or `True`. We call this as `class 1` and it is denoted by `P(class = 1)`. If the probability inches closer to one, then we will be more confident about our model that the observation is in class 1.\n",
"\n",
"Logistic regression intuition is discussed in depth in the readme document."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. The problem statement\n",
"\n",
"\n",
"In this project, I try to answer the question that whether or not it will rain tomorrow in Australia. I implement Logistic Regression with Python and Scikit-Learn. \n",
"\n",
"\n",
"To answer the question, I build a classifier to predict whether or not it will rain tomorrow in Australia by training a binary classification model using Logistic Regression. I have used the **Rain in Australia** dataset downloaded from the Kaggle website for this project."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Dataset description\n",
"\n",
"\n",
"I have used the **Rain in Australia** data set downloaded from the Kaggle website.\n",
"\n",
"\n",
"I have downloaded this data set from the Kaggle website. The data set can be found at the following url:-\n",
"\n",
"\n",
"https://www.kaggle.com/jsphyg/weather-dataset-rattle-package\n",
"\n",
"\n",
"This dataset contains daily weather observations from numerous Australian weather stations. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Import libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Import dataset"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data = 'C:/datasets/weatherAUS.csv'\n",
"\n",
"df = pd.read_csv(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Exploratory data analysis\n",
"\n",
"\n",
"Now, I will explore the data to gain insights about the data. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(142193, 24)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view dimensions of dataset\n",
"\n",
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 142193 instances and 24 variables in the data set."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>Location</th>\n",
" <th>MinTemp</th>\n",
" <th>MaxTemp</th>\n",
" <th>Rainfall</th>\n",
" <th>Evaporation</th>\n",
" <th>Sunshine</th>\n",
" <th>WindGustDir</th>\n",
" <th>WindGustSpeed</th>\n",
" <th>WindDir9am</th>\n",
" <th>...</th>\n",
" <th>Humidity3pm</th>\n",
" <th>Pressure9am</th>\n",
" <th>Pressure3pm</th>\n",
" <th>Cloud9am</th>\n",
" <th>Cloud3pm</th>\n",
" <th>Temp9am</th>\n",
" <th>Temp3pm</th>\n",
" <th>RainToday</th>\n",
" <th>RISK_MM</th>\n",
" <th>RainTomorrow</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2008-12-01</td>\n",
" <td>Albury</td>\n",
" <td>13.4</td>\n",
" <td>22.9</td>\n",
" <td>0.6</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>W</td>\n",
" <td>44.0</td>\n",
" <td>W</td>\n",
" <td>...</td>\n",
" <td>22.0</td>\n",
" <td>1007.7</td>\n",
" <td>1007.1</td>\n",
" <td>8.0</td>\n",
" <td>NaN</td>\n",
" <td>16.9</td>\n",
" <td>21.8</td>\n",
" <td>No</td>\n",
" <td>0.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2008-12-02</td>\n",
" <td>Albury</td>\n",
" <td>7.4</td>\n",
" <td>25.1</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>WNW</td>\n",
" <td>44.0</td>\n",
" <td>NNW</td>\n",
" <td>...</td>\n",
" <td>25.0</td>\n",
" <td>1010.6</td>\n",
" <td>1007.8</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>17.2</td>\n",
" <td>24.3</td>\n",
" <td>No</td>\n",
" <td>0.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2008-12-03</td>\n",
" <td>Albury</td>\n",
" <td>12.9</td>\n",
" <td>25.7</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>WSW</td>\n",
" <td>46.0</td>\n",
" <td>W</td>\n",
" <td>...</td>\n",
" <td>30.0</td>\n",
" <td>1007.6</td>\n",
" <td>1008.7</td>\n",
" <td>NaN</td>\n",
" <td>2.0</td>\n",
" <td>21.0</td>\n",
" <td>23.2</td>\n",
" <td>No</td>\n",
" <td>0.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2008-12-04</td>\n",
" <td>Albury</td>\n",
" <td>9.2</td>\n",
" <td>28.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NE</td>\n",
" <td>24.0</td>\n",
" <td>SE</td>\n",
" <td>...</td>\n",
" <td>16.0</td>\n",
" <td>1017.6</td>\n",
" <td>1012.8</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>18.1</td>\n",
" <td>26.5</td>\n",
" <td>No</td>\n",
" <td>1.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2008-12-05</td>\n",
" <td>Albury</td>\n",
" <td>17.5</td>\n",
" <td>32.3</td>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>W</td>\n",
" <td>41.0</td>\n",
" <td>ENE</td>\n",
" <td>...</td>\n",
" <td>33.0</td>\n",
" <td>1010.8</td>\n",
" <td>1006.0</td>\n",
" <td>7.0</td>\n",
" <td>8.0</td>\n",
" <td>17.8</td>\n",
" <td>29.7</td>\n",
" <td>No</td>\n",
" <td>0.2</td>\n",
" <td>No</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 24 columns</p>\n",
"</div>"
],
"text/plain": [
" Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine \\\n",
"0 2008-12-01 Albury 13.4 22.9 0.6 NaN NaN \n",
"1 2008-12-02 Albury 7.4 25.1 0.0 NaN NaN \n",
"2 2008-12-03 Albury 12.9 25.7 0.0 NaN NaN \n",
"3 2008-12-04 Albury 9.2 28.0 0.0 NaN NaN \n",
"4 2008-12-05 Albury 17.5 32.3 1.0 NaN NaN \n",
"\n",
" WindGustDir WindGustSpeed WindDir9am ... Humidity3pm \\\n",
"0 W 44.0 W ... 22.0 \n",
"1 WNW 44.0 NNW ... 25.0 \n",
"2 WSW 46.0 W ... 30.0 \n",
"3 NE 24.0 SE ... 16.0 \n",
"4 W 41.0 ENE ... 33.0 \n",
"\n",
" Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday \\\n",
"0 1007.7 1007.1 8.0 NaN 16.9 21.8 No \n",
"1 1010.6 1007.8 NaN NaN 17.2 24.3 No \n",
"2 1007.6 1008.7 NaN 2.0 21.0 23.2 No \n",
"3 1017.6 1012.8 NaN NaN 18.1 26.5 No \n",
"4 1010.8 1006.0 7.0 8.0 17.8 29.7 No \n",
"\n",
" RISK_MM RainTomorrow \n",
"0 0.0 No \n",
"1 0.0 No \n",
"2 0.0 No \n",
"3 1.0 No \n",
"4 0.2 No \n",
"\n",
"[5 rows x 24 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# preview the dataset\n",
"\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',\n",
" 'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',\n",
" 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',\n",
" 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',\n",
" 'Temp3pm', 'RainToday', 'RISK_MM', 'RainTomorrow'],\n",
" dtype='object')"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"col_names = df.columns\n",
"\n",
"col_names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Drop RISK_MM variable\n",
"\n",
"It is given in the dataset description, that we should drop the `RISK_MM` feature variable from the dataset description. So, we \n",
"should drop it as follows-"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"df.drop(['RISK_MM'], axis=1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 142193 entries, 0 to 142192\n",
"Data columns (total 23 columns):\n",
"Date 142193 non-null object\n",
"Location 142193 non-null object\n",
"MinTemp 141556 non-null float64\n",
"MaxTemp 141871 non-null float64\n",
"Rainfall 140787 non-null float64\n",
"Evaporation 81350 non-null float64\n",
"Sunshine 74377 non-null float64\n",
"WindGustDir 132863 non-null object\n",
"WindGustSpeed 132923 non-null float64\n",
"WindDir9am 132180 non-null object\n",
"WindDir3pm 138415 non-null object\n",
"WindSpeed9am 140845 non-null float64\n",
"WindSpeed3pm 139563 non-null float64\n",
"Humidity9am 140419 non-null float64\n",
"Humidity3pm 138583 non-null float64\n",
"Pressure9am 128179 non-null float64\n",
"Pressure3pm 128212 non-null float64\n",
"Cloud9am 88536 non-null float64\n",
"Cloud3pm 85099 non-null float64\n",
"Temp9am 141289 non-null float64\n",
"Temp3pm 139467 non-null float64\n",
"RainToday 140787 non-null object\n",
"RainTomorrow 142193 non-null object\n",
"dtypes: float64(16), object(7)\n",
"memory usage: 25.0+ MB\n"
]
}
],
"source": [
"# view summary of dataset\n",
"\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Types of variables\n",
"\n",
"\n",
"In this section, I segregate the dataset into categorical and numerical variables. There are a mixture of categorical and numerical variables in the dataset. Categorical variables have data type object. Numerical variables have data type float64.\n",
"\n",
"\n",
"First of all, I will find categorical variables."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 7 categorical variables\n",
"\n",
"The categorical variables are : ['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']\n"
]
}
],
"source": [
"# find categorical variables\n",
"\n",
"categorical = [var for var in df.columns if df[var].dtype=='O']\n",
"\n",
"print('There are {} categorical variables\\n'.format(len(categorical)))\n",
"\n",
"print('The categorical variables are :', categorical)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>Location</th>\n",
" <th>WindGustDir</th>\n",
" <th>WindDir9am</th>\n",
" <th>WindDir3pm</th>\n",
" <th>RainToday</th>\n",
" <th>RainTomorrow</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2008-12-01</td>\n",
" <td>Albury</td>\n",
" <td>W</td>\n",
" <td>W</td>\n",
" <td>WNW</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2008-12-02</td>\n",
" <td>Albury</td>\n",
" <td>WNW</td>\n",
" <td>NNW</td>\n",
" <td>WSW</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2008-12-03</td>\n",
" <td>Albury</td>\n",
" <td>WSW</td>\n",
" <td>W</td>\n",
" <td>WSW</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2008-12-04</td>\n",
" <td>Albury</td>\n",
" <td>NE</td>\n",
" <td>SE</td>\n",
" <td>E</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2008-12-05</td>\n",
" <td>Albury</td>\n",
" <td>W</td>\n",
" <td>ENE</td>\n",
" <td>NW</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date Location WindGustDir WindDir9am WindDir3pm RainToday \\\n",
"0 2008-12-01 Albury W W WNW No \n",
"1 2008-12-02 Albury WNW NNW WSW No \n",
"2 2008-12-03 Albury WSW W WSW No \n",
"3 2008-12-04 Albury NE SE E No \n",
"4 2008-12-05 Albury W ENE NW No \n",
"\n",
" RainTomorrow \n",
"0 No \n",
"1 No \n",
"2 No \n",
"3 No \n",
"4 No "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view the categorical variables\n",
"\n",
"df[categorical].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary of categorical variables\n",
"\n",
"\n",
"- There is a date variable. It is denoted by `Date` column.\n",
"\n",
"\n",
"- There are 6 categorical variables. These are given by `Location`, `WindGustDir`, `WindDir9am`, `WindDir3pm`, `RainToday` and `RainTomorrow`.\n",
"\n",
"\n",
"- There are two binary categorical variables - `RainToday` and `RainTomorrow`.\n",
"\n",
"\n",
"- `RainTomorrow` is the target variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore problems within categorical variables\n",
"\n",
"\n",
"First, I will explore the categorical variables.\n",
"\n",
"\n",
"### Missing values in categorical variables"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Date 0\n",
"Location 0\n",
"WindGustDir 9330\n",
"WindDir9am 10013\n",
"WindDir3pm 3778\n",
"RainToday 1406\n",
"RainTomorrow 0\n",
"dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in categorical variables\n",
"\n",
"df[categorical].isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WindGustDir 9330\n",
"WindDir9am 10013\n",
"WindDir3pm 3778\n",
"RainToday 1406\n",
"dtype: int64\n"
]
}
],
"source": [
"# print categorical variables containing missing values\n",
"\n",
"cat1 = [var for var in categorical if df[var].isnull().sum()!=0]\n",
"\n",
"print(df[cat1].isnull().sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are only 4 categorical variables in the dataset which contains missing values. These are `WindGustDir`, `WindDir9am`, `WindDir3pm` and `RainToday`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Frequency counts of categorical variables\n",
"\n",
"\n",
"Now, I will check the frequency counts of categorical variables."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2014-10-12 49\n",
"2017-01-15 49\n",
"2013-10-02 49\n",
"2014-07-15 49\n",
"2014-02-19 49\n",
"2016-08-21 49\n",
"2014-07-03 49\n",
"2016-10-21 49\n",
"2013-03-11 49\n",
"2017-02-08 49\n",
"2014-11-17 49\n",
"2013-04-25 49\n",
"2014-11-19 49\n",
"2014-08-30 49\n",
"2014-01-07 49\n",
"2013-04-10 49\n",
"2017-03-16 49\n",
"2013-09-04 49\n",
"2016-08-16 49\n",
"2016-10-19 49\n",
"2014-08-20 49\n",
"2017-05-12 49\n",
"2014-01-16 49\n",
"2016-07-22 49\n",
"2017-01-22 49\n",
"2013-09-25 49\n",
"2013-06-02 49\n",
"2016-07-06 49\n",
"2014-04-21 49\n",
"2013-10-16 49\n",
" ..\n",
"2007-11-23 1\n",
"2008-01-15 1\n",
"2007-12-22 1\n",
"2007-11-08 1\n",
"2007-11-29 1\n",
"2008-01-29 1\n",
"2008-01-06 1\n",
"2007-11-02 1\n",
"2007-12-25 1\n",
"2008-01-28 1\n",
"2007-12-08 1\n",
"2007-11-09 1\n",
"2008-01-05 1\n",
"2007-11-26 1\n",
"2007-11-10 1\n",
"2007-11-20 1\n",
"2008-01-14 1\n",
"2007-12-03 1\n",
"2008-01-12 1\n",
"2007-11-03 1\n",
"2007-12-02 1\n",
"2008-01-31 1\n",
"2007-12-01 1\n",
"2007-11-06 1\n",
"2007-11-27 1\n",
"2007-12-19 1\n",
"2007-11-19 1\n",
"2007-12-30 1\n",
"2007-12-23 1\n",
"2008-01-09 1\n",
"Name: Date, Length: 3436, dtype: int64\n",
"Canberra 3418\n",
"Sydney 3337\n",
"Perth 3193\n",
"Darwin 3192\n",
"Hobart 3188\n",
"Brisbane 3161\n",
"Adelaide 3090\n",
"Bendigo 3034\n",
"Townsville 3033\n",
"AliceSprings 3031\n",
"MountGambier 3030\n",
"Launceston 3028\n",
"Ballarat 3028\n",
"Albany 3016\n",
"Albury 3011\n",
"PerthAirport 3009\n",
"MelbourneAirport 3009\n",
"Mildura 3007\n",
"SydneyAirport 3005\n",
"Nuriootpa 3002\n",
"Sale 3000\n",
"Watsonia 2999\n",
"Tuggeranong 2998\n",
"Portland 2996\n",
"Woomera 2990\n",
"Cairns 2988\n",
"Cobar 2988\n",
"Wollongong 2983\n",
"GoldCoast 2980\n",
"WaggaWagga 2976\n",
"NorfolkIsland 2964\n",
"Penrith 2964\n",
"Newcastle 2955\n",
"SalmonGums 2955\n",
"CoffsHarbour 2953\n",
"Witchcliffe 2952\n",
"Richmond 2951\n",
"Dartmoor 2943\n",
"NorahHead 2929\n",
"BadgerysCreek 2928\n",
"MountGinini 2907\n",
"Moree 2854\n",
"Walpole 2819\n",
"PearceRAAF 2762\n",
"Williamtown 2553\n",
"Melbourne 2435\n",
"Nhil 1569\n",
"Katherine 1559\n",
"Uluru 1521\n",
"Name: Location, dtype: int64\n",
"W 9780\n",
"SE 9309\n",
"E 9071\n",
"N 9033\n",
"SSE 8993\n",
"S 8949\n",
"WSW 8901\n",
"SW 8797\n",
"SSW 8610\n",
"WNW 8066\n",
"NW 8003\n",
"ENE 7992\n",
"ESE 7305\n",
"NE 7060\n",
"NNW 6561\n",
"NNE 6433\n",
"Name: WindGustDir, dtype: int64\n",
"N 11393\n",
"SE 9162\n",
"E 9024\n",
"SSE 8966\n",
"NW 8552\n",
"S 8493\n",
"W 8260\n",
"SW 8237\n",
"NNE 7948\n",
"NNW 7840\n",
"ENE 7735\n",
"ESE 7558\n",
"NE 7527\n",
"SSW 7448\n",
"WNW 7194\n",
"WSW 6843\n",
"Name: WindDir9am, dtype: int64\n",
"SE 10663\n",
"W 9911\n",
"S 9598\n",
"WSW 9329\n",
"SW 9182\n",
"SSE 9142\n",
"N 8667\n",
"WNW 8656\n",
"NW 8468\n",
"ESE 8382\n",
"E 8342\n",
"NE 8164\n",
"SSW 8010\n",
"NNW 7733\n",
"ENE 7724\n",
"NNE 6444\n",
"Name: WindDir3pm, dtype: int64\n",
"No 109332\n",
"Yes 31455\n",
"Name: RainToday, dtype: int64\n",
"No 110316\n",
"Yes 31877\n",
"Name: RainTomorrow, dtype: int64\n"
]
}
],
"source": [
"# view frequency of categorical variables\n",
"\n",
"for var in categorical: \n",
" \n",
" print(df[var].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2014-10-12 0.000345\n",
"2017-01-15 0.000345\n",
"2013-10-02 0.000345\n",
"2014-07-15 0.000345\n",
"2014-02-19 0.000345\n",
"2016-08-21 0.000345\n",
"2014-07-03 0.000345\n",
"2016-10-21 0.000345\n",
"2013-03-11 0.000345\n",
"2017-02-08 0.000345\n",
"2014-11-17 0.000345\n",
"2013-04-25 0.000345\n",
"2014-11-19 0.000345\n",
"2014-08-30 0.000345\n",
"2014-01-07 0.000345\n",
"2013-04-10 0.000345\n",
"2017-03-16 0.000345\n",
"2013-09-04 0.000345\n",
"2016-08-16 0.000345\n",
"2016-10-19 0.000345\n",
"2014-08-20 0.000345\n",
"2017-05-12 0.000345\n",
"2014-01-16 0.000345\n",
"2016-07-22 0.000345\n",
"2017-01-22 0.000345\n",
"2013-09-25 0.000345\n",
"2013-06-02 0.000345\n",
"2016-07-06 0.000345\n",
"2014-04-21 0.000345\n",
"2013-10-16 0.000345\n",
" ... \n",
"2007-11-23 0.000007\n",
"2008-01-15 0.000007\n",
"2007-12-22 0.000007\n",
"2007-11-08 0.000007\n",
"2007-11-29 0.000007\n",
"2008-01-29 0.000007\n",
"2008-01-06 0.000007\n",
"2007-11-02 0.000007\n",
"2007-12-25 0.000007\n",
"2008-01-28 0.000007\n",
"2007-12-08 0.000007\n",
"2007-11-09 0.000007\n",
"2008-01-05 0.000007\n",
"2007-11-26 0.000007\n",
"2007-11-10 0.000007\n",
"2007-11-20 0.000007\n",
"2008-01-14 0.000007\n",
"2007-12-03 0.000007\n",
"2008-01-12 0.000007\n",
"2007-11-03 0.000007\n",
"2007-12-02 0.000007\n",
"2008-01-31 0.000007\n",
"2007-12-01 0.000007\n",
"2007-11-06 0.000007\n",
"2007-11-27 0.000007\n",
"2007-12-19 0.000007\n",
"2007-11-19 0.000007\n",
"2007-12-30 0.000007\n",
"2007-12-23 0.000007\n",
"2008-01-09 0.000007\n",
"Name: Date, Length: 3436, dtype: float64\n",
"Canberra 0.024038\n",
"Sydney 0.023468\n",
"Perth 0.022455\n",
"Darwin 0.022448\n",
"Hobart 0.022420\n",
"Brisbane 0.022230\n",
"Adelaide 0.021731\n",
"Bendigo 0.021337\n",
"Townsville 0.021330\n",
"AliceSprings 0.021316\n",
"MountGambier 0.021309\n",
"Launceston 0.021295\n",
"Ballarat 0.021295\n",
"Albany 0.021211\n",
"Albury 0.021175\n",
"PerthAirport 0.021161\n",
"MelbourneAirport 0.021161\n",
"Mildura 0.021147\n",
"SydneyAirport 0.021133\n",
"Nuriootpa 0.021112\n",
"Sale 0.021098\n",
"Watsonia 0.021091\n",
"Tuggeranong 0.021084\n",
"Portland 0.021070\n",
"Woomera 0.021028\n",
"Cairns 0.021014\n",
"Cobar 0.021014\n",
"Wollongong 0.020979\n",
"GoldCoast 0.020957\n",
"WaggaWagga 0.020929\n",
"NorfolkIsland 0.020845\n",
"Penrith 0.020845\n",
"Newcastle 0.020782\n",
"SalmonGums 0.020782\n",
"CoffsHarbour 0.020768\n",
"Witchcliffe 0.020761\n",
"Richmond 0.020753\n",
"Dartmoor 0.020697\n",
"NorahHead 0.020599\n",
"BadgerysCreek 0.020592\n",
"MountGinini 0.020444\n",
"Moree 0.020071\n",
"Walpole 0.019825\n",
"PearceRAAF 0.019424\n",
"Williamtown 0.017954\n",
"Melbourne 0.017125\n",
"Nhil 0.011034\n",
"Katherine 0.010964\n",
"Uluru 0.010697\n",
"Name: Location, dtype: float64\n",
"W 0.068780\n",
"SE 0.065467\n",
"E 0.063794\n",
"N 0.063526\n",
"SSE 0.063245\n",
"S 0.062936\n",
"WSW 0.062598\n",
"SW 0.061867\n",
"SSW 0.060552\n",
"WNW 0.056726\n",
"NW 0.056283\n",
"ENE 0.056205\n",
"ESE 0.051374\n",
"NE 0.049651\n",
"NNW 0.046142\n",
"NNE 0.045241\n",
"Name: WindGustDir, dtype: float64\n",
"N 0.080123\n",
"SE 0.064434\n",
"E 0.063463\n",
"SSE 0.063055\n",
"NW 0.060144\n",
"S 0.059729\n",
"W 0.058090\n",
"SW 0.057928\n",
"NNE 0.055896\n",
"NNW 0.055136\n",
"ENE 0.054398\n",
"ESE 0.053153\n",
"NE 0.052935\n",
"SSW 0.052380\n",
"WNW 0.050593\n",
"WSW 0.048125\n",
"Name: WindDir9am, dtype: float64\n",
"SE 0.074990\n",
"W 0.069701\n",
"S 0.067500\n",
"WSW 0.065608\n",
"SW 0.064574\n",
"SSE 0.064293\n",
"N 0.060952\n",
"WNW 0.060875\n",
"NW 0.059553\n",
"ESE 0.058948\n",
"E 0.058667\n",
"NE 0.057415\n",
"SSW 0.056332\n",
"NNW 0.054384\n",
"ENE 0.054321\n",
"NNE 0.045319\n",
"Name: WindDir3pm, dtype: float64\n",
"No 0.768899\n",
"Yes 0.221213\n",
"Name: RainToday, dtype: float64\n",
"No 0.775819\n",
"Yes 0.224181\n",
"Name: RainTomorrow, dtype: float64\n"
]
}
],
"source": [
"# view frequency distribution of categorical variables\n",
"\n",
"for var in categorical: \n",
" \n",
" print(df[var].value_counts()/np.float(len(df)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Number of labels: cardinality\n",
"\n",
"\n",
"The number of labels within a categorical variable is known as **cardinality**. A high number of labels within a variable is known as **high cardinality**. High cardinality may pose some serious problems in the machine learning model. So, I will check for high cardinality."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Date contains 3436 labels\n",
"Location contains 49 labels\n",
"WindGustDir contains 17 labels\n",
"WindDir9am contains 17 labels\n",
"WindDir3pm contains 17 labels\n",
"RainToday contains 3 labels\n",
"RainTomorrow contains 2 labels\n"
]
}
],
"source": [
"# check for cardinality in categorical variables\n",
"\n",
"for var in categorical:\n",
" \n",
" print(var, ' contains ', len(df[var].unique()), ' labels')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there is a `Date` variable which needs to be preprocessed. I will do preprocessing in the following section.\n",
"\n",
"\n",
"All the other variables contain relatively smaller number of variables."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Feature Engineering of Date Variable"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dtype('O')"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Date'].dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the data type of `Date` variable is object. I will parse the date currently coded as object into datetime format."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# parse the dates, currently coded as strings, into datetime format\n",
"\n",
"df['Date'] = pd.to_datetime(df['Date'])"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 2008\n",
"1 2008\n",
"2 2008\n",
"3 2008\n",
"4 2008\n",
"Name: Year, dtype: int64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# extract year from date\n",
"\n",
"df['Year'] = df['Date'].dt.year\n",
"\n",
"df['Year'].head()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 12\n",
"1 12\n",
"2 12\n",
"3 12\n",
"4 12\n",
"Name: Month, dtype: int64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# extract month from date\n",
"\n",
"df['Month'] = df['Date'].dt.month\n",
"\n",
"df['Month'].head()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1\n",
"1 2\n",
"2 3\n",
"3 4\n",
"4 5\n",
"Name: Day, dtype: int64"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# extract day from date\n",
"\n",
"df['Day'] = df['Date'].dt.day\n",
"\n",
"df['Day'].head()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 142193 entries, 0 to 142192\n",
"Data columns (total 26 columns):\n",
"Date 142193 non-null datetime64[ns]\n",
"Location 142193 non-null object\n",
"MinTemp 141556 non-null float64\n",
"MaxTemp 141871 non-null float64\n",
"Rainfall 140787 non-null float64\n",
"Evaporation 81350 non-null float64\n",
"Sunshine 74377 non-null float64\n",
"WindGustDir 132863 non-null object\n",
"WindGustSpeed 132923 non-null float64\n",
"WindDir9am 132180 non-null object\n",
"WindDir3pm 138415 non-null object\n",
"WindSpeed9am 140845 non-null float64\n",
"WindSpeed3pm 139563 non-null float64\n",
"Humidity9am 140419 non-null float64\n",
"Humidity3pm 138583 non-null float64\n",
"Pressure9am 128179 non-null float64\n",
"Pressure3pm 128212 non-null float64\n",
"Cloud9am 88536 non-null float64\n",
"Cloud3pm 85099 non-null float64\n",
"Temp9am 141289 non-null float64\n",
"Temp3pm 139467 non-null float64\n",
"RainToday 140787 non-null object\n",
"RainTomorrow 142193 non-null object\n",
"Year 142193 non-null int64\n",
"Month 142193 non-null int64\n",
"Day 142193 non-null int64\n",
"dtypes: datetime64[ns](1), float64(16), int64(3), object(6)\n",
"memory usage: 28.2+ MB\n"
]
}
],
"source": [
"# again view the summary of dataset\n",
"\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are three additional columns created from `Date` variable. Now, I will drop the original `Date` variable from the dataset."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# drop the original Date variable\n",
"\n",
"df.drop('Date', axis=1, inplace = True)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Location</th>\n",
" <th>MinTemp</th>\n",
" <th>MaxTemp</th>\n",
" <th>Rainfall</th>\n",
" <th>Evaporation</th>\n",
" <th>Sunshine</th>\n",
" <th>WindGustDir</th>\n",
" <th>WindGustSpeed</th>\n",
" <th>WindDir9am</th>\n",
" <th>WindDir3pm</th>\n",
" <th>...</th>\n",
" <th>Pressure3pm</th>\n",
" <th>Cloud9am</th>\n",
" <th>Cloud3pm</th>\n",
" <th>Temp9am</th>\n",
" <th>Temp3pm</th>\n",
" <th>RainToday</th>\n",
" <th>RainTomorrow</th>\n",
" <th>Year</th>\n",
" <th>Month</th>\n",
" <th>Day</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Albury</td>\n",
" <td>13.4</td>\n",
" <td>22.9</td>\n",
" <td>0.6</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>W</td>\n",
" <td>44.0</td>\n",
" <td>W</td>\n",
" <td>WNW</td>\n",
" <td>...</td>\n",
" <td>1007.1</td>\n",
" <td>8.0</td>\n",
" <td>NaN</td>\n",
" <td>16.9</td>\n",
" <td>21.8</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>2008</td>\n",
" <td>12</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Albury</td>\n",
" <td>7.4</td>\n",
" <td>25.1</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>WNW</td>\n",
" <td>44.0</td>\n",
" <td>NNW</td>\n",
" <td>WSW</td>\n",
" <td>...</td>\n",
" <td>1007.8</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>17.2</td>\n",
" <td>24.3</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>2008</td>\n",
" <td>12</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Albury</td>\n",
" <td>12.9</td>\n",
" <td>25.7</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>WSW</td>\n",
" <td>46.0</td>\n",
" <td>W</td>\n",
" <td>WSW</td>\n",
" <td>...</td>\n",
" <td>1008.7</td>\n",
" <td>NaN</td>\n",
" <td>2.0</td>\n",
" <td>21.0</td>\n",
" <td>23.2</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>2008</td>\n",
" <td>12</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Albury</td>\n",
" <td>9.2</td>\n",
" <td>28.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NE</td>\n",
" <td>24.0</td>\n",
" <td>SE</td>\n",
" <td>E</td>\n",
" <td>...</td>\n",
" <td>1012.8</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>18.1</td>\n",
" <td>26.5</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>2008</td>\n",
" <td>12</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Albury</td>\n",
" <td>17.5</td>\n",
" <td>32.3</td>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>W</td>\n",
" <td>41.0</td>\n",
" <td>ENE</td>\n",
" <td>NW</td>\n",
" <td>...</td>\n",
" <td>1006.0</td>\n",
" <td>7.0</td>\n",
" <td>8.0</td>\n",
" <td>17.8</td>\n",
" <td>29.7</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>2008</td>\n",
" <td>12</td>\n",
" <td>5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 25 columns</p>\n",
"</div>"
],
"text/plain": [
" Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir \\\n",
"0 Albury 13.4 22.9 0.6 NaN NaN W \n",
"1 Albury 7.4 25.1 0.0 NaN NaN WNW \n",
"2 Albury 12.9 25.7 0.0 NaN NaN WSW \n",
"3 Albury 9.2 28.0 0.0 NaN NaN NE \n",
"4 Albury 17.5 32.3 1.0 NaN NaN W \n",
"\n",
" WindGustSpeed WindDir9am WindDir3pm ... Pressure3pm Cloud9am Cloud3pm \\\n",
"0 44.0 W WNW ... 1007.1 8.0 NaN \n",
"1 44.0 NNW WSW ... 1007.8 NaN NaN \n",
"2 46.0 W WSW ... 1008.7 NaN 2.0 \n",
"3 24.0 SE E ... 1012.8 NaN NaN \n",
"4 41.0 ENE NW ... 1006.0 7.0 8.0 \n",
"\n",
" Temp9am Temp3pm RainToday RainTomorrow Year Month Day \n",
"0 16.9 21.8 No No 2008 12 1 \n",
"1 17.2 24.3 No No 2008 12 2 \n",
"2 21.0 23.2 No No 2008 12 3 \n",
"3 18.1 26.5 No No 2008 12 4 \n",
"4 17.8 29.7 No No 2008 12 5 \n",
"\n",
"[5 rows x 25 columns]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# preview the dataset again\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can see that the `Date` variable has been removed from the dataset.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore Categorical Variables\n",
"\n",
"\n",
"Now, I will explore the categorical variables one by one. "
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 6 categorical variables\n",
"\n",
"The categorical variables are : ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']\n"
]
}
],
"source": [
"# find categorical variables\n",
"\n",
"categorical = [var for var in df.columns if df[var].dtype=='O']\n",
"\n",
"print('There are {} categorical variables\\n'.format(len(categorical)))\n",
"\n",
"print('The categorical variables are :', categorical)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 6 categorical variables in the dataset. The `Date` variable has been removed. First, I will check missing values in categorical variables."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Location 0\n",
"WindGustDir 9330\n",
"WindDir9am 10013\n",
"WindDir3pm 3778\n",
"RainToday 1406\n",
"RainTomorrow 0\n",
"dtype: int64"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check for missing values in categorical variables \n",
"\n",
"df[categorical].isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that `WindGustDir`, `WindDir9am`, `WindDir3pm`, `RainToday` variables contain missing values. I will explore these variables one by one."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore `Location` variable"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Location contains 49 labels\n"
]
}
],
"source": [
"# print number of labels in Location variable\n",
"\n",
"print('Location contains', len(df.Location.unique()), 'labels')"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Albury', 'BadgerysCreek', 'Cobar', 'CoffsHarbour', 'Moree',\n",
" 'Newcastle', 'NorahHead', 'NorfolkIsland', 'Penrith', 'Richmond',\n",
" 'Sydney', 'SydneyAirport', 'WaggaWagga', 'Williamtown',\n",
" 'Wollongong', 'Canberra', 'Tuggeranong', 'MountGinini', 'Ballarat',\n",
" 'Bendigo', 'Sale', 'MelbourneAirport', 'Melbourne', 'Mildura',\n",
" 'Nhil', 'Portland', 'Watsonia', 'Dartmoor', 'Brisbane', 'Cairns',\n",
" 'GoldCoast', 'Townsville', 'Adelaide', 'MountGambier', 'Nuriootpa',\n",
" 'Woomera', 'Albany', 'Witchcliffe', 'PearceRAAF', 'PerthAirport',\n",
" 'Perth', 'SalmonGums', 'Walpole', 'Hobart', 'Launceston',\n",
" 'AliceSprings', 'Darwin', 'Katherine', 'Uluru'], dtype=object)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check labels in location variable\n",
"\n",
"df.Location.unique()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Canberra 3418\n",
"Sydney 3337\n",
"Perth 3193\n",
"Darwin 3192\n",
"Hobart 3188\n",
"Brisbane 3161\n",
"Adelaide 3090\n",
"Bendigo 3034\n",
"Townsville 3033\n",
"AliceSprings 3031\n",
"MountGambier 3030\n",
"Launceston 3028\n",
"Ballarat 3028\n",
"Albany 3016\n",
"Albury 3011\n",
"PerthAirport 3009\n",
"MelbourneAirport 3009\n",
"Mildura 3007\n",
"SydneyAirport 3005\n",
"Nuriootpa 3002\n",
"Sale 3000\n",
"Watsonia 2999\n",
"Tuggeranong 2998\n",
"Portland 2996\n",
"Woomera 2990\n",
"Cairns 2988\n",
"Cobar 2988\n",
"Wollongong 2983\n",
"GoldCoast 2980\n",
"WaggaWagga 2976\n",
"NorfolkIsland 2964\n",
"Penrith 2964\n",
"Newcastle 2955\n",
"SalmonGums 2955\n",
"CoffsHarbour 2953\n",
"Witchcliffe 2952\n",
"Richmond 2951\n",
"Dartmoor 2943\n",
"NorahHead 2929\n",
"BadgerysCreek 2928\n",
"MountGinini 2907\n",
"Moree 2854\n",
"Walpole 2819\n",
"PearceRAAF 2762\n",
"Williamtown 2553\n",
"Melbourne 2435\n",
"Nhil 1569\n",
"Katherine 1559\n",
"Uluru 1521\n",
"Name: Location, dtype: int64"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check frequency distribution of values in Location variable\n",
"\n",
"df.Location.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Albany</th>\n",
" <th>Albury</th>\n",
" <th>AliceSprings</th>\n",
" <th>BadgerysCreek</th>\n",
" <th>Ballarat</th>\n",
" <th>Bendigo</th>\n",
" <th>Brisbane</th>\n",
" <th>Cairns</th>\n",
" <th>Canberra</th>\n",
" <th>Cobar</th>\n",
" <th>...</th>\n",
" <th>Townsville</th>\n",
" <th>Tuggeranong</th>\n",
" <th>Uluru</th>\n",
" <th>WaggaWagga</th>\n",
" <th>Walpole</th>\n",
" <th>Watsonia</th>\n",
" <th>Williamtown</th>\n",
" <th>Witchcliffe</th>\n",
" <th>Wollongong</th>\n",
" <th>Woomera</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 48 columns</p>\n",
"</div>"
],
"text/plain": [
" Albany Albury AliceSprings BadgerysCreek Ballarat Bendigo Brisbane \\\n",
"0 0 1 0 0 0 0 0 \n",
"1 0 1 0 0 0 0 0 \n",
"2 0 1 0 0 0 0 0 \n",
"3 0 1 0 0 0 0 0 \n",
"4 0 1 0 0 0 0 0 \n",
"\n",
" Cairns Canberra Cobar ... Townsville Tuggeranong Uluru \\\n",
"0 0 0 0 ... 0 0 0 \n",
"1 0 0 0 ... 0 0 0 \n",
"2 0 0 0 ... 0 0 0 \n",
"3 0 0 0 ... 0 0 0 \n",
"4 0 0 0 ... 0 0 0 \n",
"\n",
" WaggaWagga Walpole Watsonia Williamtown Witchcliffe Wollongong \\\n",
"0 0 0 0 0 0 0 \n",
"1 0 0 0 0 0 0 \n",
"2 0 0 0 0 0 0 \n",
"3 0 0 0 0 0 0 \n",
"4 0 0 0 0 0 0 \n",
"\n",
" Woomera \n",
"0 0 \n",
"1 0 \n",
"2 0 \n",
"3 0 \n",
"4 0 \n",
"\n",
"[5 rows x 48 columns]"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's do One Hot Encoding of Location variable\n",
"# get k-1 dummy variables after One Hot Encoding \n",
"# preview the dataset with head() method\n",
"\n",
"pd.get_dummies(df.Location, drop_first=True).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore `WindGustDir` variable"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WindGustDir contains 17 labels\n"
]
}
],
"source": [
"# print number of labels in WindGustDir variable\n",
"\n",
"print('WindGustDir contains', len(df['WindGustDir'].unique()), 'labels')"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['W', 'WNW', 'WSW', 'NE', 'NNW', 'N', 'NNE', 'SW', 'ENE', 'SSE',\n",
" 'S', 'NW', 'SE', 'ESE', nan, 'E', 'SSW'], dtype=object)"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check labels in WindGustDir variable\n",
"\n",
"df['WindGustDir'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"W 9780\n",
"SE 9309\n",
"E 9071\n",
"N 9033\n",
"SSE 8993\n",
"S 8949\n",
"WSW 8901\n",
"SW 8797\n",
"SSW 8610\n",
"WNW 8066\n",
"NW 8003\n",
"ENE 7992\n",
"ESE 7305\n",
"NE 7060\n",
"NNW 6561\n",
"NNE 6433\n",
"Name: WindGustDir, dtype: int64"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check frequency distribution of values in WindGustDir variable\n",
"\n",
"df.WindGustDir.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ENE</th>\n",
" <th>ESE</th>\n",
" <th>N</th>\n",
" <th>NE</th>\n",
" <th>NNE</th>\n",
" <th>NNW</th>\n",
" <th>NW</th>\n",
" <th>S</th>\n",
" <th>SE</th>\n",
" <th>SSE</th>\n",
" <th>SSW</th>\n",
" <th>SW</th>\n",
" <th>W</th>\n",
" <th>WNW</th>\n",
" <th>WSW</th>\n",
" <th>nan</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ENE ESE N NE NNE NNW NW S SE SSE SSW SW W WNW WSW NaN\n",
"0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0\n",
"1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0\n",
"2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0\n",
"3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0\n",
"4 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's do One Hot Encoding of WindGustDir variable\n",
"# get k-1 dummy variables after One Hot Encoding \n",
"# also add an additional dummy variable to indicate there was missing data\n",
"# preview the dataset with head() method\n",
"\n",
"pd.get_dummies(df.WindGustDir, drop_first=True, dummy_na=True).head()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ENE 7992\n",
"ESE 7305\n",
"N 9033\n",
"NE 7060\n",
"NNE 6433\n",
"NNW 6561\n",
"NW 8003\n",
"S 8949\n",
"SE 9309\n",
"SSE 8993\n",
"SSW 8610\n",
"SW 8797\n",
"W 9780\n",
"WNW 8066\n",
"WSW 8901\n",
"NaN 9330\n",
"dtype: int64"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# sum the number of 1s per boolean variable over the rows of the dataset\n",
"# it will tell us how many observations we have for each category\n",
"\n",
"pd.get_dummies(df.WindGustDir, drop_first=True, dummy_na=True).sum(axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 9330 missing values in WindGustDir variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore `WindDir9am` variable"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WindDir9am contains 17 labels\n"
]
}
],
"source": [
"# print number of labels in WindDir9am variable\n",
"\n",
"print('WindDir9am contains', len(df['WindDir9am'].unique()), 'labels')"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['W', 'NNW', 'SE', 'ENE', 'SW', 'SSE', 'S', 'NE', nan, 'SSW', 'N',\n",
" 'WSW', 'ESE', 'E', 'NW', 'WNW', 'NNE'], dtype=object)"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check labels in WindDir9am variable\n",
"\n",
"df['WindDir9am'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"N 11393\n",
"SE 9162\n",
"E 9024\n",
"SSE 8966\n",
"NW 8552\n",
"S 8493\n",
"W 8260\n",
"SW 8237\n",
"NNE 7948\n",
"NNW 7840\n",
"ENE 7735\n",
"ESE 7558\n",
"NE 7527\n",
"SSW 7448\n",
"WNW 7194\n",
"WSW 6843\n",
"Name: WindDir9am, dtype: int64"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check frequency distribution of values in WindDir9am variable\n",
"\n",
"df['WindDir9am'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ENE</th>\n",
" <th>ESE</th>\n",
" <th>N</th>\n",
" <th>NE</th>\n",
" <th>NNE</th>\n",
" <th>NNW</th>\n",
" <th>NW</th>\n",
" <th>S</th>\n",
" <th>SE</th>\n",
" <th>SSE</th>\n",
" <th>SSW</th>\n",
" <th>SW</th>\n",
" <th>W</th>\n",
" <th>WNW</th>\n",
" <th>WSW</th>\n",
" <th>nan</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ENE ESE N NE NNE NNW NW S SE SSE SSW SW W WNW WSW NaN\n",
"0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0\n",
"1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0\n",
"2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0\n",
"3 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0\n",
"4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's do One Hot Encoding of WindDir9am variable\n",
"# get k-1 dummy variables after One Hot Encoding \n",
"# also add an additional dummy variable to indicate there was missing data\n",
"# preview the dataset with head() method\n",
"\n",
"pd.get_dummies(df.WindDir9am, drop_first=True, dummy_na=True).head()"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ENE 7735\n",
"ESE 7558\n",
"N 11393\n",
"NE 7527\n",
"NNE 7948\n",
"NNW 7840\n",
"NW 8552\n",
"S 8493\n",
"SE 9162\n",
"SSE 8966\n",
"SSW 7448\n",
"SW 8237\n",
"W 8260\n",
"WNW 7194\n",
"WSW 6843\n",
"NaN 10013\n",
"dtype: int64"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# sum the number of 1s per boolean variable over the rows of the dataset\n",
"# it will tell us how many observations we have for each category\n",
"\n",
"pd.get_dummies(df.WindDir9am, drop_first=True, dummy_na=True).sum(axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 10013 missing values in the `WindDir9am` variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore `WindDir3pm` variable"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WindDir3pm contains 17 labels\n"
]
}
],
"source": [
"# print number of labels in WindDir3pm variable\n",
"\n",
"print('WindDir3pm contains', len(df['WindDir3pm'].unique()), 'labels')"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['WNW', 'WSW', 'E', 'NW', 'W', 'SSE', 'ESE', 'ENE', 'NNW', 'SSW',\n",
" 'SW', 'SE', 'N', 'S', 'NNE', nan, 'NE'], dtype=object)"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check labels in WindDir3pm variable\n",
"\n",
"df['WindDir3pm'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SE 10663\n",
"W 9911\n",
"S 9598\n",
"WSW 9329\n",
"SW 9182\n",
"SSE 9142\n",
"N 8667\n",
"WNW 8656\n",
"NW 8468\n",
"ESE 8382\n",
"E 8342\n",
"NE 8164\n",
"SSW 8010\n",
"NNW 7733\n",
"ENE 7724\n",
"NNE 6444\n",
"Name: WindDir3pm, dtype: int64"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check frequency distribution of values in WindDir3pm variable\n",
"\n",
"df['WindDir3pm'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ENE</th>\n",
" <th>ESE</th>\n",
" <th>N</th>\n",
" <th>NE</th>\n",
" <th>NNE</th>\n",
" <th>NNW</th>\n",
" <th>NW</th>\n",
" <th>S</th>\n",
" <th>SE</th>\n",
" <th>SSE</th>\n",
" <th>SSW</th>\n",
" <th>SW</th>\n",
" <th>W</th>\n",
" <th>WNW</th>\n",
" <th>WSW</th>\n",
" <th>nan</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ENE ESE N NE NNE NNW NW S SE SSE SSW SW W WNW WSW NaN\n",
"0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0\n",
"1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0\n",
"2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0\n",
"3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
"4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's do One Hot Encoding of WindDir3pm variable\n",
"# get k-1 dummy variables after One Hot Encoding \n",
"# also add an additional dummy variable to indicate there was missing data\n",
"# preview the dataset with head() method\n",
"\n",
"pd.get_dummies(df.WindDir3pm, drop_first=True, dummy_na=True).head()"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ENE 7724\n",
"ESE 8382\n",
"N 8667\n",
"NE 8164\n",
"NNE 6444\n",
"NNW 7733\n",
"NW 8468\n",
"S 9598\n",
"SE 10663\n",
"SSE 9142\n",
"SSW 8010\n",
"SW 9182\n",
"W 9911\n",
"WNW 8656\n",
"WSW 9329\n",
"NaN 3778\n",
"dtype: int64"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# sum the number of 1s per boolean variable over the rows of the dataset\n",
"# it will tell us how many observations we have for each category\n",
"\n",
"pd.get_dummies(df.WindDir3pm, drop_first=True, dummy_na=True).sum(axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are 3778 missing values in the `WindDir3pm` variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore `RainToday` variable"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RainToday contains 3 labels\n"
]
}
],
"source": [
"# print number of labels in RainToday variable\n",
"\n",
"print('RainToday contains', len(df['RainToday'].unique()), 'labels')"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['No', 'Yes', nan], dtype=object)"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check labels in WindGustDir variable\n",
"\n",
"df['RainToday'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"No 109332\n",
"Yes 31455\n",
"Name: RainToday, dtype: int64"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check frequency distribution of values in WindGustDir variable\n",
"\n",
"df.RainToday.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Yes</th>\n",
" <th>nan</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Yes NaN\n",
"0 0 0\n",
"1 0 0\n",
"2 0 0\n",
"3 0 0\n",
"4 0 0"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's do One Hot Encoding of RainToday variable\n",
"# get k-1 dummy variables after One Hot Encoding \n",
"# also add an additional dummy variable to indicate there was missing data\n",
"# preview the dataset with head() method\n",
"\n",
"pd.get_dummies(df.RainToday, drop_first=True, dummy_na=True).head()"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Yes 31455\n",
"NaN 1406\n",
"dtype: int64"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# sum the number of 1s per boolean variable over the rows of the dataset\n",
"# it will tell us how many observations we have for each category\n",
"\n",
"pd.get_dummies(df.RainToday, drop_first=True, dummy_na=True).sum(axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are 1406 missing values in the `RainToday` variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore Numerical Variables"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 19 numerical variables\n",
"\n",
"The numerical variables are : ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm', 'Year', 'Month', 'Day']\n"
]
}
],
"source": [
"# find numerical variables\n",
"\n",
"numerical = [var for var in df.columns if df[var].dtype!='O']\n",
"\n",
"print('There are {} numerical variables\\n'.format(len(numerical)))\n",
"\n",
"print('The numerical variables are :', numerical)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>MinTemp</th>\n",
" <th>MaxTemp</th>\n",
" <th>Rainfall</th>\n",
" <th>Evaporation</th>\n",
" <th>Sunshine</th>\n",
" <th>WindGustSpeed</th>\n",
" <th>WindSpeed9am</th>\n",
" <th>WindSpeed3pm</th>\n",
" <th>Humidity9am</th>\n",
" <th>Humidity3pm</th>\n",
" <th>Pressure9am</th>\n",
" <th>Pressure3pm</th>\n",
" <th>Cloud9am</th>\n",
" <th>Cloud3pm</th>\n",
" <th>Temp9am</th>\n",
" <th>Temp3pm</th>\n",
" <th>Year</th>\n",
" <th>Month</th>\n",
" <th>Day</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>13.4</td>\n",
" <td>22.9</td>\n",
" <td>0.6</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>44.0</td>\n",
" <td>20.0</td>\n",
" <td>24.0</td>\n",
" <td>71.0</td>\n",
" <td>22.0</td>\n",
" <td>1007.7</td>\n",
" <td>1007.1</td>\n",
" <td>8.0</td>\n",
" <td>NaN</td>\n",
" <td>16.9</td>\n",
" <td>21.8</td>\n",
" <td>2008</td>\n",
" <td>12</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>7.4</td>\n",
" <td>25.1</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>44.0</td>\n",
" <td>4.0</td>\n",
" <td>22.0</td>\n",
" <td>44.0</td>\n",
" <td>25.0</td>\n",
" <td>1010.6</td>\n",
" <td>1007.8</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>17.2</td>\n",
" <td>24.3</td>\n",
" <td>2008</td>\n",
" <td>12</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>12.9</td>\n",
" <td>25.7</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>46.0</td>\n",
" <td>19.0</td>\n",
" <td>26.0</td>\n",
" <td>38.0</td>\n",
" <td>30.0</td>\n",
" <td>1007.6</td>\n",
" <td>1008.7</td>\n",
" <td>NaN</td>\n",
" <td>2.0</td>\n",
" <td>21.0</td>\n",
" <td>23.2</td>\n",
" <td>2008</td>\n",
" <td>12</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>9.2</td>\n",
" <td>28.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>24.0</td>\n",
" <td>11.0</td>\n",
" <td>9.0</td>\n",
" <td>45.0</td>\n",
" <td>16.0</td>\n",
" <td>1017.6</td>\n",
" <td>1012.8</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>18.1</td>\n",
" <td>26.5</td>\n",
" <td>2008</td>\n",
" <td>12</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>17.5</td>\n",
" <td>32.3</td>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>41.0</td>\n",
" <td>7.0</td>\n",
" <td>20.0</td>\n",
" <td>82.0</td>\n",
" <td>33.0</td>\n",
" <td>1010.8</td>\n",
" <td>1006.0</td>\n",
" <td>7.0</td>\n",
" <td>8.0</td>\n",
" <td>17.8</td>\n",
" <td>29.7</td>\n",
" <td>2008</td>\n",
" <td>12</td>\n",
" <td>5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed \\\n",
"0 13.4 22.9 0.6 NaN NaN 44.0 \n",
"1 7.4 25.1 0.0 NaN NaN 44.0 \n",
"2 12.9 25.7 0.0 NaN NaN 46.0 \n",
"3 9.2 28.0 0.0 NaN NaN 24.0 \n",
"4 17.5 32.3 1.0 NaN NaN 41.0 \n",
"\n",
" WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am \\\n",
"0 20.0 24.0 71.0 22.0 1007.7 \n",
"1 4.0 22.0 44.0 25.0 1010.6 \n",
"2 19.0 26.0 38.0 30.0 1007.6 \n",
"3 11.0 9.0 45.0 16.0 1017.6 \n",
"4 7.0 20.0 82.0 33.0 1010.8 \n",
"\n",
" Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm Year Month Day \n",
"0 1007.1 8.0 NaN 16.9 21.8 2008 12 1 \n",
"1 1007.8 NaN NaN 17.2 24.3 2008 12 2 \n",
"2 1008.7 NaN 2.0 21.0 23.2 2008 12 3 \n",
"3 1012.8 NaN NaN 18.1 26.5 2008 12 4 \n",
"4 1006.0 7.0 8.0 17.8 29.7 2008 12 5 "
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view the numerical variables\n",
"\n",
"df[numerical].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary of numerical variables\n",
"\n",
"\n",
"- There are 16 numerical variables. \n",
"\n",
"\n",
"- These are given by `MinTemp`, `MaxTemp`, `Rainfall`, `Evaporation`, `Sunshine`, `WindGustSpeed`, `WindSpeed9am`, `WindSpeed3pm`, `Humidity9am`, `Humidity3pm`, `Pressure9am`, `Pressure3pm`, `Cloud9am`, `Cloud3pm`, `Temp9am` and `Temp3pm`.\n",
"\n",
"\n",
"- All of the numerical variables are of continuous type."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore problems within numerical variables\n",
"\n",
"\n",
"Now, I will explore the numerical variables.\n",
"\n",
"\n",
"### Missing values in numerical variables"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MinTemp 637\n",
"MaxTemp 322\n",
"Rainfall 1406\n",
"Evaporation 60843\n",
"Sunshine 67816\n",
"WindGustSpeed 9270\n",
"WindSpeed9am 1348\n",
"WindSpeed3pm 2630\n",
"Humidity9am 1774\n",
"Humidity3pm 3610\n",
"Pressure9am 14014\n",
"Pressure3pm 13981\n",
"Cloud9am 53657\n",
"Cloud3pm 57094\n",
"Temp9am 904\n",
"Temp3pm 2726\n",
"Year 0\n",
"Month 0\n",
"Day 0\n",
"dtype: int64"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in numerical variables\n",
"\n",
"df[numerical].isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that all the 16 numerical variables contain missing values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Outliers in numerical variables"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed \\\n",
"count 141556.0 141871.0 140787.0 81350.0 74377.0 132923.0 \n",
"mean 12.0 23.0 2.0 5.0 8.0 40.0 \n",
"std 6.0 7.0 8.0 4.0 4.0 14.0 \n",
"min -8.0 -5.0 0.0 0.0 0.0 6.0 \n",
"25% 8.0 18.0 0.0 3.0 5.0 31.0 \n",
"50% 12.0 23.0 0.0 5.0 8.0 39.0 \n",
"75% 17.0 28.0 1.0 7.0 11.0 48.0 \n",
"max 34.0 48.0 371.0 145.0 14.0 135.0 \n",
"\n",
" WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am \\\n",
"count 140845.0 139563.0 140419.0 138583.0 128179.0 \n",
"mean 14.0 19.0 69.0 51.0 1018.0 \n",
"std 9.0 9.0 19.0 21.0 7.0 \n",
"min 0.0 0.0 0.0 0.0 980.0 \n",
"25% 7.0 13.0 57.0 37.0 1013.0 \n",
"50% 13.0 19.0 70.0 52.0 1018.0 \n",
"75% 19.0 24.0 83.0 66.0 1022.0 \n",
"max 130.0 87.0 100.0 100.0 1041.0 \n",
"\n",
" Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm Year \\\n",
"count 128212.0 88536.0 85099.0 141289.0 139467.0 142193.0 \n",
"mean 1015.0 4.0 5.0 17.0 22.0 2013.0 \n",
"std 7.0 3.0 3.0 6.0 7.0 3.0 \n",
"min 977.0 0.0 0.0 -7.0 -5.0 2007.0 \n",
"25% 1010.0 1.0 2.0 12.0 17.0 2011.0 \n",
"50% 1015.0 5.0 5.0 17.0 21.0 2013.0 \n",
"75% 1020.0 7.0 7.0 22.0 26.0 2015.0 \n",
"max 1040.0 9.0 9.0 40.0 47.0 2017.0 \n",
"\n",
" Month Day \n",
"count 142193.0 142193.0 \n",
"mean 6.0 16.0 \n",
"std 3.0 9.0 \n",
"min 1.0 1.0 \n",
"25% 3.0 8.0 \n",
"50% 6.0 16.0 \n",
"75% 9.0 23.0 \n",
"max 12.0 31.0 2\n"
]
}
],
"source": [
"# view summary statistics in numerical variables\n",
"\n",
"print(round(df[numerical].describe()),2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On closer inspection, we can see that the `Rainfall`, `Evaporation`, `WindSpeed9am` and `WindSpeed3pm` columns may contain outliers.\n",
"\n",
"\n",
"I will draw boxplots to visualise outliers in the above variables. "
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0,0.5,'WindSpeed3pm')"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1080x720 with 4 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# draw boxplots to visualize outliers\n",
"\n",
"plt.figure(figsize=(15,10))\n",
"\n",
"\n",
"plt.subplot(2, 2, 1)\n",
"fig = df.boxplot(column='Rainfall')\n",
"fig.set_title('')\n",
"fig.set_ylabel('Rainfall')\n",
"\n",
"\n",
"plt.subplot(2, 2, 2)\n",
"fig = df.boxplot(column='Evaporation')\n",
"fig.set_title('')\n",
"fig.set_ylabel('Evaporation')\n",
"\n",
"\n",
"plt.subplot(2, 2, 3)\n",
"fig = df.boxplot(column='WindSpeed9am')\n",
"fig.set_title('')\n",
"fig.set_ylabel('WindSpeed9am')\n",
"\n",
"\n",
"plt.subplot(2, 2, 4)\n",
"fig = df.boxplot(column='WindSpeed3pm')\n",
"fig.set_title('')\n",
"fig.set_ylabel('WindSpeed3pm')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above boxplots confirm that there are lot of outliers in these variables."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check the distribution of variables\n",
"\n",
"\n",
"Now, I will plot the histograms to check distributions to find out if they are normal or skewed. If the variable follows normal distribution, then I will do `Extreme Value Analysis` otherwise if they are skewed, I will find IQR (Interquantile range)."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0,0.5,'RainTomorrow')"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1080x720 with 4 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# plot histogram to check distribution\n",
"\n",
"plt.figure(figsize=(15,10))\n",
"\n",
"\n",
"plt.subplot(2, 2, 1)\n",
"fig = df.Rainfall.hist(bins=10)\n",
"fig.set_xlabel('Rainfall')\n",
"fig.set_ylabel('RainTomorrow')\n",
"\n",
"\n",
"plt.subplot(2, 2, 2)\n",
"fig = df.Evaporation.hist(bins=10)\n",
"fig.set_xlabel('Evaporation')\n",
"fig.set_ylabel('RainTomorrow')\n",
"\n",
"\n",
"plt.subplot(2, 2, 3)\n",
"fig = df.WindSpeed9am.hist(bins=10)\n",
"fig.set_xlabel('WindSpeed9am')\n",
"fig.set_ylabel('RainTomorrow')\n",
"\n",
"\n",
"plt.subplot(2, 2, 4)\n",
"fig = df.WindSpeed3pm.hist(bins=10)\n",
"fig.set_xlabel('WindSpeed3pm')\n",
"fig.set_ylabel('RainTomorrow')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that all the four variables are skewed. So, I will use interquantile range to find outliers."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Rainfall outliers are values < -2.4000000000000004 or > 3.2\n"
]
}
],
"source": [
"# find outliers for Rainfall variable\n",
"\n",
"IQR = df.Rainfall.quantile(0.75) - df.Rainfall.quantile(0.25)\n",
"Lower_fence = df.Rainfall.quantile(0.25) - (IQR * 3)\n",
"Upper_fence = df.Rainfall.quantile(0.75) + (IQR * 3)\n",
"print('Rainfall outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=Lower_fence, upperboundary=Upper_fence))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For `Rainfall`, the minimum and maximum values are 0.0 and 371.0. So, the outliers are values > 3.2."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Evaporation outliers are values < -11.800000000000002 or > 21.800000000000004\n"
]
}
],
"source": [
"# find outliers for Evaporation variable\n",
"\n",
"IQR = df.Evaporation.quantile(0.75) - df.Evaporation.quantile(0.25)\n",
"Lower_fence = df.Evaporation.quantile(0.25) - (IQR * 3)\n",
"Upper_fence = df.Evaporation.quantile(0.75) + (IQR * 3)\n",
"print('Evaporation outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=Lower_fence, upperboundary=Upper_fence))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For `Evaporation`, the minimum and maximum values are 0.0 and 145.0. So, the outliers are values > 21.8."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WindSpeed9am outliers are values < -29.0 or > 55.0\n"
]
}
],
"source": [
"# find outliers for WindSpeed9am variable\n",
"\n",
"IQR = df.WindSpeed9am.quantile(0.75) - df.WindSpeed9am.quantile(0.25)\n",
"Lower_fence = df.WindSpeed9am.quantile(0.25) - (IQR * 3)\n",
"Upper_fence = df.WindSpeed9am.quantile(0.75) + (IQR * 3)\n",
"print('WindSpeed9am outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=Lower_fence, upperboundary=Upper_fence))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For `WindSpeed9am`, the minimum and maximum values are 0.0 and 130.0. So, the outliers are values > 55.0."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WindSpeed3pm outliers are values < -20.0 or > 57.0\n"
]
}
],
"source": [
"# find outliers for WindSpeed3pm variable\n",
"\n",
"IQR = df.WindSpeed3pm.quantile(0.75) - df.WindSpeed3pm.quantile(0.25)\n",
"Lower_fence = df.WindSpeed3pm.quantile(0.25) - (IQR * 3)\n",
"Upper_fence = df.WindSpeed3pm.quantile(0.75) + (IQR * 3)\n",
"print('WindSpeed3pm outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=Lower_fence, upperboundary=Upper_fence))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For `WindSpeed3pm`, the minimum and maximum values are 0.0 and 87.0. So, the outliers are values > 57.0."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Declare feature vector and target variable"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"X = df.drop(['RainTomorrow'], axis=1)\n",
"\n",
"y = df['RainTomorrow']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Split data into separate training and test set"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"# split X and y into training and testing sets\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)\n"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((113754, 24), (28439, 24))"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check the shape of X_train and X_test\n",
"\n",
"X_train.shape, X_test.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. Feature Engineering\n",
"\n",
"\n",
"**Feature Engineering** is the process of transforming raw data into useful features that help us to understand our model better and increase its predictive power. I will carry out feature engineering on different types of variables.\n",
"\n",
"\n",
"First, I will display the categorical and numerical variables again separately."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Location object\n",
"MinTemp float64\n",
"MaxTemp float64\n",
"Rainfall float64\n",
"Evaporation float64\n",
"Sunshine float64\n",
"WindGustDir object\n",
"WindGustSpeed float64\n",
"WindDir9am object\n",
"WindDir3pm object\n",
"WindSpeed9am float64\n",
"WindSpeed3pm float64\n",
"Humidity9am float64\n",
"Humidity3pm float64\n",
"Pressure9am float64\n",
"Pressure3pm float64\n",
"Cloud9am float64\n",
"Cloud3pm float64\n",
"Temp9am float64\n",
"Temp3pm float64\n",
"RainToday object\n",
"Year int64\n",
"Month int64\n",
"Day int64\n",
"dtype: object"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check data types in X_train\n",
"\n",
"X_train.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# display categorical variables\n",
"\n",
"categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']\n",
"\n",
"categorical"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['MinTemp',\n",
" 'MaxTemp',\n",
" 'Rainfall',\n",
" 'Evaporation',\n",
" 'Sunshine',\n",
" 'WindGustSpeed',\n",
" 'WindSpeed9am',\n",
" 'WindSpeed3pm',\n",
" 'Humidity9am',\n",
" 'Humidity3pm',\n",
" 'Pressure9am',\n",
" 'Pressure3pm',\n",
" 'Cloud9am',\n",
" 'Cloud3pm',\n",
" 'Temp9am',\n",
" 'Temp3pm',\n",
" 'Year',\n",
" 'Month',\n",
" 'Day']"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# display numerical variables\n",
"\n",
"numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']\n",
"\n",
"numerical"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Engineering missing values in numerical variables\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MinTemp 495\n",
"MaxTemp 264\n",
"Rainfall 1139\n",
"Evaporation 48718\n",
"Sunshine 54314\n",
"WindGustSpeed 7367\n",
"WindSpeed9am 1086\n",
"WindSpeed3pm 2094\n",
"Humidity9am 1449\n",
"Humidity3pm 2890\n",
"Pressure9am 11212\n",
"Pressure3pm 11186\n",
"Cloud9am 43137\n",
"Cloud3pm 45768\n",
"Temp9am 740\n",
"Temp3pm 2171\n",
"Year 0\n",
"Month 0\n",
"Day 0\n",
"dtype: int64"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in numerical variables in X_train\n",
"\n",
"X_train[numerical].isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MinTemp 142\n",
"MaxTemp 58\n",
"Rainfall 267\n",
"Evaporation 12125\n",
"Sunshine 13502\n",
"WindGustSpeed 1903\n",
"WindSpeed9am 262\n",
"WindSpeed3pm 536\n",
"Humidity9am 325\n",
"Humidity3pm 720\n",
"Pressure9am 2802\n",
"Pressure3pm 2795\n",
"Cloud9am 10520\n",
"Cloud3pm 11326\n",
"Temp9am 164\n",
"Temp3pm 555\n",
"Year 0\n",
"Month 0\n",
"Day 0\n",
"dtype: int64"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in numerical variables in X_test\n",
"\n",
"X_test[numerical].isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MinTemp 0.0044\n",
"MaxTemp 0.0023\n",
"Rainfall 0.01\n",
"Evaporation 0.4283\n",
"Sunshine 0.4775\n",
"WindGustSpeed 0.0648\n",
"WindSpeed9am 0.0095\n",
"WindSpeed3pm 0.0184\n",
"Humidity9am 0.0127\n",
"Humidity3pm 0.0254\n",
"Pressure9am 0.0986\n",
"Pressure3pm 0.0983\n",
"Cloud9am 0.3792\n",
"Cloud3pm 0.4023\n",
"Temp9am 0.0065\n",
"Temp3pm 0.0191\n"
]
}
],
"source": [
"# print percentage of missing values in the numerical variables in training set\n",
"\n",
"for col in numerical:\n",
" if X_train[col].isnull().mean()>0:\n",
" print(col, round(X_train[col].isnull().mean(),4))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Assumption\n",
"\n",
"\n",
"I assume that the data are missing completely at random (MCAR). There are two methods which can be used to impute missing values. One is mean or median imputation and other one is random sample imputation. When there are outliers in the dataset, we should use median imputation. So, I will use median imputation because median imputation is robust to outliers.\n",
"\n",
"\n",
"I will impute missing values with the appropriate statistical measures of the data, in this case median. Imputation should be done over the training set, and then propagated to the test set. It means that the statistical measures to be used to fill missing values both in train and test set, should be extracted from the train set only. This is to avoid overfitting."
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [],
"source": [
"# impute missing values in X_train and X_test with respective column median in X_train\n",
"\n",
"for df1 in [X_train, X_test]:\n",
" for col in numerical:\n",
" col_median=X_train[col].median()\n",
" df1[col].fillna(col_median, inplace=True) \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MinTemp 0\n",
"MaxTemp 0\n",
"Rainfall 0\n",
"Evaporation 0\n",
"Sunshine 0\n",
"WindGustSpeed 0\n",
"WindSpeed9am 0\n",
"WindSpeed3pm 0\n",
"Humidity9am 0\n",
"Humidity3pm 0\n",
"Pressure9am 0\n",
"Pressure3pm 0\n",
"Cloud9am 0\n",
"Cloud3pm 0\n",
"Temp9am 0\n",
"Temp3pm 0\n",
"Year 0\n",
"Month 0\n",
"Day 0\n",
"dtype: int64"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check again missing values in numerical variables in X_train\n",
"\n",
"X_train[numerical].isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MinTemp 0\n",
"MaxTemp 0\n",
"Rainfall 0\n",
"Evaporation 0\n",
"Sunshine 0\n",
"WindGustSpeed 0\n",
"WindSpeed9am 0\n",
"WindSpeed3pm 0\n",
"Humidity9am 0\n",
"Humidity3pm 0\n",
"Pressure9am 0\n",
"Pressure3pm 0\n",
"Cloud9am 0\n",
"Cloud3pm 0\n",
"Temp9am 0\n",
"Temp3pm 0\n",
"Year 0\n",
"Month 0\n",
"Day 0\n",
"dtype: int64"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in numerical variables in X_test\n",
"\n",
"X_test[numerical].isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can see that there are no missing values in the numerical columns of training and test set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Engineering missing values in categorical variables"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Location 0.000000\n",
"WindGustDir 0.065114\n",
"WindDir9am 0.070134\n",
"WindDir3pm 0.026443\n",
"RainToday 0.010013\n",
"dtype: float64"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# print percentage of missing values in the categorical variables in training set\n",
"\n",
"X_train[categorical].isnull().mean()"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WindGustDir 0.06511419378659213\n",
"WindDir9am 0.07013379749283542\n",
"WindDir3pm 0.026443026179299188\n",
"RainToday 0.01001283471350458\n"
]
}
],
"source": [
"# print categorical variables with missing data\n",
"\n",
"for col in categorical:\n",
" if X_train[col].isnull().mean()>0:\n",
" print(col, (X_train[col].isnull().mean()))"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"# impute missing categorical variables with most frequent value\n",
"\n",
"for df2 in [X_train, X_test]:\n",
" df2['WindGustDir'].fillna(X_train['WindGustDir'].mode()[0], inplace=True)\n",
" df2['WindDir9am'].fillna(X_train['WindDir9am'].mode()[0], inplace=True)\n",
" df2['WindDir3pm'].fillna(X_train['WindDir3pm'].mode()[0], inplace=True)\n",
" df2['RainToday'].fillna(X_train['RainToday'].mode()[0], inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Location 0\n",
"WindGustDir 0\n",
"WindDir9am 0\n",
"WindDir3pm 0\n",
"RainToday 0\n",
"dtype: int64"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in categorical variables in X_train\n",
"\n",
"X_train[categorical].isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Location 0\n",
"WindGustDir 0\n",
"WindDir9am 0\n",
"WindDir3pm 0\n",
"RainToday 0\n",
"dtype: int64"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in categorical variables in X_test\n",
"\n",
"X_test[categorical].isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a final check, I will check for missing values in X_train and X_test."
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Location 0\n",
"MinTemp 0\n",
"MaxTemp 0\n",
"Rainfall 0\n",
"Evaporation 0\n",
"Sunshine 0\n",
"WindGustDir 0\n",
"WindGustSpeed 0\n",
"WindDir9am 0\n",
"WindDir3pm 0\n",
"WindSpeed9am 0\n",
"WindSpeed3pm 0\n",
"Humidity9am 0\n",
"Humidity3pm 0\n",
"Pressure9am 0\n",
"Pressure3pm 0\n",
"Cloud9am 0\n",
"Cloud3pm 0\n",
"Temp9am 0\n",
"Temp3pm 0\n",
"RainToday 0\n",
"Year 0\n",
"Month 0\n",
"Day 0\n",
"dtype: int64"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in X_train\n",
"\n",
"X_train.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Location 0\n",
"MinTemp 0\n",
"MaxTemp 0\n",
"Rainfall 0\n",
"Evaporation 0\n",
"Sunshine 0\n",
"WindGustDir 0\n",
"WindGustSpeed 0\n",
"WindDir9am 0\n",
"WindDir3pm 0\n",
"WindSpeed9am 0\n",
"WindSpeed3pm 0\n",
"Humidity9am 0\n",
"Humidity3pm 0\n",
"Pressure9am 0\n",
"Pressure3pm 0\n",
"Cloud9am 0\n",
"Cloud3pm 0\n",
"Temp9am 0\n",
"Temp3pm 0\n",
"RainToday 0\n",
"Year 0\n",
"Month 0\n",
"Day 0\n",
"dtype: int64"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in X_test\n",
"\n",
"X_test.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are no missing values in X_train and X_test."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Engineering outliers in numerical variables\n",
"\n",
"\n",
"We have seen that the `Rainfall`, `Evaporation`, `WindSpeed9am` and `WindSpeed3pm` columns contain outliers. I will use top-coding approach to cap maximum values and remove outliers from the above variables."
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"def max_value(df3, variable, top):\n",
" return np.where(df3[variable]>top, top, df3[variable])\n",
"\n",
"for df3 in [X_train, X_test]:\n",
" df3['Rainfall'] = max_value(df3, 'Rainfall', 3.2)\n",
" df3['Evaporation'] = max_value(df3, 'Evaporation', 21.8)\n",
" df3['WindSpeed9am'] = max_value(df3, 'WindSpeed9am', 55)\n",
" df3['WindSpeed3pm'] = max_value(df3, 'WindSpeed3pm', 57)"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3.2, 3.2)"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.Rainfall.max(), X_test.Rainfall.max()"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(21.8, 21.8)"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.Evaporation.max(), X_test.Evaporation.max()"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(55.0, 55.0)"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.WindSpeed9am.max(), X_test.WindSpeed9am.max()"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(57.0, 57.0)"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.WindSpeed3pm.max(), X_test.WindSpeed3pm.max()"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>MinTemp</th>\n",
" <th>MaxTemp</th>\n",
" <th>Rainfall</th>\n",
" <th>Evaporation</th>\n",
" <th>Sunshine</th>\n",
" <th>WindGustSpeed</th>\n",
" <th>WindSpeed9am</th>\n",
" <th>WindSpeed3pm</th>\n",
" <th>Humidity9am</th>\n",
" <th>Humidity3pm</th>\n",
" <th>Pressure9am</th>\n",
" <th>Pressure3pm</th>\n",
" <th>Cloud9am</th>\n",
" <th>Cloud3pm</th>\n",
" <th>Temp9am</th>\n",
" <th>Temp3pm</th>\n",
" <th>Year</th>\n",
" <th>Month</th>\n",
" <th>Day</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>12.193497</td>\n",
" <td>23.237216</td>\n",
" <td>0.675080</td>\n",
" <td>5.151606</td>\n",
" <td>8.041154</td>\n",
" <td>39.884074</td>\n",
" <td>13.978155</td>\n",
" <td>18.614756</td>\n",
" <td>68.867486</td>\n",
" <td>51.509547</td>\n",
" <td>1017.640649</td>\n",
" <td>1015.241101</td>\n",
" <td>4.651801</td>\n",
" <td>4.703588</td>\n",
" <td>16.995062</td>\n",
" <td>21.688643</td>\n",
" <td>2012.759727</td>\n",
" <td>6.404021</td>\n",
" <td>15.710419</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>6.388279</td>\n",
" <td>7.094149</td>\n",
" <td>1.183837</td>\n",
" <td>2.823707</td>\n",
" <td>2.769480</td>\n",
" <td>13.116959</td>\n",
" <td>8.806558</td>\n",
" <td>8.685862</td>\n",
" <td>18.935587</td>\n",
" <td>20.530723</td>\n",
" <td>6.738680</td>\n",
" <td>6.675168</td>\n",
" <td>2.292726</td>\n",
" <td>2.117847</td>\n",
" <td>6.463772</td>\n",
" <td>6.855649</td>\n",
" <td>2.540419</td>\n",
" <td>3.427798</td>\n",
" <td>8.796821</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>-8.200000</td>\n",
" <td>-4.800000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>6.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>980.500000</td>\n",
" <td>977.100000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-7.200000</td>\n",
" <td>-5.400000</td>\n",
" <td>2007.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>7.600000</td>\n",
" <td>18.000000</td>\n",
" <td>0.000000</td>\n",
" <td>4.000000</td>\n",
" <td>8.200000</td>\n",
" <td>31.000000</td>\n",
" <td>7.000000</td>\n",
" <td>13.000000</td>\n",
" <td>57.000000</td>\n",
" <td>37.000000</td>\n",
" <td>1013.500000</td>\n",
" <td>1011.000000</td>\n",
" <td>3.000000</td>\n",
" <td>4.000000</td>\n",
" <td>12.300000</td>\n",
" <td>16.700000</td>\n",
" <td>2011.000000</td>\n",
" <td>3.000000</td>\n",
" <td>8.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>12.000000</td>\n",
" <td>22.600000</td>\n",
" <td>0.000000</td>\n",
" <td>4.800000</td>\n",
" <td>8.500000</td>\n",
" <td>39.000000</td>\n",
" <td>13.000000</td>\n",
" <td>19.000000</td>\n",
" <td>70.000000</td>\n",
" <td>52.000000</td>\n",
" <td>1017.600000</td>\n",
" <td>1015.200000</td>\n",
" <td>5.000000</td>\n",
" <td>5.000000</td>\n",
" <td>16.700000</td>\n",
" <td>21.100000</td>\n",
" <td>2013.000000</td>\n",
" <td>6.000000</td>\n",
" <td>16.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>16.800000</td>\n",
" <td>28.200000</td>\n",
" <td>0.600000</td>\n",
" <td>5.400000</td>\n",
" <td>8.700000</td>\n",
" <td>46.000000</td>\n",
" <td>19.000000</td>\n",
" <td>24.000000</td>\n",
" <td>83.000000</td>\n",
" <td>65.000000</td>\n",
" <td>1021.800000</td>\n",
" <td>1019.400000</td>\n",
" <td>6.000000</td>\n",
" <td>6.000000</td>\n",
" <td>21.500000</td>\n",
" <td>26.300000</td>\n",
" <td>2015.000000</td>\n",
" <td>9.000000</td>\n",
" <td>23.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>33.900000</td>\n",
" <td>48.100000</td>\n",
" <td>3.200000</td>\n",
" <td>21.800000</td>\n",
" <td>14.500000</td>\n",
" <td>135.000000</td>\n",
" <td>55.000000</td>\n",
" <td>57.000000</td>\n",
" <td>100.000000</td>\n",
" <td>100.000000</td>\n",
" <td>1041.000000</td>\n",
" <td>1039.600000</td>\n",
" <td>9.000000</td>\n",
" <td>8.000000</td>\n",
" <td>40.200000</td>\n",
" <td>46.700000</td>\n",
" <td>2017.000000</td>\n",
" <td>12.000000</td>\n",
" <td>31.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" MinTemp MaxTemp Rainfall Evaporation \\\n",
"count 113754.000000 113754.000000 113754.000000 113754.000000 \n",
"mean 12.193497 23.237216 0.675080 5.151606 \n",
"std 6.388279 7.094149 1.183837 2.823707 \n",
"min -8.200000 -4.800000 0.000000 0.000000 \n",
"25% 7.600000 18.000000 0.000000 4.000000 \n",
"50% 12.000000 22.600000 0.000000 4.800000 \n",
"75% 16.800000 28.200000 0.600000 5.400000 \n",
"max 33.900000 48.100000 3.200000 21.800000 \n",
"\n",
" Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm \\\n",
"count 113754.000000 113754.000000 113754.000000 113754.000000 \n",
"mean 8.041154 39.884074 13.978155 18.614756 \n",
"std 2.769480 13.116959 8.806558 8.685862 \n",
"min 0.000000 6.000000 0.000000 0.000000 \n",
"25% 8.200000 31.000000 7.000000 13.000000 \n",
"50% 8.500000 39.000000 13.000000 19.000000 \n",
"75% 8.700000 46.000000 19.000000 24.000000 \n",
"max 14.500000 135.000000 55.000000 57.000000 \n",
"\n",
" Humidity9am Humidity3pm Pressure9am Pressure3pm \\\n",
"count 113754.000000 113754.000000 113754.000000 113754.000000 \n",
"mean 68.867486 51.509547 1017.640649 1015.241101 \n",
"std 18.935587 20.530723 6.738680 6.675168 \n",
"min 0.000000 0.000000 980.500000 977.100000 \n",
"25% 57.000000 37.000000 1013.500000 1011.000000 \n",
"50% 70.000000 52.000000 1017.600000 1015.200000 \n",
"75% 83.000000 65.000000 1021.800000 1019.400000 \n",
"max 100.000000 100.000000 1041.000000 1039.600000 \n",
"\n",
" Cloud9am Cloud3pm Temp9am Temp3pm \\\n",
"count 113754.000000 113754.000000 113754.000000 113754.000000 \n",
"mean 4.651801 4.703588 16.995062 21.688643 \n",
"std 2.292726 2.117847 6.463772 6.855649 \n",
"min 0.000000 0.000000 -7.200000 -5.400000 \n",
"25% 3.000000 4.000000 12.300000 16.700000 \n",
"50% 5.000000 5.000000 16.700000 21.100000 \n",
"75% 6.000000 6.000000 21.500000 26.300000 \n",
"max 9.000000 8.000000 40.200000 46.700000 \n",
"\n",
" Year Month Day \n",
"count 113754.000000 113754.000000 113754.000000 \n",
"mean 2012.759727 6.404021 15.710419 \n",
"std 2.540419 3.427798 8.796821 \n",
"min 2007.000000 1.000000 1.000000 \n",
"25% 2011.000000 3.000000 8.000000 \n",
"50% 2013.000000 6.000000 16.000000 \n",
"75% 2015.000000 9.000000 23.000000 \n",
"max 2017.000000 12.000000 31.000000 "
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train[numerical].describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now see that the outliers in `Rainfall`, `Evaporation`, `WindSpeed9am` and `WindSpeed3pm` columns are capped."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Encode categorical variables"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"categorical"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Location</th>\n",
" <th>WindGustDir</th>\n",
" <th>WindDir9am</th>\n",
" <th>WindDir3pm</th>\n",
" <th>RainToday</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>110803</th>\n",
" <td>Witchcliffe</td>\n",
" <td>S</td>\n",
" <td>SSE</td>\n",
" <td>S</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>87289</th>\n",
" <td>Cairns</td>\n",
" <td>ENE</td>\n",
" <td>SSE</td>\n",
" <td>SE</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>134949</th>\n",
" <td>AliceSprings</td>\n",
" <td>E</td>\n",
" <td>NE</td>\n",
" <td>N</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85553</th>\n",
" <td>Cairns</td>\n",
" <td>ESE</td>\n",
" <td>SSE</td>\n",
" <td>E</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16110</th>\n",
" <td>Newcastle</td>\n",
" <td>W</td>\n",
" <td>N</td>\n",
" <td>SE</td>\n",
" <td>No</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Location WindGustDir WindDir9am WindDir3pm RainToday\n",
"110803 Witchcliffe S SSE S No\n",
"87289 Cairns ENE SSE SE Yes\n",
"134949 AliceSprings E NE N No\n",
"85553 Cairns ESE SSE E No\n",
"16110 Newcastle W N SE No"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train[categorical].head()"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [],
"source": [
"# encode RainToday variable\n",
"\n",
"import category_encoders as ce\n",
"\n",
"encoder = ce.BinaryEncoder(cols=['RainToday'])\n",
"\n",
"X_train = encoder.fit_transform(X_train)\n",
"\n",
"X_test = encoder.transform(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>RainToday_0</th>\n",
" <th>RainToday_1</th>\n",
" <th>Location</th>\n",
" <th>MinTemp</th>\n",
" <th>MaxTemp</th>\n",
" <th>Rainfall</th>\n",
" <th>Evaporation</th>\n",
" <th>Sunshine</th>\n",
" <th>WindGustDir</th>\n",
" <th>WindGustSpeed</th>\n",
" <th>...</th>\n",
" <th>Humidity3pm</th>\n",
" <th>Pressure9am</th>\n",
" <th>Pressure3pm</th>\n",
" <th>Cloud9am</th>\n",
" <th>Cloud3pm</th>\n",
" <th>Temp9am</th>\n",
" <th>Temp3pm</th>\n",
" <th>Year</th>\n",
" <th>Month</th>\n",
" <th>Day</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>110803</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>Witchcliffe</td>\n",
" <td>13.9</td>\n",
" <td>22.6</td>\n",
" <td>0.2</td>\n",
" <td>4.8</td>\n",
" <td>8.5</td>\n",
" <td>S</td>\n",
" <td>41.0</td>\n",
" <td>...</td>\n",
" <td>55.0</td>\n",
" <td>1013.9</td>\n",
" <td>1013.4</td>\n",
" <td>5.0</td>\n",
" <td>5.0</td>\n",
" <td>18.8</td>\n",
" <td>20.4</td>\n",
" <td>2014</td>\n",
" <td>4</td>\n",
" <td>25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>87289</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>Cairns</td>\n",
" <td>22.4</td>\n",
" <td>29.4</td>\n",
" <td>2.0</td>\n",
" <td>6.0</td>\n",
" <td>6.3</td>\n",
" <td>ENE</td>\n",
" <td>33.0</td>\n",
" <td>...</td>\n",
" <td>59.0</td>\n",
" <td>1016.9</td>\n",
" <td>1013.1</td>\n",
" <td>7.0</td>\n",
" <td>5.0</td>\n",
" <td>26.4</td>\n",
" <td>27.5</td>\n",
" <td>2015</td>\n",
" <td>11</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>134949</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>AliceSprings</td>\n",
" <td>9.7</td>\n",
" <td>36.2</td>\n",
" <td>0.0</td>\n",
" <td>11.4</td>\n",
" <td>12.3</td>\n",
" <td>E</td>\n",
" <td>31.0</td>\n",
" <td>...</td>\n",
" <td>2.0</td>\n",
" <td>1018.1</td>\n",
" <td>1013.6</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>28.5</td>\n",
" <td>35.0</td>\n",
" <td>2014</td>\n",
" <td>10</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85553</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>Cairns</td>\n",
" <td>20.5</td>\n",
" <td>30.1</td>\n",
" <td>0.0</td>\n",
" <td>8.8</td>\n",
" <td>11.1</td>\n",
" <td>ESE</td>\n",
" <td>37.0</td>\n",
" <td>...</td>\n",
" <td>53.0</td>\n",
" <td>1014.1</td>\n",
" <td>1010.8</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>27.3</td>\n",
" <td>29.4</td>\n",
" <td>2010</td>\n",
" <td>10</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16110</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>Newcastle</td>\n",
" <td>16.8</td>\n",
" <td>29.2</td>\n",
" <td>0.0</td>\n",
" <td>4.8</td>\n",
" <td>8.5</td>\n",
" <td>W</td>\n",
" <td>39.0</td>\n",
" <td>...</td>\n",
" <td>53.0</td>\n",
" <td>1017.6</td>\n",
" <td>1015.2</td>\n",
" <td>5.0</td>\n",
" <td>8.0</td>\n",
" <td>22.2</td>\n",
" <td>27.0</td>\n",
" <td>2012</td>\n",
" <td>11</td>\n",
" <td>8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 25 columns</p>\n",
"</div>"
],
"text/plain": [
" RainToday_0 RainToday_1 Location MinTemp MaxTemp Rainfall \\\n",
"110803 0 1 Witchcliffe 13.9 22.6 0.2 \n",
"87289 1 0 Cairns 22.4 29.4 2.0 \n",
"134949 0 1 AliceSprings 9.7 36.2 0.0 \n",
"85553 0 1 Cairns 20.5 30.1 0.0 \n",
"16110 0 1 Newcastle 16.8 29.2 0.0 \n",
"\n",
" Evaporation Sunshine WindGustDir WindGustSpeed ... Humidity3pm \\\n",
"110803 4.8 8.5 S 41.0 ... 55.0 \n",
"87289 6.0 6.3 ENE 33.0 ... 59.0 \n",
"134949 11.4 12.3 E 31.0 ... 2.0 \n",
"85553 8.8 11.1 ESE 37.0 ... 53.0 \n",
"16110 4.8 8.5 W 39.0 ... 53.0 \n",
"\n",
" Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm Year \\\n",
"110803 1013.9 1013.4 5.0 5.0 18.8 20.4 2014 \n",
"87289 1016.9 1013.1 7.0 5.0 26.4 27.5 2015 \n",
"134949 1018.1 1013.6 1.0 1.0 28.5 35.0 2014 \n",
"85553 1014.1 1010.8 2.0 3.0 27.3 29.4 2010 \n",
"16110 1017.6 1015.2 5.0 8.0 22.2 27.0 2012 \n",
"\n",
" Month Day \n",
"110803 4 25 \n",
"87289 11 2 \n",
"134949 10 19 \n",
"85553 10 30 \n",
"16110 11 8 \n",
"\n",
"[5 rows x 25 columns]"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that two additional variables `RainToday_0` and `RainToday_1` are created from `RainToday` variable.\n",
"\n",
"Now, I will create the `X_train` training set."
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [],
"source": [
"X_train = pd.concat([X_train[numerical], X_train[['RainToday_0', 'RainToday_1']],\n",
" pd.get_dummies(X_train.Location), \n",
" pd.get_dummies(X_train.WindGustDir),\n",
" pd.get_dummies(X_train.WindDir9am),\n",
" pd.get_dummies(X_train.WindDir3pm)], axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>MinTemp</th>\n",
" <th>MaxTemp</th>\n",
" <th>Rainfall</th>\n",
" <th>Evaporation</th>\n",
" <th>Sunshine</th>\n",
" <th>WindGustSpeed</th>\n",
" <th>WindSpeed9am</th>\n",
" <th>WindSpeed3pm</th>\n",
" <th>Humidity9am</th>\n",
" <th>Humidity3pm</th>\n",
" <th>...</th>\n",
" <th>NNW</th>\n",
" <th>NW</th>\n",
" <th>S</th>\n",
" <th>SE</th>\n",
" <th>SSE</th>\n",
" <th>SSW</th>\n",
" <th>SW</th>\n",
" <th>W</th>\n",
" <th>WNW</th>\n",
" <th>WSW</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>110803</th>\n",
" <td>13.9</td>\n",
" <td>22.6</td>\n",
" <td>0.2</td>\n",
" <td>4.8</td>\n",
" <td>8.5</td>\n",
" <td>41.0</td>\n",
" <td>20.0</td>\n",
" <td>28.0</td>\n",
" <td>65.0</td>\n",
" <td>55.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>87289</th>\n",
" <td>22.4</td>\n",
" <td>29.4</td>\n",
" <td>2.0</td>\n",
" <td>6.0</td>\n",
" <td>6.3</td>\n",
" <td>33.0</td>\n",
" <td>7.0</td>\n",
" <td>19.0</td>\n",
" <td>71.0</td>\n",
" <td>59.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>134949</th>\n",
" <td>9.7</td>\n",
" <td>36.2</td>\n",
" <td>0.0</td>\n",
" <td>11.4</td>\n",
" <td>12.3</td>\n",
" <td>31.0</td>\n",
" <td>15.0</td>\n",
" <td>11.0</td>\n",
" <td>6.0</td>\n",
" <td>2.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85553</th>\n",
" <td>20.5</td>\n",
" <td>30.1</td>\n",
" <td>0.0</td>\n",
" <td>8.8</td>\n",
" <td>11.1</td>\n",
" <td>37.0</td>\n",
" <td>22.0</td>\n",
" <td>19.0</td>\n",
" <td>59.0</td>\n",
" <td>53.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16110</th>\n",
" <td>16.8</td>\n",
" <td>29.2</td>\n",
" <td>0.0</td>\n",
" <td>4.8</td>\n",
" <td>8.5</td>\n",
" <td>39.0</td>\n",
" <td>0.0</td>\n",
" <td>7.0</td>\n",
" <td>72.0</td>\n",
" <td>53.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 118 columns</p>\n",
"</div>"
],
"text/plain": [
" MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed \\\n",
"110803 13.9 22.6 0.2 4.8 8.5 41.0 \n",
"87289 22.4 29.4 2.0 6.0 6.3 33.0 \n",
"134949 9.7 36.2 0.0 11.4 12.3 31.0 \n",
"85553 20.5 30.1 0.0 8.8 11.1 37.0 \n",
"16110 16.8 29.2 0.0 4.8 8.5 39.0 \n",
"\n",
" WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm ... NNW NW S \\\n",
"110803 20.0 28.0 65.0 55.0 ... 0 0 1 \n",
"87289 7.0 19.0 71.0 59.0 ... 0 0 0 \n",
"134949 15.0 11.0 6.0 2.0 ... 0 0 0 \n",
"85553 22.0 19.0 59.0 53.0 ... 0 0 0 \n",
"16110 0.0 7.0 72.0 53.0 ... 0 0 0 \n",
"\n",
" SE SSE SSW SW W WNW WSW \n",
"110803 0 0 0 0 0 0 0 \n",
"87289 1 0 0 0 0 0 0 \n",
"134949 0 0 0 0 0 0 0 \n",
"85553 0 0 0 0 0 0 0 \n",
"16110 1 0 0 0 0 0 0 \n",
"\n",
"[5 rows x 118 columns]"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly, I will create the `X_test` testing set."
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [],
"source": [
"X_test = pd.concat([X_test[numerical], X_test[['RainToday_0', 'RainToday_1']],\n",
" pd.get_dummies(X_test.Location), \n",
" pd.get_dummies(X_test.WindGustDir),\n",
" pd.get_dummies(X_test.WindDir9am),\n",
" pd.get_dummies(X_test.WindDir3pm)], axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>MinTemp</th>\n",
" <th>MaxTemp</th>\n",
" <th>Rainfall</th>\n",
" <th>Evaporation</th>\n",
" <th>Sunshine</th>\n",
" <th>WindGustSpeed</th>\n",
" <th>WindSpeed9am</th>\n",
" <th>WindSpeed3pm</th>\n",
" <th>Humidity9am</th>\n",
" <th>Humidity3pm</th>\n",
" <th>...</th>\n",
" <th>NNW</th>\n",
" <th>NW</th>\n",
" <th>S</th>\n",
" <th>SE</th>\n",
" <th>SSE</th>\n",
" <th>SSW</th>\n",
" <th>SW</th>\n",
" <th>W</th>\n",
" <th>WNW</th>\n",
" <th>WSW</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>86232</th>\n",
" <td>17.4</td>\n",
" <td>29.0</td>\n",
" <td>0.0</td>\n",
" <td>3.6</td>\n",
" <td>11.1</td>\n",
" <td>33.0</td>\n",
" <td>11.0</td>\n",
" <td>19.0</td>\n",
" <td>63.0</td>\n",
" <td>61.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>57576</th>\n",
" <td>6.8</td>\n",
" <td>14.4</td>\n",
" <td>0.8</td>\n",
" <td>0.8</td>\n",
" <td>8.5</td>\n",
" <td>46.0</td>\n",
" <td>17.0</td>\n",
" <td>22.0</td>\n",
" <td>80.0</td>\n",
" <td>55.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>124071</th>\n",
" <td>10.1</td>\n",
" <td>15.4</td>\n",
" <td>3.2</td>\n",
" <td>4.8</td>\n",
" <td>8.5</td>\n",
" <td>31.0</td>\n",
" <td>13.0</td>\n",
" <td>9.0</td>\n",
" <td>70.0</td>\n",
" <td>61.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>117955</th>\n",
" <td>14.4</td>\n",
" <td>33.4</td>\n",
" <td>0.0</td>\n",
" <td>8.0</td>\n",
" <td>11.6</td>\n",
" <td>41.0</td>\n",
" <td>9.0</td>\n",
" <td>17.0</td>\n",
" <td>40.0</td>\n",
" <td>23.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>133468</th>\n",
" <td>6.8</td>\n",
" <td>14.3</td>\n",
" <td>3.2</td>\n",
" <td>0.2</td>\n",
" <td>7.3</td>\n",
" <td>28.0</td>\n",
" <td>15.0</td>\n",
" <td>13.0</td>\n",
" <td>92.0</td>\n",
" <td>47.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 118 columns</p>\n",
"</div>"
],
"text/plain": [
" MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed \\\n",
"86232 17.4 29.0 0.0 3.6 11.1 33.0 \n",
"57576 6.8 14.4 0.8 0.8 8.5 46.0 \n",
"124071 10.1 15.4 3.2 4.8 8.5 31.0 \n",
"117955 14.4 33.4 0.0 8.0 11.6 41.0 \n",
"133468 6.8 14.3 3.2 0.2 7.3 28.0 \n",
"\n",
" WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm ... NNW NW S \\\n",
"86232 11.0 19.0 63.0 61.0 ... 0 0 0 \n",
"57576 17.0 22.0 80.0 55.0 ... 0 0 1 \n",
"124071 13.0 9.0 70.0 61.0 ... 0 0 0 \n",
"117955 9.0 17.0 40.0 23.0 ... 0 0 0 \n",
"133468 15.0 13.0 92.0 47.0 ... 0 0 0 \n",
"\n",
" SE SSE SSW SW W WNW WSW \n",
"86232 0 0 0 0 0 0 0 \n",
"57576 0 0 0 0 0 0 0 \n",
"124071 0 1 0 0 0 0 0 \n",
"117955 0 0 0 1 0 0 0 \n",
"133468 0 0 0 0 0 0 0 \n",
"\n",
"[5 rows x 118 columns]"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_test.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have training and testing set ready for model building. Before that, we should map all the feature variables onto the same scale. It is called `feature scaling`. I will do it as follows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 11. Feature Scaling"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>MinTemp</th>\n",
" <th>MaxTemp</th>\n",
" <th>Rainfall</th>\n",
" <th>Evaporation</th>\n",
" <th>Sunshine</th>\n",
" <th>WindGustSpeed</th>\n",
" <th>WindSpeed9am</th>\n",
" <th>WindSpeed3pm</th>\n",
" <th>Humidity9am</th>\n",
" <th>Humidity3pm</th>\n",
" <th>...</th>\n",
" <th>NNW</th>\n",
" <th>NW</th>\n",
" <th>S</th>\n",
" <th>SE</th>\n",
" <th>SSE</th>\n",
" <th>SSW</th>\n",
" <th>SW</th>\n",
" <th>W</th>\n",
" <th>WNW</th>\n",
" <th>WSW</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>...</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>12.193497</td>\n",
" <td>23.237216</td>\n",
" <td>0.675080</td>\n",
" <td>5.151606</td>\n",
" <td>8.041154</td>\n",
" <td>39.884074</td>\n",
" <td>13.978155</td>\n",
" <td>18.614756</td>\n",
" <td>68.867486</td>\n",
" <td>51.509547</td>\n",
" <td>...</td>\n",
" <td>0.054530</td>\n",
" <td>0.060288</td>\n",
" <td>0.067259</td>\n",
" <td>0.101605</td>\n",
" <td>0.064059</td>\n",
" <td>0.056402</td>\n",
" <td>0.064464</td>\n",
" <td>0.069334</td>\n",
" <td>0.060798</td>\n",
" <td>0.065483</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>6.388279</td>\n",
" <td>7.094149</td>\n",
" <td>1.183837</td>\n",
" <td>2.823707</td>\n",
" <td>2.769480</td>\n",
" <td>13.116959</td>\n",
" <td>8.806558</td>\n",
" <td>8.685862</td>\n",
" <td>18.935587</td>\n",
" <td>20.530723</td>\n",
" <td>...</td>\n",
" <td>0.227061</td>\n",
" <td>0.238021</td>\n",
" <td>0.250471</td>\n",
" <td>0.302130</td>\n",
" <td>0.244860</td>\n",
" <td>0.230698</td>\n",
" <td>0.245578</td>\n",
" <td>0.254022</td>\n",
" <td>0.238960</td>\n",
" <td>0.247378</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>-8.200000</td>\n",
" <td>-4.800000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>6.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>7.600000</td>\n",
" <td>18.000000</td>\n",
" <td>0.000000</td>\n",
" <td>4.000000</td>\n",
" <td>8.200000</td>\n",
" <td>31.000000</td>\n",
" <td>7.000000</td>\n",
" <td>13.000000</td>\n",
" <td>57.000000</td>\n",
" <td>37.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>12.000000</td>\n",
" <td>22.600000</td>\n",
" <td>0.000000</td>\n",
" <td>4.800000</td>\n",
" <td>8.500000</td>\n",
" <td>39.000000</td>\n",
" <td>13.000000</td>\n",
" <td>19.000000</td>\n",
" <td>70.000000</td>\n",
" <td>52.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>16.800000</td>\n",
" <td>28.200000</td>\n",
" <td>0.600000</td>\n",
" <td>5.400000</td>\n",
" <td>8.700000</td>\n",
" <td>46.000000</td>\n",
" <td>19.000000</td>\n",
" <td>24.000000</td>\n",
" <td>83.000000</td>\n",
" <td>65.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>33.900000</td>\n",
" <td>48.100000</td>\n",
" <td>3.200000</td>\n",
" <td>21.800000</td>\n",
" <td>14.500000</td>\n",
" <td>135.000000</td>\n",
" <td>55.000000</td>\n",
" <td>57.000000</td>\n",
" <td>100.000000</td>\n",
" <td>100.000000</td>\n",
" <td>...</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 118 columns</p>\n",
"</div>"
],
"text/plain": [
" MinTemp MaxTemp Rainfall Evaporation \\\n",
"count 113754.000000 113754.000000 113754.000000 113754.000000 \n",
"mean 12.193497 23.237216 0.675080 5.151606 \n",
"std 6.388279 7.094149 1.183837 2.823707 \n",
"min -8.200000 -4.800000 0.000000 0.000000 \n",
"25% 7.600000 18.000000 0.000000 4.000000 \n",
"50% 12.000000 22.600000 0.000000 4.800000 \n",
"75% 16.800000 28.200000 0.600000 5.400000 \n",
"max 33.900000 48.100000 3.200000 21.800000 \n",
"\n",
" Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm \\\n",
"count 113754.000000 113754.000000 113754.000000 113754.000000 \n",
"mean 8.041154 39.884074 13.978155 18.614756 \n",
"std 2.769480 13.116959 8.806558 8.685862 \n",
"min 0.000000 6.000000 0.000000 0.000000 \n",
"25% 8.200000 31.000000 7.000000 13.000000 \n",
"50% 8.500000 39.000000 13.000000 19.000000 \n",
"75% 8.700000 46.000000 19.000000 24.000000 \n",
"max 14.500000 135.000000 55.000000 57.000000 \n",
"\n",
" Humidity9am Humidity3pm ... NNW \\\n",
"count 113754.000000 113754.000000 ... 113754.000000 \n",
"mean 68.867486 51.509547 ... 0.054530 \n",
"std 18.935587 20.530723 ... 0.227061 \n",
"min 0.000000 0.000000 ... 0.000000 \n",
"25% 57.000000 37.000000 ... 0.000000 \n",
"50% 70.000000 52.000000 ... 0.000000 \n",
"75% 83.000000 65.000000 ... 0.000000 \n",
"max 100.000000 100.000000 ... 1.000000 \n",
"\n",
" NW S SE SSE \\\n",
"count 113754.000000 113754.000000 113754.000000 113754.000000 \n",
"mean 0.060288 0.067259 0.101605 0.064059 \n",
"std 0.238021 0.250471 0.302130 0.244860 \n",
"min 0.000000 0.000000 0.000000 0.000000 \n",
"25% 0.000000 0.000000 0.000000 0.000000 \n",
"50% 0.000000 0.000000 0.000000 0.000000 \n",
"75% 0.000000 0.000000 0.000000 0.000000 \n",
"max 1.000000 1.000000 1.000000 1.000000 \n",
"\n",
" SSW SW W WNW \\\n",
"count 113754.000000 113754.000000 113754.000000 113754.000000 \n",
"mean 0.056402 0.064464 0.069334 0.060798 \n",
"std 0.230698 0.245578 0.254022 0.238960 \n",
"min 0.000000 0.000000 0.000000 0.000000 \n",
"25% 0.000000 0.000000 0.000000 0.000000 \n",
"50% 0.000000 0.000000 0.000000 0.000000 \n",
"75% 0.000000 0.000000 0.000000 0.000000 \n",
"max 1.000000 1.000000 1.000000 1.000000 \n",
"\n",
" WSW \n",
"count 113754.000000 \n",
"mean 0.065483 \n",
"std 0.247378 \n",
"min 0.000000 \n",
"25% 0.000000 \n",
"50% 0.000000 \n",
"75% 0.000000 \n",
"max 1.000000 \n",
"\n",
"[8 rows x 118 columns]"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.describe()"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [],
"source": [
"cols = X_train.columns"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import MinMaxScaler\n",
"\n",
"scaler = MinMaxScaler()\n",
"\n",
"X_train = scaler.fit_transform(X_train)\n",
"\n",
"X_test = scaler.transform(X_test)\n"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [],
"source": [
"X_train = pd.DataFrame(X_train, columns=[cols])"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [],
"source": [
"X_test = pd.DataFrame(X_test, columns=[cols])"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th>MinTemp</th>\n",
" <th>MaxTemp</th>\n",
" <th>Rainfall</th>\n",
" <th>Evaporation</th>\n",
" <th>Sunshine</th>\n",
" <th>WindGustSpeed</th>\n",
" <th>WindSpeed9am</th>\n",
" <th>WindSpeed3pm</th>\n",
" <th>Humidity9am</th>\n",
" <th>Humidity3pm</th>\n",
" <th>...</th>\n",
" <th>NNW</th>\n",
" <th>NW</th>\n",
" <th>S</th>\n",
" <th>SE</th>\n",
" <th>SSE</th>\n",
" <th>SSW</th>\n",
" <th>SW</th>\n",
" <th>W</th>\n",
" <th>WNW</th>\n",
" <th>WSW</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>...</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" <td>113754.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>0.484406</td>\n",
" <td>0.530004</td>\n",
" <td>0.210962</td>\n",
" <td>0.236312</td>\n",
" <td>0.554562</td>\n",
" <td>0.262667</td>\n",
" <td>0.254148</td>\n",
" <td>0.326575</td>\n",
" <td>0.688675</td>\n",
" <td>0.515095</td>\n",
" <td>...</td>\n",
" <td>0.054530</td>\n",
" <td>0.060288</td>\n",
" <td>0.067259</td>\n",
" <td>0.101605</td>\n",
" <td>0.064059</td>\n",
" <td>0.056402</td>\n",
" <td>0.064464</td>\n",
" <td>0.069334</td>\n",
" <td>0.060798</td>\n",
" <td>0.065483</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.151741</td>\n",
" <td>0.134105</td>\n",
" <td>0.369949</td>\n",
" <td>0.129528</td>\n",
" <td>0.190999</td>\n",
" <td>0.101682</td>\n",
" <td>0.160119</td>\n",
" <td>0.152384</td>\n",
" <td>0.189356</td>\n",
" <td>0.205307</td>\n",
" <td>...</td>\n",
" <td>0.227061</td>\n",
" <td>0.238021</td>\n",
" <td>0.250471</td>\n",
" <td>0.302130</td>\n",
" <td>0.244860</td>\n",
" <td>0.230698</td>\n",
" <td>0.245578</td>\n",
" <td>0.254022</td>\n",
" <td>0.238960</td>\n",
" <td>0.247378</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>0.375297</td>\n",
" <td>0.431002</td>\n",
" <td>0.000000</td>\n",
" <td>0.183486</td>\n",
" <td>0.565517</td>\n",
" <td>0.193798</td>\n",
" <td>0.127273</td>\n",
" <td>0.228070</td>\n",
" <td>0.570000</td>\n",
" <td>0.370000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>0.479810</td>\n",
" <td>0.517958</td>\n",
" <td>0.000000</td>\n",
" <td>0.220183</td>\n",
" <td>0.586207</td>\n",
" <td>0.255814</td>\n",
" <td>0.236364</td>\n",
" <td>0.333333</td>\n",
" <td>0.700000</td>\n",
" <td>0.520000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>0.593824</td>\n",
" <td>0.623819</td>\n",
" <td>0.187500</td>\n",
" <td>0.247706</td>\n",
" <td>0.600000</td>\n",
" <td>0.310078</td>\n",
" <td>0.345455</td>\n",
" <td>0.421053</td>\n",
" <td>0.830000</td>\n",
" <td>0.650000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>...</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 118 columns</p>\n",
"</div>"
],
"text/plain": [
" MinTemp MaxTemp Rainfall Evaporation \\\n",
"count 113754.000000 113754.000000 113754.000000 113754.000000 \n",
"mean 0.484406 0.530004 0.210962 0.236312 \n",
"std 0.151741 0.134105 0.369949 0.129528 \n",
"min 0.000000 0.000000 0.000000 0.000000 \n",
"25% 0.375297 0.431002 0.000000 0.183486 \n",
"50% 0.479810 0.517958 0.000000 0.220183 \n",
"75% 0.593824 0.623819 0.187500 0.247706 \n",
"max 1.000000 1.000000 1.000000 1.000000 \n",
"\n",
" Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm \\\n",
"count 113754.000000 113754.000000 113754.000000 113754.000000 \n",
"mean 0.554562 0.262667 0.254148 0.326575 \n",
"std 0.190999 0.101682 0.160119 0.152384 \n",
"min 0.000000 0.000000 0.000000 0.000000 \n",
"25% 0.565517 0.193798 0.127273 0.228070 \n",
"50% 0.586207 0.255814 0.236364 0.333333 \n",
"75% 0.600000 0.310078 0.345455 0.421053 \n",
"max 1.000000 1.000000 1.000000 1.000000 \n",
"\n",
" Humidity9am Humidity3pm ... NNW \\\n",
"count 113754.000000 113754.000000 ... 113754.000000 \n",
"mean 0.688675 0.515095 ... 0.054530 \n",
"std 0.189356 0.205307 ... 0.227061 \n",
"min 0.000000 0.000000 ... 0.000000 \n",
"25% 0.570000 0.370000 ... 0.000000 \n",
"50% 0.700000 0.520000 ... 0.000000 \n",
"75% 0.830000 0.650000 ... 0.000000 \n",
"max 1.000000 1.000000 ... 1.000000 \n",
"\n",
" NW S SE SSE \\\n",
"count 113754.000000 113754.000000 113754.000000 113754.000000 \n",
"mean 0.060288 0.067259 0.101605 0.064059 \n",
"std 0.238021 0.250471 0.302130 0.244860 \n",
"min 0.000000 0.000000 0.000000 0.000000 \n",
"25% 0.000000 0.000000 0.000000 0.000000 \n",
"50% 0.000000 0.000000 0.000000 0.000000 \n",
"75% 0.000000 0.000000 0.000000 0.000000 \n",
"max 1.000000 1.000000 1.000000 1.000000 \n",
"\n",
" SSW SW W WNW \\\n",
"count 113754.000000 113754.000000 113754.000000 113754.000000 \n",
"mean 0.056402 0.064464 0.069334 0.060798 \n",
"std 0.230698 0.245578 0.254022 0.238960 \n",
"min 0.000000 0.000000 0.000000 0.000000 \n",
"25% 0.000000 0.000000 0.000000 0.000000 \n",
"50% 0.000000 0.000000 0.000000 0.000000 \n",
"75% 0.000000 0.000000 0.000000 0.000000 \n",
"max 1.000000 1.000000 1.000000 1.000000 \n",
"\n",
" WSW \n",
"count 113754.000000 \n",
"mean 0.065483 \n",
"std 0.247378 \n",
"min 0.000000 \n",
"25% 0.000000 \n",
"50% 0.000000 \n",
"75% 0.000000 \n",
"max 1.000000 \n",
"\n",
"[8 rows x 118 columns]"
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have `X_train` dataset ready to be fed into the Logistic Regression classifier. I will do it as follows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 12. Model training"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
" intercept_scaling=1, max_iter=100, multi_class='warn',\n",
" n_jobs=None, penalty='l2', random_state=0, solver='liblinear',\n",
" tol=0.0001, verbose=0, warm_start=False)"
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# train a logistic regression model on the training set\n",
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"\n",
"# instantiate the model\n",
"logreg = LogisticRegression(solver='liblinear', random_state=0)\n",
"\n",
"\n",
"# fit the model\n",
"logreg.fit(X_train, y_train)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 13. Predict results"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['No', 'No', 'No', ..., 'No', 'No', 'Yes'], dtype=object)"
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pred_test = logreg.predict(X_test)\n",
"\n",
"y_pred_test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### predict_proba method\n",
"\n",
"\n",
"**predict_proba** method gives the probabilities for the target variable(0 and 1) in this case, in array form.\n",
"\n",
"`0 is for probability of no rain` and `1 is for probability of rain.`"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.91387232, 0.83563172, 0.82035588, ..., 0.97674036, 0.7985333 ,\n",
" 0.3073458 ])"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# probability of getting output as 0 - no rain\n",
"\n",
"logreg.predict_proba(X_test)[:,0]"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.08612768, 0.16436828, 0.17964412, ..., 0.02325964, 0.2014667 ,\n",
" 0.6926542 ])"
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# probability of getting output as 1 - rain\n",
"\n",
"logreg.predict_proba(X_test)[:,1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 14. Check accuracy score"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model accuracy score: 0.8501\n"
]
}
],
"source": [
"from sklearn.metrics import accuracy_score\n",
"\n",
"print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, **y_test** are the true class labels and **y_pred_test** are the predicted class labels in the test-set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compare the train-set and test-set accuracy\n",
"\n",
"\n",
"Now, I will compare the train-set and test-set accuracy to check for overfitting."
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['No', 'No', 'No', ..., 'No', 'No', 'No'], dtype=object)"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pred_train = logreg.predict(X_train)\n",
"\n",
"y_pred_train"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training-set accuracy score: 0.8476\n"
]
}
],
"source": [
"print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check for overfitting and underfitting"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training set score: 0.8476\n",
"Test set score: 0.8501\n"
]
}
],
"source": [
"# print the scores on training and test set\n",
"\n",
"print('Training set score: {:.4f}'.format(logreg.score(X_train, y_train)))\n",
"\n",
"print('Test set score: {:.4f}'.format(logreg.score(X_test, y_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The training-set accuracy score is 0.8476 while the test-set accuracy to be 0.8501. These two values are quite comparable. So, there is no question of overfitting. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In Logistic Regression, we use default value of C = 1. It provides good performance with approximately 85% accuracy on both the training and the test set. But the model performance on both the training and test set are very comparable. It is likely the case of underfitting. \n",
"\n",
"I will increase C and fit a more flexible model."
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,\n",
" intercept_scaling=1, max_iter=100, multi_class='warn',\n",
" n_jobs=None, penalty='l2', random_state=0, solver='liblinear',\n",
" tol=0.0001, verbose=0, warm_start=False)"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# fit the Logsitic Regression model with C=100\n",
"\n",
"# instantiate the model\n",
"logreg100 = LogisticRegression(C=100, solver='liblinear', random_state=0)\n",
"\n",
"\n",
"# fit the model\n",
"logreg100.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training set score: 0.8478\n",
"Test set score: 0.8505\n"
]
}
],
"source": [
"# print the scores on training and test set\n",
"\n",
"print('Training set score: {:.4f}'.format(logreg100.score(X_train, y_train)))\n",
"\n",
"print('Test set score: {:.4f}'.format(logreg100.score(X_test, y_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that, C=100 results in higher test set accuracy and also a slightly increased training set accuracy. So, we can conclude that a more complex model should perform better."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, I will investigate, what happens if we use more regularized model than the default value of C=1, by setting C=0.01."
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,\n",
" intercept_scaling=1, max_iter=100, multi_class='warn',\n",
" n_jobs=None, penalty='l2', random_state=0, solver='liblinear',\n",
" tol=0.0001, verbose=0, warm_start=False)"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# fit the Logsitic Regression model with C=001\n",
"\n",
"# instantiate the model\n",
"logreg001 = LogisticRegression(C=0.01, solver='liblinear', random_state=0)\n",
"\n",
"\n",
"# fit the model\n",
"logreg001.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training set score: 0.8409\n",
"Test set score: 0.8448\n"
]
}
],
"source": [
"# print the scores on training and test set\n",
"\n",
"print('Training set score: {:.4f}'.format(logreg001.score(X_train, y_train)))\n",
"\n",
"print('Test set score: {:.4f}'.format(logreg001.score(X_test, y_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, if we use more regularized model by setting C=0.01, then both the training and test set accuracy decrease relatiev to the default parameters."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compare model accuracy with null accuracy\n",
"\n",
"\n",
"So, the model accuracy is 0.8501. But, we cannot say that our model is very good based on the above accuracy. We must compare it with the **null accuracy**. Null accuracy is the accuracy that could be achieved by always predicting the most frequent class.\n",
"\n",
"So, we should first check the class distribution in the test set. "
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"No 22067\n",
"Yes 6372\n",
"Name: RainTomorrow, dtype: int64"
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check class distribution in test set\n",
"\n",
"y_test.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the occurences of most frequent class is 22067. So, we can calculate null accuracy by dividing 22067 by total number of occurences."
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Null accuracy score: 0.7759\n"
]
}
],
"source": [
"# check null accuracy score\n",
"\n",
"null_accuracy = (22067/(22067+6372))\n",
"\n",
"print('Null accuracy score: {0:0.4f}'. format(null_accuracy))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that our model accuracy score is 0.8501 but null accuracy score is 0.7759. So, we can conclude that our Logistic Regression model is doing a very good job in predicting the class labels."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, based on the above analysis we can conclude that our classification model accuracy is very good. Our model is doing a very good job in terms of predicting the class labels.\n",
"\n",
"\n",
"But, it does not give the underlying distribution of values. Also, it does not tell anything about the type of errors our classifer is making. \n",
"\n",
"\n",
"We have another tool called `Confusion matrix` that comes to our rescue."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 15. Confusion matrix\n",
"\n",
"\n",
"A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.\n",
"\n",
"\n",
"Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-\n",
"\n",
"\n",
"**True Positives (TP)** – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to that class.\n",
"\n",
"\n",
"**True Negatives (TN)** – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually does not belong to that class.\n",
"\n",
"\n",
"**False Positives (FP)** – False Positives occur when we predict an observation belongs to a certain class but the observation actually does not belong to that class. This type of error is called **Type I error.**\n",
"\n",
"\n",
"\n",
"**False Negatives (FN)** – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually belongs to that class. This is a very serious error and it is called **Type II error.**\n",
"\n",
"\n",
"\n",
"These four outcomes are summarized in a confusion matrix given below.\n"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Confusion matrix\n",
"\n",
" [[20892 1175]\n",
" [ 3087 3285]]\n",
"\n",
"True Positives(TP) = 20892\n",
"\n",
"True Negatives(TN) = 3285\n",
"\n",
"False Positives(FP) = 1175\n",
"\n",
"False Negatives(FN) = 3087\n"
]
}
],
"source": [
"# Print the Confusion Matrix and slice it into four pieces\n",
"\n",
"from sklearn.metrics import confusion_matrix\n",
"\n",
"cm = confusion_matrix(y_test, y_pred_test)\n",
"\n",
"print('Confusion matrix\\n\\n', cm)\n",
"\n",
"print('\\nTrue Positives(TP) = ', cm[0,0])\n",
"\n",
"print('\\nTrue Negatives(TN) = ', cm[1,1])\n",
"\n",
"print('\\nFalse Positives(FP) = ', cm[0,1])\n",
"\n",
"print('\\nFalse Negatives(FN) = ', cm[1,0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The confusion matrix shows `20892 + 3285 = 24177 correct predictions` and `3087 + 1175 = 4262 incorrect predictions`.\n",
"\n",
"\n",
"In this case, we have\n",
"\n",
"\n",
"- `True Positives` (Actual Positive:1 and Predict Positive:1) - 20892\n",
"\n",
"\n",
"- `True Negatives` (Actual Negative:0 and Predict Negative:0) - 3285\n",
"\n",
"\n",
"- `False Positives` (Actual Negative:0 but Predict Positive:1) - 1175 `(Type I error)`\n",
"\n",
"\n",
"- `False Negatives` (Actual Positive:1 but Predict Negative:0) - 3087 `(Type II error)`"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0xacc3104f60>"
]
},
"execution_count": 114,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# visualize confusion matrix with seaborn heatmap\n",
"\n",
"cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], \n",
" index=['Predict Positive:1', 'Predict Negative:0'])\n",
"\n",
"sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 16. Classification metrices"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Classification Report\n",
"\n",
"\n",
"**Classification report** is another way to evaluate the classification model performance. It displays the **precision**, **recall**, **f1** and **support** scores for the model. I have described these terms in later.\n",
"\n",
"We can print a classification report as follows:-"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" No 0.87 0.95 0.91 22067\n",
" Yes 0.74 0.52 0.61 6372\n",
"\n",
" micro avg 0.85 0.85 0.85 28439\n",
" macro avg 0.80 0.73 0.76 28439\n",
"weighted avg 0.84 0.85 0.84 28439\n",
"\n"
]
}
],
"source": [
"from sklearn.metrics import classification_report\n",
"\n",
"print(classification_report(y_test, y_pred_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Classification accuracy"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [],
"source": [
"TP = cm[0,0]\n",
"TN = cm[1,1]\n",
"FP = cm[0,1]\n",
"FN = cm[1,0]"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Classification accuracy : 0.8501\n"
]
}
],
"source": [
"# print classification accuracy\n",
"\n",
"classification_accuracy = (TP + TN) / float(TP + TN + FP + FN)\n",
"\n",
"print('Classification accuracy : {0:0.4f}'.format(classification_accuracy))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Classification error"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Classification error : 0.1499\n"
]
}
],
"source": [
"# print classification error\n",
"\n",
"classification_error = (FP + FN) / float(TP + TN + FP + FN)\n",
"\n",
"print('Classification error : {0:0.4f}'.format(classification_error))\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Precision\n",
"\n",
"\n",
"**Precision** can be defined as the percentage of correctly predicted positive outcomes out of all the predicted positive outcomes. It can be given as the ratio of true positives (TP) to the sum of true and false positives (TP + FP). \n",
"\n",
"\n",
"So, **Precision** identifies the proportion of correctly predicted positive outcome. It is more concerned with the positive class than the negative class.\n",
"\n",
"\n",
"\n",
"Mathematically, precision can be defined as the ratio of `TP to (TP + FP).`\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Precision : 0.9468\n"
]
}
],
"source": [
"# print precision score\n",
"\n",
"precision = TP / float(TP + FP)\n",
"\n",
"\n",
"print('Precision : {0:0.4f}'.format(precision))\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Recall\n",
"\n",
"\n",
"Recall can be defined as the percentage of correctly predicted positive outcomes out of all the actual positive outcomes.\n",
"It can be given as the ratio of true positives (TP) to the sum of true positives and false negatives (TP + FN). **Recall** is also called **Sensitivity**.\n",
"\n",
"\n",
"**Recall** identifies the proportion of correctly predicted actual positives.\n",
"\n",
"\n",
"Mathematically, recall can be given as the ratio of `TP to (TP + FN).`\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Recall or Sensitivity : 0.8713\n"
]
}
],
"source": [
"recall = TP / float(TP + FN)\n",
"\n",
"print('Recall or Sensitivity : {0:0.4f}'.format(recall))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### True Positive Rate\n",
"\n",
"\n",
"**True Positive Rate** is synonymous with **Recall**.\n"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"True Positive Rate : 0.8713\n"
]
}
],
"source": [
"true_positive_rate = TP / float(TP + FN)\n",
"\n",
"\n",
"print('True Positive Rate : {0:0.4f}'.format(true_positive_rate))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### False Positive Rate"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"False Positive Rate : 0.2635\n"
]
}
],
"source": [
"false_positive_rate = FP / float(FP + TN)\n",
"\n",
"\n",
"print('False Positive Rate : {0:0.4f}'.format(false_positive_rate))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Specificity"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Specificity : 0.7365\n"
]
}
],
"source": [
"specificity = TN / (TN + FP)\n",
"\n",
"print('Specificity : {0:0.4f}'.format(specificity))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### f1-score\n",
"\n",
"\n",
"**f1-score** is the weighted harmonic mean of precision and recall. The best possible **f1-score** would be 1.0 and the worst \n",
"would be 0.0. **f1-score** is the harmonic mean of precision and recall. So, **f1-score** is always lower than accuracy measures as they embed precision and recall into their computation. The weighted average of `f1-score` should be used to \n",
"compare classifier models, not global accuracy.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Support\n",
"\n",
"\n",
"**Support** is the actual number of occurrences of the class in our dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 17. Adjusting the threshold level"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0.91387232, 0.08612768],\n",
" [0.83563172, 0.16436828],\n",
" [0.82035588, 0.17964412],\n",
" [0.99025882, 0.00974118],\n",
" [0.95726809, 0.04273191],\n",
" [0.97994232, 0.02005768],\n",
" [0.17838588, 0.82161412],\n",
" [0.23482434, 0.76517566],\n",
" [0.90050811, 0.09949189],\n",
" [0.85480088, 0.14519912]])"
]
},
"execution_count": 124,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# print the first 10 predicted probabilities of two classes- 0 and 1\n",
"\n",
"y_pred_prob = logreg.predict_proba(X_test)[0:10]\n",
"\n",
"y_pred_prob"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Observations\n",
"\n",
"\n",
"- In each row, the numbers sum to 1.\n",
"\n",
"\n",
"- There are 2 columns which correspond to 2 classes - 0 and 1.\n",
"\n",
" - Class 0 - predicted probability that there is no rain tomorrow. \n",
" \n",
" - Class 1 - predicted probability that there is rain tomorrow.\n",
" \n",
" \n",
"- Importance of predicted probabilities\n",
"\n",
" - We can rank the observations by probability of rain or no rain.\n",
"\n",
"\n",
"- predict_proba process\n",
"\n",
" - Predicts the probabilities \n",
" \n",
" - Choose the class with the highest probability \n",
" \n",
" \n",
"- Classification threshold level\n",
"\n",
" - There is a classification threshold level of 0.5. \n",
" \n",
" - Class 1 - probability of rain is predicted if probability > 0.5. \n",
" \n",
" - Class 0 - probability of no rain is predicted if probability < 0.5. \n",
" \n"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Prob of - No rain tomorrow (0)</th>\n",
" <th>Prob of - Rain tomorrow (1)</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.913872</td>\n",
" <td>0.086128</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.835632</td>\n",
" <td>0.164368</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.820356</td>\n",
" <td>0.179644</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.990259</td>\n",
" <td>0.009741</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.957268</td>\n",
" <td>0.042732</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0.979942</td>\n",
" <td>0.020058</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0.178386</td>\n",
" <td>0.821614</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>0.234824</td>\n",
" <td>0.765176</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0.900508</td>\n",
" <td>0.099492</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>0.854801</td>\n",
" <td>0.145199</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Prob of - No rain tomorrow (0) Prob of - Rain tomorrow (1)\n",
"0 0.913872 0.086128\n",
"1 0.835632 0.164368\n",
"2 0.820356 0.179644\n",
"3 0.990259 0.009741\n",
"4 0.957268 0.042732\n",
"5 0.979942 0.020058\n",
"6 0.178386 0.821614\n",
"7 0.234824 0.765176\n",
"8 0.900508 0.099492\n",
"9 0.854801 0.145199"
]
},
"execution_count": 125,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# store the probabilities in dataframe\n",
"\n",
"y_pred_prob_df = pd.DataFrame(data=y_pred_prob, columns=['Prob of - No rain tomorrow (0)', 'Prob of - Rain tomorrow (1)'])\n",
"\n",
"y_pred_prob_df"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.08612768, 0.16436828, 0.17964412, 0.00974118, 0.04273191,\n",
" 0.02005768, 0.82161412, 0.76517566, 0.09949189, 0.14519912])"
]
},
"execution_count": 126,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# print the first 10 predicted probabilities for class 1 - Probability of rain\n",
"\n",
"logreg.predict_proba(X_test)[0:10, 1]"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [],
"source": [
"# store the predicted probabilities for class 1 - Probability of rain\n",
"\n",
"y_pred1 = logreg.predict_proba(X_test)[:, 1]"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0,0.5,'Frequency')"
]
},
"execution_count": 128,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# plot histogram of predicted probabilities\n",
"\n",
"\n",
"# adjust the font size \n",
"plt.rcParams['font.size'] = 12\n",
"\n",
"\n",
"# plot histogram with 10 bins\n",
"plt.hist(y_pred1, bins = 10)\n",
"\n",
"\n",
"# set the title of predicted probabilities\n",
"plt.title('Histogram of predicted probabilities of rain')\n",
"\n",
"\n",
"# set the x-axis limit\n",
"plt.xlim(0,1)\n",
"\n",
"\n",
"# set the title\n",
"plt.xlabel('Predicted probabilities of rain')\n",
"plt.ylabel('Frequency')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Observations\n",
"\n",
"\n",
"- We can see that the above histogram is highly positive skewed.\n",
"\n",
"\n",
"- The first column tell us that there are approximately 15000 observations with probability between 0.0 and 0.1.\n",
"\n",
"\n",
"- There are small number of observations with probability > 0.5.\n",
"\n",
"\n",
"- So, these small number of observations predict that there will be rain tomorrow.\n",
"\n",
"\n",
"- Majority of observations predict that there will be no rain tomorrow."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Lower the threshold"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"With 0.1 threshold the Confusion Matrix is \n",
"\n",
" [[12726 9341]\n",
" [ 547 5825]] \n",
"\n",
" with 18551 correct predictions, \n",
"\n",
" 9341 Type I errors( False Positives), \n",
"\n",
" 547 Type II errors( False Negatives), \n",
"\n",
" Accuracy score: 0.6523084496641935 \n",
"\n",
" Sensitivity: 0.9141556811048337 \n",
"\n",
" Specificity: 0.5766982371867494 \n",
"\n",
" ==================================================== \n",
"\n",
"\n",
"With 0.2 threshold the Confusion Matrix is \n",
"\n",
" [[17067 5000]\n",
" [ 1233 5139]] \n",
"\n",
" with 22206 correct predictions, \n",
"\n",
" 5000 Type I errors( False Positives), \n",
"\n",
" 1233 Type II errors( False Negatives), \n",
"\n",
" Accuracy score: 0.7808291430781673 \n",
"\n",
" Sensitivity: 0.806497175141243 \n",
"\n",
" Specificity: 0.7734173199800607 \n",
"\n",
" ==================================================== \n",
"\n",
"\n",
"With 0.3 threshold the Confusion Matrix is \n",
"\n",
" [[19080 2987]\n",
" [ 1873 4499]] \n",
"\n",
" with 23579 correct predictions, \n",
"\n",
" 2987 Type I errors( False Positives), \n",
"\n",
" 1873 Type II errors( False Negatives), \n",
"\n",
" Accuracy score: 0.8291079151868912 \n",
"\n",
" Sensitivity: 0.7060577526679221 \n",
"\n",
" Specificity: 0.8646395069560883 \n",
"\n",
" ==================================================== \n",
"\n",
"\n",
"With 0.4 threshold the Confusion Matrix is \n",
"\n",
" [[20191 1876]\n",
" [ 2517 3855]] \n",
"\n",
" with 24046 correct predictions, \n",
"\n",
" 1876 Type I errors( False Positives), \n",
"\n",
" 2517 Type II errors( False Negatives), \n",
"\n",
" Accuracy score: 0.845529027040332 \n",
"\n",
" Sensitivity: 0.6049905838041432 \n",
"\n",
" Specificity: 0.9149861784565188 \n",
"\n",
" ==================================================== \n",
"\n",
"\n"
]
}
],
"source": [
"from sklearn.preprocessing import binarize\n",
"\n",
"for i in range(1,5):\n",
" \n",
" cm1=0\n",
" \n",
" y_pred1 = logreg.predict_proba(X_test)[:,1]\n",
" \n",
" y_pred1 = y_pred1.reshape(-1,1)\n",
" \n",
" y_pred2 = binarize(y_pred1, i/10)\n",
" \n",
" y_pred2 = np.where(y_pred2 == 1, 'Yes', 'No')\n",
" \n",
" cm1 = confusion_matrix(y_test, y_pred2)\n",
" \n",
" print ('With',i/10,'threshold the Confusion Matrix is ','\\n\\n',cm1,'\\n\\n',\n",
" \n",
" 'with',cm1[0,0]+cm1[1,1],'correct predictions, ', '\\n\\n', \n",
" \n",
" cm1[0,1],'Type I errors( False Positives), ','\\n\\n',\n",
" \n",
" cm1[1,0],'Type II errors( False Negatives), ','\\n\\n',\n",
" \n",
" 'Accuracy score: ', (accuracy_score(y_test, y_pred2)), '\\n\\n',\n",
" \n",
" 'Sensitivity: ',cm1[1,1]/(float(cm1[1,1]+cm1[1,0])), '\\n\\n',\n",
" \n",
" 'Specificity: ',cm1[0,0]/(float(cm1[0,0]+cm1[0,1])),'\\n\\n',\n",
" \n",
" '====================================================', '\\n\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comments\n",
"\n",
"\n",
"- In binary problems, the threshold of 0.5 is used by default to convert predicted probabilities into class predictions.\n",
"\n",
"\n",
"- Threshold can be adjusted to increase sensitivity or specificity. \n",
"\n",
"\n",
"- Sensitivity and specificity have an inverse relationship. Increasing one would always decrease the other and vice versa.\n",
"\n",
"\n",
"- We can see that increasing the threshold level results in increased accuracy.\n",
"\n",
"\n",
"- Adjusting the threshold level should be one of the last step you do in the model-building process."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 18. ROC - AUC\n",
"\n",
"\n",
"\n",
"### ROC Curve\n",
"\n",
"\n",
"Another tool to measure the classification model performance visually is **ROC Curve**. ROC Curve stands for **Receiver Operating Characteristic Curve**. An **ROC Curve** is a plot which shows the performance of a classification model at various \n",
"classification threshold levels. \n",
"\n",
"\n",
"\n",
"The **ROC Curve** plots the **True Positive Rate (TPR)** against the **False Positive Rate (FPR)** at various threshold levels.\n",
"\n",
"\n",
"\n",
"**True Positive Rate (TPR)** is also called **Recall**. It is defined as the ratio of `TP to (TP + FN).`\n",
"\n",
"\n",
"\n",
"**False Positive Rate (FPR)** is defined as the ratio of `FP to (FP + TN).`\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"In the ROC Curve, we will focus on the TPR (True Positive Rate) and FPR (False Positive Rate) of a single point. This will give us the general performance of the ROC curve which consists of the TPR and FPR at various threshold levels. So, an ROC Curve plots TPR vs FPR at different classification threshold levels. If we lower the threshold levels, it may result in more items being classified as positve. It will increase both True Positives (TP) and False Positives (FP).\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# plot ROC Curve\n",
"\n",
"from sklearn.metrics import roc_curve\n",
"\n",
"fpr, tpr, thresholds = roc_curve(y_test, y_pred1, pos_label = 'Yes')\n",
"\n",
"plt.figure(figsize=(6,4))\n",
"\n",
"plt.plot(fpr, tpr, linewidth=2)\n",
"\n",
"plt.plot([0,1], [0,1], 'k--' )\n",
"\n",
"plt.rcParams['font.size'] = 12\n",
"\n",
"plt.title('ROC curve for RainTomorrow classifier')\n",
"\n",
"plt.xlabel('False Positive Rate (1 - Specificity)')\n",
"\n",
"plt.ylabel('True Positive Rate (Sensitivity)')\n",
"\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ROC curve help us to choose a threshold level that balances sensitivity and specificity for a particular context."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ROC AUC\n",
"\n",
"\n",
"**ROC AUC** stands for **Receiver Operating Characteristic - Area Under Curve**. It is a technique to compare classifier performance. In this technique, we measure the `area under the curve (AUC)`. A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5. \n",
"\n",
"\n",
"So, **ROC AUC** is the percentage of the ROC plot that is underneath the curve."
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ROC AUC : 0.8729\n"
]
}
],
"source": [
"# compute ROC AUC\n",
"\n",
"from sklearn.metrics import roc_auc_score\n",
"\n",
"ROC_AUC = roc_auc_score(y_test, y_pred1)\n",
"\n",
"print('ROC AUC : {:.4f}'.format(ROC_AUC))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comments\n",
"\n",
"\n",
"- ROC AUC is a single number summary of classifier performance. The higher the value, the better the classifier.\n",
"\n",
"- ROC AUC of our model approaches towards 1. So, we can conclude that our classifier does a good job in predicting whether it will rain tomorrow or not."
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cross validated ROC AUC : 0.8695\n"
]
}
],
"source": [
"# calculate cross-validated ROC AUC \n",
"\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"Cross_validated_ROC_AUC = cross_val_score(logreg, X_train, y_train, cv=5, scoring='roc_auc').mean()\n",
"\n",
"print('Cross validated ROC AUC : {:.4f}'.format(Cross_validated_ROC_AUC))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model evaluation and improvement\n",
"\n",
"\n",
"\n",
"In this section, I will employ several techniques to improve the model performance. I will discuss 3 techniques which are used in practice for performance improvement. These are `recursive feature elimination`, `k-fold cross validation` and `hyperparameter optimization using GridSearchCV`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 19. Recursive Feature Elimination with Cross Validation\n",
"\n",
"\n",
"`Recursive feature elimination (RFE)` is a feature selection technique that helps us to select best features from the given number of features. At first, the model is built on all the given features. Then, it removes the least useful predictor and build the model again. This process is repeated until all the unimportant features are removed from the model.\n",
"\n",
"\n",
"`Recursive Feature Elimination with Cross-Validated (RFECV) feature selection` technique selects the best subset of features for the estimator by removing 0 to N features iteratively using recursive feature elimination. Then it selects the best subset based on the accuracy or cross-validation score or roc-auc of the model. Recursive feature elimination technique eliminates n features from a model by fitting the model multiple times and at each step, removing the weakest features.\n",
"\n",
"\n",
"I will use this technique to select best features from this model."
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.feature_selection import RFECV\n",
"\n",
"rfecv = RFECV(estimator=logreg, step=1, cv=5, scoring='accuracy')\n",
"\n",
"rfecv = rfecv.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Optimal number of features : 112\n"
]
}
],
"source": [
"print(\"Optimal number of features : %d\" % rfecv.n_features_)"
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
" intercept_scaling=1, max_iter=100, multi_class='warn',\n",
" n_jobs=None, penalty='l2', random_state=0, solver='liblinear',\n",
" tol=0.0001, verbose=0, warm_start=False)"
]
},
"execution_count": 135,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# transform the training data\n",
"\n",
"X_train_rfecv = rfecv.transform(X_train)\n",
"\n",
"\n",
"# train classifier\n",
"\n",
"logreg.fit(X_train_rfecv, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {},
"outputs": [],
"source": [
"# test classifier on test data\n",
"\n",
"X_test_rfecv = rfecv.transform(X_test)\n",
"\n",
"y_pred_rfecv = logreg.predict(X_test_rfecv)"
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Classifier score: 0.8500\n"
]
}
],
"source": [
"# print mean accuracy on transformed test data and labels\n",
"\n",
"print (\"Classifier score: {:.4f}\".format(logreg.score(X_test_rfecv,y_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our original model accuracy score is 0.8501 whereas accuracy score after RFECV is 0.8500. So, we can obtain approximately similar accuracy but with reduced or optimal set of features."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Confusion-matrix revisited\n",
"\n",
"\n",
"I will again plot the confusion-matrix for this model to get an idea of errors our model is making."
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Confusion matrix\n",
"\n",
" [[20893 1174]\n",
" [ 3091 3281]]\n",
"\n",
"True Positives(TP1) = 20893\n",
"\n",
"True Negatives(TN1) = 3281\n",
"\n",
"False Positives(FP1) = 1174\n",
"\n",
"False Negatives(FN1) = 3091\n"
]
}
],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"\n",
"cm1 = confusion_matrix(y_test, y_pred_rfecv)\n",
"\n",
"print('Confusion matrix\\n\\n', cm1)\n",
"\n",
"print('\\nTrue Positives(TP1) = ', cm1[0,0])\n",
"\n",
"print('\\nTrue Negatives(TN1) = ', cm1[1,1])\n",
"\n",
"print('\\nFalse Positives(FP1) = ', cm1[0,1])\n",
"\n",
"print('\\nFalse Negatives(FN1) = ', cm1[1,0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that in the original model, we have FP = 1175 whereas FP1 = 1174. So, we get approximately same number of false positives. Also, FN = 3087 whereas FN1 = 3091. So, we get slightly higher false negatives."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 20. k-Fold Cross Validation"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cross-validation scores:[0.84690783 0.84624852 0.84633642 0.84958903 0.84773626]\n"
]
}
],
"source": [
"# Applying 10-Fold Cross Validation\n",
"\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"scores = cross_val_score(logreg, X_train, y_train, cv = 5, scoring='accuracy')\n",
"\n",
"print('Cross-validation scores:{}'.format(scores))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can summarize the cross-validation accuracy by calculating its mean."
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Average cross-validation score: 0.8474\n"
]
}
],
"source": [
"# compute Average cross-validation score\n",
"\n",
"print('Average cross-validation score: {:.4f}'.format(scores.mean()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our, original model score is found to be 0.8476. The average cross-validation score is 0.8474. So, we can conclude that cross-validation does not result in performance improvement."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 21. Hyperparameter Optimization using GridSearch CV"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"GridSearchCV(cv=5, error_score='raise-deprecating',\n",
" estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
" intercept_scaling=1, max_iter=100, multi_class='warn',\n",
" n_jobs=None, penalty='l2', random_state=0, solver='liblinear',\n",
" tol=0.0001, verbose=0, warm_start=False),\n",
" fit_params=None, iid='warn', n_jobs=None,\n",
" param_grid=[{'penalty': ['l1', 'l2']}, {'C': [1, 10, 100, 1000]}],\n",
" pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n",
" scoring='accuracy', verbose=0)"
]
},
"execution_count": 141,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"\n",
"parameters = [{'penalty':['l1','l2']}, \n",
" {'C':[1, 10, 100, 1000]}]\n",
"\n",
"\n",
"\n",
"grid_search = GridSearchCV(estimator = logreg, \n",
" param_grid = parameters,\n",
" scoring = 'accuracy',\n",
" cv = 5,\n",
" verbose=0)\n",
"\n",
"\n",
"grid_search.fit(X_train, y_train)\n"
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GridSearch CV best score : 0.8474\n",
"\n",
"\n",
"Parameters that give the best results : \n",
"\n",
" {'penalty': 'l1'}\n",
"\n",
"\n",
"Estimator that was chosen by the search : \n",
"\n",
" LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
" intercept_scaling=1, max_iter=100, multi_class='warn',\n",
" n_jobs=None, penalty='l1', random_state=0, solver='liblinear',\n",
" tol=0.0001, verbose=0, warm_start=False)\n"
]
}
],
"source": [
"# examine the best model\n",
"\n",
"# best score achieved during the GridSearchCV\n",
"print('GridSearch CV best score : {:.4f}\\n\\n'.format(grid_search.best_score_))\n",
"\n",
"# print parameters that give the best results\n",
"print('Parameters that give the best results :','\\n\\n', (grid_search.best_params_))\n",
"\n",
"# print estimator that was chosen by the GridSearch\n",
"print('\\n\\nEstimator that was chosen by the search :','\\n\\n', (grid_search.best_estimator_))"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GridSearch CV score on test set: 0.8507\n"
]
}
],
"source": [
"# calculate GridSearch CV score on test set\n",
"\n",
"print('GridSearch CV score on test set: {0:0.4f}'.format(grid_search.score(X_test, y_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comments\n",
"\n",
"\n",
"- Our original model test accuracy is 0.8501 while GridSearch CV accuracy is 0.8507.\n",
"\n",
"\n",
"- We can see that GridSearch CV improve the performance for this particular model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 22. Results and Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1.\tThe logistic regression model accuracy score is 0.8501. So, the model does a very good job in predicting whether or not it will rain tomorrow in Australia.\n",
"\n",
"2.\tSmall number of observations predict that there will be rain tomorrow. Majority of observations predict that there will be no rain tomorrow.\n",
"\n",
"3.\tThe model shows no signs of overfitting.\n",
"\n",
"4.\tIncreasing the value of C results in higher test set accuracy and also a slightly increased training set accuracy. So, we can conclude that a more complex model should perform better.\n",
"\n",
"5.\tIncreasing the threshold level results in increased accuracy.\n",
"\n",
"6.\tROC AUC of our model approaches towards 1. So, we can conclude that our classifier does a good job in predicting whether it will rain tomorrow or not.\n",
"\n",
"7.\tOur original model accuracy score is 0.8501 whereas accuracy score after RFECV is 0.8500. So, we can obtain approximately similar accuracy but with reduced set of features.\n",
"\n",
"8.\tIn the original model, we have FP = 1175 whereas FP1 = 1174. So, we get approximately same number of false positives. Also, FN = 3087 whereas FN1 = 3091. So, we get slighly higher false negatives.\n",
"\n",
"9.\tOur, original model score is found to be 0.8476. The average cross-validation score is 0.8474. So, we can conclude that cross-validation does not result in performance improvement.\n",
"\n",
"10.\tOur original model test accuracy is 0.8501 while GridSearch CV accuracy is 0.8507. We can see that GridSearch CV improve the performance for this particular model.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment