Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save pb111/62067d128be7f7f86e81916ff946af7b to your computer and use it in GitHub Desktop.
Save pb111/62067d128be7f7f86e81916ff946af7b to your computer and use it in GitHub Desktop.
Logistic Regression with Python and Scikit-Learn
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Logistic Regression with Python and Scikit-Learn\n",
"\n",
"\n",
"In this project, I implement Logistic Regression with Python and Scikit-Learn. I build a classifier to predict whether or not it will rain tomorrow in Australia by training a binary classification model using Logistic Regression. I have used the **Rain in Australia** dataset downloaded from the Kaggle website for this project."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"\n",
"The table of contents for this project is as follows:-\n",
"\n",
"\n",
"1.\tIntroduction to Logistic Regression\n",
"2.\tLogistic Regression intuition\n",
"3.\tThe problem statement\n",
"4.\tDataset description\n",
"5.\tImport libraries\n",
"6.\tImport dataset\n",
"7.\tExploratory data analysis\n",
"8.\tDeclare feature vector and target variable\n",
"9.\tSplit data into separate training and test set\n",
"10.\tFeature engineering\n",
"11.\tFeature scaling\n",
"12.\tModel training\n",
"13.\tPredict results\n",
"14.\tCheck accuracy score\n",
"15.\tConfusion matrix\n",
"16.\tClassification metrices\n",
"17.\tAdjusting the threshold level\n",
"18.\tROC - AUC\n",
"19.\tRecursive feature elimination\n",
"20.\tk-Fold Cross Validation\n",
"21.\tHyperparameter optimization using GridSearch CV\n",
"22.\tResults and conclusion\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Introduction to Logistic Regression\n",
"\n",
"\n",
"When data scientists may come across a new classification problem, the first algorithm that may come across their mind is **Logistic Regression**. It is a supervised learning classification algorithm which is used to predict observations to a discrete set of classes. Practically, it is used to classify observations into different categories. Hence, its output is discrete in nature. **Logistic Regression** is also called **Logit Regression**. It is one of the most simple, straightforward and versatile classification algorithms which is used to solve classification problems."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Logistic Regression intuition\n",
"\n",
"\n",
"In statistics, the **Logistic Regression model** is a widely used statistical model which is primarily used for classification purposes. It means that given a set of observations, Logistic Regression algorithm helps us to classify these observations into two or more discrete classes. So, the target variable is discrete in nature.\n",
"\n",
"\n",
"Logistic Regression algorithm works by implementing a linear equation with independent or explanatory variables to predict a response value. This predicted response value, denoted by z is then converted into a probability value that lie between 0 and 1. We use the **sigmoid function** in order to map predicted values to probability values. This sigmoid function then maps any real value into a probability value between 0 and 1. \n",
"\n",
"\n",
"\n",
"The sigmoid function returns a probability value between 0 and 1. This probability value is then mapped to a discrete class which is either “0” or “1”. In order to map this probability value to a discrete class (pass/fail, yes/no, true/false), we select a threshold value. This threshold value is called **Decision boundary**. Above this threshold value, we will map the probability values into class 1 and below which we will map values into class 0.\n",
"\n",
"\n",
"Mathematically, it can be expressed as follows:-\n",
"\n",
"\n",
" p ≥ 0.5 => class = 1\n",
" \n",
" p < 0.5 => class = 0 \n",
"\n",
"\n",
"Generally, the decision boundary is set to 0.5. So, if the probability value is 0.8 (> 0.5), we will map this observation to class 1. Similarly, if the probability value is 0.2 (< 0.5), we will map this observation to class 0.\n",
"\n",
"\n",
"We can use our knowledge of `sigmoid function` and `decision boundary` to write a prediction function. A prediction function in logistic regression returns the probability of the observation being positive, `Yes` or `True`. We call this as `class 1` and it is denoted by `P(class = 1)`. If the probability inches closer to one, then we will be more confident about our model that the observation is in class 1.\n",
"\n",
"Logistic regression intuition is discussed in depth in the readme document."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. The problem statement\n",
"\n",
"\n",
"In this project, I try to answer the question that whether or not it will rain tomorrow in Australia. I implement Logistic Regression with Python and Scikit-Learn. \n",
"\n",
"\n",
"To answer the question, I build a classifier to predict whether or not it will rain tomorrow in Australia by training a binary classification model using Logistic Regression. I have used the **Rain in Australia** dataset downloaded from the Kaggle website for this project."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Dataset description\n",
"\n",
"\n",
"I have used the **Rain in Australia** data set downloaded from the Kaggle website.\n",
"\n",
"\n",
"I have downloaded this data set from the Kaggle website. The data set can be found at the following url:-\n",
"\n",
"\n",
"https://www.kaggle.com/jsphyg/weather-dataset-rattle-package\n",
"\n",
"\n",
"This dataset contains daily weather observations from numerous Australian weather stations. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Import libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Import dataset"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data = 'C:/datasets/weatherAUS.csv'\n",
"\n",
"df = pd.read_csv(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Exploratory data analysis\n",
"\n",
"\n",
"Now, I will explore the data to gain insights about the data. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(142193, 24)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view dimensions of dataset\n",
"\n",
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 142193 instances and 24 variables in the data set."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>Location</th>\n",
" <th>MinTemp</th>\n",
" <th>MaxTemp</th>\n",
" <th>Rainfall</th>\n",
" <th>Evaporation</th>\n",
" <th>Sunshine</th>\n",
" <th>WindGustDir</th>\n",
" <th>WindGustSpeed</th>\n",
" <th>WindDir9am</th>\n",
" <th>...</th>\n",
" <th>Humidity3pm</th>\n",
" <th>Pressure9am</th>\n",
" <th>Pressure3pm</th>\n",
" <th>Cloud9am</th>\n",
" <th>Cloud3pm</th>\n",
" <th>Temp9am</th>\n",
" <th>Temp3pm</th>\n",
" <th>RainToday</th>\n",
" <th>RISK_MM</th>\n",
" <th>RainTomorrow</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2008-12-01</td>\n",
" <td>Albury</td>\n",
" <td>13.4</td>\n",
" <td>22.9</td>\n",
" <td>0.6</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>W</td>\n",
" <td>44.0</td>\n",
" <td>W</td>\n",
" <td>...</td>\n",
" <td>22.0</td>\n",
" <td>1007.7</td>\n",
" <td>1007.1</td>\n",
" <td>8.0</td>\n",
" <td>NaN</td>\n",
" <td>16.9</td>\n",
" <td>21.8</td>\n",
" <td>No</td>\n",
" <td>0.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2008-12-02</td>\n",
" <td>Albury</td>\n",
" <td>7.4</td>\n",
" <td>25.1</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>WNW</td>\n",
" <td>44.0</td>\n",
" <td>NNW</td>\n",
" <td>...</td>\n",
" <td>25.0</td>\n",
" <td>1010.6</td>\n",
" <td>1007.8</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>17.2</td>\n",
" <td>24.3</td>\n",
" <td>No</td>\n",
" <td>0.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2008-12-03</td>\n",
" <td>Albury</td>\n",
" <td>12.9</td>\n",
" <td>25.7</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>WSW</td>\n",
" <td>46.0</td>\n",
" <td>W</td>\n",
" <td>...</td>\n",
" <td>30.0</td>\n",
" <td>1007.6</td>\n",
" <td>1008.7</td>\n",
" <td>NaN</td>\n",
" <td>2.0</td>\n",
" <td>21.0</td>\n",
" <td>23.2</td>\n",
" <td>No</td>\n",
" <td>0.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2008-12-04</td>\n",
" <td>Albury</td>\n",
" <td>9.2</td>\n",
" <td>28.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NE</td>\n",
" <td>24.0</td>\n",
" <td>SE</td>\n",
" <td>...</td>\n",
" <td>16.0</td>\n",
" <td>1017.6</td>\n",
" <td>1012.8</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>18.1</td>\n",
" <td>26.5</td>\n",
" <td>No</td>\n",
" <td>1.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2008-12-05</td>\n",
" <td>Albury</td>\n",
" <td>17.5</td>\n",
" <td>32.3</td>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>W</td>\n",
" <td>41.0</td>\n",
" <td>ENE</td>\n",
" <td>...</td>\n",
" <td>33.0</td>\n",
" <td>1010.8</td>\n",
" <td>1006.0</td>\n",
" <td>7.0</td>\n",
" <td>8.0</td>\n",
" <td>17.8</td>\n",
" <td>29.7</td>\n",
" <td>No</td>\n",
" <td>0.2</td>\n",
" <td>No</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 24 columns</p>\n",
"</div>"
],
"text/plain": [
" Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine \\\n",
"0 2008-12-01 Albury 13.4 22.9 0.6 NaN NaN \n",
"1 2008-12-02 Albury 7.4 25.1 0.0 NaN NaN \n",
"2 2008-12-03 Albury 12.9 25.7 0.0 NaN NaN \n",
"3 2008-12-04 Albury 9.2 28.0 0.0 NaN NaN \n",
"4 2008-12-05 Albury 17.5 32.3 1.0 NaN NaN \n",
"\n",
" WindGustDir WindGustSpeed WindDir9am ... Humidity3pm \\\n",
"0 W 44.0 W ... 22.0 \n",
"1 WNW 44.0 NNW ... 25.0 \n",
"2 WSW 46.0 W ... 30.0 \n",
"3 NE 24.0 SE ... 16.0 \n",
"4 W 41.0 ENE ... 33.0 \n",
"\n",
" Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday \\\n",
"0 1007.7 1007.1 8.0 NaN 16.9 21.8 No \n",
"1 1010.6 1007.8 NaN NaN 17.2 24.3 No \n",
"2 1007.6 1008.7 NaN 2.0 21.0 23.2 No \n",
"3 1017.6 1012.8 NaN NaN 18.1 26.5 No \n",
"4 1010.8 1006.0 7.0 8.0 17.8 29.7 No \n",
"\n",
" RISK_MM RainTomorrow \n",
"0 0.0 No \n",
"1 0.0 No \n",
"2 0.0 No \n",
"3 1.0 No \n",
"4 0.2 No \n",
"\n",
"[5 rows x 24 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# preview the dataset\n",
"\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',\n",
" 'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',\n",
" 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',\n",
" 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',\n",
" 'Temp3pm', 'RainToday', 'RISK_MM', 'RainTomorrow'],\n",
" dtype='object')"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"col_names = df.columns\n",
"\n",
"col_names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Drop RISK_MM variable\n",
"\n",
"It is given in the dataset description, that we should drop the `RISK_MM` feature variable from the dataset description. So, we \n",
"should drop it as follows-"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"df.drop(['RISK_MM'], axis=1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 142193 entries, 0 to 142192\n",
"Data columns (total 23 columns):\n",
"Date 142193 non-null object\n",
"Location 142193 non-null object\n",
"MinTemp 141556 non-null float64\n",
"MaxTemp 141871 non-null float64\n",
"Rainfall 140787 non-null float64\n",
"Evaporation 81350 non-null float64\n",
"Sunshine 74377 non-null float64\n",
"WindGustDir 132863 non-null object\n",
"WindGustSpeed 132923 non-null float64\n",
"WindDir9am 132180 non-null object\n",
"WindDir3pm 138415 non-null object\n",
"WindSpeed9am 140845 non-null float64\n",
"WindSpeed3pm 139563 non-null float64\n",
"Humidity9am 140419 non-null float64\n",
"Humidity3pm 138583 non-null float64\n",
"Pressure9am 128179 non-null float64\n",
"Pressure3pm 128212 non-null float64\n",
"Cloud9am 88536 non-null float64\n",
"Cloud3pm 85099 non-null float64\n",
"Temp9am 141289 non-null float64\n",
"Temp3pm 139467 non-null float64\n",
"RainToday 140787 non-null object\n",
"RainTomorrow 142193 non-null object\n",
"dtypes: float64(16), object(7)\n",
"memory usage: 25.0+ MB\n"
]
}
],
"source": [
"# view summary of dataset\n",
"\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Types of variables\n",
"\n",
"\n",
"In this section, I segregate the dataset into categorical and numerical variables. There are a mixture of categorical and numerical variables in the dataset. Categorical variables have data type object. Numerical variables have data type float64.\n",
"\n",
"\n",
"First of all, I will find categorical variables."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 7 categorical variables\n",
"\n",
"The categorical variables are : ['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']\n"
]
}
],
"source": [
"# find categorical variables\n",
"\n",
"categorical = [var for var in df.columns if df[var].dtype=='O']\n",
"\n",
"print('There are {} categorical variables\\n'.format(len(categorical)))\n",
"\n",
"print('The categorical variables are :', categorical)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>Location</th>\n",
" <th>WindGustDir</th>\n",
" <th>WindDir9am</th>\n",
" <th>WindDir3pm</th>\n",
" <th>RainToday</th>\n",
" <th>RainTomorrow</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2008-12-01</td>\n",
" <td>Albury</td>\n",
" <td>W</td>\n",
" <td>W</td>\n",
" <td>WNW</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2008-12-02</td>\n",
" <td>Albury</td>\n",
" <td>WNW</td>\n",
" <td>NNW</td>\n",
" <td>WSW</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2008-12-03</td>\n",
" <td>Albury</td>\n",
" <td>WSW</td>\n",
" <td>W</td>\n",
" <td>WSW</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2008-12-04</td>\n",
" <td>Albury</td>\n",
" <td>NE</td>\n",
" <td>SE</td>\n",
" <td>E</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2008-12-05</td>\n",
" <td>Albury</td>\n",
" <td>W</td>\n",
" <td>ENE</td>\n",
" <td>NW</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date Location WindGustDir WindDir9am WindDir3pm RainToday \\\n",
"0 2008-12-01 Albury W W WNW No \n",
"1 2008-12-02 Albury WNW NNW WSW No \n",
"2 2008-12-03 Albury WSW W WSW No \n",
"3 2008-12-04 Albury NE SE E No \n",
"4 2008-12-05 Albury W ENE NW No \n",
"\n",
" RainTomorrow \n",
"0 No \n",
"1 No \n",
"2 No \n",
"3 No \n",
"4 No "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# view the categorical variables\n",
"\n",
"df[categorical].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary of categorical variables\n",
"\n",
"\n",
"- There is a date variable. It is denoted by `Date` column.\n",
"\n",
"\n",
"- There are 6 categorical variables. These are given by `Location`, `WindGustDir`, `WindDir9am`, `WindDir3pm`, `RainToday` and `RainTomorrow`.\n",
"\n",
"\n",
"- There are two binary categorical variables - `RainToday` and `RainTomorrow`.\n",
"\n",
"\n",
"- `RainTomorrow` is the target variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore problems within categorical variables\n",
"\n",
"\n",
"First, I will explore the categorical variables.\n",
"\n",
"\n",
"### Missing values in categorical variables"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Date 0\n",
"Location 0\n",
"WindGustDir 9330\n",
"WindDir9am 10013\n",
"WindDir3pm 3778\n",
"RainToday 1406\n",
"RainTomorrow 0\n",
"dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check missing values in categorical variables\n",
"\n",
"df[categorical].isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WindGustDir 9330\n",
"WindDir9am 10013\n",
"WindDir3pm 3778\n",
"RainToday 1406\n",
"dtype: int64\n"
]
}
],
"source": [
"# print categorical variables containing missing values\n",
"\n",
"cat1 = [var for var in categorical if df[var].isnull().sum()!=0]\n",
"\n",
"print(df[cat1].isnull().sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are only 4 categorical variables in the dataset which contains missing values. These are `WindGustDir`, `WindDir9am`, `WindDir3pm` and `RainToday`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Frequency counts of categorical variables\n",
"\n",
"\n",
"Now, I will check the frequency counts of categorical variables."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2014-10-12 49\n",
"2017-01-15 49\n",
"2013-10-02 49\n",
"2014-07-15 49\n",
"2014-02-19 49\n",
"2016-08-21 49\n",
"2014-07-03 49\n",
"2016-10-21 49\n",
"2013-03-11 49\n",
"2017-02-08 49\n",
"2014-11-17 49\n",
"2013-04-25 49\n",
"2014-11-19 49\n",
"2014-08-30 49\n",
"2014-01-07 49\n",
"2013-04-10 49\n",
"2017-03-16 49\n",
"2013-09-04 49\n",
"2016-08-16 49\n",
"2016-10-19 49\n",
"2014-08-20 49\n",
"2017-05-12 49\n",
"2014-01-16 49\n",
"2016-07-22 49\n",
"2017-01-22 49\n",
"2013-09-25 49\n",
"2013-06-02 49\n",
"2016-07-06 49\n",
"2014-04-21 49\n",
"2013-10-16 49\n",
" ..\n",
"2007-11-23 1\n",
"2008-01-15 1\n",
"2007-12-22 1\n",
"2007-11-08 1\n",
"2007-11-29 1\n",
"2008-01-29 1\n",
"2008-01-06 1\n",
"2007-11-02 1\n",
"2007-12-25 1\n",
"2008-01-28 1\n",
"2007-12-08 1\n",
"2007-11-09 1\n",
"2008-01-05 1\n",
"2007-11-26 1\n",
"2007-11-10 1\n",
"2007-11-20 1\n",
"2008-01-14 1\n",
"2007-12-03 1\n",
"2008-01-12 1\n",
"2007-11-03 1\n",
"2007-12-02 1\n",
"2008-01-31 1\n",
"2007-12-01 1\n",
"2007-11-06 1\n",
"2007-11-27 1\n",
"2007-12-19 1\n",
"2007-11-19 1\n",
"2007-12-30 1\n",
"2007-12-23 1\n",
"2008-01-09 1\n",
"Name: Date, Length: 3436, dtype: int64\n",
"Canberra 3418\n",
"Sydney 3337\n",
"Perth 3193\n",
"Darwin 3192\n",
"Hobart 3188\n",
"Brisbane 3161\n",
"Adelaide 3090\n",
"Bendigo 3034\n",
"Townsville 3033\n",
"AliceSprings 3031\n",
"MountGambier 3030\n",
"Launceston 3028\n",
"Ballarat 3028\n",
"Albany 3016\n",
"Albury 3011\n",
"PerthAirport 3009\n",
"MelbourneAirport 3009\n",
"Mildura 3007\n",
"SydneyAirport 3005\n",
"Nuriootpa 3002\n",
"Sale 3000\n",
"Watsonia 2999\n",
"Tuggeranong 2998\n",
"Portland 2996\n",
"Woomera 2990\n",
"Cairns 2988\n",
"Cobar 2988\n",
"Wollongong 2983\n",
"GoldCoast 2980\n",
"WaggaWagga 2976\n",
"NorfolkIsland 2964\n",
"Penrith 2964\n",
"Newcastle 2955\n",
"SalmonGums 2955\n",
"CoffsHarbour 2953\n",
"Witchcliffe 2952\n",
"Richmond 2951\n",
"Dartmoor 2943\n",
"NorahHead 2929\n",
"BadgerysCreek 2928\n",
"MountGinini 2907\n",
"Moree 2854\n",
"Walpole 2819\n",
"PearceRAAF 2762\n",
"Williamtown 2553\n",
"Melbourne 2435\n",
"Nhil 1569\n",
"Katherine 1559\n",
"Uluru 1521\n",
"Name: Location, dtype: int64\n",
"W 9780\n",
"SE 9309\n",
"E 9071\n",
"N 9033\n",
"SSE 8993\n",
"S 8949\n",
"WSW 8901\n",
"SW 8797\n",
"SSW 8610\n",
"WNW 8066\n",
"NW 8003\n",
"ENE 7992\n",
"ESE 7305\n",
"NE 7060\n",
"NNW 6561\n",
"NNE 6433\n",
"Name: WindGustDir, dtype: int64\n",
"N 11393\n",
"SE 9162\n",
"E 9024\n",
"SSE 8966\n",
"NW 8552\n",
"S 8493\n",
"W 8260\n",
"SW 8237\n",
"NNE 7948\n",
"NNW 7840\n",
"ENE 7735\n",
"ESE 7558\n",
"NE 7527\n",
"SSW 7448\n",
"WNW 7194\n",
"WSW 6843\n",
"Name: WindDir9am, dtype: int64\n",
"SE 10663\n",
"W 9911\n",
"S 9598\n",
"WSW 9329\n",
"SW 9182\n",
"SSE 9142\n",
"N 8667\n",
"WNW 8656\n",
"NW 8468\n",
"ESE 8382\n",
"E 8342\n",
"NE 8164\n",
"SSW 8010\n",
"NNW 7733\n",
"ENE 7724\n",
"NNE 6444\n",
"Name: WindDir3pm, dtype: int64\n",
"No 109332\n",
"Yes 31455\n",
"Name: RainToday, dtype: int64\n",
"No 110316\n",
"Yes 31877\n",
"Name: RainTomorrow, dtype: int64\n"
]
}
],
"source": [
"# view frequency of categorical variables\n",
"\n",
"for var in categorical: \n",
" \n",
" print(df[var].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2014-10-12 0.000345\n",
"2017-01-15 0.000345\n",
"2013-10-02 0.000345\n",
"2014-07-15 0.000345\n",
"2014-02-19 0.000345\n",
"2016-08-21 0.000345\n",
"2014-07-03 0.000345\n",
"2016-10-21 0.000345\n",
"2013-03-11 0.000345\n",
"2017-02-08 0.000345\n",
"2014-11-17 0.000345\n",
"2013-04-25 0.000345\n",
"2014-11-19 0.000345\n",
"2014-08-30 0.000345\n",
"2014-01-07 0.000345\n",
"2013-04-10 0.000345\n",
"2017-03-16 0.000345\n",
"2013-09-04 0.000345\n",
"2016-08-16 0.000345\n",
"2016-10-19 0.000345\n",
"2014-08-20 0.000345\n",
"2017-05-12 0.000345\n",
"2014-01-16 0.000345\n",
"2016-07-22 0.000345\n",
"2017-01-22 0.000345\n",
"2013-09-25 0.000345\n",
"2013-06-02 0.000345\n",
"2016-07-06 0.000345\n",
"2014-04-21 0.000345\n",
"2013-10-16 0.000345\n",
" ... \n",
"2007-11-23 0.000007\n",
"2008-01-15 0.000007\n",
"2007-12-22 0.000007\n",
"2007-11-08 0.000007\n",
"2007-11-29 0.000007\n",
"2008-01-29 0.000007\n",
"2008-01-06 0.000007\n",
"2007-11-02 0.000007\n",
"2007-12-25 0.000007\n",
"2008-01-28 0.000007\n",
"2007-12-08 0.000007\n",
"2007-11-09 0.000007\n",
"2008-01-05 0.000007\n",
"2007-11-26 0.000007\n",
"2007-11-10 0.000007\n",
"2007-11-20 0.000007\n",
"2008-01-14 0.000007\n",
"2007-12-03 0.000007\n",
"2008-01-12 0.000007\n",
"2007-11-03 0.000007\n",
"2007-12-02 0.000007\n",
"2008-01-31 0.000007\n",
"2007-12-01 0.000007\n",
"2007-11-06 0.000007\n",
"2007-11-27 0.000007\n",
"2007-12-19 0.000007\n",
"2007-11-19 0.000007\n",
"2007-12-30 0.000007\n",
"2007-12-23 0.000007\n",
"2008-01-09 0.000007\n",
"Name: Date, Length: 3436, dtype: float64\n",
"Canberra 0.024038\n",
"Sydney 0.023468\n",
"Perth 0.022455\n",
"Darwin 0.022448\n",
"Hobart 0.022420\n",
"Brisbane 0.022230\n",
"Adelaide 0.021731\n",
"Bendigo 0.021337\n",
"Townsville 0.021330\n",
"AliceSprings 0.021316\n",
"MountGambier 0.021309\n",
"Launceston 0.021295\n",
"Ballarat 0.021295\n",
"Albany 0.021211\n",
"Albury 0.021175\n",
"PerthAirport 0.021161\n",
"MelbourneAirport 0.021161\n",
"Mildura 0.021147\n",
"SydneyAirport 0.021133\n",
"Nuriootpa 0.021112\n",
"Sale 0.021098\n",
"Watsonia 0.021091\n",
"Tuggeranong 0.021084\n",
"Portland 0.021070\n",
"Woomera 0.021028\n",
"Cairns 0.021014\n",
"Cobar 0.021014\n",
"Wollongong 0.020979\n",
"GoldCoast 0.020957\n",
"WaggaWagga 0.020929\n",
"NorfolkIsland 0.020845\n",
"Penrith 0.020845\n",
"Newcastle 0.020782\n",
"SalmonGums 0.020782\n",
"CoffsHarbour 0.020768\n",
"Witchcliffe 0.020761\n",
"Richmond 0.020753\n",
"Dartmoor 0.020697\n",
"NorahHead 0.020599\n",
"BadgerysCreek 0.020592\n",
"MountGinini 0.020444\n",
"Moree 0.020071\n",
"Walpole 0.019825\n",
"PearceRAAF 0.019424\n",
"Williamtown 0.017954\n",
"Melbourne 0.017125\n",
"Nhil 0.011034\n",
"Katherine 0.010964\n",
"Uluru 0.010697\n",
"Name: Location, dtype: float64\n",
"W 0.068780\n",
"SE 0.065467\n",
"E 0.063794\n",
"N 0.063526\n",
"SSE 0.063245\n",
"S 0.062936\n",
"WSW 0.062598\n",
"SW 0.061867\n",
"SSW 0.060552\n",
"WNW 0.056726\n",
"NW 0.056283\n",
"ENE 0.056205\n",
"ESE 0.051374\n",
"NE 0.049651\n",
"NNW 0.046142\n",
"NNE 0.045241\n",
"Name: WindGustDir, dtype: float64\n",
"N 0.080123\n",
"SE 0.064434\n",
"E 0.063463\n",
"SSE 0.063055\n",
"NW 0.060144\n",
"S 0.059729\n",
"W 0.058090\n",
"SW 0.057928\n",
"NNE 0.055896\n",
"NNW 0.055136\n",
"ENE 0.054398\n",
"ESE 0.053153\n",
"NE 0.052935\n",
"SSW 0.052380\n",
"WNW 0.050593\n",
"WSW 0.048125\n",
"Name: WindDir9am, dtype: float64\n",
"SE 0.074990\n",
"W 0.069701\n",
"S 0.067500\n",
"WSW 0.065608\n",
"SW 0.064574\n",
"SSE 0.064293\n",
"N 0.060952\n",
"WNW 0.060875\n",
"NW 0.059553\n",
"ESE 0.058948\n",
"E 0.058667\n",
"NE 0.057415\n",
"SSW 0.056332\n",
"NNW 0.054384\n",
"ENE 0.054321\n",
"NNE 0.045319\n",
"Name: WindDir3pm, dtype: float64\n",
"No 0.768899\n",
"Yes 0.221213\n",
"Name: RainToday, dtype: float64\n",
"No 0.775819\n",
"Yes 0.224181\n",
"Name: RainTomorrow, dtype: float64\n"
]
}
],
"source": [
"# view frequency distribution of categorical variables\n",
"\n",
"for var in categorical: \n",
" \n",
" print(df[var].value_counts()/np.float(len(df)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Number of labels: cardinality\n",
"\n",
"\n",
"The number of labels within a categorical variable is known as **cardinality**. A high number of labels within a variable is known as **high cardinality**. High cardinality may pose some serious problems in the machine learning model. So, I will check for high cardinality."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Date contains 3436 labels\n",
"Location contains 49 labels\n",
"WindGustDir contains 17 labels\n",
"WindDir9am contains 17 labels\n",
"WindDir3pm contains 17 labels\n",
"RainToday contains 3 labels\n",
"RainTomorrow contains 2 labels\n"
]
}
],
"source": [
"# check for cardinality in categorical variables\n",
"\n",
"for var in categorical:\n",
" \n",
" print(var, ' contains ', len(df[var].unique()), ' labels')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there is a `Date` variable which needs to be preprocessed. I will do preprocessing in the following section.\n",
"\n",
"\n",
"All the other variables contain relatively smaller number of variables."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Feature Engineering of Date Variable"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dtype('O')"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Date'].dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the data type of `Date` variable is object. I will parse the date currently coded as object into datetime format."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# parse the dates, currently coded as strings, into datetime format\n",
"\n",
"df['Date'] = pd.to_datetime(df['Date'])"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 2008\n",
"1 2008\n",
"2 2008\n",
"3 2008\n",
"4 2008\n",
"Name: Year, dtype: int64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# extract year from date\n",
"\n",
"df['Year'] = df['Date'].dt.year\n",
"\n",
"df['Year'].head()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 12\n",
"1 12\n",
"2 12\n",
"3 12\n",
"4 12\n",
"Name: Month, dtype: int64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# extract month from date\n",
"\n",
"df['Month'] = df['Date'].dt.month\n",
"\n",
"df['Month'].head()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1\n",
"1 2\n",
"2 3\n",
"3 4\n",
"4 5\n",
"Name: Day, dtype: int64"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# extract day from date\n",
"\n",
"df['Day'] = df['Date'].dt.day\n",
"\n",
"df['Day'].head()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 142193 entries, 0 to 142192\n",
"Data columns (total 26 columns):\n",
"Date 142193 non-null datetime64[ns]\n",
"Location 142193 non-null object\n",
"MinTemp 141556 non-null float64\n",
"MaxTemp 141871 non-null float64\n",
"Rainfall 140787 non-null float64\n",
"Evaporation 81350 non-null float64\n",
"Sunshine 74377 non-null float64\n",
"WindGustDir 132863 non-null object\n",
"WindGustSpeed 132923 non-null float64\n",
"WindDir9am 132180 non-null object\n",
"WindDir3pm 138415 non-null object\n",
"WindSpeed9am 140845 non-null float64\n",
"WindSpeed3pm 139563 non-null float64\n",
"Humidity9am 140419 non-null float64\n",
"Humidity3pm 138583 non-null float64\n",
"Pressure9am 128179 non-null float64\n",
"Pressure3pm 128212 non-null float64\n",
"Cloud9am 88536 non-null float64\n",
"Cloud3pm 85099 non-null float64\n",
"Temp9am 141289 non-null float64\n",
"Temp3pm 139467 non-null float64\n",
"RainToday 140787 non-null object\n",
"RainTomorrow 142193 non-null object\n",
"Year 142193 non-null int64\n",
"Month 142193 non-null int64\n",
"Day 142193 non-null int64\n",
"dtypes: datetime64[ns](1), float64(16), int64(3), object(6)\n",
"memory usage: 28.2+ MB\n"
]
}
],
"source": [
"# again view the summary of dataset\n",
"\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are three additional columns created from `Date` variable. Now, I will drop the original `Date` variable from the dataset."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# drop the original Date variable\n",
"\n",
"df.drop('Date', axis=1, inplace = True)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Location</th>\n",
" <th>MinTemp</th>\n",
" <th>MaxTemp</th>\n",
" <th>Rainfall</th>\n",
" <th>Evaporation</th>\n",
" <th>Sunshine</th>\n",
" <th>WindGustDir</th>\n",
" <th>WindGustSpeed</th>\n",
" <th>WindDir9am</th>\n",
" <th>WindDir3pm</th>\n",
" <th>...</th>\n",
" <th>Pressure3pm</th>\n",
" <th>Cloud9am</th>\n",
" <th>Cloud3pm</th>\n",
" <th>Temp9am</th>\n",
" <th>Temp3pm</th>\n",
" <th>RainToday</th>\n",
" <th>RainTomorrow</th>\n",
" <th>Year</th>\n",
" <th>Month</th>\n",
" <th>Day</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Albury</td>\n",
" <td>13.4</td>\n",
" <td>22.9</td>\n",
" <td>0.6</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>W</td>\n",
" <td>44.0</td>\n",
" <td>W</td>\n",
" <td>WNW</td>\n",
" <td>...</td>\n",
" <td>1007.1</td>\n",
" <td>8.0</td>\n",
" <td>NaN</td>\n",
" <td>16.9</td>\n",
" <td>21.8</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>2008</td>\n",
" <td>12</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Albury</td>\n",
" <td>7.4</td>\n",
" <td>25.1</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>WNW</td>\n",
" <td>44.0</td>\n",
" <td>NNW</td>\n",
" <td>WSW</td>\n",
" <td>...</td>\n",
" <td>1007.8</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>17.2</td>\n",
" <td>24.3</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>2008</td>\n",
" <td>12</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Albury</td>\n",
" <td>12.9</td>\n",
" <td>25.7</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>WSW</td>\n",
" <td>46.0</td>\n",
" <td>W</td>\n",
" <td>WSW</td>\n",
" <td>...</td>\n",
" <td>1008.7</td>\n",
" <td>NaN</td>\n",
" <td>2.0</td>\n",
" <td>21.0</td>\n",
" <td>23.2</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>2008</td>\n",
" <td>12</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Albury</td>\n",
" <td>9.2</td>\n",
" <td>28.0</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NE</td>\n",
" <td>24.0</td>\n",
" <td>SE</td>\n",
" <td>E</td>\n",
" <td>...</td>\n",
" <td>1012.8</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>18.1</td>\n",
" <td>26.5</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>2008</td>\n",
" <td>12</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Albury</td>\n",
" <td>17.5</td>\n",
" <td>32.3</td>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>W</td>\n",
" <td>41.0</td>\n",
" <td>ENE</td>\n",
" <td>NW</td>\n",
" <td>...</td>\n",
" <td>1006.0</td>\n",
" <td>7.0</td>\n",
" <td>8.0</td>\n",
" <td>17.8</td>\n",
" <td>29.7</td>\n",
" <td>No</td>\n",
" <td>No</td>\n",
" <td>2008</td>\n",
" <td>12</td>\n",
" <td>5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 25 columns</p>\n",
"</div>"
],
"text/plain": [
" Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir \\\n",
"0 Albury 13.4 22.9 0.6 NaN NaN W \n",
"1 Albury 7.4 25.1 0.0 NaN NaN WNW \n",
"2 Albury 12.9 25.7 0.0 NaN NaN WSW \n",
"3 Albury 9.2 28.0 0.0 NaN NaN NE \n",
"4 Albury 17.5 32.3 1.0 NaN NaN W \n",
"\n",
" WindGustSpeed WindDir9am WindDir3pm ... Pressure3pm Cloud9am Cloud3pm \\\n",
"0 44.0 W WNW ... 1007.1 8.0 NaN \n",
"1 44.0 NNW WSW ... 1007.8 NaN NaN \n",
"2 46.0 W WSW ... 1008.7 NaN 2.0 \n",
"3 24.0 SE E ... 1012.8 NaN NaN \n",
"4 41.0 ENE NW ... 1006.0 7.0 8.0 \n",
"\n",
" Temp9am Temp3pm RainToday RainTomorrow Year Month Day \n",
"0 16.9 21.8 No No 2008 12 1 \n",
"1 17.2 24.3 No No 2008 12 2 \n",
"2 21.0 23.2 No No 2008 12 3 \n",
"3 18.1 26.5 No No 2008 12 4 \n",
"4 17.8 29.7 No No 2008 12 5 \n",
"\n",
"[5 rows x 25 columns]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# preview the dataset again\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can see that the `Date` variable has been removed from the dataset.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore Categorical Variables\n",
"\n",
"\n",
"Now, I will explore the categorical variables one by one. "
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 6 categorical variables\n",
"\n",
"The categorical variables are : ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']\n"
]
}
],
"source": [
"# find categorical variables\n",
"\n",
"categorical = [var for var in df.columns if df[var].dtype=='O']\n",
"\n",
"print('There are {} categorical variables\\n'.format(len(categorical)))\n",
"\n",
"print('The categorical variables are :', categorical)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 6 categorical variables in the dataset. The `Date` variable has been removed. First, I will check missing values in categorical variables."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Location 0\n",
"WindGustDir 9330\n",
"WindDir9am 10013\n",
"WindDir3pm 3778\n",
"RainToday 1406\n",
"RainTomorrow 0\n",
"dtype: int64"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check for missing values in categorical variables \n",
"\n",
"df[categorical].isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that `WindGustDir`, `WindDir9am`, `WindDir3pm`, `RainToday` variables contain missing values. I will explore these variables one by one."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore `Location` variable"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Location contains 49 labels\n"
]
}
],
"source": [
"# print number of labels in Location variable\n",
"\n",
"print('Location contains', len(df.Location.unique()), 'labels')"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Albury', 'BadgerysCreek', 'Cobar', 'CoffsHarbour', 'Moree',\n",
" 'Newcastle', 'NorahHead', 'NorfolkIsland', 'Penrith', 'Richmond',\n",
" 'Sydney', 'SydneyAirport', 'WaggaWagga', 'Williamtown',\n",
" 'Wollongong', 'Canberra', 'Tuggeranong', 'MountGinini', 'Ballarat',\n",
" 'Bendigo', 'Sale', 'MelbourneAirport', 'Melbourne', 'Mildura',\n",
" 'Nhil', 'Portland', 'Watsonia', 'Dartmoor', 'Brisbane', 'Cairns',\n",
" 'GoldCoast', 'Townsville', 'Adelaide', 'MountGambier', 'Nuriootpa',\n",
" 'Woomera', 'Albany', 'Witchcliffe', 'PearceRAAF', 'PerthAirport',\n",
" 'Perth', 'SalmonGums', 'Walpole', 'Hobart', 'Launceston',\n",
" 'AliceSprings', 'Darwin', 'Katherine', 'Uluru'], dtype=object)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check labels in location variable\n",
"\n",
"df.Location.unique()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Canberra 3418\n",
"Sydney 3337\n",
"Perth 3193\n",
"Darwin 3192\n",
"Hobart 3188\n",
"Brisbane 3161\n",
"Adelaide 3090\n",
"Bendigo 3034\n",
"Townsville 3033\n",
"AliceSprings 3031\n",
"MountGambier 3030\n",
"Launceston 3028\n",
"Ballarat 3028\n",
"Albany 3016\n",
"Albury 3011\n",
"PerthAirport 3009\n",
"MelbourneAirport 3009\n",
"Mildura 3007\n",
"SydneyAirport 3005\n",
"Nuriootpa 3002\n",
"Sale 3000\n",
"Watsonia 2999\n",
"Tuggeranong 2998\n",
"Portland 2996\n",
"Woomera 2990\n",
"Cairns 2988\n",
"Cobar 2988\n",
"Wollongong 2983\n",
"GoldCoast 2980\n",
"WaggaWagga 2976\n",
"NorfolkIsland 2964\n",
"Penrith 2964\n",
"Newcastle 2955\n",
"SalmonGums 2955\n",
"CoffsHarbour 2953\n",
"Witchcliffe 2952\n",
"Richmond 2951\n",
"Dartmoor 2943\n",
"NorahHead 2929\n",
"BadgerysCreek 2928\n",
"MountGinini 2907\n",
"Moree 2854\n",
"Walpole 2819\n",
"PearceRAAF 2762\n",
"Williamtown 2553\n",
"Melbourne 2435\n",
"Nhil 1569\n",
"Katherine 1559\n",
"Uluru 1521\n",
"Name: Location, dtype: int64"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check frequency distribution of values in Location variable\n",
"\n",
"df.Location.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Albany</th>\n",
" <th>Albury</th>\n",
" <th>AliceSprings</th>\n",
" <th>BadgerysCreek</th>\n",
" <th>Ballarat</th>\n",
" <th>Bendigo</th>\n",
" <th>Brisbane</th>\n",
" <th>Cairns</th>\n",
" <th>Canberra</th>\n",
" <th>Cobar</th>\n",
" <th>...</th>\n",
" <th>Townsville</th>\n",
" <th>Tuggeranong</th>\n",
" <th>Uluru</th>\n",
" <th>WaggaWagga</th>\n",
" <th>Walpole</th>\n",
" <th>Watsonia</th>\n",
" <th>Williamtown</th>\n",
" <th>Witchcliffe</th>\n",
" <th>Wollongong</th>\n",
" <th>Woomera</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 48 columns</p>\n",
"</div>"
],
"text/plain": [
" Albany Albury AliceSprings BadgerysCreek Ballarat Bendigo Brisbane \\\n",
"0 0 1 0 0 0 0 0 \n",
"1 0 1 0 0 0 0 0 \n",
"2 0 1 0 0 0 0 0 \n",
"3 0 1 0 0 0 0 0 \n",
"4 0 1 0 0 0 0 0 \n",
"\n",
" Cairns Canberra Cobar ... Townsville Tuggeranong Uluru \\\n",
"0 0 0 0 ... 0 0 0 \n",
"1 0 0 0 ... 0 0 0 \n",
"2 0 0 0 ... 0 0 0 \n",
"3 0 0 0 ... 0 0 0 \n",
"4 0 0 0 ... 0 0 0 \n",
"\n",
" WaggaWagga Walpole Watsonia Williamtown Witchcliffe Wollongong \\\n",
"0 0 0 0 0 0 0 \n",
"1 0 0 0 0 0 0 \n",
"2 0 0 0 0 0 0 \n",
"3 0 0 0 0 0 0 \n",
"4 0 0 0 0 0 0 \n",
"\n",
" Woomera \n",
"0 0 \n",
"1 0 \n",
"2 0 \n",
"3 0 \n",
"4 0 \n",
"\n",
"[5 rows x 48 columns]"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's do One Hot Encoding of Location variable\n",
"# get k-1 dummy variables after One Hot Encoding \n",
"# preview the dataset with head() method\n",
"\n",
"pd.get_dummies(df.Location, drop_first=True).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore `WindGustDir` variable"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WindGustDir contains 17 labels\n"
]
}
],
"source": [
"# print number of labels in WindGustDir variable\n",
"\n",
"print('WindGustDir contains', len(df['WindGustDir'].unique()), 'labels')"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['W', 'WNW', 'WSW', 'NE', 'NNW', 'N', 'NNE', 'SW', 'ENE', 'SSE',\n",
" 'S', 'NW', 'SE', 'ESE', nan, 'E', 'SSW'], dtype=object)"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check labels in WindGustDir variable\n",
"\n",
"df['WindGustDir'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"W 9780\n",
"SE 9309\n",
"E 9071\n",
"N 9033\n",
"SSE 8993\n",
"S 8949\n",
"WSW 8901\n",
"SW 8797\n",
"SSW 8610\n",
"WNW 8066\n",
"NW 8003\n",
"ENE 7992\n",
"ESE 7305\n",
"NE 7060\n",
"NNW 6561\n",
"NNE 6433\n",
"Name: WindGustDir, dtype: int64"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check frequency distribution of values in WindGustDir variable\n",
"\n",
"df.WindGustDir.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ENE</th>\n",
" <th>ESE</th>\n",
" <th>N</th>\n",
" <th>NE</th>\n",
" <th>NNE</th>\n",
" <th>NNW</th>\n",
" <th>NW</th>\n",
" <th>S</th>\n",
" <th>SE</th>\n",
" <th>SSE</th>\n",
" <th>SSW</th>\n",
" <th>SW</th>\n",
" <th>W</th>\n",
" <th>WNW</th>\n",
" <th>WSW</th>\n",
" <th>nan</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ENE ESE N NE NNE NNW NW S SE SSE SSW SW W WNW WSW NaN\n",
"0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0\n",
"1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0\n",
"2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0\n",
"3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0\n",
"4 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# let's do One Hot Encoding of WindGustDir variable\n",
"# get k-1 dummy variables after One Hot Encoding \n",
"# also add an additional dummy variable to indicate there was missing data\n",
"# preview the dataset with head() method\n",
"\n",
"pd.get_dummies(df.WindGustDir, drop_first=True, dummy_na=True).head()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ENE 7992\n",
"ESE 7305\n",
"N 9033\n",
"NE 7060\n",
"NNE 6433\n",
"NNW 6561\n",
"NW 8003\n",
"S 8949\n",
"SE 9309\n",
"SSE 8993\n",
"SSW 8610\n",
"SW 8797\n",
"W 9780\n",
"WNW 8066\n",
"WSW 8901\n",
"NaN 9330\n",
"dtype: int64"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# sum the number of 1s per boolean variable over the rows of the dataset\n",
"# it will tell us how many observations we have for each category\n",
"\n",
"pd.get_dummies(df.WindGustDir, drop_first=True, dummy_na=True).sum(axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that there are 9330 missing values in WindGustDir variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Explore `WindDir9am` variable"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WindDir9am contains 17 labels\n"
]
}
],
"source": [
"# print number of labels in WindDir9am variable\n",
"\n",
"print('WindDir9am contains', len(df['WindDir9am'].unique()), 'labels')"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['W', 'NNW', 'SE', 'ENE', 'SW', 'SSE', 'S', 'NE', nan, 'SSW', 'N',\n",
" 'WSW', 'ESE', 'E', 'NW', 'WNW', 'NNE'], dtype=object)"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check labels in WindDir9am variable\n",
"\n",
"df['WindDir9am'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"N 11393\n",
"SE 9162\n",
"E 9024\n",
"SSE 8966\n",
"NW 8552\n",
"S 8493\n",
"W 8260\n",
"SW 8237\n",
"NNE 7948\n",
"NNW 7840\n",
"ENE 7735\n",
"ESE 7558\n",
"NE 7527\n",
"SSW 7448\n",
"WNW 7194\n",
"WSW 6843\n",
"Name: WindDir9am, dtype: int64"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check frequency distribution of values in WindDir9am variable\n",
"\n",
"df['WindDir9am'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ENE</th>\n",
" <th>ESE</th>\n",
" <th>N</th>\n",
" <th>NE</th>\n",
" <th>NNE</th>\n",