Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save pb111/b876c913780799678dcfbec144d9c04b to your computer and use it in GitHub Desktop.
Save pb111/b876c913780799678dcfbec144d9c04b to your computer and use it in GitHub Desktop.
Data Preprocessing Project – Dealing with Missing Numerical Values
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Preprocessing Project - Dealing with Missing Numerical Values\n",
"\n",
"\n",
"In this project, I discuss various data preprocessing techniques to handle missing numerical values. The contents of this project are categorized into various sections which are listed in table of contents as follows:-\n",
"\n",
"\n",
"## Table of Contents\n",
"\n",
"\n",
"1.\tIntroduction\n",
"\n",
"2.\tSource dataset\n",
"\n",
"3.\tDealing with missing numerical values\n",
"\n",
"4.\tDrop missing values with dropna()\n",
"\n",
"5.\tFill missing values with a test statistic\n",
"\n",
"6.\tFill missing values with Imputer\n",
"\n",
"7.\tBuild a prediction model\n",
"\n",
"8.\tKNN Imputation\n",
"\n",
"9.\tCheck with ASSERT statement\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Introduction\n",
"\n",
"\n",
"Over the last few decades, Machine Learning (ML) has gained immense popularity in solving real world business problems. It has emerged as a technology tool for companies looking to boost productivity and profit. ML practitioners source real world data and write algorithms to solve business problems. The success of the ML algorithms depends on the quality of the data. The data must be free from errors and discrepancies. It must adhere to a specific standard so that ML algorithms can accept them. But, this does not happen in reality.\n",
"\n",
"In reality, the data has its own limitations. The data is dirty. It is incomplete, noisy and inconsistent. Incomplete data means it has missing values and lacks certain attributes. The data may be noisy as it contains errors and outliers and hence does not produce desired results. Lastly, the data may be inconsistent as it contains discrepancies in data or duplicate data.\n",
"\n",
"So, ML practitioners must take actions to transform raw data into standardized data that is suitable for ML algorithms. It involves cleaning, transforming and standardizing data to remove all the inadequacies and irregularities in the data. These actions are collectively known as **Data Preprocessing**. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Source dataset\n",
"\n",
"\n",
"I have used wiki4HE.csv data set for this project. I have downloaded this data set from the UCI Machine Learning Repository. The data set describes survey results of faculty members from two Spanish universities on teaching uses of Wikipedia. \n",
"\n",
"\n",
"The dataset contains 53 attributes and 913 instances. Out of the 53 attributes, 4 are of numeric data types and 49 are of text or character data types. \n",
"\n",
"\n",
"The data set can be found at the following url-\n",
"\n",
"\n",
"https://archive.ics.uci.edu/ml/datasets/wiki4HE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Dealing with missing numerical values\n",
"\n",
"\n",
"It is a very common scenario that when looking at a real world data, a data scientist may come across missing values. These missing values could be due to error prone data entry process, wrong data collection methods, certain values not applicable, particular fields left blank in a survey or the respondent decline to answer. Whatever may be the reason for the missing value, the data scientist must find ways to handle these missing values. He knows that missing values need to be handled carefully, because they give wrong results if we simply ignore them. He must answer whether he should delete these missing values or replace them with a suitable statistic. The first step in dealing with missing values properly is to identify them. \n",
"\n",
"\n",
"The initial inspection of the data help us to detect whether there are missing values in the data set. It can be done by Exploratory Data Analysis. So, it is always important that a data scientist always perform Exploratory Data Analysis (EDA) to identify missing values correctly.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Import required libraries\n",
"\n",
"import numpy as np\n",
"\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Import the dataset\n",
"\n",
"dataset = \"C:/project_datasets/wiki4HE.csv\"\n",
"\n",
"df = pd.read_csv(dataset, sep = ';')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"### Exploratory Data Analysis (EDA)\n",
"\n",
"\n",
"Below is the list of commands to identity missing values with EDA.\n",
"\n",
"\n",
"1.\t`df.head()`\n",
"\n",
"This will output the first five rows of the dataset. It will give us quick view on the presence of ‘NaN’ or ‘?’ ‘-1’ or ’0’ or blank spaces “” in the dataset. If required, we can view more number of rows by specifying the number of rows inside the parenthesis. \n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" AGE GENDER DOMAIN PhD YEARSEXP UNIVERSITY UOC_POSITION OTHER_POSITION \\\n",
"0 40 0 2 1 14 1 2 ? \n",
"1 42 0 5 1 18 1 2 ? \n",
"2 37 0 4 1 13 1 3 ? \n",
"3 40 0 4 0 13 1 3 ? \n",
"4 51 0 6 0 8 1 3 ? \n",
"\n",
" OTHERSTATUS USERWIKI ... BI2 Inc1 Inc2 Inc3 Inc4 Exp1 Exp2 Exp3 Exp4 Exp5 \n",
"0 ? 0 ... 3 5 5 5 5 4 4 4 1 2 \n",
"1 ? 0 ... 2 4 4 3 4 2 2 4 2 4 \n",
"2 ? 0 ... 1 5 3 5 5 2 2 2 1 3 \n",
"3 ? 0 ... 3 3 4 4 3 4 4 3 3 4 \n",
"4 ? 1 ... 5 5 5 4 4 5 5 5 4 4 \n",
"\n",
"[5 rows x 53 columns]\n"
]
}
],
"source": [
"# View the first 5 rows of the dataset\n",
"\n",
"print(df.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Interpretation**\n",
"\n",
"We can see that there are lots of missing values in the dataset. The columns **OTHER_POSITION** and **OTHERSTATUS** contain missing values.\n",
"\n",
"The column **GENDER** contain zeros. It might be because of **Male** is encoded as 1 and **Female** is encoded as 0.\n",
"\n",
"We need to explore the dataset further to confirm which columns contain the missing values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. `df.info()`\n",
"\n",
"This command is quite useful in detecting the missing values in the dataset. It will tell us the total number of non - null observations present including the total number of entries. Once number of entries isn’t equal to number of non - null observations, we know there are missing values in the dataset.\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 913 entries, 0 to 912\n",
"Data columns (total 53 columns):\n",
"AGE 913 non-null int64\n",
"GENDER 913 non-null int64\n",
"DOMAIN 913 non-null object\n",
"PhD 913 non-null int64\n",
"YEARSEXP 913 non-null object\n",
"UNIVERSITY 913 non-null int64\n",
"UOC_POSITION 913 non-null object\n",
"OTHER_POSITION 913 non-null object\n",
"OTHERSTATUS 913 non-null object\n",
"USERWIKI 913 non-null object\n",
"PU1 913 non-null object\n",
"PU2 913 non-null object\n",
"PU3 913 non-null object\n",
"PEU1 913 non-null object\n",
"PEU2 913 non-null object\n",
"PEU3 913 non-null object\n",
"ENJ1 913 non-null object\n",
"ENJ2 913 non-null object\n",
"Qu1 913 non-null object\n",
"Qu2 913 non-null object\n",
"Qu3 913 non-null object\n",
"Qu4 913 non-null object\n",
"Qu5 913 non-null object\n",
"Vis1 913 non-null object\n",
"Vis2 913 non-null object\n",
"Vis3 913 non-null object\n",
"Im1 913 non-null object\n",
"Im2 913 non-null object\n",
"Im3 913 non-null object\n",
"SA1 913 non-null object\n",
"SA2 913 non-null object\n",
"SA3 913 non-null object\n",
"Use1 913 non-null object\n",
"Use2 913 non-null object\n",
"Use3 913 non-null object\n",
"Use4 913 non-null object\n",
"Use5 913 non-null object\n",
"Pf1 913 non-null object\n",
"Pf2 913 non-null object\n",
"Pf3 913 non-null object\n",
"JR1 913 non-null object\n",
"JR2 913 non-null object\n",
"BI1 913 non-null object\n",
"BI2 913 non-null object\n",
"Inc1 913 non-null object\n",
"Inc2 913 non-null object\n",
"Inc3 913 non-null object\n",
"Inc4 913 non-null object\n",
"Exp1 913 non-null object\n",
"Exp2 913 non-null object\n",
"Exp3 913 non-null object\n",
"Exp4 913 non-null object\n",
"Exp5 913 non-null object\n",
"dtypes: int64(4), object(49)\n",
"memory usage: 378.1+ KB\n",
"None\n"
]
}
],
"source": [
"# View the summary of the dataframe\n",
"\n",
"print(df.info())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Interpretation**\n",
"\n",
"The above command shows that there are no missing values in the data set. But, this is not true. The dataset contains missing values.It may be because of missing values are encoded in different ways. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Encode missing numerical values\n",
"\n",
"\n",
"Missing values are encoded in different ways. They can appear as ‘NaN’, ‘NA’, ‘?’, zero ‘0’, ‘xx’, minus one ‘-1’ or a blank space “ ”. We need to use various pandas methods to deal with missing values. But, pandas always recognize missing values as ‘NaN’. So, it is essential that we should first convert all the ‘?’, zeros ‘0’, ‘xx’, minus ones ‘-1’ or blank spaces “ ” to ‘NaN’. If the missing values isn’t identified as ‘NaN’, then we have to first convert or replace such non ‘NaN’ entry with a ‘NaN’.\n",
"\n",
"\n",
"### Convert '?' to ‘NaN’\n",
"\n",
"`df[df == '?'] = np.nan`\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Convert '?' to 'NaN'\n",
"\n",
"df[df == '?'] = np.nan"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" AGE GENDER DOMAIN PhD YEARSEXP UNIVERSITY UOC_POSITION OTHER_POSITION \\\n",
"0 40 0 2 1 14 1 2 NaN \n",
"1 42 0 5 1 18 1 2 NaN \n",
"2 37 0 4 1 13 1 3 NaN \n",
"3 40 0 4 0 13 1 3 NaN \n",
"4 51 0 6 0 8 1 3 NaN \n",
"\n",
" OTHERSTATUS USERWIKI ... BI2 Inc1 Inc2 Inc3 Inc4 Exp1 Exp2 Exp3 Exp4 Exp5 \n",
"0 NaN 0 ... 3 5 5 5 5 4 4 4 1 2 \n",
"1 NaN 0 ... 2 4 4 3 4 2 2 4 2 4 \n",
"2 NaN 0 ... 1 5 3 5 5 2 2 2 1 3 \n",
"3 NaN 0 ... 3 3 4 4 3 4 4 3 3 4 \n",
"4 NaN 1 ... 5 5 5 4 4 5 5 5 4 4 \n",
"\n",
"[5 rows x 53 columns]\n"
]
}
],
"source": [
"# View the first 5 rows of the dataset again\n",
"\n",
"print(df.head())"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 913 entries, 0 to 912\n",
"Data columns (total 53 columns):\n",
"AGE 913 non-null int64\n",
"GENDER 913 non-null int64\n",
"DOMAIN 911 non-null object\n",
"PhD 913 non-null int64\n",
"YEARSEXP 890 non-null object\n",
"UNIVERSITY 913 non-null int64\n",
"UOC_POSITION 800 non-null object\n",
"OTHER_POSITION 652 non-null object\n",
"OTHERSTATUS 373 non-null object\n",
"USERWIKI 909 non-null object\n",
"PU1 906 non-null object\n",
"PU2 902 non-null object\n",
"PU3 908 non-null object\n",
"PEU1 909 non-null object\n",
"PEU2 899 non-null object\n",
"PEU3 816 non-null object\n",
"ENJ1 906 non-null object\n",
"ENJ2 896 non-null object\n",
"Qu1 906 non-null object\n",
"Qu2 903 non-null object\n",
"Qu3 898 non-null object\n",
"Qu4 891 non-null object\n",
"Qu5 884 non-null object\n",
"Vis1 841 non-null object\n",
"Vis2 796 non-null object\n",
"Vis3 905 non-null object\n",
"Im1 891 non-null object\n",
"Im2 893 non-null object\n",
"Im3 856 non-null object\n",
"SA1 902 non-null object\n",
"SA2 901 non-null object\n",
"SA3 902 non-null object\n",
"Use1 899 non-null object\n",
"Use2 896 non-null object\n",
"Use3 904 non-null object\n",
"Use4 890 non-null object\n",
"Use5 898 non-null object\n",
"Pf1 902 non-null object\n",
"Pf2 907 non-null object\n",
"Pf3 899 non-null object\n",
"JR1 886 non-null object\n",
"JR2 860 non-null object\n",
"BI1 881 non-null object\n",
"BI2 870 non-null object\n",
"Inc1 878 non-null object\n",
"Inc2 878 non-null object\n",
"Inc3 876 non-null object\n",
"Inc4 871 non-null object\n",
"Exp1 900 non-null object\n",
"Exp2 902 non-null object\n",
"Exp3 900 non-null object\n",
"Exp4 899 non-null object\n",
"Exp5 900 non-null object\n",
"dtypes: int64(4), object(49)\n",
"memory usage: 378.1+ KB\n",
"None\n"
]
}
],
"source": [
"# View the summary of the dataframe again\n",
"\n",
"print(df.info())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Interpretation** \n",
"\n",
"Now, we can see that there are lots of columns containing missing values. We should view the column names of the dataframe. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['AGE', 'GENDER', 'DOMAIN', 'PhD', 'YEARSEXP', 'UNIVERSITY',\n",
" 'UOC_POSITION', 'OTHER_POSITION', 'OTHERSTATUS', 'USERWIKI', 'PU1',\n",
" 'PU2', 'PU3', 'PEU1', 'PEU2', 'PEU3', 'ENJ1', 'ENJ2', 'Qu1', 'Qu2',\n",
" 'Qu3', 'Qu4', 'Qu5', 'Vis1', 'Vis2', 'Vis3', 'Im1', 'Im2', 'Im3', 'SA1',\n",
" 'SA2', 'SA3', 'Use1', 'Use2', 'Use3', 'Use4', 'Use5', 'Pf1', 'Pf2',\n",
" 'Pf3', 'JR1', 'JR2', 'BI1', 'BI2', 'Inc1', 'Inc2', 'Inc3', 'Inc4',\n",
" 'Exp1', 'Exp2', 'Exp3', 'Exp4', 'Exp5'],\n",
" dtype='object')\n"
]
}
],
"source": [
"# View the column names of dataframe\n",
"\n",
"print(df.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. `df.describe()`\n",
"\n",
"This will display summary statistics of all observed features and labels. The most important statistic is the minimum value. If we see -1 or 0 in our observations, then we can suspect missing value.\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" AGE GENDER PhD UNIVERSITY\n",
"count 913.000000 913.000000 913.000000 913.000000\n",
"mean 42.246440 0.424973 0.464403 1.123768\n",
"std 8.058418 0.494610 0.499005 0.329497\n",
"min 23.000000 0.000000 0.000000 1.000000\n",
"25% 36.000000 0.000000 0.000000 1.000000\n",
"50% 42.000000 0.000000 0.000000 1.000000\n",
"75% 47.000000 1.000000 1.000000 1.000000\n",
"max 69.000000 1.000000 1.000000 2.000000\n"
]
}
],
"source": [
"# View the descriptive statistics of the dataframe\n",
"\n",
"print(df.describe())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Interpretation**\n",
"\n",
"We can see there are four columns of integer data types - **AGE**, **GENDER**, **PhD** and **UNIVERSITY**. \n",
"\n",
"In the **AGE** column, the maximum and minimum values are 69 and 23. The median value is 42 and the count is 913. We do not suspect any missing value in this column. \n",
"\n",
"Similar, explanation goes for the **PhD** and **UNIVERSITY** columns.\n",
"\n",
"The **GENDER** column has only two possible values 0 and 1. This is reasonable because 0 is for female and 1 is for male. \n",
"\n",
"So, we do not find any missing values in the above four columns. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"4.\t`df.isnull()`\n",
"\n",
"The above command checks whether each cell in a dataframe contains missing values or not. If the cell contains missing value, it returns True otherwise it returns False. \n",
"\n",
"\n",
"5.\t`df.isnull.sum()`\n",
"\n",
"The above command returns the total number of missing values in each column in the data set.\n",
"\n",
"\n",
"\t"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"AGE 0\n",
"GENDER 0\n",
"DOMAIN 2\n",
"PhD 0\n",
"YEARSEXP 23\n",
"UNIVERSITY 0\n",
"UOC_POSITION 113\n",
"OTHER_POSITION 261\n",
"OTHERSTATUS 540\n",
"USERWIKI 4\n",
"PU1 7\n",
"PU2 11\n",
"PU3 5\n",
"PEU1 4\n",
"PEU2 14\n",
"PEU3 97\n",
"ENJ1 7\n",
"ENJ2 17\n",
"Qu1 7\n",
"Qu2 10\n",
"Qu3 15\n",
"Qu4 22\n",
"Qu5 29\n",
"Vis1 72\n",
"Vis2 117\n",
"Vis3 8\n",
"Im1 22\n",
"Im2 20\n",
"Im3 57\n",
"SA1 11\n",
"SA2 12\n",
"SA3 11\n",
"Use1 14\n",
"Use2 17\n",
"Use3 9\n",
"Use4 23\n",
"Use5 15\n",
"Pf1 11\n",
"Pf2 6\n",
"Pf3 14\n",
"JR1 27\n",
"JR2 53\n",
"BI1 32\n",
"BI2 43\n",
"Inc1 35\n",
"Inc2 35\n",
"Inc3 37\n",
"Inc4 42\n",
"Exp1 13\n",
"Exp2 11\n",
"Exp3 13\n",
"Exp4 14\n",
"Exp5 13\n",
"dtype: int64\n"
]
}
],
"source": [
"# View missing values in each column in the dataset\n",
"\n",
"print(df.isnull().sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Interpretation**\n",
"\n",
"We can see that there is a **YEARSEXP** column which contain 23 missing values. In the data set description, it is given that this column denotes number of years of university teaching experience and its data type is numeric. But, the df.info() command shows that it is of object data type. So, we need to change its data type.\n",
"\n",
"Similarly, the last five columns **Exp1**, **Exp2**, **Exp3**, **Exp4** and **Exp5** denote the number of years of experience. They contain 13, 11, 13, 14 and 13 missing values respectively. They have numeric data types. But, the df.info() command shows that they are of object data types. So, we need to change their data types as well.\n",
"\n",
"All the other columns are of text data types.\n",
"\n",
"So, we need to subset these columns from the above dataset."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Subset the dataframe df with above columns\n",
"\n",
"df_sub = df[['YEARSEXP', 'Exp1', 'Exp2', 'Exp3', 'Exp4', 'Exp5']]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"YEARSEXP object\n",
"Exp1 object\n",
"Exp2 object\n",
"Exp3 object\n",
"Exp4 object\n",
"Exp5 object\n",
"dtype: object\n"
]
}
],
"source": [
"# Check the data types of columns of df_sub\n",
"\n",
"print(df_sub.dtypes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Interpretation**\n",
"\n",
"We can see that the data type of columns of the dataframe of df_sub is object. We should convert it into integer data type."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# Convert columns of df_sub into integer data types\n",
"\n",
"df_sub = df_sub.apply(pd.to_numeric)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"YEARSEXP float64\n",
"Exp1 float64\n",
"Exp2 float64\n",
"Exp3 float64\n",
"Exp4 float64\n",
"Exp5 float64\n",
"dtype: object\n"
]
}
],
"source": [
"print(df_sub.dtypes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Interpretation**\n",
"\n",
"We can see that all the columns of df_sub dataframe are converted to float64 numeric data types. "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 913 entries, 0 to 912\n",
"Data columns (total 6 columns):\n",
"YEARSEXP 890 non-null float64\n",
"Exp1 900 non-null float64\n",
"Exp2 902 non-null float64\n",
"Exp3 900 non-null float64\n",
"Exp4 899 non-null float64\n",
"Exp5 900 non-null float64\n",
"dtypes: float64(6)\n",
"memory usage: 42.9 KB\n",
"None\n"
]
}
],
"source": [
"# View the summary of the dataframe df_sub\n",
"\n",
"print(df_sub.info())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"6. `isna()` and `notna()` functions to detect ‘NA’ values\n",
"\n",
"Pandas provides `isna()` and `notna()` functions to detect ‘NA’ values. These are also methods on Series and DataFrame objects.\n",
"\n",
"\n",
"Examples of `isna()` and `notna()` commands\n",
"\n",
"\n",
"*detect ‘NA’ values in the dataframe*\t\n",
"\n",
"`df.isna()`\n",
"\n",
"\n",
"*detect ‘NA’ values in a particular column in the dataframe*\n",
"\n",
"`pd.isna(df[col_name])`\n",
"\n",
"`df[col_name].notna()`\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"YEARSEXP 23\n",
"Exp1 13\n",
"Exp2 11\n",
"Exp3 13\n",
"Exp4 14\n",
"Exp5 13\n",
"dtype: int64\n"
]
}
],
"source": [
"# View the number of missing values in each column of dataframe df_sub\n",
"\n",
"print(df_sub.isnull().sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Interpretation**\n",
"\n",
"We can see that columns **YEARSEXP**, **Exp1**, **Exp2**, **Exp3**, **Exp4** and **Exp5** contain 23, 13, 11, 13, 14 and 13 missing values respectively.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Handle missing values\n",
"\n",
"There are several methods to handle missing values. Each method has its own advantages and disadvantages. The choice of the method is subjective and depends on the nature of data and the missing values. The summary of the options available for handling missing values is given below:-\n",
"\n",
"**•\tDrop missing values with dropna()**\n",
"\n",
"**•\tFill missing values with a test statistic**\n",
"\n",
"**•\tFill missing values with Imputer**\n",
"\n",
"**•\tBuild a Prediction Model**\n",
"\n",
"**•\tKNN Imputation**\n",
"\n",
"\n",
"\n",
"I have discussed each method in below sections:-\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.\tDrop missing values with dropna()\n",
"\n",
"This is the easiest method to handle missing values. In this method, we drop labels or columns from a data set which refer to missing values. \n",
"\n",
"\n",
"drop labels or rows from a data set containing missing values\n",
"\n",
"`df.dropna (axis = 0)`\n",
"\n",
"\n",
"\n",
"drop columns from a data set containing missing values\n",
"\n",
"`df.dropna(axis = 1)`\n",
"\n",
"\n",
"This is the Pandas dataframe **dropna()** method. An equivalent **dropna()** method is available for Series with same functionality.\n",
"\n",
"\n",
"\n",
"To drop a specific column from the dataframe, we can use drop() method of Pandas dataframe.\n",
"\n",
"\n",
"\n",
"### drop col_name column from Pandas dataframe\n",
"\n",
"\n",
"`df.drop(‘col_name’, axis = 1)` \n",
"\n",
"\n",
"\n",
"**A note about axis parameter** \n",
"\n",
"\n",
"Axis value may contain (0 or ‘index’) or (1 or ‘columns’). Its default value is 0.\n",
"\n",
"We set axis = 0 or ‘index’ to drop rows which contain missing values.\n",
"\n",
"We set axis = 1 or ‘columns’ to drop columns which contain missing values.\n",
"\n",
"\n",
"\n",
"After dropping the missing values, we can again check for missing values and the dimensions of the dataframe.\n",
"\n",
"\n",
"\n",
"again check the missing values in each column\n",
"\n",
"`df.isnull.sum()`\n",
"\n",
"\n",
"again check the dimensions of the dataset\n",
"\n",
"`df.shape`\n",
"\n",
"\n",
"\n",
"But, this method has one disadvantage. It involves the risk of losing useful information. Suppose there are lots of missing values in our dataset. If drop them, we may end up throwing away valuable information along with the missing data. It is a very grave mistake as it involves losing key information. So, it is only advised when there are only few missing values in our dataset.\n",
"\n",
"\n",
"So, it's better to develop an imputation strategy so that we can impute missing values with the mean or the median of the row or column containing the missing values.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# Copy the dataframe df_sub\n",
"\n",
"df1 = df_sub.copy()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"YEARSEXP 23\n",
"Exp1 13\n",
"Exp2 11\n",
"Exp3 13\n",
"Exp4 14\n",
"Exp5 13\n",
"dtype: int64\n"
]
}
],
"source": [
"# View the number of missing values in each column of dataframe df1\n",
"\n",
"print(df1.isnull().sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Interpretation**\n",
"\n",
"The column **Exp2** contain least number of missing values. So, I will drop that column from df1."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# Drop column Exp2 from df1\n",
"\n",
"df1 = df1.drop('Exp2', axis = 1)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" YEARSEXP Exp1 Exp3 Exp4 Exp5\n",
"0 14.0 4.0 4.0 1.0 2.0\n",
"1 18.0 2.0 4.0 2.0 4.0\n",
"2 13.0 2.0 2.0 1.0 3.0\n",
"3 13.0 4.0 3.0 3.0 4.0\n",
"4 8.0 5.0 5.0 4.0 4.0\n"
]
}
],
"source": [
"# View the first five rows of dataframe df1\n",
"\n",
"print(df1.head())"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 913 entries, 0 to 912\n",
"Data columns (total 5 columns):\n",
"YEARSEXP 890 non-null float64\n",
"Exp1 900 non-null float64\n",
"Exp3 900 non-null float64\n",
"Exp4 899 non-null float64\n",
"Exp5 900 non-null float64\n",
"dtypes: float64(5)\n",
"memory usage: 35.7 KB\n",
"None\n"
]
}
],
"source": [
"# View the summary of dataframe df1\n",
"\n",
"print(df1.info())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Conclusion**\n",
"\n",
"I have dropped the **Exp2** column from the dataframe df1 with df1.drop() command."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Fill missing values with a test statistic\n",
"\n",
"In this method, we fill the missing values with a test statistic like mean, median or mode of the particular feature the missing value belongs to. One can also specify a forward-fill or back-fill to propagate the next values backward or previous value forward.\n",
"\n",
"\n",
"Filling missing values with a test statistic like median\n",
"\n",
"`median = df['col_name'].median()`\n",
"\n",
"`df['col_name'].fillna(value = median, inplace = True )`\n",
"\n",
"\n",
"\n",
"We can also use replace() in place of fillna()\n",
"\n",
"`df[‘col_name’].replace(to_replace = NaN, value = median, inplace = True)`\n",
"\n",
"\n",
"If we choose this method, then we should compute the median value on the training set and use it to fill the missing values in the training set. Then we should save the median value that we have computed. Later, we will replace missing values in the test set with the median value to evaluate the system.\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# Copy the df1 dataframe\n",
"\n",
"df2 = df1.copy()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"YEARSEXP 23\n",
"Exp1 13\n",
"Exp3 13\n",
"Exp4 14\n",
"Exp5 13\n",
"dtype: int64\n"
]
}
],
"source": [
"# View the number of missing values in each column of dataframe df2\n",
"\n",
"print(df2.isnull().sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Interpretation**\n",
"\n",
"We can see that the **YEARSEXP** column contain 23 missing values. I will fill missing values in **YEARSEXP** column with median of **YEARSEXP** column."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# Fill missing values in YEARSEXP column with median of YEARSEXP column.\n",
"\n",
"median = df2['YEARSEXP'].median()\n",
"\n",
"df2['YEARSEXP'].fillna(value = median, inplace = True)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 913 entries, 0 to 912\n",
"Data columns (total 5 columns):\n",
"YEARSEXP 913 non-null float64\n",
"Exp1 900 non-null float64\n",
"Exp3 900 non-null float64\n",
"Exp4 899 non-null float64\n",
"Exp5 900 non-null float64\n",
"dtypes: float64(5)\n",
"memory usage: 35.7 KB\n",
"None\n"
]
}
],
"source": [
"# View the summary of df2 dataframe \n",
"\n",
"print(df2.info())"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"YEARSEXP 0\n",
"Exp1 13\n",
"Exp3 13\n",
"Exp4 14\n",
"Exp5 13\n",
"dtype: int64\n"
]
}
],
"source": [
"# Again view the number of missing values in each column of dataframe df2\n",
"\n",
"print(df2.isnull().sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Interpretation**\n",
"\n",
"I have fill all the missing values of **YEARSEXP** column with the median value of **YEARSEXP** column. Now, this column has no missing values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6.\tFill missing values with Imputer\n",
"\n",
"Scikit-Learn provides Imputer class to deal with the missing values. In this method, we replace the missing value with the mean value of the entire feature column. This can be done as shown in the following code:\n",
"\n",
"\n",
"`from sklearn.preprocessing import Imputer`\n",
"\n",
"`imp = Imputer(missing_values='NaN', strategy='mean', axis=0)`\n",
"\n",
"\n",
"`imputed_data = imp.fit_transform(df)`\n",
"\n",
"`imputed_data`\n",
"\n",
"\n",
"Here, I have replaced each ‘NaN’ value with the corresponding mean value. The mean value is separately calculated for each feature column. If instead of axis = 0, we set axis = 1, then mean values are calculated for each row. \n",
"\n",
"\n",
"Other options for strategy parameter are ‘median’ or ‘most_frequent’. The ‘most_frequent’ parameter replaces the missing values with the most frequent value. It is useful for imputing categorical feature values.\n"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[14. 4. 4. 1. 2.]\n",
" [18. 2. 4. 2. 4.]\n",
" [13. 2. 2. 1. 3.]\n",
" ...\n",
" [ 9. 5. 5. 4. 1.]\n",
" [10. 4. 2. 1. 1.]\n",
" [12. 2. 3. 1. 1.]]\n"
]
}
],
"source": [
"# Fill missing values with Imputer\n",
"\n",
"from sklearn.preprocessing import Imputer\n",
"\n",
"imp = Imputer(missing_values='NaN', strategy='mean', axis=0)\n",
"\n",
"df2 = imp.fit_transform(df2)\n",
"\n",
"print(df2)\n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"# Imputer convert the dataframe df2 into a numpy array.\n",
"\n",
"# So, we need to convert it back into the dataframe df2.\n",
"\n",
"columnnames = ['YEARSEXP', 'Exp1', 'Exp3', 'Exp4', 'Exp5']\n",
"\n",
"df2 = pd.DataFrame(df2, columns = columnnames)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" YEARSEXP Exp1 Exp3 Exp4 Exp5\n",
"0 14.0 4.0 4.0 1.0 2.0\n",
"1 18.0 2.0 4.0 2.0 4.0\n",
"2 13.0 2.0 2.0 1.0 3.0\n",
"3 13.0 4.0 3.0 3.0 4.0\n",
"4 8.0 5.0 5.0 4.0 4.0\n"
]
}
],
"source": [
"# View the first 5 rows of imputed dataframe df2\n",
"\n",
"print(df2.head())"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 913 entries, 0 to 912\n",
"Data columns (total 5 columns):\n",
"YEARSEXP 913 non-null float64\n",
"Exp1 913 non-null float64\n",
"Exp3 913 non-null float64\n",
"Exp4 913 non-null float64\n",
"Exp5 913 non-null float64\n",
"dtypes: float64(5)\n",
"memory usage: 35.7 KB\n",
"None\n"
]
}
],
"source": [
"# View the summary of the imputed dataframe df2\n",
"\n",
"print(df2.info())"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"YEARSEXP 0\n",
"Exp1 0\n",
"Exp3 0\n",
"Exp4 0\n",
"Exp5 0\n",
"dtype: int64\n"
]
}
],
"source": [
"# Agian check that there are no missing values in df2\n",
"\n",
"print(df2.isnull().sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Interpretation**\n",
"\n",
"We can see that there are no missing numerical values in the columns of dataframe df2. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.\tBuild a prediction model\n",
"\n",
"We can build a prediction model to handle missing values. In this method, we divide our data set into two sets – training set and test set. Training set does not contain any missing values and test set contains missing values. The variable containing missing values can be treated as a target variable. Next, we create a model to predict target variable and use it to populate missing values of test data set. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8.\tKNN Imputation\n",
"\n",
"In this method, the missing values of an attribute are imputed using the given number of attributes that are mostly similar to the attribute whose values are missing. The similarity of attributes is determined using a distance function.\n",
"\n",
"The above two mmethods are more sophisticated methods to deal with missing numerical values. Hence, I will not go into much detail.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Check with ASSERT statement\n",
"\n",
"\n",
"Finally, we can check for missing values programmatically. If we drop or fill missing values, we expect no missing values. We can write an assert statement to verify this. So, we can use an assert statement to programmatically check that no missing or unexpected ‘0’ value is present. This gives confidence that our code is running properly.\n",
"Assert statement will return nothing if the value being tested is true and will throw an AssertionError if the value is false.\n",
"\n",
"Asserts\n",
"\n",
"•\tassert 1 == 1 (return Nothing if the value is True)\n",
"\n",
"•\tassert 1 == 2 (return AssertionError if the value is False)\n",
"\n",
"\n",
"\n",
"assert that there are no missing values in the dataframe\n",
"\n",
"`assert pd.notnull(df).all().all()`\n",
"\n",
"\n",
"assert that there are no missing values for a particular column in dataframe\n",
"\n",
"`assert df.column_name.notnull().all()`\n",
"\n",
"\n",
"assert all values are greater than 0\n",
"\n",
"`assert (df >=0).all().all()`\n",
"\n",
"\n",
"assert no entry in a column is equal to 0\n",
"\n",
"`assert (df['column_name']!=0).all().all()`\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# assert that there are no missing values in the dataframe df2\n",
"\n",
"assert pd.notnull(df2).all().all()\n",
"\n",
"\n",
"# When I run the above command, it returns nothing. Hence the assert statement is true. \n",
"\n",
"# So, there are no missing values in dataframe df2.\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"# assert that there are no missing values for a particular column in the dataframe\n",
"\n",
"assert df2['YEARSEXP'].notnull().all()\n",
"\n",
"assert df2['Exp1'].notnull().all()\n",
"\n",
"assert df2['Exp3'].notnull().all()\n",
"\n",
"assert df2['Exp4'].notnull().all()\n",
"\n",
"assert df2['Exp5'].notnull().all()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Interpretation**\n",
"\n",
"When I run the above commands, it returns nothing. Hence the assert statements are true. Hence, there are no missing values in df2 dataframe.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This concludes our discussion on missing values."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment