pb111/Data Preprocessing Project - Dealing with Missing Numerical Values.ipynb

## Data Preprocessing Project - Dealing with Missing Numerical Values.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Preprocessing Project -  Dealing with Missing Numerical Values\n",
    "\n",
    "\n",
    "In this project, I discuss various data preprocessing techniques to handle missing numerical values. The contents of this project are categorized into various sections which are listed in table of contents as follows:-\n",
    "\n",
    "\n",
    "## Table of Contents\n",
    "\n",
    "\n",
    "1.\tIntroduction\n",
    "\n",
    "2.\tSource dataset\n",
    "\n",
    "3.\tDealing with missing numerical values\n",
    "\n",
    "4.\tDrop missing values with dropna()\n",
    "\n",
    "5.\tFill missing values with a test statistic\n",
    "\n",
    "6.\tFill missing values with Imputer\n",
    "\n",
    "7.\tBuild a prediction model\n",
    "\n",
    "8.\tKNN Imputation\n",
    "\n",
    "9.\tCheck with ASSERT statement\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Introduction\n",
    "\n",
    "\n",
    "Over the last few decades, Machine Learning (ML) has gained immense popularity in solving real world business problems. It has emerged as a technology tool for companies looking to boost productivity and profit. ML practitioners source real world data and write algorithms to solve business problems. The success of the ML algorithms depends on the quality of the data. The data must be free from errors and discrepancies. It must adhere to a specific standard so that ML algorithms can accept them. But, this does not happen in reality.\n",
    "\n",
    "In reality, the data has its own limitations. The data is dirty. It is incomplete, noisy and inconsistent.  Incomplete data means it has missing values and lacks certain attributes. The data may be noisy as it contains errors and outliers and hence does not produce desired results. Lastly, the data may be inconsistent as it contains discrepancies in data or duplicate data.\n",
    "\n",
    "So, ML practitioners must take actions to transform raw data into standardized data that is suitable for ML algorithms.  It involves cleaning, transforming and standardizing data to remove all the inadequacies and irregularities in the data. These actions are collectively known as **Data Preprocessing**.   \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Source dataset\n",
    "\n",
    "\n",
    "I have used wiki4HE.csv data set for this project. I have downloaded this data set from the UCI Machine Learning Repository. The data set describes survey results of faculty members from two Spanish universities on teaching uses of Wikipedia. \n",
    "\n",
    "\n",
    "The dataset contains 53 attributes and 913 instances. Out of the 53 attributes, 4 are of numeric data types and 49 are of text or character data types.  \n",
    "\n",
    "\n",
    "The data set can be found at the following url-\n",
    "\n",
    "\n",
    "https://archive.ics.uci.edu/ml/datasets/wiki4HE\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Dealing with missing numerical values\n",
    "\n",
    "\n",
    "It is a very common scenario that when looking at a real world data, a data scientist may come across missing values. These missing values could be due to error prone data entry process, wrong data collection methods, certain values not applicable, particular fields left blank in a survey or the respondent decline to answer. Whatever may be the reason for the missing value, the data scientist must find ways to handle these missing values. He knows that missing values need to be handled carefully, because they give wrong results if we simply ignore them. He must answer whether he should delete these missing values or replace them with a suitable statistic. The first step in dealing with missing values properly is to identify them. \n",
    "\n",
    "\n",
    "The initial inspection of the data help us to detect whether there are missing values in the data set. It can be done by Exploratory Data Analysis. So, it is always important that a data scientist always perform Exploratory Data Analysis (EDA) to identify missing values correctly.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import required libraries\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the dataset\n",
    "\n",
    "dataset = \"C:/project_datasets/wiki4HE.csv\"\n",
    "\n",
    "df = pd.read_csv(dataset, sep = ';')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "### Exploratory Data Analysis (EDA)\n",
    "\n",
    "\n",
    "Below is the list of commands to identity missing values with EDA.\n",
    "\n",
    "\n",
    "1.\t`df.head()`\n",
    "\n",
    "This will output the first five rows of the dataset. It will give us quick view on the presence of ‘NaN’ or ‘?’ ‘-1’ or ’0’ or blank spaces “” in the dataset. If required, we can view more number of rows by specifying the number of rows inside the parenthesis. \n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   AGE  GENDER DOMAIN  PhD YEARSEXP  UNIVERSITY UOC_POSITION OTHER_POSITION  \\\n",
      "0   40       0      2    1       14           1            2              ?   \n",
      "1   42       0      5    1       18           1            2              ?   \n",
      "2   37       0      4    1       13           1            3              ?   \n",
      "3   40       0      4    0       13           1            3              ?   \n",
      "4   51       0      6    0        8           1            3              ?   \n",
      "\n",
      "  OTHERSTATUS USERWIKI ...  BI2 Inc1 Inc2 Inc3 Inc4 Exp1 Exp2 Exp3 Exp4 Exp5  \n",
      "0           ?        0 ...    3    5    5    5    5    4    4    4    1    2  \n",
      "1           ?        0 ...    2    4    4    3    4    2    2    4    2    4  \n",
      "2           ?        0 ...    1    5    3    5    5    2    2    2    1    3  \n",
      "3           ?        0 ...    3    3    4    4    3    4    4    3    3    4  \n",
      "4           ?        1 ...    5    5    5    4    4    5    5    5    4    4  \n",
      "\n",
      "[5 rows x 53 columns]\n"
     ]
    }
   ],
   "source": [
    "# View the first 5 rows of the dataset\n",
    "\n",
    "print(df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Interpretation**\n",
    "\n",
    "We can see that there are lots of missing values in the dataset. The columns **OTHER_POSITION** and **OTHERSTATUS** contain missing values.\n",
    "\n",
    "The column **GENDER** contain zeros. It might be because of **Male** is encoded as 1 and  **Female** is encoded as 0.\n",
    "\n",
    "We need to explore the dataset further to confirm which columns contain the missing values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. `df.info()`\n",
    "\n",
    "This command is quite useful in detecting the missing values in the dataset. It will tell us the total number of non - null observations present including the total number of entries. Once number of entries isn’t equal to number of non - null observations, we know there are missing values in the dataset.\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 913 entries, 0 to 912\n",
      "Data columns (total 53 columns):\n",
      "AGE               913 non-null int64\n",
      "GENDER            913 non-null int64\n",
      "DOMAIN            913 non-null object\n",
      "PhD               913 non-null int64\n",
      "YEARSEXP          913 non-null object\n",
      "UNIVERSITY        913 non-null int64\n",
      "UOC_POSITION      913 non-null object\n",
      "OTHER_POSITION    913 non-null object\n",
      "OTHERSTATUS       913 non-null object\n",
      "USERWIKI          913 non-null object\n",
      "PU1               913 non-null object\n",
      "PU2               913 non-null object\n",
      "PU3               913 non-null object\n",
      "PEU1              913 non-null object\n",
      "PEU2              913 non-null object\n",
      "PEU3              913 non-null object\n",
      "ENJ1              913 non-null object\n",
      "ENJ2              913 non-null object\n",
      "Qu1               913 non-null object\n",
      "Qu2               913 non-null object\n",
      "Qu3               913 non-null object\n",
      "Qu4               913 non-null object\n",
      "Qu5               913 non-null object\n",
      "Vis1              913 non-null object\n",
      "Vis2              913 non-null object\n",
      "Vis3              913 non-null object\n",
      "Im1               913 non-null object\n",
      "Im2               913 non-null object\n",
      "Im3               913 non-null object\n",
      "SA1               913 non-null object\n",
      "SA2               913 non-null object\n",
      "SA3               913 non-null object\n",
      "Use1              913 non-null object\n",
      "Use2              913 non-null object\n",
      "Use3              913 non-null object\n",
      "Use4              913 non-null object\n",
      "Use5              913 non-null object\n",
      "Pf1               913 non-null object\n",
      "Pf2               913 non-null object\n",
      "Pf3               913 non-null object\n",
      "JR1               913 non-null object\n",
      "JR2               913 non-null object\n",
      "BI1               913 non-null object\n",
      "BI2               913 non-null object\n",
      "Inc1              913 non-null object\n",
      "Inc2              913 non-null object\n",
      "Inc3              913 non-null object\n",
      "Inc4              913 non-null object\n",
      "Exp1              913 non-null object\n",
      "Exp2              913 non-null object\n",
      "Exp3              913 non-null object\n",
      "Exp4              913 non-null object\n",
      "Exp5              913 non-null object\n",
      "dtypes: int64(4), object(49)\n",
      "memory usage: 378.1+ KB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "# View the summary of the dataframe\n",
    "\n",
    "print(df.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Interpretation**\n",
    "\n",
    "The above command shows that there are no missing values in the data set. But, this is not true. The dataset contains missing values.It may be because of missing values are encoded in different ways. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Encode missing numerical values\n",
    "\n",
    "\n",
    "Missing values are encoded in different ways. They can appear as ‘NaN’, ‘NA’, ‘?’, zero ‘0’, ‘xx’, minus one ‘-1’ or a blank space “ ”. We need to use various pandas methods to deal with missing values. But, pandas always recognize missing values as ‘NaN’.  So, it is essential that we should first convert all the ‘?’, zeros ‘0’, ‘xx’, minus ones ‘-1’ or blank spaces “ ” to ‘NaN’. If the missing values isn’t identified as ‘NaN’, then we have to first convert or replace such non ‘NaN’ entry with a ‘NaN’.\n",
    "\n",
    "\n",
    "### Convert '?' to ‘NaN’\n",
    "\n",
    "`df[df == '?'] = np.nan`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert '?' to 'NaN'\n",
    "\n",
    "df[df == '?'] = np.nan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   AGE  GENDER DOMAIN  PhD YEARSEXP  UNIVERSITY UOC_POSITION OTHER_POSITION  \\\n",
      "0   40       0      2    1       14           1            2            NaN   \n",
      "1   42       0      5    1       18           1            2            NaN   \n",
      "2   37       0      4    1       13           1            3            NaN   \n",
      "3   40       0      4    0       13           1            3            NaN   \n",
      "4   51       0      6    0        8           1            3            NaN   \n",
      "\n",
      "  OTHERSTATUS USERWIKI ...  BI2 Inc1 Inc2 Inc3 Inc4 Exp1 Exp2 Exp3 Exp4 Exp5  \n",
      "0         NaN        0 ...    3    5    5    5    5    4    4    4    1    2  \n",
      "1         NaN        0 ...    2    4    4    3    4    2    2    4    2    4  \n",
      "2         NaN        0 ...    1    5    3    5    5    2    2    2    1    3  \n",
      "3         NaN        0 ...    3    3    4    4    3    4    4    3    3    4  \n",
      "4         NaN        1 ...    5    5    5    4    4    5    5    5    4    4  \n",
      "\n",
      "[5 rows x 53 columns]\n"
     ]
    }
   ],
   "source": [
    "# View the first 5 rows of the dataset again\n",
    "\n",
    "print(df.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 913 entries, 0 to 912\n",
      "Data columns (total 53 columns):\n",
      "AGE               913 non-null int64\n",
      "GENDER            913 non-null int64\n",
      "DOMAIN            911 non-null object\n",
      "PhD               913 non-null int64\n",
      "YEARSEXP          890 non-null object\n",
      "UNIVERSITY        913 non-null int64\n",
      "UOC_POSITION      800 non-null object\n",
      "OTHER_POSITION    652 non-null object\n",
      "OTHERSTATUS       373 non-null object\n",
      "USERWIKI          909 non-null object\n",
      "PU1               906 non-null object\n",
      "PU2               902 non-null object\n",
      "PU3               908 non-null object\n",
      "PEU1              909 non-null object\n",
      "PEU2              899 non-null object\n",
      "PEU3              816 non-null object\n",
      "ENJ1              906 non-null object\n",
      "ENJ2              896 non-null object\n",
      "Qu1               906 non-null object\n",
      "Qu2               903 non-null object\n",
      "Qu3               898 non-null object\n",
      "Qu4               891 non-null object\n",
      "Qu5               884 non-null object\n",
      "Vis1              841 non-null object\n",
      "Vis2              796 non-null object\n",
      "Vis3              905 non-null object\n",
      "Im1               891 non-null object\n",
      "Im2               893 non-null object\n",
      "Im3               856 non-null object\n",
      "SA1               902 non-null object\n",
      "SA2               901 non-null object\n",
      "SA3               902 non-null object\n",
      "Use1              899 non-null object\n",
      "Use2              896 non-null object\n",
      "Use3              904 non-null object\n",
      "Use4              890 non-null object\n",
      "Use5              898 non-null object\n",
      "Pf1               902 non-null object\n",
      "Pf2               907 non-null object\n",
      "Pf3               899 non-null object\n",
      "JR1               886 non-null object\n",
      "JR2               860 non-null object\n",
      "BI1               881 non-null object\n",
      "BI2               870 non-null object\n",
      "Inc1              878 non-null object\n",
      "Inc2              878 non-null object\n",
      "Inc3              876 non-null object\n",
      "Inc4              871 non-null object\n",
      "Exp1              900 non-null object\n",
      "Exp2              902 non-null object\n",
      "Exp3              900 non-null object\n",
      "Exp4              899 non-null object\n",
      "Exp5              900 non-null object\n",
      "dtypes: int64(4), object(49)\n",
      "memory usage: 378.1+ KB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "# View the summary of the dataframe again\n",
    "\n",
    "print(df.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Interpretation** \n",
    "\n",
    "Now, we can see that there are lots of columns containing missing values. We should view the column names of the dataframe. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['AGE', 'GENDER', 'DOMAIN', 'PhD', 'YEARSEXP', 'UNIVERSITY',\n",
      "       'UOC_POSITION', 'OTHER_POSITION', 'OTHERSTATUS', 'USERWIKI', 'PU1',\n",
      "       'PU2', 'PU3', 'PEU1', 'PEU2', 'PEU3', 'ENJ1', 'ENJ2', 'Qu1', 'Qu2',\n",
      "       'Qu3', 'Qu4', 'Qu5', 'Vis1', 'Vis2', 'Vis3', 'Im1', 'Im2', 'Im3', 'SA1',\n",
      "       'SA2', 'SA3', 'Use1', 'Use2', 'Use3', 'Use4', 'Use5', 'Pf1', 'Pf2',\n",
      "       'Pf3', 'JR1', 'JR2', 'BI1', 'BI2', 'Inc1', 'Inc2', 'Inc3', 'Inc4',\n",
      "       'Exp1', 'Exp2', 'Exp3', 'Exp4', 'Exp5'],\n",
      "      dtype='object')\n"
     ]
    }
   ],
   "source": [
    "# View the column names of dataframe\n",
    "\n",
    "print(df.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. `df.describe()`\n",
    "\n",
    "This will display summary statistics of all observed features and labels. The most important statistic is the minimum value. If we see -1 or 0 in our observations, then we can suspect missing value.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              AGE      GENDER         PhD  UNIVERSITY\n",
      "count  913.000000  913.000000  913.000000  913.000000\n",
      "mean    42.246440    0.424973    0.464403    1.123768\n",
      "std      8.058418    0.494610    0.499005    0.329497\n",
      "min     23.000000    0.000000    0.000000    1.000000\n",
      "25%     36.000000    0.000000    0.000000    1.000000\n",
      "50%     42.000000    0.000000    0.000000    1.000000\n",
      "75%     47.000000    1.000000    1.000000    1.000000\n",
      "max     69.000000    1.000000    1.000000    2.000000\n"
     ]
    }
   ],
   "source": [
    "# View the descriptive statistics of the dataframe\n",
    "\n",
    "print(df.describe())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Interpretation**\n",
    "\n",
    "We can see there are four columns of integer data types - **AGE**, **GENDER**, **PhD** and **UNIVERSITY**. \n",
    "\n",
    "In the **AGE** column, the maximum and minimum values are 69 and 23. The median value is 42 and the count is 913. We do not suspect any missing value in this column. \n",
    "\n",
    "Similar, explanation goes for the **PhD** and **UNIVERSITY** columns.\n",
    "\n",
    "The **GENDER** column has only two possible values 0 and 1. This is reasonable because 0 is for female and 1 is for male. \n",
    "\n",
    "So, we do not find any missing values in the above four columns. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4.\t`df.isnull()`\n",
    "\n",
    "The above command checks whether each cell in a dataframe contains missing values or not. If the cell contains missing value, it returns True otherwise it returns False. \n",
    "\n",
    "\n",
    "5.\t`df.isnull.sum()`\n",
    "\n",
    "The above command returns the total number of missing values in each column in the data set.\n",
    "\n",
    "\n",
    "\t"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AGE                 0\n",
      "GENDER              0\n",
      "DOMAIN              2\n",
      "PhD                 0\n",
      "YEARSEXP           23\n",
      "UNIVERSITY          0\n",
      "UOC_POSITION      113\n",
      "OTHER_POSITION    261\n",
      "OTHERSTATUS       540\n",
      "USERWIKI            4\n",
      "PU1                 7\n",
      "PU2                11\n",
      "PU3                 5\n",
      "PEU1                4\n",
      "PEU2               14\n",
      "PEU3               97\n",
      "ENJ1                7\n",
      "ENJ2               17\n",
      "Qu1                 7\n",
      "Qu2                10\n",
      "Qu3                15\n",
      "Qu4                22\n",
      "Qu5                29\n",
      "Vis1               72\n",
      "Vis2              117\n",
      "Vis3                8\n",
      "Im1                22\n",
      "Im2                20\n",
      "Im3                57\n",
      "SA1                11\n",
      "SA2                12\n",
      "SA3                11\n",
      "Use1               14\n",
      "Use2               17\n",
      "Use3                9\n",
      "Use4               23\n",
      "Use5               15\n",
      "Pf1                11\n",
      "Pf2                 6\n",
      "Pf3                14\n",
      "JR1                27\n",
      "JR2                53\n",
      "BI1                32\n",
      "BI2                43\n",
      "Inc1               35\n",
      "Inc2               35\n",
      "Inc3               37\n",
      "Inc4               42\n",
      "Exp1               13\n",
      "Exp2               11\n",
      "Exp3               13\n",
      "Exp4               14\n",
      "Exp5               13\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# View missing values in each column in the dataset\n",
    "\n",
    "print(df.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Interpretation**\n",
    "\n",
    "We can see that there is a **YEARSEXP** column which contain 23 missing values. In the data set description, it is given that this column denotes number of years of university teaching experience and its data type is numeric. But, the df.info() command shows that it is of object data type. So, we need to change its data type.\n",
    "\n",
    "Similarly, the last five columns **Exp1**, **Exp2**, **Exp3**, **Exp4** and **Exp5** denote the number of years of experience. They contain 13, 11, 13, 14 and 13 missing values respectively. They have numeric data types. But, the df.info() command shows that they are of object data types. So, we need to change their data types as well.\n",
    "\n",
    "All the other columns are of text data types.\n",
    "\n",
    "So, we need to subset these columns from the above dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Subset the dataframe df with above columns\n",
    "\n",
    "df_sub = df[['YEARSEXP', 'Exp1', 'Exp2', 'Exp3', 'Exp4', 'Exp5']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "YEARSEXP    object\n",
      "Exp1        object\n",
      "Exp2        object\n",
      "Exp3        object\n",
      "Exp4        object\n",
      "Exp5        object\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "# Check the data types of columns of df_sub\n",
    "\n",
    "print(df_sub.dtypes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Interpretation**\n",
    "\n",
    "We can see that the data type of columns of the dataframe of df_sub is object. We should convert it into integer data type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert columns of df_sub into integer data types\n",
    "\n",
    "df_sub = df_sub.apply(pd.to_numeric)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "YEARSEXP    float64\n",
      "Exp1        float64\n",
      "Exp2        float64\n",
      "Exp3        float64\n",
      "Exp4        float64\n",
      "Exp5        float64\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "print(df_sub.dtypes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Interpretation**\n",
    "\n",
    "We can see that all the columns of df_sub dataframe are converted to float64 numeric data types. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 913 entries, 0 to 912\n",
      "Data columns (total 6 columns):\n",
      "YEARSEXP    890 non-null float64\n",
      "Exp1        900 non-null float64\n",
      "Exp2        902 non-null float64\n",
      "Exp3        900 non-null float64\n",
      "Exp4        899 non-null float64\n",
      "Exp5        900 non-null float64\n",
      "dtypes: float64(6)\n",
      "memory usage: 42.9 KB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "# View the summary of the dataframe df_sub\n",
    "\n",
    "print(df_sub.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6. `isna()` and `notna()` functions to detect ‘NA’ values\n",
    "\n",
    "Pandas provides `isna()` and `notna()` functions to detect ‘NA’ values. These are also methods on Series and DataFrame objects.\n",
    "\n",
    "\n",
    "Examples of `isna()` and `notna()` commands\n",
    "\n",
    "\n",
    "*detect ‘NA’ values in the dataframe*\t\n",
    "\n",
    "`df.isna()`\n",
    "\n",
    "\n",
    "*detect ‘NA’ values in a particular column in the dataframe*\n",
    "\n",
    "`pd.isna(df[col_name])`\n",
    "\n",
    "`df[col_name].notna()`\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "YEARSEXP    23\n",
      "Exp1        13\n",
      "Exp2        11\n",
      "Exp3        13\n",
      "Exp4        14\n",
      "Exp5        13\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# View the number of missing values in each column of dataframe df_sub\n",
    "\n",
    "print(df_sub.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Interpretation**\n",
    "\n",
    "We can see that columns **YEARSEXP**, **Exp1**, **Exp2**, **Exp3**, **Exp4** and **Exp5** contain 23, 13, 11, 13, 14 and 13 missing values respectively.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Handle missing values\n",
    "\n",
    "There are several methods to handle missing values. Each method has its own advantages and disadvantages. The choice of the method is subjective and depends on the nature of data and the missing values. The summary of the options available for handling missing values is given below:-\n",
    "\n",
    "**•\tDrop missing values with dropna()**\n",
    "\n",
    "**•\tFill missing values with a test statistic**\n",
    "\n",
    "**•\tFill missing values with Imputer**\n",
    "\n",
    "**•\tBuild a Prediction Model**\n",
    "\n",
    "**•\tKNN Imputation**\n",
    "\n",
    "\n",
    "\n",
    "I have discussed each method in below sections:-\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.\tDrop missing values with dropna()\n",
    "\n",
    "This is the easiest method to handle missing values. In this method, we drop labels or columns from a data set which refer to missing values. \n",
    "\n",
    "\n",
    "drop labels or rows from a data set containing missing values\n",
    "\n",
    "`df.dropna (axis = 0)`\n",
    "\n",
    "\n",
    "\n",
    "drop columns from a data set containing missing values\n",
    "\n",
    "`df.dropna(axis = 1)`\n",
    "\n",
    "\n",
    "This is the Pandas dataframe **dropna()** method. An equivalent **dropna()** method is available for Series with same functionality.\n",
    "\n",
    "\n",
    "\n",
    "To drop a specific column from the dataframe, we can use drop() method of Pandas dataframe.\n",
    "\n",
    "\n",
    "\n",
    "### drop col_name column from Pandas dataframe\n",
    "\n",
    "\n",
    "`df.drop(‘col_name’, axis = 1)` \n",
    "\n",
    "\n",
    "\n",
    "**A note about axis parameter** \n",
    "\n",
    "\n",
    "Axis value may contain (0 or ‘index’) or (1 or ‘columns’). Its default value is 0.\n",
    "\n",
    "We set axis = 0 or ‘index’ to drop rows which contain missing values.\n",
    "\n",
    "We set axis = 1 or ‘columns’ to drop columns which contain missing values.\n",
    "\n",
    "\n",
    "\n",
    "After dropping the missing values, we can again check for missing values and the dimensions of the dataframe.\n",
    "\n",
    "\n",
    "\n",
    "again check the missing values in each column\n",
    "\n",
    "`df.isnull.sum()`\n",
    "\n",
    "\n",
    "again check the dimensions of the dataset\n",
    "\n",
    "`df.shape`\n",
    "\n",
    "\n",
    "\n",
    "But, this method has one disadvantage. It involves the risk of losing useful information. Suppose there are lots of missing values in our dataset. If drop them, we may end up throwing away valuable information along with the missing data. It is a very grave mistake as it involves losing key information. So, it is only advised when there are only few missing values in our dataset.\n",
    "\n",
    "\n",
    "So, it's better to develop an imputation strategy so that we can impute missing values with the mean or the median of the row or column containing the missing values.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copy the dataframe df_sub\n",
    "\n",
    "df1 = df_sub.copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "YEARSEXP    23\n",
      "Exp1        13\n",
      "Exp2        11\n",
      "Exp3        13\n",
      "Exp4        14\n",
      "Exp5        13\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# View the number of missing values in each column of dataframe df1\n",
    "\n",
    "print(df1.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Interpretation**\n",
    "\n",
    "The column **Exp2** contain least number of missing values. So, I will drop that column from df1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop column Exp2 from df1\n",
    "\n",
    "df1 = df1.drop('Exp2', axis = 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   YEARSEXP  Exp1  Exp3  Exp4  Exp5\n",
      "0      14.0   4.0   4.0   1.0   2.0\n",
      "1      18.0   2.0   4.0   2.0   4.0\n",
      "2      13.0   2.0   2.0   1.0   3.0\n",
      "3      13.0   4.0   3.0   3.0   4.0\n",
      "4       8.0   5.0   5.0   4.0   4.0\n"
     ]
    }
   ],
   "source": [
    "# View the first five rows of dataframe df1\n",
    "\n",
    "print(df1.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 913 entries, 0 to 912\n",
      "Data columns (total 5 columns):\n",
      "YEARSEXP    890 non-null float64\n",
      "Exp1        900 non-null float64\n",
      "Exp3        900 non-null float64\n",
      "Exp4        899 non-null float64\n",
      "Exp5        900 non-null float64\n",
      "dtypes: float64(5)\n",
      "memory usage: 35.7 KB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "# View the summary of dataframe df1\n",
    "\n",
    "print(df1.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Conclusion**\n",
    "\n",
    "I have dropped the **Exp2** column from the dataframe df1 with df1.drop() command."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Fill missing values with a test statistic\n",
    "\n",
    "In this method, we fill the missing values with a test statistic like mean, median or mode of the particular feature the missing value belongs to. One can also specify a forward-fill or back-fill to propagate the next values backward or previous value forward.\n",
    "\n",
    "\n",
    "Filling missing values with a test statistic like median\n",
    "\n",
    "`median = df['col_name'].median()`\n",
    "\n",
    "`df['col_name'].fillna(value = median, inplace = True )`\n",
    "\n",
    "\n",
    "\n",
    "We can also use replace() in place of fillna()\n",
    "\n",
    "`df[‘col_name’].replace(to_replace = NaN, value = median, inplace = True)`\n",
    "\n",
    "\n",
    "If we choose this method, then we should compute the median value on the training set and use it to fill the missing values in the training set. Then we should save the median value that we have computed.  Later, we will replace missing values in the test set with the median value to evaluate the system.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copy the df1 dataframe\n",
    "\n",
    "df2 = df1.copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "YEARSEXP    23\n",
      "Exp1        13\n",
      "Exp3        13\n",
      "Exp4        14\n",
      "Exp5        13\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# View the number of missing values in each column of dataframe df2\n",
    "\n",
    "print(df2.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Interpretation**\n",
    "\n",
    "We can see that the **YEARSEXP** column contain 23 missing values. I will fill missing values in **YEARSEXP** column with median of **YEARSEXP** column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fill missing values in YEARSEXP column with median of YEARSEXP column.\n",
    "\n",
    "median = df2['YEARSEXP'].median()\n",
    "\n",
    "df2['YEARSEXP'].fillna(value = median, inplace = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 913 entries, 0 to 912\n",
      "Data columns (total 5 columns):\n",
      "YEARSEXP    913 non-null float64\n",
      "Exp1        900 non-null float64\n",
      "Exp3        900 non-null float64\n",
      "Exp4        899 non-null float64\n",
      "Exp5        900 non-null float64\n",
      "dtypes: float64(5)\n",
      "memory usage: 35.7 KB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "# View the summary of df2 dataframe \n",
    "\n",
    "print(df2.info())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "YEARSEXP     0\n",
      "Exp1        13\n",
      "Exp3        13\n",
      "Exp4        14\n",
      "Exp5        13\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# Again view the number of missing values in each column of dataframe df2\n",
    "\n",
    "print(df2.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Interpretation**\n",
    "\n",
    "I have fill all the missing values of **YEARSEXP** column with the median value of **YEARSEXP** column. Now, this column has no missing values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6.\tFill missing values with Imputer\n",
    "\n",
    "Scikit-Learn provides Imputer class to deal with the missing values. In this method, we replace the missing value with the mean value of the entire feature column. This can be done as shown in the following code:\n",
    "\n",
    "\n",
    "`from sklearn.preprocessing import Imputer`\n",
    "\n",
    "`imp = Imputer(missing_values='NaN',  strategy='mean', axis=0)`\n",
    "\n",
    "\n",
    "`imputed_data = imp.fit_transform(df)`\n",
    "\n",
    "`imputed_data`\n",
    "\n",
    "\n",
    "Here, I have replaced each ‘NaN’ value with the corresponding mean value. The mean value is separately calculated for each feature column. If instead of axis = 0, we set axis = 1, then mean values are calculated for each row. \n",
    "\n",
    "\n",
    "Other options for strategy parameter are ‘median’ or ‘most_frequent’. The ‘most_frequent’ parameter replaces the missing values with the most frequent value. It is useful for imputing categorical feature values.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[14.  4.  4.  1.  2.]\n",
      " [18.  2.  4.  2.  4.]\n",
      " [13.  2.  2.  1.  3.]\n",
      " ...\n",
      " [ 9.  5.  5.  4.  1.]\n",
      " [10.  4.  2.  1.  1.]\n",
      " [12.  2.  3.  1.  1.]]\n"
     ]
    }
   ],
   "source": [
    "# Fill missing values with Imputer\n",
    "\n",
    "from sklearn.preprocessing import Imputer\n",
    "\n",
    "imp = Imputer(missing_values='NaN',  strategy='mean', axis=0)\n",
    "\n",
    "df2 = imp.fit_transform(df2)\n",
    "\n",
    "print(df2)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Imputer convert the dataframe df2 into a numpy array.\n",
    "\n",
    "# So, we need to convert it back into the dataframe df2.\n",
    "\n",
    "columnnames = ['YEARSEXP', 'Exp1', 'Exp3', 'Exp4', 'Exp5']\n",
    "\n",
    "df2 = pd.DataFrame(df2, columns = columnnames)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   YEARSEXP  Exp1  Exp3  Exp4  Exp5\n",
      "0      14.0   4.0   4.0   1.0   2.0\n",
      "1      18.0   2.0   4.0   2.0   4.0\n",
      "2      13.0   2.0   2.0   1.0   3.0\n",
      "3      13.0   4.0   3.0   3.0   4.0\n",
      "4       8.0   5.0   5.0   4.0   4.0\n"
     ]
    }
   ],
   "source": [
    "# View the first 5 rows of imputed dataframe df2\n",
    "\n",
    "print(df2.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 913 entries, 0 to 912\n",
      "Data columns (total 5 columns):\n",
      "YEARSEXP    913 non-null float64\n",
      "Exp1        913 non-null float64\n",
      "Exp3        913 non-null float64\n",
      "Exp4        913 non-null float64\n",
      "Exp5        913 non-null float64\n",
      "dtypes: float64(5)\n",
      "memory usage: 35.7 KB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "# View the summary of the imputed dataframe df2\n",
    "\n",
    "print(df2.info())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "YEARSEXP    0\n",
      "Exp1        0\n",
      "Exp3        0\n",
      "Exp4        0\n",
      "Exp5        0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# Agian check that there are no missing values in df2\n",
    "\n",
    "print(df2.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Interpretation**\n",
    "\n",
    "We can see that there are no missing numerical values in the columns of dataframe df2. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7.\tBuild a prediction model\n",
    "\n",
    "We can build a prediction model to handle missing values. In this method, we divide our data set into two sets – training set and test set. Training set does not contain any missing values and test set contains missing values. The variable containing missing values can be treated as a target variable. Next, we create a model to predict target variable and use it to populate missing values of test data set. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8.\tKNN Imputation\n",
    "\n",
    "In this method, the missing values of an attribute are imputed using the given number of attributes that are mostly similar to the attribute whose values are missing. The similarity of attributes is determined using a distance function.\n",
    "\n",
    "The above two mmethods are more sophisticated methods to deal with missing numerical values. Hence, I will not go into much detail.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Check with ASSERT statement\n",
    "\n",
    "\n",
    "Finally, we can check for missing values programmatically. If we drop or fill missing values, we expect no missing values. We can write an assert statement to verify this. So, we can use an assert statement to programmatically check that no missing or unexpected ‘0’ value is present. This gives confidence that our code is running properly.\n",
    "Assert statement will return nothing if the value being tested is true and will throw an AssertionError if the value is false.\n",
    "\n",
    "Asserts\n",
    "\n",
    "•\tassert 1 == 1   (return Nothing if the value is True)\n",
    "\n",
    "•\tassert 1 == 2   (return AssertionError if the value is False)\n",
    "\n",
    "\n",
    "\n",
    "assert that there are no missing values in the dataframe\n",
    "\n",
    "`assert pd.notnull(df).all().all()`\n",
    "\n",
    "\n",
    "assert that there are no missing values for a particular column in dataframe\n",
    "\n",
    "`assert df.column_name.notnull().all()`\n",
    "\n",
    "\n",
    "assert all values are greater than 0\n",
    "\n",
    "`assert (df >=0).all().all()`\n",
    "\n",
    "\n",
    "assert no entry in a column is equal to 0\n",
    "\n",
    "`assert (df['column_name']!=0).all().all()`\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# assert that there are no missing values in the dataframe df2\n",
    "\n",
    "assert pd.notnull(df2).all().all()\n",
    "\n",
    "\n",
    "# When I run the above command, it returns nothing. Hence the assert statement is true. \n",
    "\n",
    "# So, there are no missing values in dataframe df2.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "# assert that there are no missing values for a particular column in the dataframe\n",
    "\n",
    "assert df2['YEARSEXP'].notnull().all()\n",
    "\n",
    "assert df2['Exp1'].notnull().all()\n",
    "\n",
    "assert df2['Exp3'].notnull().all()\n",
    "\n",
    "assert df2['Exp4'].notnull().all()\n",
    "\n",
    "assert df2['Exp5'].notnull().all()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Interpretation**\n",
    "\n",
    "When I run the above commands, it returns nothing. Hence the assert statements are true. Hence, there are no missing values in df2 dataframe.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This concludes our discussion on missing values."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}