AhmedAlhallag/Lab1.2_DataCleaning(MissingValues)-Copy1.ipynb

## Lab1.2_DataCleaning(MissingValues)-Copy1.ipynb
{
  "cells": [
    {
      "metadata": {},
      "id": "e36c3286",
      "cell_type": "markdown",
      "source": "# KH4005CEM Lab 1.2: Detecting & Handling Missing Values "
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "61c76ffa",
      "cell_type": "code",
      "source": "# Note: This step assumes that you grabed the values (as an .xlsx file or as a .csv file) from somewhere\n# either from kaggle, or web scraped\n\n\n# importing Vlues dataset using pandas library \nimport pandas as pd \n\n# df = pd.read_csv(\"dataset.csv\")\ndf = pd.read_csv(\"dataset_InLab.csv\")\n",
      "execution_count": 272,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "d4c52155",
      "cell_type": "code",
      "source": "print(df)\ndf\n",
      "execution_count": 270,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "   ID       City  Items        Price Payment   Tax Transaction\n0   1    Colombo      3        500.0       ?  0.25         yes\n1   2  Singapore      2       1500.0       0  0.31         yes\n2   3          ?      4       2000.0       1  1.20         yes\n3   4      Tokyo   4300       1300.0       #  0.80          no\n4   5  Melbourne      6          1.3      --  0.50          no\n5   6  Singapore      4       1800.0       1  0.30          no\n6   7          ?      5  999999999.0       1  0.40         yes\n"
        },
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>ID</th>\n      <th>City</th>\n      <th>Items</th>\n      <th>Price</th>\n      <th>Payment</th>\n      <th>Tax</th>\n      <th>Transaction</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>Colombo</td>\n      <td>3</td>\n      <td>500.0</td>\n      <td>?</td>\n      <td>0.25</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2</td>\n      <td>Singapore</td>\n      <td>2</td>\n      <td>1500.0</td>\n      <td>0</td>\n      <td>0.31</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>3</td>\n      <td>?</td>\n      <td>4</td>\n      <td>2000.0</td>\n      <td>1</td>\n      <td>1.20</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>4</td>\n      <td>Tokyo</td>\n      <td>4300</td>\n      <td>1300.0</td>\n      <td>#</td>\n      <td>0.80</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>5</td>\n      <td>Melbourne</td>\n      <td>6</td>\n      <td>1.3</td>\n      <td>--</td>\n      <td>0.50</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>6</td>\n      <td>Singapore</td>\n      <td>4</td>\n      <td>1800.0</td>\n      <td>1</td>\n      <td>0.30</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>7</td>\n      <td>?</td>\n      <td>5</td>\n      <td>999999999.0</td>\n      <td>1</td>\n      <td>0.40</td>\n      <td>yes</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
            "text/plain": "   ID       City  Items        Price Payment   Tax Transaction\n0   1    Colombo      3        500.0       ?  0.25         yes\n1   2  Singapore      2       1500.0       0  0.31         yes\n2   3          ?      4       2000.0       1  1.20         yes\n3   4      Tokyo   4300       1300.0       #  0.80          no\n4   5  Melbourne      6          1.3      --  0.50          no\n5   6  Singapore      4       1800.0       1  0.30          no\n6   7          ?      5  999999999.0       1  0.40         yes"
          },
          "execution_count": 270,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "id": "86e7150c",
      "cell_type": "markdown",
      "source": "## Intro to Regular Expression\n- Check the regex quick guide on Notion Page\n- Replace any missing valus, such as: special symbols, empty strings, speacial words (NA, null, etc..) with a pythonic NaN data type\n- NaN (Not a Number) data types are defined in the numpy library\n- Converting missing values to NaN types will allow you to generate a summary of columns containing missing values"
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "c1b7ec40",
      "cell_type": "code",
      "source": "import numpy as np\n\n# replace empty strings and Question marks with a nan\ndf.replace(r\"\\?|^$\",\"\", regex=True)\n",
      "execution_count": 273,
      "outputs": [
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>ID</th>\n      <th>City</th>\n      <th>Items</th>\n      <th>Price</th>\n      <th>Payment</th>\n      <th>Tax</th>\n      <th>Transaction</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>Colombo</td>\n      <td>3</td>\n      <td>500.0</td>\n      <td></td>\n      <td>0.25</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2</td>\n      <td>Singapore</td>\n      <td>2</td>\n      <td>1500.0</td>\n      <td>0</td>\n      <td>0.31</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>3</td>\n      <td></td>\n      <td>4</td>\n      <td>2000.0</td>\n      <td>1</td>\n      <td>1.20</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>4</td>\n      <td>Tokyo</td>\n      <td>4300</td>\n      <td>1300.0</td>\n      <td>#</td>\n      <td>0.80</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>5</td>\n      <td>Melbourne</td>\n      <td>6</td>\n      <td>1.3</td>\n      <td>--</td>\n      <td>0.50</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>6</td>\n      <td>Singapore</td>\n      <td>4</td>\n      <td>1800.0</td>\n      <td>1</td>\n      <td>0.30</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>7</td>\n      <td></td>\n      <td>5</td>\n      <td>999999999.0</td>\n      <td>1</td>\n      <td>0.40</td>\n      <td>yes</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
            "text/plain": "   ID       City  Items        Price Payment   Tax Transaction\n0   1    Colombo      3        500.0          0.25         yes\n1   2  Singapore      2       1500.0       0  0.31         yes\n2   3                 4       2000.0       1  1.20         yes\n3   4      Tokyo   4300       1300.0       #  0.80          no\n4   5  Melbourne      6          1.3      --  0.50          no\n5   6  Singapore      4       1800.0       1  0.30          no\n6   7                 5  999999999.0       1  0.40         yes"
          },
          "execution_count": 273,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "id": "00630714",
      "cell_type": "markdown",
      "source": "###  Generate a summary report of Missing Values"
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "157abfd3",
      "cell_type": "code",
      "source": "df.isna().sum()\n",
      "execution_count": 256,
      "outputs": [
        {
          "data": {
            "text/plain": "ID             0\nCity           2\nItems          0\nPrice          0\nPayment        1\nTax            0\nTransaction    0\ndtype: int64"
          },
          "execution_count": 256,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "id": "9b8ddaea",
      "cell_type": "markdown",
      "source": "### Grab list of column names that contains missing values"
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "06712d5b",
      "cell_type": "code",
      "source": "df.columns[df.isna().any()]\n# Convert it into a list\ndf.columns[df.isna().any()].to_list()\n",
      "execution_count": 257,
      "outputs": [
        {
          "data": {
            "text/plain": "['City', 'Payment']"
          },
          "execution_count": 257,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "8d42062c",
      "cell_type": "code",
      "source": "df",
      "execution_count": 258,
      "outputs": [
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>ID</th>\n      <th>City</th>\n      <th>Items</th>\n      <th>Price</th>\n      <th>Payment</th>\n      <th>Tax</th>\n      <th>Transaction</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>Colombo</td>\n      <td>3</td>\n      <td>500.0</td>\n      <td>NaN</td>\n      <td>0.25</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2</td>\n      <td>Singapore</td>\n      <td>2</td>\n      <td>1500.0</td>\n      <td>0</td>\n      <td>0.31</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>3</td>\n      <td>NaN</td>\n      <td>4</td>\n      <td>2000.0</td>\n      <td>1</td>\n      <td>1.20</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>4</td>\n      <td>Tokyo</td>\n      <td>4300</td>\n      <td>1300.0</td>\n      <td>#</td>\n      <td>0.80</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>5</td>\n      <td>Melbourne</td>\n      <td>6</td>\n      <td>1.3</td>\n      <td>--</td>\n      <td>0.50</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>6</td>\n      <td>Singapore</td>\n      <td>4</td>\n      <td>1800.0</td>\n      <td>1</td>\n      <td>0.30</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>7</td>\n      <td>NaN</td>\n      <td>5</td>\n      <td>999999999.0</td>\n      <td>1</td>\n      <td>0.40</td>\n      <td>yes</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
            "text/plain": "   ID       City  Items        Price Payment   Tax Transaction\n0   1    Colombo      3        500.0     NaN  0.25         yes\n1   2  Singapore      2       1500.0       0  0.31         yes\n2   3        NaN      4       2000.0       1  1.20         yes\n3   4      Tokyo   4300       1300.0       #  0.80          no\n4   5  Melbourne      6          1.3      --  0.50          no\n5   6  Singapore      4       1800.0       1  0.30          no\n6   7        NaN      5  999999999.0       1  0.40         yes"
          },
          "execution_count": 258,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "id": "2e9637ea",
      "cell_type": "markdown",
      "source": "### Put all of the previous work to Action \n#### Create a function that replaces missing values & one that detects them"
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "c7b54d79",
      "cell_type": "code",
      "source": "def detect_missing(df):\n    df = convert_missing_to(df, np.nan)\n    cols_with_missing = df.columns[df.isna().any()].tolist()\n    if cols_with_missing:\n        return True, cols_with_missing \n    return False, None\n\ndef convert_missing_to(df, cri):\n    return df.replace(r\"\\?|^$\",cri, regex=True)\n\n    \n    ",
      "execution_count": 274,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "1a11b3ca",
      "cell_type": "code",
      "source": "# Step 1: detect?\ndetect_missing(df)\n\n# Step 2: convert missing to NaN\ndf_withNan = convert_missing_to(df,np.nan)\n\n# Step 3: Generate summary \ndf_withNan.isna().sum()\n\n",
      "execution_count": 275,
      "outputs": [
        {
          "data": {
            "text/plain": "ID             0\nCity           2\nItems          0\nPrice          0\nPayment        1\nTax            0\nTransaction    0\ndtype: int64"
          },
          "execution_count": 275,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "id": "23cfa9a9",
      "cell_type": "markdown",
      "source": "## Handle Missing "
    },
    {
      "metadata": {},
      "id": "c735068c",
      "cell_type": "markdown",
      "source": "### To handle missing data, we can get the mode for categorical columns (\"City\") or the mean for numerical columns (Payment)\n- Note: there are other ways as well. We will focus on the statistical ones for this lab"
    },
    {
      "metadata": {},
      "id": "6ff7b5b0",
      "cell_type": "markdown",
      "source": "### Before we get the mode for the City column, we need to make sure missing values are removed (generating a temporary version of the dataframe to get the mode from)"
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "fdebfaeb",
      "cell_type": "code",
      "source": "# convert missing to empty (empty strings are easier to exclude than the nans)\ndf_empty = convert_missing_to(df,\"\")",
      "execution_count": 276,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "0b12fe66",
      "cell_type": "code",
      "source": "df_empty",
      "execution_count": 277,
      "outputs": [
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>ID</th>\n      <th>City</th>\n      <th>Items</th>\n      <th>Price</th>\n      <th>Payment</th>\n      <th>Tax</th>\n      <th>Transaction</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>Colombo</td>\n      <td>3</td>\n      <td>500.0</td>\n      <td></td>\n      <td>0.25</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2</td>\n      <td>Singapore</td>\n      <td>2</td>\n      <td>1500.0</td>\n      <td>0</td>\n      <td>0.31</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>3</td>\n      <td></td>\n      <td>4</td>\n      <td>2000.0</td>\n      <td>1</td>\n      <td>1.20</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>4</td>\n      <td>Tokyo</td>\n      <td>4300</td>\n      <td>1300.0</td>\n      <td>#</td>\n      <td>0.80</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>5</td>\n      <td>Melbourne</td>\n      <td>6</td>\n      <td>1.3</td>\n      <td>--</td>\n      <td>0.50</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>6</td>\n      <td>Singapore</td>\n      <td>4</td>\n      <td>1800.0</td>\n      <td>1</td>\n      <td>0.30</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>7</td>\n      <td></td>\n      <td>5</td>\n      <td>999999999.0</td>\n      <td>1</td>\n      <td>0.40</td>\n      <td>yes</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
            "text/plain": "   ID       City  Items        Price Payment   Tax Transaction\n0   1    Colombo      3        500.0          0.25         yes\n1   2  Singapore      2       1500.0       0  0.31         yes\n2   3                 4       2000.0       1  1.20         yes\n3   4      Tokyo   4300       1300.0       #  0.80          no\n4   5  Melbourne      6          1.3      --  0.50          no\n5   6  Singapore      4       1800.0       1  0.30          no\n6   7                 5  999999999.0       1  0.40         yes"
          },
          "execution_count": 277,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "c06cfff7",
      "cell_type": "code",
      "source": "# create a copy of the City column that do not contain missing values (delete missing values)\n\nCity_unique = []\nfor val in df_empty[\"City\"].to_list():\n    if val:\n        City_unique.append(val)",
      "execution_count": 278,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "2f766454",
      "cell_type": "code",
      "source": "City_unique",
      "execution_count": 279,
      "outputs": [
        {
          "data": {
            "text/plain": "['Colombo', 'Singapore', 'Tokyo', 'Melbourne', 'Singapore']"
          },
          "execution_count": 279,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "c40b786e",
      "cell_type": "code",
      "source": "# import stats from scipy to get the mode method \nfrom scipy import stats",
      "execution_count": 280,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "a447658c",
      "cell_type": "code",
      "source": "mode_city = stats.mode(City_unique)[0][0]\n",
      "execution_count": 281,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "6b848725",
      "cell_type": "code",
      "source": "# generate another final copy of the dataframe, and replace the missing values from the original table with an \n# alias that is unique to us \ndf_final = convert_missing_to(df,\"TO BE REPLACED\")\n\n# now replace that alias with the mode value\ndf_final['City'] = df_final['City'].apply( lambda x: x.replace(\"TO BE REPLACED\",mode_city))",
      "execution_count": 282,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "174c6ba5",
      "cell_type": "code",
      "source": "df_final",
      "execution_count": 283,
      "outputs": [
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>ID</th>\n      <th>City</th>\n      <th>Items</th>\n      <th>Price</th>\n      <th>Payment</th>\n      <th>Tax</th>\n      <th>Transaction</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>Colombo</td>\n      <td>3</td>\n      <td>500.0</td>\n      <td>TO BE REPLACED</td>\n      <td>0.25</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2</td>\n      <td>Singapore</td>\n      <td>2</td>\n      <td>1500.0</td>\n      <td>0</td>\n      <td>0.31</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>3</td>\n      <td>Singapore</td>\n      <td>4</td>\n      <td>2000.0</td>\n      <td>1</td>\n      <td>1.20</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>4</td>\n      <td>Tokyo</td>\n      <td>4300</td>\n      <td>1300.0</td>\n      <td>#</td>\n      <td>0.80</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>5</td>\n      <td>Melbourne</td>\n      <td>6</td>\n      <td>1.3</td>\n      <td>--</td>\n      <td>0.50</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>6</td>\n      <td>Singapore</td>\n      <td>4</td>\n      <td>1800.0</td>\n      <td>1</td>\n      <td>0.30</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>7</td>\n      <td>Singapore</td>\n      <td>5</td>\n      <td>999999999.0</td>\n      <td>1</td>\n      <td>0.40</td>\n      <td>yes</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
            "text/plain": "   ID       City  Items        Price         Payment   Tax Transaction\n0   1    Colombo      3        500.0  TO BE REPLACED  0.25         yes\n1   2  Singapore      2       1500.0               0  0.31         yes\n2   3  Singapore      4       2000.0               1  1.20         yes\n3   4      Tokyo   4300       1300.0               #  0.80          no\n4   5  Melbourne      6          1.3              --  0.50          no\n5   6  Singapore      4       1800.0               1  0.30          no\n6   7  Singapore      5  999999999.0               1  0.40         yes"
          },
          "execution_count": 283,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "id": "b1b87bc7",
      "cell_type": "markdown",
      "source": "### Activity: Do the same steps for the 'Payment\" Column\n- Note: you would want to replace missing values (after aliasing them) with the \"mean\" this time! since these are numerical.\n- We can continue on the df_empty and operate on the payment column"
    },
    {
      "metadata": {},
      "id": "3056e1a7",
      "cell_type": "markdown",
      "source": "### Detecting & Handling Noise/Outliers (next lab)"
    }
  ],
  "metadata": {
    "_draft": {
      "nbviewer_url": "https://gist.github.com/d6ee8c6dfd782c0838174a17223d28aa"
    },
    "gist": {
      "id": "d6ee8c6dfd782c0838174a17223d28aa",
      "data": {
        "description": "Desktop/DataCleaningLab2_1_NoSol.ipynb",
        "public": true
      }
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.8.8",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}
	{
	"cells": [
	{
	"metadata": {},
	"id": "e36c3286",
	"cell_type": "markdown",
	"source": "# KH4005CEM Lab 1.2: Detecting & Handling Missing Values "
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "61c76ffa",
	"cell_type": "code",
	"source": "# Note: This step assumes that you grabed the values (as an .xlsx file or as a .csv file) from somewhere\n# either from kaggle, or web scraped\n\n\n# importing Vlues dataset using pandas library \nimport pandas as pd \n\n# df = pd.read_csv(\"dataset.csv\")\ndf = pd.read_csv(\"dataset_InLab.csv\")\n",
	"execution_count": 272,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "d4c52155",
	"cell_type": "code",
	"source": "print(df)\ndf\n",
	"execution_count": 270,
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3 500.0 ? 0.25 yes\n1 2 Singapore 2 1500.0 0 0.31 yes\n2 3 ? 4 2000.0 1 1.20 yes\n3 4 Tokyo 4300 1300.0 # 0.80 no\n4 5 Melbourne 6 1.3 -- 0.50 no\n5 6 Singapore 4 1800.0 1 0.30 no\n6 7 ? 5 999999999.0 1 0.40 yes\n"
	},
	{
	"data": {
	"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>ID</th>\n <th>City</th>\n <th>Items</th>\n <th>Price</th>\n <th>Payment</th>\n <th>Tax</th>\n <th>Transaction</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>Colombo</td>\n <td>3</td>\n <td>500.0</td>\n <td>?</td>\n <td>0.25</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>Singapore</td>\n <td>2</td>\n <td>1500.0</td>\n <td>0</td>\n <td>0.31</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>?</td>\n <td>4</td>\n <td>2000.0</td>\n <td>1</td>\n <td>1.20</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>Tokyo</td>\n <td>4300</td>\n <td>1300.0</td>\n <td>#</td>\n <td>0.80</td>\n <td>no</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>Melbourne</td>\n <td>6</td>\n <td>1.3</td>\n <td>--</td>\n <td>0.50</td>\n <td>no</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6</td>\n <td>Singapore</td>\n <td>4</td>\n <td>1800.0</td>\n <td>1</td>\n <td>0.30</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7</td>\n <td>?</td>\n <td>5</td>\n <td>999999999.0</td>\n <td>1</td>\n <td>0.40</td>\n <td>yes</td>\n </tr>\n </tbody>\n</table>\n</div>",
	"text/plain": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3 500.0 ? 0.25 yes\n1 2 Singapore 2 1500.0 0 0.31 yes\n2 3 ? 4 2000.0 1 1.20 yes\n3 4 Tokyo 4300 1300.0 # 0.80 no\n4 5 Melbourne 6 1.3 -- 0.50 no\n5 6 Singapore 4 1800.0 1 0.30 no\n6 7 ? 5 999999999.0 1 0.40 yes"
	},
	"execution_count": 270,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {},
	"id": "86e7150c",
	"cell_type": "markdown",
	"source": "## Intro to Regular Expression\n- Check the regex quick guide on Notion Page\n- Replace any missing valus, such as: special symbols, empty strings, speacial words (NA, null, etc..) with a pythonic NaN data type\n- NaN (Not a Number) data types are defined in the numpy library\n- Converting missing values to NaN types will allow you to generate a summary of columns containing missing values"
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "c1b7ec40",
	"cell_type": "code",
	"source": "import numpy as np\n\n# replace empty strings and Question marks with a nan\ndf.replace(r\"\\?\|^$\",\"\", regex=True)\n",
	"execution_count": 273,
	"outputs": [
	{
	"data": {
	"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>ID</th>\n <th>City</th>\n <th>Items</th>\n <th>Price</th>\n <th>Payment</th>\n <th>Tax</th>\n <th>Transaction</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>Colombo</td>\n <td>3</td>\n <td>500.0</td>\n <td></td>\n <td>0.25</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>Singapore</td>\n <td>2</td>\n <td>1500.0</td>\n <td>0</td>\n <td>0.31</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td></td>\n <td>4</td>\n <td>2000.0</td>\n <td>1</td>\n <td>1.20</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>Tokyo</td>\n <td>4300</td>\n <td>1300.0</td>\n <td>#</td>\n <td>0.80</td>\n <td>no</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>Melbourne</td>\n <td>6</td>\n <td>1.3</td>\n <td>--</td>\n <td>0.50</td>\n <td>no</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6</td>\n <td>Singapore</td>\n <td>4</td>\n <td>1800.0</td>\n <td>1</td>\n <td>0.30</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7</td>\n <td></td>\n <td>5</td>\n <td>999999999.0</td>\n <td>1</td>\n <td>0.40</td>\n <td>yes</td>\n </tr>\n </tbody>\n</table>\n</div>",
	"text/plain": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3 500.0 0.25 yes\n1 2 Singapore 2 1500.0 0 0.31 yes\n2 3 4 2000.0 1 1.20 yes\n3 4 Tokyo 4300 1300.0 # 0.80 no\n4 5 Melbourne 6 1.3 -- 0.50 no\n5 6 Singapore 4 1800.0 1 0.30 no\n6 7 5 999999999.0 1 0.40 yes"
	},
	"execution_count": 273,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {},
	"id": "00630714",
	"cell_type": "markdown",
	"source": "### Generate a summary report of Missing Values"
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "157abfd3",
	"cell_type": "code",
	"source": "df.isna().sum()\n",
	"execution_count": 256,
	"outputs": [
	{
	"data": {
	"text/plain": "ID 0\nCity 2\nItems 0\nPrice 0\nPayment 1\nTax 0\nTransaction 0\ndtype: int64"
	},
	"execution_count": 256,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {},
	"id": "9b8ddaea",
	"cell_type": "markdown",
	"source": "### Grab list of column names that contains missing values"
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "06712d5b",
	"cell_type": "code",
	"source": "df.columns[df.isna().any()]\n# Convert it into a list\ndf.columns[df.isna().any()].to_list()\n",
	"execution_count": 257,
	"outputs": [
	{
	"data": {
	"text/plain": "['City', 'Payment']"
	},
	"execution_count": 257,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "8d42062c",
	"cell_type": "code",
	"source": "df",
	"execution_count": 258,
	"outputs": [
	{
	"data": {
	"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>ID</th>\n <th>City</th>\n <th>Items</th>\n <th>Price</th>\n <th>Payment</th>\n <th>Tax</th>\n <th>Transaction</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>Colombo</td>\n <td>3</td>\n <td>500.0</td>\n <td>NaN</td>\n <td>0.25</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>Singapore</td>\n <td>2</td>\n <td>1500.0</td>\n <td>0</td>\n <td>0.31</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>NaN</td>\n <td>4</td>\n <td>2000.0</td>\n <td>1</td>\n <td>1.20</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>Tokyo</td>\n <td>4300</td>\n <td>1300.0</td>\n <td>#</td>\n <td>0.80</td>\n <td>no</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>Melbourne</td>\n <td>6</td>\n <td>1.3</td>\n <td>--</td>\n <td>0.50</td>\n <td>no</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6</td>\n <td>Singapore</td>\n <td>4</td>\n <td>1800.0</td>\n <td>1</td>\n <td>0.30</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7</td>\n <td>NaN</td>\n <td>5</td>\n <td>999999999.0</td>\n <td>1</td>\n <td>0.40</td>\n <td>yes</td>\n </tr>\n </tbody>\n</table>\n</div>",
	"text/plain": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3 500.0 NaN 0.25 yes\n1 2 Singapore 2 1500.0 0 0.31 yes\n2 3 NaN 4 2000.0 1 1.20 yes\n3 4 Tokyo 4300 1300.0 # 0.80 no\n4 5 Melbourne 6 1.3 -- 0.50 no\n5 6 Singapore 4 1800.0 1 0.30 no\n6 7 NaN 5 999999999.0 1 0.40 yes"
	},
	"execution_count": 258,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {},
	"id": "2e9637ea",
	"cell_type": "markdown",
	"source": "### Put all of the previous work to Action \n#### Create a function that replaces missing values & one that detects them"
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "c7b54d79",
	"cell_type": "code",
	"source": "def detect_missing(df):\n df = convert_missing_to(df, np.nan)\n cols_with_missing = df.columns[df.isna().any()].tolist()\n if cols_with_missing:\n return True, cols_with_missing \n return False, None\n\ndef convert_missing_to(df, cri):\n return df.replace(r\"\\?\|^$\",cri, regex=True)\n\n \n ",
	"execution_count": 274,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "1a11b3ca",
	"cell_type": "code",
	"source": "# Step 1: detect?\ndetect_missing(df)\n\n# Step 2: convert missing to NaN\ndf_withNan = convert_missing_to(df,np.nan)\n\n# Step 3: Generate summary \ndf_withNan.isna().sum()\n\n",
	"execution_count": 275,
	"outputs": [
	{
	"data": {
	"text/plain": "ID 0\nCity 2\nItems 0\nPrice 0\nPayment 1\nTax 0\nTransaction 0\ndtype: int64"
	},
	"execution_count": 275,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {},
	"id": "23cfa9a9",
	"cell_type": "markdown",
	"source": "## Handle Missing "
	},
	{
	"metadata": {},
	"id": "c735068c",
	"cell_type": "markdown",
	"source": "### To handle missing data, we can get the mode for categorical columns (\"City\") or the mean for numerical columns (Payment)\n- Note: there are other ways as well. We will focus on the statistical ones for this lab"
	},
	{
	"metadata": {},
	"id": "6ff7b5b0",
	"cell_type": "markdown",
	"source": "### Before we get the mode for the City column, we need to make sure missing values are removed (generating a temporary version of the dataframe to get the mode from)"
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "fdebfaeb",
	"cell_type": "code",
	"source": "# convert missing to empty (empty strings are easier to exclude than the nans)\ndf_empty = convert_missing_to(df,\"\")",
	"execution_count": 276,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "0b12fe66",
	"cell_type": "code",
	"source": "df_empty",
	"execution_count": 277,
	"outputs": [
	{
	"data": {
	"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>ID</th>\n <th>City</th>\n <th>Items</th>\n <th>Price</th>\n <th>Payment</th>\n <th>Tax</th>\n <th>Transaction</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>Colombo</td>\n <td>3</td>\n <td>500.0</td>\n <td></td>\n <td>0.25</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>Singapore</td>\n <td>2</td>\n <td>1500.0</td>\n <td>0</td>\n <td>0.31</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td></td>\n <td>4</td>\n <td>2000.0</td>\n <td>1</td>\n <td>1.20</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>Tokyo</td>\n <td>4300</td>\n <td>1300.0</td>\n <td>#</td>\n <td>0.80</td>\n <td>no</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>Melbourne</td>\n <td>6</td>\n <td>1.3</td>\n <td>--</td>\n <td>0.50</td>\n <td>no</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6</td>\n <td>Singapore</td>\n <td>4</td>\n <td>1800.0</td>\n <td>1</td>\n <td>0.30</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7</td>\n <td></td>\n <td>5</td>\n <td>999999999.0</td>\n <td>1</td>\n <td>0.40</td>\n <td>yes</td>\n </tr>\n </tbody>\n</table>\n</div>",
	"text/plain": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3 500.0 0.25 yes\n1 2 Singapore 2 1500.0 0 0.31 yes\n2 3 4 2000.0 1 1.20 yes\n3 4 Tokyo 4300 1300.0 # 0.80 no\n4 5 Melbourne 6 1.3 -- 0.50 no\n5 6 Singapore 4 1800.0 1 0.30 no\n6 7 5 999999999.0 1 0.40 yes"
	},
	"execution_count": 277,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "c06cfff7",
	"cell_type": "code",
	"source": "# create a copy of the City column that do not contain missing values (delete missing values)\n\nCity_unique = []\nfor val in df_empty[\"City\"].to_list():\n if val:\n City_unique.append(val)",
	"execution_count": 278,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "2f766454",
	"cell_type": "code",
	"source": "City_unique",
	"execution_count": 279,
	"outputs": [
	{
	"data": {
	"text/plain": "['Colombo', 'Singapore', 'Tokyo', 'Melbourne', 'Singapore']"
	},
	"execution_count": 279,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "c40b786e",
	"cell_type": "code",
	"source": "# import stats from scipy to get the mode method \nfrom scipy import stats",
	"execution_count": 280,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "a447658c",
	"cell_type": "code",
	"source": "mode_city = stats.mode(City_unique)[0][0]\n",
	"execution_count": 281,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "6b848725",
	"cell_type": "code",
	"source": "# generate another final copy of the dataframe, and replace the missing values from the original table with an \n# alias that is unique to us \ndf_final = convert_missing_to(df,\"TO BE REPLACED\")\n\n# now replace that alias with the mode value\ndf_final['City'] = df_final['City'].apply( lambda x: x.replace(\"TO BE REPLACED\",mode_city))",
	"execution_count": 282,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": false
	},
	"id": "174c6ba5",
	"cell_type": "code",
	"source": "df_final",
	"execution_count": 283,
	"outputs": [
	{
	"data": {
	"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>ID</th>\n <th>City</th>\n <th>Items</th>\n <th>Price</th>\n <th>Payment</th>\n <th>Tax</th>\n <th>Transaction</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>Colombo</td>\n <td>3</td>\n <td>500.0</td>\n <td>TO BE REPLACED</td>\n <td>0.25</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>Singapore</td>\n <td>2</td>\n <td>1500.0</td>\n <td>0</td>\n <td>0.31</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>Singapore</td>\n <td>4</td>\n <td>2000.0</td>\n <td>1</td>\n <td>1.20</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>Tokyo</td>\n <td>4300</td>\n <td>1300.0</td>\n <td>#</td>\n <td>0.80</td>\n <td>no</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>Melbourne</td>\n <td>6</td>\n <td>1.3</td>\n <td>--</td>\n <td>0.50</td>\n <td>no</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6</td>\n <td>Singapore</td>\n <td>4</td>\n <td>1800.0</td>\n <td>1</td>\n <td>0.30</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7</td>\n <td>Singapore</td>\n <td>5</td>\n <td>999999999.0</td>\n <td>1</td>\n <td>0.40</td>\n <td>yes</td>\n </tr>\n </tbody>\n</table>\n</div>",
	"text/plain": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3 500.0 TO BE REPLACED 0.25 yes\n1 2 Singapore 2 1500.0 0 0.31 yes\n2 3 Singapore 4 2000.0 1 1.20 yes\n3 4 Tokyo 4300 1300.0 # 0.80 no\n4 5 Melbourne 6 1.3 -- 0.50 no\n5 6 Singapore 4 1800.0 1 0.30 no\n6 7 Singapore 5 999999999.0 1 0.40 yes"
	},
	"execution_count": 283,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {},
	"id": "b1b87bc7",
	"cell_type": "markdown",
	"source": "### Activity: Do the same steps for the 'Payment\" Column\n- Note: you would want to replace missing values (after aliasing them) with the \"mean\" this time! since these are numerical.\n- We can continue on the df_empty and operate on the payment column"
	},
	{
	"metadata": {},
	"id": "3056e1a7",
	"cell_type": "markdown",
	"source": "### Detecting & Handling Noise/Outliers (next lab)"
	}
	],
	"metadata": {
	"_draft": {
	"nbviewer_url": "https://gist.github.com/d6ee8c6dfd782c0838174a17223d28aa"
	},
	"gist": {
	"id": "d6ee8c6dfd782c0838174a17223d28aa",
	"data": {
	"description": "Desktop/DataCleaningLab2_1_NoSol.ipynb",
	"public": true
	}
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3",
	"language": "python"
	},
	"language_info": {
	"name": "python",
	"version": "3.8.8",
	"mimetype": "text/x-python",
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"pygments_lexer": "ipython3",
	"nbconvert_exporter": "python",
	"file_extension": ".py"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}