AhmedAlhallag/Lab 2_Data Cleaning Noise_NoSol.ipynb

## Lab 2_Data Cleaning Noise_NoSol.ipynb
{
  "cells": [
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "### Lab 2: Data Cleaning (Detecting & Handling Outliers)"
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "42964d7c",
      "cell_type": "code",
      "source": "import pandas as pd\n\ndf = pd.read_csv(\"dataset_InLab_v2.csv\")\n  ",
      "execution_count": 208,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "f917546d",
      "cell_type": "code",
      "source": "df",
      "execution_count": 209,
      "outputs": [
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>ID</th>\n      <th>City</th>\n      <th>Items</th>\n      <th>Price</th>\n      <th>Payment</th>\n      <th>Tax</th>\n      <th>Transaction</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>Colombo</td>\n      <td>3</td>\n      <td>500.0</td>\n      <td>%</td>\n      <td>0.25</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2</td>\n      <td>Singapore</td>\n      <td>2</td>\n      <td>1500.0</td>\n      <td>0</td>\n      <td>0.31</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>3</td>\n      <td>?</td>\n      <td>4</td>\n      <td>2000.0</td>\n      <td>1</td>\n      <td>1.20</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>4</td>\n      <td>Tokyo</td>\n      <td>4300</td>\n      <td>1300.0</td>\n      <td>#</td>\n      <td>0.80</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>5</td>\n      <td>Melbourne</td>\n      <td>6</td>\n      <td>1.3</td>\n      <td>--</td>\n      <td>0.50</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>6</td>\n      <td>Singapore</td>\n      <td>4</td>\n      <td>1800.0</td>\n      <td>1</td>\n      <td>0.30</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>7</td>\n      <td>?</td>\n      <td>5</td>\n      <td>999999999.0</td>\n      <td>1</td>\n      <td>0.40</td>\n      <td>yes</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
            "text/plain": "   ID       City  Items        Price Payment   Tax Transaction\n0   1    Colombo      3        500.0       %  0.25         yes\n1   2  Singapore      2       1500.0       0  0.31         yes\n2   3          ?      4       2000.0       1  1.20         yes\n3   4      Tokyo   4300       1300.0       #  0.80          no\n4   5  Melbourne      6          1.3      --  0.50          no\n5   6  Singapore      4       1800.0       1  0.30          no\n6   7          ?      5  999999999.0       1  0.40         yes"
          },
          "execution_count": 209,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "254aa548",
      "cell_type": "code",
      "source": "import numpy as np",
      "execution_count": 4,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "3c1964f0",
      "cell_type": "code",
      "source": "def find_iqr(col):\n    q3, q1 = np.percentile(col, [75, 25])\n    return q1, q3, q3 - q1 ",
      "execution_count": 85,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "ab0689a5",
      "cell_type": "code",
      "source": "# find the iqr for Itsms column\nq3, q1, iqr = find_iqr(df['Items'])",
      "execution_count": 204,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "9c9e299c",
      "cell_type": "code",
      "source": "def get_outliers(col):\n    q1,q3, iqr =  find_iqr(col)\n    outs = []\n    for x in col:\n        if x<(q1-1.5*iqr) or x>(q3+1.5*iqr):\n            outs.append(x)\n    return outs ",
      "execution_count": 205,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "f65f52a8",
      "cell_type": "code",
      "source": "\nout_items = get_outliers(df[\"Items\"])\nout_price = get_outliers(df[\"Price\"])",
      "execution_count": 210,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "812ec8c8",
      "cell_type": "code",
      "source": "print(out_items, out_price) ",
      "execution_count": 211,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "[4300] [999999999.0]\n"
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "b44b865b",
      "cell_type": "code",
      "source": "# test replcaing a value in a column in the DF\ndf['Items'].replace([4300],200)\n\n\n# Important notes:\n# 1) [For multiple replacements] if you had multiple outliers within the same columns you can do the following:\ndf['Items'].replace([4300, 8928, 7861], 200)\n                    ",
      "execution_count": 219,
      "outputs": [
        {
          "data": {
            "text/plain": "0      3\n1      2\n2      4\n3    200\n4      6\n5      4\n6      5\nName: Items, dtype: int64"
          },
          "execution_count": 219,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "f11e58f4",
      "cell_type": "code",
      "source": "df # unchanged!",
      "execution_count": 218,
      "outputs": [
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>ID</th>\n      <th>City</th>\n      <th>Items</th>\n      <th>Price</th>\n      <th>Payment</th>\n      <th>Tax</th>\n      <th>Transaction</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>Colombo</td>\n      <td>3</td>\n      <td>500.0</td>\n      <td>%</td>\n      <td>0.25</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2</td>\n      <td>Singapore</td>\n      <td>2</td>\n      <td>1500.0</td>\n      <td>0</td>\n      <td>0.31</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>3</td>\n      <td>?</td>\n      <td>4</td>\n      <td>2000.0</td>\n      <td>1</td>\n      <td>1.20</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>4</td>\n      <td>Tokyo</td>\n      <td>4300</td>\n      <td>1300.0</td>\n      <td>#</td>\n      <td>0.80</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>5</td>\n      <td>Melbourne</td>\n      <td>6</td>\n      <td>1.3</td>\n      <td>--</td>\n      <td>0.50</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>6</td>\n      <td>Singapore</td>\n      <td>4</td>\n      <td>1800.0</td>\n      <td>1</td>\n      <td>0.30</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>7</td>\n      <td>?</td>\n      <td>5</td>\n      <td>999999999.0</td>\n      <td>1</td>\n      <td>0.40</td>\n      <td>yes</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
            "text/plain": "   ID       City  Items        Price Payment   Tax Transaction\n0   1    Colombo      3        500.0       %  0.25         yes\n1   2  Singapore      2       1500.0       0  0.31         yes\n2   3          ?      4       2000.0       1  1.20         yes\n3   4      Tokyo   4300       1300.0       #  0.80          no\n4   5  Melbourne      6          1.3      --  0.50          no\n5   6  Singapore      4       1800.0       1  0.30          no\n6   7          ?      5  999999999.0       1  0.40         yes"
          },
          "execution_count": 218,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "3226472b",
      "cell_type": "code",
      "source": "def find_replace(col, outs, rep):\n        return col.replace([outs], rep)    ",
      "execution_count": 220,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "ee60d747",
      "cell_type": "code",
      "source": "# Prepare replacement: Get the mean for the Items column\nitem_unique = []\n\nfor val in df['Items']:\n    if val not in out_items:\n        item_unique.append(val)",
      "execution_count": 222,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "ef73822b",
      "cell_type": "code",
      "source": "item_unique",
      "execution_count": 223,
      "outputs": [
        {
          "data": {
            "text/plain": "[3, 2, 4, 6, 4, 5]"
          },
          "execution_count": 223,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "fdf9a8ed",
      "cell_type": "code",
      "source": "mean_items = np.mean(item_unique)",
      "execution_count": 227,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "5c8c749b",
      "cell_type": "code",
      "source": "mean_items",
      "execution_count": 228,
      "outputs": [
        {
          "data": {
            "text/plain": "4.0"
          },
          "execution_count": 228,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "de32bc59",
      "cell_type": "code",
      "source": "# Handle Outliers just like missing values, replae and persist the change\n\ndf['Items'] = find_replace(df[\"Items\"], out_items, mean_items)",
      "execution_count": 229,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "c5f694af",
      "cell_type": "code",
      "source": "df # notice: 4300 is not there anymore",
      "execution_count": 231,
      "outputs": [
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>ID</th>\n      <th>City</th>\n      <th>Items</th>\n      <th>Price</th>\n      <th>Payment</th>\n      <th>Tax</th>\n      <th>Transaction</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>Colombo</td>\n      <td>3.0</td>\n      <td>500.0</td>\n      <td>%</td>\n      <td>0.25</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2</td>\n      <td>Singapore</td>\n      <td>2.0</td>\n      <td>1500.0</td>\n      <td>0</td>\n      <td>0.31</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>3</td>\n      <td>?</td>\n      <td>4.0</td>\n      <td>2000.0</td>\n      <td>1</td>\n      <td>1.20</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>4</td>\n      <td>Tokyo</td>\n      <td>4.0</td>\n      <td>1300.0</td>\n      <td>#</td>\n      <td>0.80</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>5</td>\n      <td>Melbourne</td>\n      <td>6.0</td>\n      <td>1.3</td>\n      <td>--</td>\n      <td>0.50</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>6</td>\n      <td>Singapore</td>\n      <td>4.0</td>\n      <td>1800.0</td>\n      <td>1</td>\n      <td>0.30</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>7</td>\n      <td>?</td>\n      <td>5.0</td>\n      <td>999999999.0</td>\n      <td>1</td>\n      <td>0.40</td>\n      <td>yes</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
            "text/plain": "   ID       City  Items        Price Payment   Tax Transaction\n0   1    Colombo    3.0        500.0       %  0.25         yes\n1   2  Singapore    2.0       1500.0       0  0.31         yes\n2   3          ?    4.0       2000.0       1  1.20         yes\n3   4      Tokyo    4.0       1300.0       #  0.80          no\n4   5  Melbourne    6.0          1.3      --  0.50          no\n5   6  Singapore    4.0       1800.0       1  0.30          no\n6   7          ?    5.0  999999999.0       1  0.40         yes"
          },
          "execution_count": 231,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "id": "889b76fd",
      "cell_type": "markdown",
      "source": "### Activity: Do the same steps for the Price's outlier value"
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "44748f6a",
      "cell_type": "code",
      "source": "# Do the same step for Price's outlier\n# 1- Detect the outliers in Price\n# 2- Grab the mean for the Price column WITHOUT the outliers effect\n# 3- Handle Outliers as if they were missing values (repelace them with the mean)",
      "execution_count": 232,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "id": "2d7fc9bd",
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "id": "6a8c8925",
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "id": "16597de7",
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "id": "e3dc93e1",
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": true
      },
      "id": "56029df2",
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "eb94c7c9",
      "cell_type": "code",
      "source": "# final result with no outliers in Price nor in items\ndf",
      "execution_count": 239,
      "outputs": [
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>ID</th>\n      <th>City</th>\n      <th>Items</th>\n      <th>Price</th>\n      <th>Payment</th>\n      <th>Tax</th>\n      <th>Transaction</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>Colombo</td>\n      <td>3.0</td>\n      <td>500.00</td>\n      <td>%</td>\n      <td>0.25</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2</td>\n      <td>Singapore</td>\n      <td>2.0</td>\n      <td>1500.00</td>\n      <td>0</td>\n      <td>0.31</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>3</td>\n      <td>?</td>\n      <td>4.0</td>\n      <td>2000.00</td>\n      <td>1</td>\n      <td>1.20</td>\n      <td>yes</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>4</td>\n      <td>Tokyo</td>\n      <td>4.0</td>\n      <td>1300.00</td>\n      <td>#</td>\n      <td>0.80</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>5</td>\n      <td>Melbourne</td>\n      <td>6.0</td>\n      <td>1.30</td>\n      <td>--</td>\n      <td>0.50</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>6</td>\n      <td>Singapore</td>\n      <td>4.0</td>\n      <td>1800.00</td>\n      <td>1</td>\n      <td>0.30</td>\n      <td>no</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>7</td>\n      <td>?</td>\n      <td>5.0</td>\n      <td>1183.55</td>\n      <td>1</td>\n      <td>0.40</td>\n      <td>yes</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
            "text/plain": "   ID       City  Items    Price Payment   Tax Transaction\n0   1    Colombo    3.0   500.00       %  0.25         yes\n1   2  Singapore    2.0  1500.00       0  0.31         yes\n2   3          ?    4.0  2000.00       1  1.20         yes\n3   4      Tokyo    4.0  1300.00       #  0.80          no\n4   5  Melbourne    6.0     1.30      --  0.50          no\n5   6  Singapore    4.0  1800.00       1  0.30          no\n6   7          ?    5.0  1183.55       1  0.40         yes"
          },
          "execution_count": 239,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "dfccf02a",
      "cell_type": "code",
      "source": "df.to_excel(\"noise_cleaned.xlsx\",index=False)",
      "execution_count": 240,
      "outputs": []
    },
    {
      "metadata": {},
      "id": "18d40566",
      "cell_type": "markdown",
      "source": "### [Self-Paced] Assessment 1: Scraping (Using Only Pandas, no automation) & Cleaning IMDB's Top Movies Dataset "
    },
    {
      "metadata": {},
      "id": "723276d3",
      "cell_type": "markdown",
      "source": "- (Without using Selenium; NOT THE BEST WAY!) Scrape the IMBD's Top Movies using the following:"
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "839f346d",
      "cell_type": "code",
      "source": "df = pd.read_html('https://www.imdb.com/chart/top/')",
      "execution_count": 244,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "21155477",
      "cell_type": "code",
      "source": "df = df[0]\ndf",
      "execution_count": 245,
      "outputs": [
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Unnamed: 0</th>\n      <th>Rank &amp; Title</th>\n      <th>IMDb Rating</th>\n      <th>Your Rating</th>\n      <th>Unnamed: 4</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>NaN</td>\n      <td>1.  The Shawshank Redemption  (1994)</td>\n      <td>9.2</td>\n      <td>12345678910 NOT YET RELEASED  Seen</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>NaN</td>\n      <td>2.  The Godfather  (1972)</td>\n      <td>9.2</td>\n      <td>12345678910 NOT YET RELEASED  Seen</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>NaN</td>\n      <td>3.  The Dark Knight  (2008)</td>\n      <td>9.0</td>\n      <td>12345678910 NOT YET RELEASED  Seen</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>NaN</td>\n      <td>4.  The Godfather Part II  (1974)</td>\n      <td>9.0</td>\n      <td>12345678910 NOT YET RELEASED  Seen</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>NaN</td>\n      <td>5.  12 Angry Men  (1957)</td>\n      <td>9.0</td>\n      <td>12345678910 NOT YET RELEASED  Seen</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>245</th>\n      <td>NaN</td>\n      <td>246.  Dersu Uzala  (1975)</td>\n      <td>8.0</td>\n      <td>12345678910 NOT YET RELEASED  Seen</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>246</th>\n      <td>NaN</td>\n      <td>247.  The Help  (2011)</td>\n      <td>8.0</td>\n      <td>12345678910 NOT YET RELEASED  Seen</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>247</th>\n      <td>NaN</td>\n      <td>248.  Aladdin  (1992)</td>\n      <td>8.0</td>\n      <td>12345678910 NOT YET RELEASED  Seen</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>248</th>\n      <td>NaN</td>\n      <td>249.  Gandhi  (1982)</td>\n      <td>8.0</td>\n      <td>12345678910 NOT YET RELEASED  Seen</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>249</th>\n      <td>NaN</td>\n      <td>250.  Dances with Wolves  (1990)</td>\n      <td>8.0</td>\n      <td>12345678910 NOT YET RELEASED  Seen</td>\n      <td>NaN</td>\n    </tr>\n  </tbody>\n</table>\n<p>250 rows × 5 columns</p>\n</div>",
            "text/plain": "     Unnamed: 0                          Rank & Title  IMDb Rating  \\\n0           NaN  1.  The Shawshank Redemption  (1994)          9.2   \n1           NaN             2.  The Godfather  (1972)          9.2   \n2           NaN           3.  The Dark Knight  (2008)          9.0   \n3           NaN     4.  The Godfather Part II  (1974)          9.0   \n4           NaN              5.  12 Angry Men  (1957)          9.0   \n..          ...                                   ...          ...   \n245         NaN             246.  Dersu Uzala  (1975)          8.0   \n246         NaN                247.  The Help  (2011)          8.0   \n247         NaN                 248.  Aladdin  (1992)          8.0   \n248         NaN                  249.  Gandhi  (1982)          8.0   \n249         NaN      250.  Dances with Wolves  (1990)          8.0   \n\n                            Your Rating  Unnamed: 4  \n0    12345678910 NOT YET RELEASED  Seen         NaN  \n1    12345678910 NOT YET RELEASED  Seen         NaN  \n2    12345678910 NOT YET RELEASED  Seen         NaN  \n3    12345678910 NOT YET RELEASED  Seen         NaN  \n4    12345678910 NOT YET RELEASED  Seen         NaN  \n..                                  ...         ...  \n245  12345678910 NOT YET RELEASED  Seen         NaN  \n246  12345678910 NOT YET RELEASED  Seen         NaN  \n247  12345678910 NOT YET RELEASED  Seen         NaN  \n248  12345678910 NOT YET RELEASED  Seen         NaN  \n249  12345678910 NOT YET RELEASED  Seen         NaN  \n\n[250 rows x 5 columns]"
          },
          "execution_count": 245,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "id": "2c999367",
      "cell_type": "markdown",
      "source": "#### Notice the following problems that needs to be cleaned\n- 'Unnamed: 0', 'Unnamed: 4' and 'Your Rating' columns should be deleted entirely\n- \"Rank & Title\" Column needs to be split into two columns: \"Rank\", \"Title\"\n- Thus, Rank should contain the numerical values and title will hold only the title values\n- Furthermore, \"Title\" needs to be split into: \"Title\", \"Year\" \n"
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "6446b179",
      "cell_type": "code",
      "source": "# Hint: \n# To split a column into two based on a delimiter (. in this example)\n# df[ [col1, col2] ] = df[ col ].str.split(\".\", expand=True)\n\n# if the delimiter had multiple occurences in a row, and you want to split baed on the first occurence only, add \n# the 'n=1' key argument\n\n# df[ [col1, col2] ] = df[ col ].str.split(\".\", expand=True, n = 1)\n\n\n",
      "execution_count": 246,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "adda2068",
      "cell_type": "code",
      "source": "",
      "execution_count": 252,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "1f5f79a5",
      "cell_type": "code",
      "source": "",
      "execution_count": 254,
      "outputs": []
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "18b0336b",
      "cell_type": "code",
      "source": "# Hint:\n# To drop columns in-place use the following:\n# df.drop(columns=[col1, col2, ...], inplace = True)\n\n\n",
      "execution_count": 259,
      "outputs": []
    },
    {
      "metadata": {},
      "id": "95a7216d",
      "cell_type": "markdown",
      "source": "#### Final outcome you should achieve"
    },
    {
      "metadata": {
        "trusted": false
      },
      "id": "dfa63292",
      "cell_type": "code",
      "source": "df",
      "execution_count": 257,
      "outputs": [
        {
          "data": {
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>IMDb Rating</th>\n      <th>Year</th>\n      <th>Rank</th>\n      <th>Title</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>9.2</td>\n      <td>1994</td>\n      <td>1</td>\n      <td>The Shawshank Redemption</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>9.2</td>\n      <td>1972</td>\n      <td>2</td>\n      <td>The Godfather</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>9.0</td>\n      <td>2008</td>\n      <td>3</td>\n      <td>The Dark Knight</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>9.0</td>\n      <td>1974</td>\n      <td>4</td>\n      <td>The Godfather Part II</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>9.0</td>\n      <td>1957</td>\n      <td>5</td>\n      <td>12 Angry Men</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>245</th>\n      <td>8.0</td>\n      <td>1975</td>\n      <td>246</td>\n      <td>Dersu Uzala</td>\n    </tr>\n    <tr>\n      <th>246</th>\n      <td>8.0</td>\n      <td>2011</td>\n      <td>247</td>\n      <td>The Help</td>\n    </tr>\n    <tr>\n      <th>247</th>\n      <td>8.0</td>\n      <td>1992</td>\n      <td>248</td>\n      <td>Aladdin</td>\n    </tr>\n    <tr>\n      <th>248</th>\n      <td>8.0</td>\n      <td>1982</td>\n      <td>249</td>\n      <td>Gandhi</td>\n    </tr>\n    <tr>\n      <th>249</th>\n      <td>8.0</td>\n      <td>1990</td>\n      <td>250</td>\n      <td>Dances with Wolves</td>\n    </tr>\n  </tbody>\n</table>\n<p>250 rows × 4 columns</p>\n</div>",
            "text/plain": "     IMDb Rating  Year Rank                         Title\n0            9.2  1994    1    The Shawshank Redemption  \n1            9.2  1972    2               The Godfather  \n2            9.0  2008    3             The Dark Knight  \n3            9.0  1974    4       The Godfather Part II  \n4            9.0  1957    5                12 Angry Men  \n..           ...   ...  ...                           ...\n245          8.0  1975  246                 Dersu Uzala  \n246          8.0  2011  247                    The Help  \n247          8.0  1992  248                     Aladdin  \n248          8.0  1982  249                      Gandhi  \n249          8.0  1990  250          Dances with Wolves  \n\n[250 rows x 4 columns]"
          },
          "execution_count": 257,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "id": "0c5732b4",
      "cell_type": "markdown",
      "source": "### [Self-Paced] Assessment 2: Scraping & Cleaning a Pokemon Dataset for an Online Game-Card Trading Platform   "
    },
    {
      "metadata": {},
      "id": "4f730047",
      "cell_type": "markdown",
      "source": "- Scrape the following website(using Selenium Automation is a MUST): https://pokemondb.net/pokedex/all\n- Clean it when necessarya nd store the result in an excel sheet\n- Watch the recording on Selnium Web Scraping + Yelp project to get a more detailed overview "
    }
  ],
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.8.8",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "gist": {
      "id": "",
      "data": {
        "description": "Desktop/Lab 2_Data Cleaning Noise-Copy1.ipynb",
        "public": true
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}