Skip to content

Instantly share code, notes, and snippets.

@AhmedAlhallag
Created February 23, 2023 05:04
Show Gist options
  • Save AhmedAlhallag/7635580787ea226dc090429dcbddd06a to your computer and use it in GitHub Desktop.
Save AhmedAlhallag/7635580787ea226dc090429dcbddd06a to your computer and use it in GitHub Desktop.
Desktop/Lab 2_Data Cleaning Noise-Copy1.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "### Lab 2: Data Cleaning (Detecting & Handling Outliers)"
},
{
"metadata": {
"trusted": false
},
"id": "42964d7c",
"cell_type": "code",
"source": "import pandas as pd\n\ndf = pd.read_csv(\"dataset_InLab_v2.csv\")\n ",
"execution_count": 208,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "f917546d",
"cell_type": "code",
"source": "df",
"execution_count": 209,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>ID</th>\n <th>City</th>\n <th>Items</th>\n <th>Price</th>\n <th>Payment</th>\n <th>Tax</th>\n <th>Transaction</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>Colombo</td>\n <td>3</td>\n <td>500.0</td>\n <td>%</td>\n <td>0.25</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>Singapore</td>\n <td>2</td>\n <td>1500.0</td>\n <td>0</td>\n <td>0.31</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>?</td>\n <td>4</td>\n <td>2000.0</td>\n <td>1</td>\n <td>1.20</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>Tokyo</td>\n <td>4300</td>\n <td>1300.0</td>\n <td>#</td>\n <td>0.80</td>\n <td>no</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>Melbourne</td>\n <td>6</td>\n <td>1.3</td>\n <td>--</td>\n <td>0.50</td>\n <td>no</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6</td>\n <td>Singapore</td>\n <td>4</td>\n <td>1800.0</td>\n <td>1</td>\n <td>0.30</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7</td>\n <td>?</td>\n <td>5</td>\n <td>999999999.0</td>\n <td>1</td>\n <td>0.40</td>\n <td>yes</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3 500.0 % 0.25 yes\n1 2 Singapore 2 1500.0 0 0.31 yes\n2 3 ? 4 2000.0 1 1.20 yes\n3 4 Tokyo 4300 1300.0 # 0.80 no\n4 5 Melbourne 6 1.3 -- 0.50 no\n5 6 Singapore 4 1800.0 1 0.30 no\n6 7 ? 5 999999999.0 1 0.40 yes"
},
"execution_count": 209,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": false
},
"id": "254aa548",
"cell_type": "code",
"source": "import numpy as np",
"execution_count": 4,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "3c1964f0",
"cell_type": "code",
"source": "def find_iqr(col):\n q3, q1 = np.percentile(col, [75, 25])\n return q1, q3, q3 - q1 ",
"execution_count": 85,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "ab0689a5",
"cell_type": "code",
"source": "# find the iqr for Itsms column\nq3, q1, iqr = find_iqr(df['Items'])",
"execution_count": 204,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "9c9e299c",
"cell_type": "code",
"source": "def get_outliers(col):\n q1,q3, iqr = find_iqr(col)\n outs = []\n for x in col:\n if x<(q1-1.5*iqr) or x>(q3+1.5*iqr):\n outs.append(x)\n return outs ",
"execution_count": 205,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "f65f52a8",
"cell_type": "code",
"source": "\nout_items = get_outliers(df[\"Items\"])\nout_price = get_outliers(df[\"Price\"])",
"execution_count": 210,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "812ec8c8",
"cell_type": "code",
"source": "print(out_items, out_price) ",
"execution_count": 211,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "[4300] [999999999.0]\n"
}
]
},
{
"metadata": {
"trusted": false
},
"id": "b44b865b",
"cell_type": "code",
"source": "# test replcaing a value in a column in the DF\ndf['Items'].replace([4300],200)\n\n\n# Important notes:\n# 1) [For multiple replacements] if you had multiple outliers within the same columns you can do the following:\ndf['Items'].replace([4300, 8928, 7861], 200)\n ",
"execution_count": 219,
"outputs": [
{
"data": {
"text/plain": "0 3\n1 2\n2 4\n3 200\n4 6\n5 4\n6 5\nName: Items, dtype: int64"
},
"execution_count": 219,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": false
},
"id": "f11e58f4",
"cell_type": "code",
"source": "df # unchanged!",
"execution_count": 218,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>ID</th>\n <th>City</th>\n <th>Items</th>\n <th>Price</th>\n <th>Payment</th>\n <th>Tax</th>\n <th>Transaction</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>Colombo</td>\n <td>3</td>\n <td>500.0</td>\n <td>%</td>\n <td>0.25</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>Singapore</td>\n <td>2</td>\n <td>1500.0</td>\n <td>0</td>\n <td>0.31</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>?</td>\n <td>4</td>\n <td>2000.0</td>\n <td>1</td>\n <td>1.20</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>Tokyo</td>\n <td>4300</td>\n <td>1300.0</td>\n <td>#</td>\n <td>0.80</td>\n <td>no</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>Melbourne</td>\n <td>6</td>\n <td>1.3</td>\n <td>--</td>\n <td>0.50</td>\n <td>no</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6</td>\n <td>Singapore</td>\n <td>4</td>\n <td>1800.0</td>\n <td>1</td>\n <td>0.30</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7</td>\n <td>?</td>\n <td>5</td>\n <td>999999999.0</td>\n <td>1</td>\n <td>0.40</td>\n <td>yes</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3 500.0 % 0.25 yes\n1 2 Singapore 2 1500.0 0 0.31 yes\n2 3 ? 4 2000.0 1 1.20 yes\n3 4 Tokyo 4300 1300.0 # 0.80 no\n4 5 Melbourne 6 1.3 -- 0.50 no\n5 6 Singapore 4 1800.0 1 0.30 no\n6 7 ? 5 999999999.0 1 0.40 yes"
},
"execution_count": 218,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": false
},
"id": "3226472b",
"cell_type": "code",
"source": "def find_replace(col, outs, rep):\n return col.replace([outs], rep) ",
"execution_count": 220,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "ee60d747",
"cell_type": "code",
"source": "# Prepare replacement: Get the mean for the Items column\nitem_unique = []\n\nfor val in df['Items']:\n if val not in out_items:\n item_unique.append(val)",
"execution_count": 222,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "ef73822b",
"cell_type": "code",
"source": "item_unique",
"execution_count": 223,
"outputs": [
{
"data": {
"text/plain": "[3, 2, 4, 6, 4, 5]"
},
"execution_count": 223,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": false
},
"id": "fdf9a8ed",
"cell_type": "code",
"source": "mean_items = np.mean(item_unique)",
"execution_count": 227,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "5c8c749b",
"cell_type": "code",
"source": "mean_items",
"execution_count": 228,
"outputs": [
{
"data": {
"text/plain": "4.0"
},
"execution_count": 228,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": false
},
"id": "de32bc59",
"cell_type": "code",
"source": "# Handle Outliers just like missing values, replae and persist the change\n\ndf['Items'] = find_replace(df[\"Items\"], out_items, mean_items)",
"execution_count": 229,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "c5f694af",
"cell_type": "code",
"source": "df # notice: 4300 is not there anymore",
"execution_count": 231,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>ID</th>\n <th>City</th>\n <th>Items</th>\n <th>Price</th>\n <th>Payment</th>\n <th>Tax</th>\n <th>Transaction</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>Colombo</td>\n <td>3.0</td>\n <td>500.0</td>\n <td>%</td>\n <td>0.25</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>Singapore</td>\n <td>2.0</td>\n <td>1500.0</td>\n <td>0</td>\n <td>0.31</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>?</td>\n <td>4.0</td>\n <td>2000.0</td>\n <td>1</td>\n <td>1.20</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>Tokyo</td>\n <td>4.0</td>\n <td>1300.0</td>\n <td>#</td>\n <td>0.80</td>\n <td>no</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>Melbourne</td>\n <td>6.0</td>\n <td>1.3</td>\n <td>--</td>\n <td>0.50</td>\n <td>no</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6</td>\n <td>Singapore</td>\n <td>4.0</td>\n <td>1800.0</td>\n <td>1</td>\n <td>0.30</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7</td>\n <td>?</td>\n <td>5.0</td>\n <td>999999999.0</td>\n <td>1</td>\n <td>0.40</td>\n <td>yes</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3.0 500.0 % 0.25 yes\n1 2 Singapore 2.0 1500.0 0 0.31 yes\n2 3 ? 4.0 2000.0 1 1.20 yes\n3 4 Tokyo 4.0 1300.0 # 0.80 no\n4 5 Melbourne 6.0 1.3 -- 0.50 no\n5 6 Singapore 4.0 1800.0 1 0.30 no\n6 7 ? 5.0 999999999.0 1 0.40 yes"
},
"execution_count": 231,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"id": "889b76fd",
"cell_type": "markdown",
"source": "### Activity: Do the same steps for the Price's outlier value"
},
{
"metadata": {
"trusted": false
},
"id": "44748f6a",
"cell_type": "code",
"source": "# Do the same step for Price's outlier\n# 1- Detect the outliers in Price\n# 2- Grab the mean for the Price column WITHOUT the outliers effect\n# 3- Handle Outliers as if they were missing values (repelace them with the mean)",
"execution_count": 232,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"id": "2d7fc9bd",
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"id": "6a8c8925",
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"id": "16597de7",
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"id": "e3dc93e1",
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"id": "56029df2",
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "eb94c7c9",
"cell_type": "code",
"source": "# final result with no outliers in Price nor in items\ndf",
"execution_count": 239,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>ID</th>\n <th>City</th>\n <th>Items</th>\n <th>Price</th>\n <th>Payment</th>\n <th>Tax</th>\n <th>Transaction</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>Colombo</td>\n <td>3.0</td>\n <td>500.00</td>\n <td>%</td>\n <td>0.25</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>Singapore</td>\n <td>2.0</td>\n <td>1500.00</td>\n <td>0</td>\n <td>0.31</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>?</td>\n <td>4.0</td>\n <td>2000.00</td>\n <td>1</td>\n <td>1.20</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>Tokyo</td>\n <td>4.0</td>\n <td>1300.00</td>\n <td>#</td>\n <td>0.80</td>\n <td>no</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>Melbourne</td>\n <td>6.0</td>\n <td>1.30</td>\n <td>--</td>\n <td>0.50</td>\n <td>no</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6</td>\n <td>Singapore</td>\n <td>4.0</td>\n <td>1800.00</td>\n <td>1</td>\n <td>0.30</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7</td>\n <td>?</td>\n <td>5.0</td>\n <td>1183.55</td>\n <td>1</td>\n <td>0.40</td>\n <td>yes</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3.0 500.00 % 0.25 yes\n1 2 Singapore 2.0 1500.00 0 0.31 yes\n2 3 ? 4.0 2000.00 1 1.20 yes\n3 4 Tokyo 4.0 1300.00 # 0.80 no\n4 5 Melbourne 6.0 1.30 -- 0.50 no\n5 6 Singapore 4.0 1800.00 1 0.30 no\n6 7 ? 5.0 1183.55 1 0.40 yes"
},
"execution_count": 239,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": false
},
"id": "dfccf02a",
"cell_type": "code",
"source": "df.to_excel(\"noise_cleaned.xlsx\",index=False)",
"execution_count": 240,
"outputs": []
},
{
"metadata": {},
"id": "18d40566",
"cell_type": "markdown",
"source": "### [Self-Paced] Assessment 1: Scraping (Using Only Pandas, no automation) & Cleaning IMDB's Top Movies Dataset "
},
{
"metadata": {},
"id": "723276d3",
"cell_type": "markdown",
"source": "- (Without using Selenium; NOT THE BEST WAY!) Scrape the IMBD's Top Movies using the following:"
},
{
"metadata": {
"trusted": false
},
"id": "839f346d",
"cell_type": "code",
"source": "df = pd.read_html('https://www.imdb.com/chart/top/')",
"execution_count": 244,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "21155477",
"cell_type": "code",
"source": "df = df[0]\ndf",
"execution_count": 245,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Unnamed: 0</th>\n <th>Rank &amp; Title</th>\n <th>IMDb Rating</th>\n <th>Your Rating</th>\n <th>Unnamed: 4</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>NaN</td>\n <td>1. The Shawshank Redemption (1994)</td>\n <td>9.2</td>\n <td>12345678910 NOT YET RELEASED Seen</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>1</th>\n <td>NaN</td>\n <td>2. The Godfather (1972)</td>\n <td>9.2</td>\n <td>12345678910 NOT YET RELEASED Seen</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>2</th>\n <td>NaN</td>\n <td>3. The Dark Knight (2008)</td>\n <td>9.0</td>\n <td>12345678910 NOT YET RELEASED Seen</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>3</th>\n <td>NaN</td>\n <td>4. The Godfather Part II (1974)</td>\n <td>9.0</td>\n <td>12345678910 NOT YET RELEASED Seen</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>4</th>\n <td>NaN</td>\n <td>5. 12 Angry Men (1957)</td>\n <td>9.0</td>\n <td>12345678910 NOT YET RELEASED Seen</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>...</th>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n </tr>\n <tr>\n <th>245</th>\n <td>NaN</td>\n <td>246. Dersu Uzala (1975)</td>\n <td>8.0</td>\n <td>12345678910 NOT YET RELEASED Seen</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>246</th>\n <td>NaN</td>\n <td>247. The Help (2011)</td>\n <td>8.0</td>\n <td>12345678910 NOT YET RELEASED Seen</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>247</th>\n <td>NaN</td>\n <td>248. Aladdin (1992)</td>\n <td>8.0</td>\n <td>12345678910 NOT YET RELEASED Seen</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>248</th>\n <td>NaN</td>\n <td>249. Gandhi (1982)</td>\n <td>8.0</td>\n <td>12345678910 NOT YET RELEASED Seen</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>249</th>\n <td>NaN</td>\n <td>250. Dances with Wolves (1990)</td>\n <td>8.0</td>\n <td>12345678910 NOT YET RELEASED Seen</td>\n <td>NaN</td>\n </tr>\n </tbody>\n</table>\n<p>250 rows × 5 columns</p>\n</div>",
"text/plain": " Unnamed: 0 Rank & Title IMDb Rating \\\n0 NaN 1. The Shawshank Redemption (1994) 9.2 \n1 NaN 2. The Godfather (1972) 9.2 \n2 NaN 3. The Dark Knight (2008) 9.0 \n3 NaN 4. The Godfather Part II (1974) 9.0 \n4 NaN 5. 12 Angry Men (1957) 9.0 \n.. ... ... ... \n245 NaN 246. Dersu Uzala (1975) 8.0 \n246 NaN 247. The Help (2011) 8.0 \n247 NaN 248. Aladdin (1992) 8.0 \n248 NaN 249. Gandhi (1982) 8.0 \n249 NaN 250. Dances with Wolves (1990) 8.0 \n\n Your Rating Unnamed: 4 \n0 12345678910 NOT YET RELEASED Seen NaN \n1 12345678910 NOT YET RELEASED Seen NaN \n2 12345678910 NOT YET RELEASED Seen NaN \n3 12345678910 NOT YET RELEASED Seen NaN \n4 12345678910 NOT YET RELEASED Seen NaN \n.. ... ... \n245 12345678910 NOT YET RELEASED Seen NaN \n246 12345678910 NOT YET RELEASED Seen NaN \n247 12345678910 NOT YET RELEASED Seen NaN \n248 12345678910 NOT YET RELEASED Seen NaN \n249 12345678910 NOT YET RELEASED Seen NaN \n\n[250 rows x 5 columns]"
},
"execution_count": 245,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"id": "2c999367",
"cell_type": "markdown",
"source": "#### Notice the following problems that needs to be cleaned\n- 'Unnamed: 0', 'Unnamed: 4' and 'Your Rating' columns should be deleted entirely\n- \"Rank & Title\" Column needs to be split into two columns: \"Rank\", \"Title\"\n- Thus, Rank should contain the numerical values and title will hold only the title values\n- Furthermore, \"Title\" needs to be split into: \"Title\", \"Year\" \n"
},
{
"metadata": {
"trusted": false
},
"id": "6446b179",
"cell_type": "code",
"source": "# Hint: \n# To split a column into two based on a delimiter (. in this example)\n# df[ [col1, col2] ] = df[ col ].str.split(\".\", expand=True)\n\n# if the delimiter had multiple occurences in a row, and you want to split baed on the first occurence only, add \n# the 'n=1' key argument\n\n# df[ [col1, col2] ] = df[ col ].str.split(\".\", expand=True, n = 1)\n\n\n",
"execution_count": 246,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "adda2068",
"cell_type": "code",
"source": "",
"execution_count": 252,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "1f5f79a5",
"cell_type": "code",
"source": "",
"execution_count": 254,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "18b0336b",
"cell_type": "code",
"source": "# Hint:\n# To drop columns in-place use the following:\n# df.drop(columns=[col1, col2, ...], inplace = True)\n\n\n",
"execution_count": 259,
"outputs": []
},
{
"metadata": {},
"id": "95a7216d",
"cell_type": "markdown",
"source": "#### Final outcome you should achieve"
},
{
"metadata": {
"trusted": false
},
"id": "dfa63292",
"cell_type": "code",
"source": "df",
"execution_count": 257,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>IMDb Rating</th>\n <th>Year</th>\n <th>Rank</th>\n <th>Title</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>9.2</td>\n <td>1994</td>\n <td>1</td>\n <td>The Shawshank Redemption</td>\n </tr>\n <tr>\n <th>1</th>\n <td>9.2</td>\n <td>1972</td>\n <td>2</td>\n <td>The Godfather</td>\n </tr>\n <tr>\n <th>2</th>\n <td>9.0</td>\n <td>2008</td>\n <td>3</td>\n <td>The Dark Knight</td>\n </tr>\n <tr>\n <th>3</th>\n <td>9.0</td>\n <td>1974</td>\n <td>4</td>\n <td>The Godfather Part II</td>\n </tr>\n <tr>\n <th>4</th>\n <td>9.0</td>\n <td>1957</td>\n <td>5</td>\n <td>12 Angry Men</td>\n </tr>\n <tr>\n <th>...</th>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n <td>...</td>\n </tr>\n <tr>\n <th>245</th>\n <td>8.0</td>\n <td>1975</td>\n <td>246</td>\n <td>Dersu Uzala</td>\n </tr>\n <tr>\n <th>246</th>\n <td>8.0</td>\n <td>2011</td>\n <td>247</td>\n <td>The Help</td>\n </tr>\n <tr>\n <th>247</th>\n <td>8.0</td>\n <td>1992</td>\n <td>248</td>\n <td>Aladdin</td>\n </tr>\n <tr>\n <th>248</th>\n <td>8.0</td>\n <td>1982</td>\n <td>249</td>\n <td>Gandhi</td>\n </tr>\n <tr>\n <th>249</th>\n <td>8.0</td>\n <td>1990</td>\n <td>250</td>\n <td>Dances with Wolves</td>\n </tr>\n </tbody>\n</table>\n<p>250 rows × 4 columns</p>\n</div>",
"text/plain": " IMDb Rating Year Rank Title\n0 9.2 1994 1 The Shawshank Redemption \n1 9.2 1972 2 The Godfather \n2 9.0 2008 3 The Dark Knight \n3 9.0 1974 4 The Godfather Part II \n4 9.0 1957 5 12 Angry Men \n.. ... ... ... ...\n245 8.0 1975 246 Dersu Uzala \n246 8.0 2011 247 The Help \n247 8.0 1992 248 Aladdin \n248 8.0 1982 249 Gandhi \n249 8.0 1990 250 Dances with Wolves \n\n[250 rows x 4 columns]"
},
"execution_count": 257,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"id": "0c5732b4",
"cell_type": "markdown",
"source": "### [Self-Paced] Assessment 2: Scraping & Cleaning a Pokemon Dataset for an Online Game-Card Trading Platform "
},
{
"metadata": {},
"id": "4f730047",
"cell_type": "markdown",
"source": "- Scrape the following website(using Selenium Automation is a MUST): https://pokemondb.net/pokedex/all\n- Clean it when necessarya nd store the result in an excel sheet\n- Watch the recording on Selnium Web Scraping + Yelp project to get a more detailed overview "
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.8.8",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"gist": {
"id": "",
"data": {
"description": "Desktop/Lab 2_Data Cleaning Noise-Copy1.ipynb",
"public": true
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment