Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save AhmedAlhallag/69847114b7789934833d81327cc7b642 to your computer and use it in GitHub Desktop.
Save AhmedAlhallag/69847114b7789934833d81327cc7b642 to your computer and use it in GitHub Desktop.
Desktop/DataCleaningLab2_1_NoSol.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"id": "e36c3286",
"cell_type": "markdown",
"source": "# KH4005CEM Lab 1.2: Detecting & Handling Missing Values "
},
{
"metadata": {
"trusted": false
},
"id": "61c76ffa",
"cell_type": "code",
"source": "# Note: This step assumes that you grabed the values (as an .xlsx file or as a .csv file) from somewhere\n# either from kaggle, or web scraped\n\n\n# importing Vlues dataset using pandas library \nimport pandas as pd \n\n# df = pd.read_csv(\"dataset.csv\")\ndf = pd.read_csv(\"dataset_InLab.csv\")\n",
"execution_count": 272,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "d4c52155",
"cell_type": "code",
"source": "print(df)\ndf\n",
"execution_count": 270,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3 500.0 ? 0.25 yes\n1 2 Singapore 2 1500.0 0 0.31 yes\n2 3 ? 4 2000.0 1 1.20 yes\n3 4 Tokyo 4300 1300.0 # 0.80 no\n4 5 Melbourne 6 1.3 -- 0.50 no\n5 6 Singapore 4 1800.0 1 0.30 no\n6 7 ? 5 999999999.0 1 0.40 yes\n"
},
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>ID</th>\n <th>City</th>\n <th>Items</th>\n <th>Price</th>\n <th>Payment</th>\n <th>Tax</th>\n <th>Transaction</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>Colombo</td>\n <td>3</td>\n <td>500.0</td>\n <td>?</td>\n <td>0.25</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>Singapore</td>\n <td>2</td>\n <td>1500.0</td>\n <td>0</td>\n <td>0.31</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>?</td>\n <td>4</td>\n <td>2000.0</td>\n <td>1</td>\n <td>1.20</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>Tokyo</td>\n <td>4300</td>\n <td>1300.0</td>\n <td>#</td>\n <td>0.80</td>\n <td>no</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>Melbourne</td>\n <td>6</td>\n <td>1.3</td>\n <td>--</td>\n <td>0.50</td>\n <td>no</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6</td>\n <td>Singapore</td>\n <td>4</td>\n <td>1800.0</td>\n <td>1</td>\n <td>0.30</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7</td>\n <td>?</td>\n <td>5</td>\n <td>999999999.0</td>\n <td>1</td>\n <td>0.40</td>\n <td>yes</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3 500.0 ? 0.25 yes\n1 2 Singapore 2 1500.0 0 0.31 yes\n2 3 ? 4 2000.0 1 1.20 yes\n3 4 Tokyo 4300 1300.0 # 0.80 no\n4 5 Melbourne 6 1.3 -- 0.50 no\n5 6 Singapore 4 1800.0 1 0.30 no\n6 7 ? 5 999999999.0 1 0.40 yes"
},
"execution_count": 270,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"id": "86e7150c",
"cell_type": "markdown",
"source": "## Intro to Regular Expression\n- Check the regex quick guide on Notion Page\n- Replace any missing valus, such as: special symbols, empty strings, speacial words (NA, null, etc..) with a pythonic NaN data type\n- NaN (Not a Number) data types are defined in the numpy library\n- Converting missing values to NaN types will allow you to generate a summary of columns containing missing values"
},
{
"metadata": {
"trusted": false
},
"id": "c1b7ec40",
"cell_type": "code",
"source": "import numpy as np\n\n# replace empty strings and Question marks with a nan\ndf.replace(r\"\\?|^$\",\"\", regex=True)\n",
"execution_count": 273,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>ID</th>\n <th>City</th>\n <th>Items</th>\n <th>Price</th>\n <th>Payment</th>\n <th>Tax</th>\n <th>Transaction</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>Colombo</td>\n <td>3</td>\n <td>500.0</td>\n <td></td>\n <td>0.25</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>Singapore</td>\n <td>2</td>\n <td>1500.0</td>\n <td>0</td>\n <td>0.31</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td></td>\n <td>4</td>\n <td>2000.0</td>\n <td>1</td>\n <td>1.20</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>Tokyo</td>\n <td>4300</td>\n <td>1300.0</td>\n <td>#</td>\n <td>0.80</td>\n <td>no</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>Melbourne</td>\n <td>6</td>\n <td>1.3</td>\n <td>--</td>\n <td>0.50</td>\n <td>no</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6</td>\n <td>Singapore</td>\n <td>4</td>\n <td>1800.0</td>\n <td>1</td>\n <td>0.30</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7</td>\n <td></td>\n <td>5</td>\n <td>999999999.0</td>\n <td>1</td>\n <td>0.40</td>\n <td>yes</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3 500.0 0.25 yes\n1 2 Singapore 2 1500.0 0 0.31 yes\n2 3 4 2000.0 1 1.20 yes\n3 4 Tokyo 4300 1300.0 # 0.80 no\n4 5 Melbourne 6 1.3 -- 0.50 no\n5 6 Singapore 4 1800.0 1 0.30 no\n6 7 5 999999999.0 1 0.40 yes"
},
"execution_count": 273,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"id": "00630714",
"cell_type": "markdown",
"source": "### Generate a summary report of Missing Values"
},
{
"metadata": {
"trusted": false
},
"id": "157abfd3",
"cell_type": "code",
"source": "df.isna().sum()\n",
"execution_count": 256,
"outputs": [
{
"data": {
"text/plain": "ID 0\nCity 2\nItems 0\nPrice 0\nPayment 1\nTax 0\nTransaction 0\ndtype: int64"
},
"execution_count": 256,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"id": "9b8ddaea",
"cell_type": "markdown",
"source": "### Grab list of column names that contains missing values"
},
{
"metadata": {
"trusted": false
},
"id": "06712d5b",
"cell_type": "code",
"source": "df.columns[df.isna().any()]\n# Convert it into a list\ndf.columns[df.isna().any()].to_list()\n",
"execution_count": 257,
"outputs": [
{
"data": {
"text/plain": "['City', 'Payment']"
},
"execution_count": 257,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": false
},
"id": "8d42062c",
"cell_type": "code",
"source": "df",
"execution_count": 258,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>ID</th>\n <th>City</th>\n <th>Items</th>\n <th>Price</th>\n <th>Payment</th>\n <th>Tax</th>\n <th>Transaction</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>Colombo</td>\n <td>3</td>\n <td>500.0</td>\n <td>NaN</td>\n <td>0.25</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>Singapore</td>\n <td>2</td>\n <td>1500.0</td>\n <td>0</td>\n <td>0.31</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>NaN</td>\n <td>4</td>\n <td>2000.0</td>\n <td>1</td>\n <td>1.20</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>Tokyo</td>\n <td>4300</td>\n <td>1300.0</td>\n <td>#</td>\n <td>0.80</td>\n <td>no</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>Melbourne</td>\n <td>6</td>\n <td>1.3</td>\n <td>--</td>\n <td>0.50</td>\n <td>no</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6</td>\n <td>Singapore</td>\n <td>4</td>\n <td>1800.0</td>\n <td>1</td>\n <td>0.30</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7</td>\n <td>NaN</td>\n <td>5</td>\n <td>999999999.0</td>\n <td>1</td>\n <td>0.40</td>\n <td>yes</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3 500.0 NaN 0.25 yes\n1 2 Singapore 2 1500.0 0 0.31 yes\n2 3 NaN 4 2000.0 1 1.20 yes\n3 4 Tokyo 4300 1300.0 # 0.80 no\n4 5 Melbourne 6 1.3 -- 0.50 no\n5 6 Singapore 4 1800.0 1 0.30 no\n6 7 NaN 5 999999999.0 1 0.40 yes"
},
"execution_count": 258,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"id": "2e9637ea",
"cell_type": "markdown",
"source": "### Put all of the previous work to Action \n#### Create a function that replaces missing values & one that detects them"
},
{
"metadata": {
"trusted": false
},
"id": "c7b54d79",
"cell_type": "code",
"source": "def detect_missing(df):\n df = convert_missing_to(df, np.nan)\n cols_with_missing = df.columns[df.isna().any()].tolist()\n if cols_with_missing:\n return True, cols_with_missing \n return False, None\n\ndef convert_missing_to(df, cri):\n return df.replace(r\"\\?|^$\",cri, regex=True)\n\n \n ",
"execution_count": 274,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "1a11b3ca",
"cell_type": "code",
"source": "# Step 1: detect?\ndetect_missing(df)\n\n# Step 2: convert missing to NaN\ndf_withNan = convert_missing_to(df,np.nan)\n\n# Step 3: Generate summary \ndf_withNan.isna().sum()\n\n",
"execution_count": 275,
"outputs": [
{
"data": {
"text/plain": "ID 0\nCity 2\nItems 0\nPrice 0\nPayment 1\nTax 0\nTransaction 0\ndtype: int64"
},
"execution_count": 275,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"id": "23cfa9a9",
"cell_type": "markdown",
"source": "## Handle Missing "
},
{
"metadata": {},
"id": "c735068c",
"cell_type": "markdown",
"source": "### To handle missing data, we can get the mode for categorical columns (\"City\") or the mean for numerical columns (Payment)\n- Note: there are other ways as well. We will focus on the statistical ones for this lab"
},
{
"metadata": {},
"id": "6ff7b5b0",
"cell_type": "markdown",
"source": "### Before we get the mode for the City column, we need to make sure missing values are removed (generating a temporary version of the dataframe to get the mode from)"
},
{
"metadata": {
"trusted": false
},
"id": "fdebfaeb",
"cell_type": "code",
"source": "# convert missing to empty (empty strings are easier to exclude than the nans)\ndf_empty = convert_missing_to(df,\"\")",
"execution_count": 276,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "0b12fe66",
"cell_type": "code",
"source": "df_empty",
"execution_count": 277,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>ID</th>\n <th>City</th>\n <th>Items</th>\n <th>Price</th>\n <th>Payment</th>\n <th>Tax</th>\n <th>Transaction</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>Colombo</td>\n <td>3</td>\n <td>500.0</td>\n <td></td>\n <td>0.25</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>Singapore</td>\n <td>2</td>\n <td>1500.0</td>\n <td>0</td>\n <td>0.31</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td></td>\n <td>4</td>\n <td>2000.0</td>\n <td>1</td>\n <td>1.20</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>Tokyo</td>\n <td>4300</td>\n <td>1300.0</td>\n <td>#</td>\n <td>0.80</td>\n <td>no</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>Melbourne</td>\n <td>6</td>\n <td>1.3</td>\n <td>--</td>\n <td>0.50</td>\n <td>no</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6</td>\n <td>Singapore</td>\n <td>4</td>\n <td>1800.0</td>\n <td>1</td>\n <td>0.30</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7</td>\n <td></td>\n <td>5</td>\n <td>999999999.0</td>\n <td>1</td>\n <td>0.40</td>\n <td>yes</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3 500.0 0.25 yes\n1 2 Singapore 2 1500.0 0 0.31 yes\n2 3 4 2000.0 1 1.20 yes\n3 4 Tokyo 4300 1300.0 # 0.80 no\n4 5 Melbourne 6 1.3 -- 0.50 no\n5 6 Singapore 4 1800.0 1 0.30 no\n6 7 5 999999999.0 1 0.40 yes"
},
"execution_count": 277,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": false
},
"id": "c06cfff7",
"cell_type": "code",
"source": "# create a copy of the City column that do not contain missing values (delete missing values)\n\nCity_unique = []\nfor val in df_empty[\"City\"].to_list():\n if val:\n City_unique.append(val)",
"execution_count": 278,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "2f766454",
"cell_type": "code",
"source": "City_unique",
"execution_count": 279,
"outputs": [
{
"data": {
"text/plain": "['Colombo', 'Singapore', 'Tokyo', 'Melbourne', 'Singapore']"
},
"execution_count": 279,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": false
},
"id": "c40b786e",
"cell_type": "code",
"source": "# import stats from scipy to get the mode method \nfrom scipy import stats",
"execution_count": 280,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "a447658c",
"cell_type": "code",
"source": "mode_city = stats.mode(City_unique)[0][0]\n",
"execution_count": 281,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "6b848725",
"cell_type": "code",
"source": "# generate another final copy of the dataframe, and replace the missing values from the original table with an \n# alias that is unique to us \ndf_final = convert_missing_to(df,\"TO BE REPLACED\")\n\n# now replace that alias with the mode value\ndf_final['City'] = df_final['City'].apply( lambda x: x.replace(\"TO BE REPLACED\",mode_city))",
"execution_count": 282,
"outputs": []
},
{
"metadata": {
"trusted": false
},
"id": "174c6ba5",
"cell_type": "code",
"source": "df_final",
"execution_count": 283,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>ID</th>\n <th>City</th>\n <th>Items</th>\n <th>Price</th>\n <th>Payment</th>\n <th>Tax</th>\n <th>Transaction</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>Colombo</td>\n <td>3</td>\n <td>500.0</td>\n <td>TO BE REPLACED</td>\n <td>0.25</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>Singapore</td>\n <td>2</td>\n <td>1500.0</td>\n <td>0</td>\n <td>0.31</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>Singapore</td>\n <td>4</td>\n <td>2000.0</td>\n <td>1</td>\n <td>1.20</td>\n <td>yes</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>Tokyo</td>\n <td>4300</td>\n <td>1300.0</td>\n <td>#</td>\n <td>0.80</td>\n <td>no</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>Melbourne</td>\n <td>6</td>\n <td>1.3</td>\n <td>--</td>\n <td>0.50</td>\n <td>no</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6</td>\n <td>Singapore</td>\n <td>4</td>\n <td>1800.0</td>\n <td>1</td>\n <td>0.30</td>\n <td>no</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7</td>\n <td>Singapore</td>\n <td>5</td>\n <td>999999999.0</td>\n <td>1</td>\n <td>0.40</td>\n <td>yes</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " ID City Items Price Payment Tax Transaction\n0 1 Colombo 3 500.0 TO BE REPLACED 0.25 yes\n1 2 Singapore 2 1500.0 0 0.31 yes\n2 3 Singapore 4 2000.0 1 1.20 yes\n3 4 Tokyo 4300 1300.0 # 0.80 no\n4 5 Melbourne 6 1.3 -- 0.50 no\n5 6 Singapore 4 1800.0 1 0.30 no\n6 7 Singapore 5 999999999.0 1 0.40 yes"
},
"execution_count": 283,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"id": "b1b87bc7",
"cell_type": "markdown",
"source": "### Activity: Do the same steps for the 'Payment\" Column\n- Note: you would want to replace missing values (after aliasing them) with the \"mean\" this time! since these are numerical.\n- We can continue on the df_empty and operate on the payment column"
},
{
"metadata": {},
"id": "3056e1a7",
"cell_type": "markdown",
"source": "### Detecting & Handling Noise/Outliers (next lab)"
}
],
"metadata": {
"_draft": {
"nbviewer_url": "https://gist.github.com/d6ee8c6dfd782c0838174a17223d28aa"
},
"gist": {
"id": "d6ee8c6dfd782c0838174a17223d28aa",
"data": {
"description": "Desktop/DataCleaningLab2_1_NoSol.ipynb",
"public": true
}
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.8.8",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment