Skip to content

Instantly share code, notes, and snippets.

@iniakunhuda
Created August 28, 2020 16:27
Show Gist options
  • Save iniakunhuda/29bff23dcc4fb6cd320799282987d5bb to your computer and use it in GitHub Desktop.
Save iniakunhuda/29bff23dcc4fb6cd320799282987d5bb to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data Preprocessing adalah sebuah tahapan awal dalam sebuah pengolahan data sebelum data diaplikasikan dengan algoritma machine learning. Data yang biasanya kita gunakan dalam kehidupan sehari - hari entah itu dari database, data excel dan sumber lainnya, merupakan data unstruktur (datanya tidak sempurna). Misalkan dalam sebuah dataset (kumpulan data) terdapat data yang kosong, tipe data yang berbeda dengan yang lain, dan sebagainya. Masalah tersebut harus bisa kita selesaikan terlebih dahulu agar data yang kita kelola lebih mudah dan outputnya sesuai dengan yang kita harapkan."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Terdapat beberapa case yang akan kita bahas dalam artikel berikut, antara lain seperti:\n",
"* Mengimport libraries\n",
"* Mengimport dataset\n",
"* Menangani data kosong di dataset\n",
"* Mengolah data string menjadi kategori\n",
"* Membagi dataset menjadi training dan test set\n",
"* Feature Scaling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pada kesempatan kali ini saya akan menggunakan text editor Anaconda Navigator yaitu Jupyter Notebook untuk mempermudah kita mempelajarinya. Apabila kamu belum menginstall Anaconda, kamu bisa membaca tutorial Instalasi Anaconda di url ini. Jika kamu belum pernah belajar python, alangkah lebih baik kamu belajar sintax python dasar terlebih dahulu agar dapat dengan mudah mengikuti alur program yang akan kita buat.\n",
"\n",
"Buat file python / jupyter notebook baru komputer kamu"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Mengimport libraries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Library yang kita gunakan adalah **Pandas** untuk mengolah dataset kita. Cara memanggil library adalah sebagai berikut"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*as* digunakan untuk **alias** sehingga kita bisa memanggil library dengan lebih pendek. Nama alias terserah dari kita, namun pandas umumnya akan menggunakan kata *pd* untuk aliasnya. Begitu juga misalkan kita akan menggunakan library **numpy** untuk pengolahan angka, kita bisa memanggil dengan menggunakan format yang sama, **import numpy as np**\n",
"\n",
"apabila library pandas belum ada, kamu bisa menginstall library dengan menggunakan **pip** ataupun **anaconda**"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Note: you may need to restart the kernel to use updated packages.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"ERROR: Invalid requirement: '#'\n"
]
}
],
"source": [
"pip install pandas # install via pip"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"CondaValueError: invalid package specification: #\n",
"\n"
]
}
],
"source": [
"conda install -c pandas # install via anaconda"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Mengimport dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dataset yang akan kita gunakan adalah dataset dari file **Data.csv** yang bisa diunduh di (link). Masukkan dataset kamu dalam satu folder / direktori yang sama dengan file python"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"dt = pd.read_csv(\"Data.csv\") # load data.csv dan akan disimpan di variabel dt"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Country</th>\n",
" <th>Age</th>\n",
" <th>Salary</th>\n",
" <th>Purchased</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>France</td>\n",
" <td>44.0</td>\n",
" <td>72000.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Spain</td>\n",
" <td>27.0</td>\n",
" <td>48000.0</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Germany</td>\n",
" <td>30.0</td>\n",
" <td>54000.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Spain</td>\n",
" <td>38.0</td>\n",
" <td>61000.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Germany</td>\n",
" <td>40.0</td>\n",
" <td>NaN</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Country Age Salary Purchased\n",
"0 France 44.0 72000.0 No\n",
"1 Spain 27.0 48000.0 Yes\n",
"2 Germany 30.0 54000.0 No\n",
"3 Spain 38.0 61000.0 No\n",
"4 Germany 40.0 NaN Yes"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dt.head(5) # tampilkan 5 data teratas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Seperti yang kita lihat di 5 data teratas pada dataset tsb, terdapat data yang kosong di kolom **Salary** di baris ke 5. Namun kita tidak mengetahui berapa data yang kosong di tiap kolom. Untuk melihat apakah ada data yang kosong di kolom yang lain kita bisa menggunakan fungsi berikut"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Country 0\n",
"Age 1\n",
"Salary 1\n",
"Purchased 0\n",
"dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dt.isna().sum() # menghitung data kosong di semua kolom"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nah, darisini kita bisa melihat bahwa ada 1 data kosong di kolom **Age** dan satu lainnya di kolom **Salary**. Data yang kosong akan memiliki pengaruh besar terdapat data yang lain jika kita tidak mengolah data itu terlebih dahulu. Output yang dihasilkan bisa jadi bias. Maksudnya bagaimana? Misalkan kita akan memfilter negara Germany dan ingin melihat gaji terkecil di negara itu, maka yang akan kita dapat bukanlah **NaN** (padahal seharusnya bisa jadi data tersebut yang paling besar atau kecil)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Country</th>\n",
" <th>Age</th>\n",
" <th>Salary</th>\n",
" <th>Purchased</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Germany</td>\n",
" <td>30.0</td>\n",
" <td>54000.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Germany</td>\n",
" <td>40.0</td>\n",
" <td>NaN</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Germany</td>\n",
" <td>50.0</td>\n",
" <td>83000.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Country Age Salary Purchased\n",
"2 Germany 30.0 54000.0 No\n",
"4 Germany 40.0 NaN Yes\n",
"8 Germany 50.0 83000.0 No"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"german = dt[dt['Country'] == \"Germany\"]\n",
"german"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Country Germany\n",
"Age 30\n",
"Salary 54000\n",
"Purchased No\n",
"dtype: object"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"german.min()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lalu bagaimana cara kita menangani data kosong tersebut? Terdapat beberapa kasus yang mengakibatkan data kosong, seperti user lupa mengisi data, data hilang ketika didownload, program error, dan lain sebagainya. Kita akan mempelajari bagaimana cara menangani data kosong dengan langkah - langkahnya"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Mencari & Menangani data kosong"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Kita akan belajar bagaimana cara:\n",
"* Mencari data kosong\n",
" * Data kosong standard (NaN)\n",
" * Data kosong lainnya\n",
"* Menangani data kosong dengan beberapa metode, antara lain:\n",
" * Menghapus baris\n",
" * Mengganti Dengan Mean / Median / Mode\n",
" * Mengganti dengan kategori unik\n",
" * Memprediksi nilai yang hilang dengan LinearRegression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Mencari data kosong standard (NaN)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Country 0\n",
"Age 1\n",
"Salary 1\n",
"Purchased 0\n",
"dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dt.isna().sum() # semua kolom"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 False\n",
"5 False\n",
"6 True\n",
"7 False\n",
"8 False\n",
"9 False\n",
"Name: Age, dtype: bool"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dt['Age'].isnull() # spesifik data pada kolom"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Data kosong lainnya\n",
"Data kosong yang akan kita cari adalah \"wadah kosong\" yang udah kita definisikan sebelumnya. Misalkan terdapat data string n/a, NA, na, kosong, dsb"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Country</th>\n",
" <th>Age</th>\n",
" <th>Salary</th>\n",
" <th>Purchased</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>France</td>\n",
" <td>44.0</td>\n",
" <td>72000</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Spain</td>\n",
" <td>27.0</td>\n",
" <td>48000</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Germany</td>\n",
" <td>30.0</td>\n",
" <td>--</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Spain</td>\n",
" <td>38.0</td>\n",
" <td>61000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Germany</td>\n",
" <td>40.0</td>\n",
" <td>NaN</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>France</td>\n",
" <td>NaN</td>\n",
" <td>58000</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Spain</td>\n",
" <td>NaN</td>\n",
" <td>52000</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>France</td>\n",
" <td>48.0</td>\n",
" <td>na</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Germany</td>\n",
" <td>50.0</td>\n",
" <td>83000</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>France</td>\n",
" <td>37.0</td>\n",
" <td>67000</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Country Age Salary Purchased\n",
"0 France 44.0 72000 No\n",
"1 Spain 27.0 48000 Yes\n",
"2 Germany 30.0 -- No\n",
"3 Spain 38.0 61000 NaN\n",
"4 Germany 40.0 NaN Yes\n",
"5 France NaN 58000 Yes\n",
"6 Spain NaN 52000 No\n",
"7 France 48.0 na Yes\n",
"8 Germany 50.0 83000 No\n",
"9 France 37.0 67000 Yes"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv(\"Data2.csv\")\n",
"df.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Country</th>\n",
" <th>Age</th>\n",
" <th>Salary</th>\n",
" <th>Purchased</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>France</td>\n",
" <td>44.0</td>\n",
" <td>72000.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Spain</td>\n",
" <td>27.0</td>\n",
" <td>48000.0</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Germany</td>\n",
" <td>30.0</td>\n",
" <td>NaN</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Spain</td>\n",
" <td>38.0</td>\n",
" <td>61000.0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Germany</td>\n",
" <td>40.0</td>\n",
" <td>NaN</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>France</td>\n",
" <td>NaN</td>\n",
" <td>58000.0</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Spain</td>\n",
" <td>NaN</td>\n",
" <td>52000.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>France</td>\n",
" <td>48.0</td>\n",
" <td>NaN</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Germany</td>\n",
" <td>50.0</td>\n",
" <td>83000.0</td>\n",
" <td>No</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>France</td>\n",
" <td>37.0</td>\n",
" <td>67000.0</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Country Age Salary Purchased\n",
"0 France 44.0 72000.0 No\n",
"1 Spain 27.0 48000.0 Yes\n",
"2 Germany 30.0 NaN No\n",
"3 Spain 38.0 61000.0 NaN\n",
"4 Germany 40.0 NaN Yes\n",
"5 France NaN 58000.0 Yes\n",
"6 Spain NaN 52000.0 No\n",
"7 France 48.0 NaN Yes\n",
"8 Germany 50.0 83000.0 No\n",
"9 France 37.0 67000.0 Yes"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Buat list tipe data kosong\n",
"missing_values = [\"n/a\", \"na\", \"--\"]\n",
"df_ms = pd.read_csv(\"Data2.csv\", na_values = missing_values)\n",
"df_ms.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tipe data kosong yang telah kita tentukan tadi akan diconvert oleh pandas menjadi NaN"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Menangani data kosong"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Menghapus baris"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cara yang pertama yang paling mudah dilakukan untuk melakukan cleansing adalah menghapus baris yang memiliki data kosong. Meskipun mudah dilakukan, namun cara ini dinilai tidak efektif kalau dataset kita sedikit. Karena bisa jadi output yang dihasilkan tidak sesuai karena informasi datanya bias. Berikut ini adalah cara menghapus baris,"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Country 0\n",
"Age 0\n",
"Salary 0\n",
"Purchased 0\n",
"dtype: int64"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dt.dropna(inplace=True)\n",
"dt.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Keuntungan**\n",
"* Menghapus data yang kosong, model bisa sangat akurat (kalau data dalam dataset banyak)\n",
"* Cepat dan mudah\n",
"\n",
"**Kekurangan**\n",
"* Kehilangan informasi dan data \n",
"* Berfungsi buruk jika persentase nilai yang hilang tinggi (katakanlah 30%), dibandingkan dengan seluruh dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Mengganti Dengan Mean / Median / Modus"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cara ini bisa dilakukan apabila data yang kosong adalah tipe data numerik (integer, float) seperti usia dan gaji. Kita dapat menghitung mean, median atau modus dari data dan menggantinya dengan nilai yang hilang. Metode ini lebih baik daripada menghapus baris, dikarenakan kita menambahkan data yang mendekati dengan dataset."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 44.0\n",
"1 27.0\n",
"2 30.0\n",
"3 38.0\n",
"4 40.0\n",
"5 35.0\n",
"6 NaN\n",
"7 48.0\n",
"8 50.0\n",
"9 37.0\n",
"Name: Age, dtype: float64"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dt['Age'].head(10)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"38.77777777777778"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dt['Age'].mean() # bisa diganti .median() atau .mode() untuk modus"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 44.000000\n",
"1 27.000000\n",
"2 30.000000\n",
"3 38.000000\n",
"4 40.000000\n",
"5 35.000000\n",
"6 38.777778\n",
"7 48.000000\n",
"8 50.000000\n",
"9 37.000000\n",
"Name: Age, dtype: float64"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"dt['Age'].replace(np.NaN, dt['Age'].mean())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Keuntungan**\n",
"* Cara yang lebih baik jika ukuran datanya kecil \n",
"* Ini dapat mencegah kehilangan data yang mengakibatkan penghapusan baris dan kolom\n",
"\n",
"**Kekurangan**\n",
"* Bisa jadi data yang dihasilkan bias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Mengganti dengan kategori unik"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cara ini berguna untuk data kategori selain numerik seperti data string. Misalkan dalam kolom Purchased ternyata ada data kosong, bisa kita ganti dengan string \"U\" untuk menandakan \"Unknown\". Cara ini dapat berguna dalam beberapa kasus seperti kita ingin menambahkan data default No pada tiap data yang kosong dan sebagainya."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 No\n",
"1 Yes\n",
"2 No\n",
"3 No\n",
"4 Yes\n",
"5 Yes\n",
"6 No\n",
"7 Yes\n",
"8 No\n",
"9 Yes\n",
"Name: Purchased, dtype: object"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dt['Purchased'].fillna(\"U\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Keuntungan**\n",
"* Data tidak hilang, namun diganti dengan kategori unik\n",
"\n",
"**Kekurangan**\n",
"* Menambahkan fitur lain ke model saat melakukan encoding, yang dapat mengakibatkan performa yang buruk"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Memprediksi nilai yang hilang dengan LinearRegression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Kita dapat memprediksi null dengan bantuan algoritma pembelajaran mesin. Metode ini dapat menghasilkan akurasi yang lebih baik, kecuali nilai yang hilang diharapkan memiliki varian yang sangat tinggi. Kita akan menggunakan LinearRegression untuk mengganti NaN di kolom Age berdasarkan Country, Salary, dan Purchased. Cara ini sangat berguna ketika kolom / feature yang lain tidak kosong sehingga bisa mennjadi independent variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Keuntungan**\n",
"* Menghasilkan estimasi yang tidak bias dari parameter model\n",
"\n",
"**Kekurangan**\n",
"* Bias juga muncul ketika satu set pengkondisian yang tidak lengkap digunakan untuk variabel kategorikal "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Referensi\n",
"* https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e\n",
"* https://analyticsindiamag.com/5-ways-handle-missing-values-machine-learning-datasets/\n",
"* https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4\n",
"* https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b\n",
"* https://machinelearningmastery.com/handle-missing-data-python/\n",
"* https://medium.com/analytics-vidhya/data-cleaning-and-preprocessing-a4b751f4066f"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 Yes
France 35 58000 Yes
Spain 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 67000 Yes
Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 -- No
Spain 38 61000 NA
Germany 40 Yes
France NA 58000 Yes
Spain 52000 No
France 48 na Yes
Germany 50 83000 No
France 37 67000 Yes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment