Last active April 5, 2022
"cell_type": "markdown",
"source": [
Berdasarkan isu [#151]( **Uji Outlier**
"Referensi Isu:\n",
"- Te Chow, V. (2010). Applied Hydrology. Tata McGraw-Hill Education. p403-405.\n",
"- Limantara, Lily M. (2018): Rekayasa Hidrologi, Edisi Revisi. Penerbit Andi Offset, Yogyakarta. (hal. 89-93)\n",
"Deksripsi Isu:\n",
"- Mencari nilai outlier pada data\n",
"- Buat tabel lampiran yang digunakan untuk mencari nilai $K_n$.\n",
" - Tabel `rel_Kn_n`, hubungan $K_n$ dengan jumlah data $N$.\n",
"- Membuat fungsi membaca data kemudian mengeluarkan nilai batas bawah dan atas\n",
"- Periksa juga apakah data memiliki outlier atau tidak. Jika iya, mungkin dikeluarkan bagian mana saja yang memiliki outlier. \n"
"cell_type": "markdown",
"source": [
"cell_type": "code",
"source": [
"import numpy as np\n",
"import pandas as pd"
"cell_type": "code",
"source": [
"# contoh diambil dari buku\n",
"# Limantara, Lily M. (2018): Rekayasa Hidrologi, Edisi Revisi. \n",
"# Penerbit Andi Offset, Yogyakarta. (hal. 90-91)\n",
"_hujan = np.array([2818, 2542, 1949, 1842, 1748, 1737, 1605, 1558, 1433, 1264])\n",
"_index = np.array([2010, 2013, 2008, 2012, 2011, 2014, 2009, 2007, 2015, 2006])\n",
"data = pd.DataFrame(\n",
" data=np.stack([_index, _hujan], axis=1), \n",
" columns=['Tahun','Hujan']\n",
"# ubah kolom tahun jadi datetime, dan index\n",
"data.Tahun = pd.to_datetime(data.Tahun, format=\"%Y\")\n",
"data.set_index('Tahun', inplace=True)\n",
"output_type": "execute_result",
"data": {
"text/plain": [
" Hujan\n",
"Tahun \n",
"2010-01-01 2818\n",
"2013-01-01 2542\n",
"2008-01-01 1949\n",
"2012-01-01 1842\n",
"2011-01-01 1748\n",
"2014-01-01 1737\n",
"2009-01-01 1605\n",
"2007-01-01 1558\n",
"2015-01-01 1433\n",
"2006-01-01 1264"
"cell_type": "markdown",
"source": [
"# TABEL\n",
"Nilai tabel mengikuti referensi buku. \n",
"- Tabel `t_rel_Kn_n` diambil pada buku Te Chow, V. (2010). Applied Hydrology. p404. \n",
"Tabel yang digunakan dalam perhitungan dibangkitkan dengan kode dibawah ini:"
"cell_type": "code",
"execution_count": 3,
"outputs": [
"output_type": "execute_result",
"data": {
"text/plain": [
" N Kn\n",
"0 10.0 2.036\n",
"1 11.0 2.088\n",
"2 12.0 2.134\n",
"3 13.0 2.175\n",
"4 14.0 2.213"
"source": [
"_N = np.array(\n",
" [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, \n",
" 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, \n",
" 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, \n",
" 130, 140]\n",
"_Kn = np.array(\n",
" [2.036, 2.088, 2.134, 2.175, 2.213, 2.247, 2.279, 2.309, 2.335, 2.361, \n",
" 2.385, 2.408, 2.429, 2.448, 2.467, 2.486, 2.502, 2.519, 2.534, 2.549, \n",
" 2.563, 2.577, 2.591, 2.604, 2.616, 2.628, 2.639, 2.65, 2.661, 2.671, \n",
" 2.682, 2.692, 2.7, 2.71, 2.719, 2.727, 2.736, 2.744, 2.753, 2.76, \n",
" 2.768, 2.804, 2.837, 2.866, 2.893, 2.917, 2.94, 2.961, 2.981, 3, \n",
" 3.017, 3.049, 3.078, 3.104, 3.129]\n",
"t_rel_Kn_n = pd.DataFrame(np.stack((_N, _Kn), axis=1), columns=['N', 'Kn'])\n",
"cell_type": "markdown",
"source": [
"# KODE"
"cell_type": "code",
"source": [
"def find_Kn(n, table=t_rel_Kn_n):\n",
" if n < 10 or n > 140:\n",
" raise ValueError('Jumlah data diluar batas bawah (10) / batas atas (140)')\n",
" else:\n",
" N = table['N'].to_numpy()\n",
" Kn = table['Kn'].to_numpy()\n",
" return np.interp(n, N, Kn)\n",
"def calc_boundary(df, col=None, result='value', show_stat=False):\n",
" col = df.columns[0] if col is None else col\n",
" \n",
" x = df[col].to_numpy()\n",
" n = x.size\n",
" xlog = np.log10(x)\n",
" xlogmean = xlog.mean()\n",
" xlogstd = xlog.std(ddof=1)\n",
" Kn = find_Kn(n)\n",
" # higher\n",
" y_h = xlogmean + Kn*xlogstd\n",
" val_h = 10**y_h\n",
" # lower\n",
" y_l = xlogmean - Kn*xlogstd\n",
" val_l = 10**y_l\n",
" if show_stat:\n",
" print(\n",
" f'Statistik:',\n",
" f'N = {n}',\n",
" f'Mean (log) = {xlogmean:.5f}',\n",
" f'Std (log) = {xlogstd:.5f}',\n",
" f'Lower (val) = {val_l:.5f}',\n",
" f'Higher (val) = {val_h:.5f}',\n",
" sep='\\n', end='\\n\\n'\n",
" )\n",
" if result.lower() == 'value':\n",
" return (val_l, val_h)\n",
" elif result.lower() == 'log':\n",
" return (y_l, y_h)\n",
"def find_outlier(df, col=None, verbose=False, **kwargs):\n",
" \n",
" low, high = calc_boundary(df, col, **kwargs)\n",
" \n",
" col = df.columns[0] if col is None else col\n",
" masklow = df[col] < low\n",
" maskhigh = df[col] > high\n",
" mask = masklow | maskhigh\n",
" \n",
" if verbose and masklow.sum():\n",
" print(f'Ada outlier dibawah batas bawah sebanyak {masklow.sum()}.')\n",
" if verbose and maskhigh.sum():\n",
" print(f'Ada outlier diatas batas atas sebanyak {maskhigh.sum()}.')\n",
" def check_outlier(x):\n",
" if x < low:\n",
" return \"lower\"\n",
" elif x > high:\n",
" return \"higher\"\n",
" else:\n",
" return pd.NA\n",
" if mask.sum() != 0:\n",
" new_df = df.copy()\n",
" new_df['outlier'] = df[col].apply(check_outlier)\n",
" return new_df[[col, 'outlier']]\n",
" else:\n",
" print(\"Tidak ada Outlier\")\n",
" return None\n"
"cell_type": "markdown",
"source": [
"cell_type": "markdown",
"source": [
"## Fungsi `find_Kn(n)`\n",
"Fungsi ini digunakan untuk mencari nilai $K_n$ untuk perhitungan uji outlier.\n",
"- Argumen fungsi:\n",
" - `n`: jumlah data $\\left(10 \\le \\mathbb{N} \\le 140\\right)$. Diluar batasan tersebut akan menghasilkan peringatan `ValueError`. "
"cell_type": "code",
"source": [
"data": {
"cell_type": "code",
"source": [
" find_Kn(141)\n",
"except ValueError:\n",
" print(\"Hasil akan error\")"
"cell_type": "markdown",
"source": [
"## Fungsi `calc_boundary(df, col=None, result='value', show_stat=False)`\n",
"Fungsi `calc_boundary(...)` digunakan untuk mencari nilai batas bawah dan batas atas outlier dari data. Keluaran fungsi ini berupa _tuple_ dengan bentuk `(low_boundary, high_boundary)`.\n",
"- Argumen Posisi:\n",
" - `df`: dataset dalam objek `pandas.DataFrame`.\n",
"- Argumen Opsional:\n",
" - `col=None`: nama kolom data yang akan dicek outlier. Jika `None` maka dipilih kolom pertama dari dataframe.\n",
" - `result=\"value\"`: keluaran berupa nilai batasan aktual dengan skala original. Jika menggunakan `log`, keluaran berupa nilai batasan dalam skala logaritmik.\n",
" - `show_stat=False`: jika `True` akan menampilkan nilai statistik berupa jumlah data, rata-rata, standar deviasi, batasan bawah dan atas dalam skala original. "
"cell_type": "code",
"source": [
"data": {
"cell_type": "code",
"source": [
"calc_boundary(data, col='Hujan', result='log', show_stat=True)"
"text": [
"data": {
"cell_type": "markdown",
"source": [
"## Fungsi `find_outlier(df, col=None, verbose=False, **kwargs)`\n",
"Fungsi `find_outlier(...)` digunakan untuk memeriksa apakah data memiliki outlier atau tidak dan memberi keluaran berupa dataframe yang telah ditandai data mana saja yang dikategorikan outlier.\n",
"- Argumen Posisi:\n",
" - `df`: dataset dalam objek `pandas.DataFrame`.\n",
"- Argumen Opsional:\n",
" - `col=None`: nama kolom data yang akan dicek outlier. Jika `None` maka dipilih kolom pertama dari dataframe.\n",
" - `verbose=False`: memberi informasi tambahan jika memiliki outlier dan seberapa banyak. \n",
" - `**kwargs`: _keyword arguments_ dari fungsi `.calc_boundary()`. "
"metadata": {
"source": [
"cell_type": "code",
"source": [
"find_outlier(data, show_stat=True)"
"cell_type": "code",
"source": [
"# contoh data dengan outlier\n",
"data2 = data.copy()\n",
"data2.loc['2012'] = 4000"
"cell_type": "code",
"source": [
"find_outlier(data2, 'Hujan', verbose=True)"
"output_type": "execute_result",
"data": {
"text/plain": [
" Hujan outlier\n",
"Tahun \n",
"2010-01-01 2818 <NA>\n",
"2013-01-01 2542 <NA>\n",
"2008-01-01 1949 <NA>\n",
"2012-01-01 4000 higher\n",
"2011-01-01 1748 <NA>\n",
"2014-01-01 1737 <NA>\n",
"2009-01-01 1605 <NA>\n",
"2007-01-01 1558 <NA>\n",
"2015-01-01 1433 <NA>\n",
"2006-01-01 1264 <NA>"
"cell_type": "code",
"source": [
"data2.loc['2012'] = 40\n",
"find_outlier(data2, show_stat=True)"
"text": [
"data": {
"text/plain": [
" Hujan outlier\n",
"Tahun \n",
"2010-01-01 2818 <NA>\n",
"2013-01-01 2542 <NA>\n",
"2008-01-01 1949 <NA>\n",
"2012-01-01 40 lower\n",
"2011-01-01 1748 <NA>\n",
"2014-01-01 1737 <NA>\n",
"2009-01-01 1605 <NA>\n",
"2007-01-01 1558 <NA>\n",
"2015-01-01 1433 <NA>\n",
"2006-01-01 1264 <NA>"
"cell_type": "markdown",
"source": [
"# Changelog\n",
"- 20220303 - 1.0.0 - Initial\n",
"#### Copyright &copy; 2022 [Taruma Sakti Megariansyah](\n",
"Source code in this notebook is licensed under a [MIT License]( Data in this notebook is licensed under a [Creative Common Attribution 4.0 International]( \n"
