Skip to content

Instantly share code, notes, and snippets.

@DavidKatz-il
Created December 29, 2020 11:20
Show Gist options
  • Save DavidKatz-il/16737cb60733303c4ac65a0dd288609a to your computer and use it in GitHub Desktop.
Save DavidKatz-il/16737cb60733303c4ac65a0dd288609a to your computer and use it in GitHub Desktop.
The notebook for NaN Analysis post (https://dev.to/sephib/nans-bites-17kk).
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Background\n",
"While working on a health-related classification project we encountered a very large [sparse metrix](https://en.wikipedia.org/wiki/Sparse_matrix), due to the vast amount of health/lab tests that were available. After a meeting with the business domain experts, we understood that our initial data preprocessing for removing missing data (NaN) was faulty. \n",
"In this post, we would like to share the pit-fall that we experienced and share our process for identifying features with missing values related to classification problems. \n",
"\n",
"# The Pit Fall \n",
"We had over 2K of features across 40K patients. We knew that most of the features had a significant amount of `NaN`s, so we used the common methods - such as [scikit-learn's VarianceThreshold](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) and [caret's near zero variance](http://topepo.github.io/caret/pre-processing.html#zero--and-near-zero-variance-predictors) functions to remove features with high `NaN` values. \n",
"We were left with less than 70 features and ran our base model to see if our classifier model could predict better than randomness. After displaying the [feature importance](https://catboost.ai/docs/concepts/python-reference_catboostclassifier_get_feature_importance.html#python-reference_catboostclassifier_get_feature_importance) from our `CatBoost` model, some concerns were raised regarding some of the features. \n",
"So we went back and did some homework... \n",
"\n",
"While re-analyzing the features that were left, we saw that although they had passed \n",
"our initial `NaN` tests, we did not check for the distribution of the NaN across our classes, i.e. some features had a significant amount of NaNs concentrated in a specific class which were not evenly distributed across our group classifications."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data sample \n",
"To demonstrate this process let's look at an example dataset - the [Horse Colic](https://archive.ics.uci.edu/ml/datasets/Horse+Colic) dataset. \n",
"This dataset includes the outcome/survival of horses diagnosed with colic disease based upon their past medical histories. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"df.shape: (299, 28)\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>surgery</th>\n",
" <th>Age</th>\n",
" <th>Hospital_Number</th>\n",
" <th>rectal_temperature</th>\n",
" <th>pulse</th>\n",
" <th>respiratory_rate</th>\n",
" <th>temperature_of_extremities</th>\n",
" <th>peripheral_pulse</th>\n",
" <th>mucous_membranes</th>\n",
" <th>capillary_refill_time</th>\n",
" <th>...</th>\n",
" <th>packed_cell_volume</th>\n",
" <th>total_protein</th>\n",
" <th>abdominocentesis_appearance</th>\n",
" <th>abdomcentesis_total_protein</th>\n",
" <th>outcome</th>\n",
" <th>surgical_lesion</th>\n",
" <th>type_of_lesion1</th>\n",
" <th>type_of_lesion2</th>\n",
" <th>type_of_lesion3</th>\n",
" <th>cp_data</th>\n",
" </tr>\n",
" <tr>\n",
" <th>ID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>530101</td>\n",
" <td>38.50</td>\n",
" <td>66</td>\n",
" <td>28</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>...</td>\n",
" <td>45.00</td>\n",
" <td>8.40</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>died</td>\n",
" <td>2</td>\n",
" <td>11300</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>534817</td>\n",
" <td>39.2</td>\n",
" <td>88</td>\n",
" <td>20</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>50</td>\n",
" <td>85</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>euthanized</td>\n",
" <td>2</td>\n",
" <td>2208</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2 rows × 28 columns</p>\n",
"</div>"
],
"text/plain": [
" surgery Age Hospital_Number rectal_temperature pulse respiratory_rate \\\n",
"ID \n",
"0 2 1 530101 38.50 66 28 \n",
"1 1 1 534817 39.2 88 20 \n",
"\n",
" temperature_of_extremities peripheral_pulse mucous_membranes \\\n",
"ID \n",
"0 3 3 NaN \n",
"1 NaN NaN 4 \n",
"\n",
" capillary_refill_time ... packed_cell_volume total_protein \\\n",
"ID ... \n",
"0 2 ... 45.00 8.40 \n",
"1 1 ... 50 85 \n",
"\n",
" abdominocentesis_appearance abdomcentesis_total_protein outcome \\\n",
"ID \n",
"0 NaN NaN died \n",
"1 2 2 euthanized \n",
"\n",
" surgical_lesion type_of_lesion1 type_of_lesion2 type_of_lesion3 cp_data \n",
"ID \n",
"0 2 11300 0 0 2 \n",
"1 2 2208 0 0 2 \n",
"\n",
"[2 rows x 28 columns]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from itertools import combinations\n",
"\n",
"pd.options.display.float_format = \"{:,.2f}\".format\n",
"\n",
"names = \"surgery,Age,Hospital Number,rectal temperature,pulse,respiratory rate,temperature of extremities,peripheral pulse,mucous membranes,capillary refill time,pain,peristalsis,abdominal distension,nasogastric tube,nasogastric reflux,nasogastric reflux PH,rectal examination,abdomen,packed cell volume,total protein,abdominocentesis appearance,abdomcentesis total protein,outcome,surgical lesion,type of lesion1,type of lesion2,type of lesion3,cp_data\"\n",
"names = names.replace(\" \", \"_\").split(\",\")\n",
"file_path = (\n",
" \"https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv\"\n",
")\n",
"df = pd.read_csv(file_path, names=names)\n",
"label_col = \"outcome\"\n",
"\n",
"\n",
"def preprocess_df(df):\n",
" df.columns = [c.strip() for c in df.columns]\n",
" df = df.replace(\"?\", np.nan).replace(\"nan\", np.nan)\n",
" df = df[~(df[\"outcome\"].isna())].copy() # clean up label column\n",
" df.index.name = \"ID\"\n",
" df[label_col] = (\n",
" df[label_col]\n",
" .astype(str)\n",
" .replace({\"1\": \"lived\", \"2\": \"died\", \"3\": \"euthanized\"})\n",
" )\n",
" return df\n",
"\n",
"\n",
"df = preprocess_df(df)\n",
"print(f\"df.shape: {df.shape}\")\n",
"df.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we will check how many featurs have `NaN` values."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The number features with NaN values are 19\n"
]
}
],
"source": [
"# There are 28 features in this dataset\n",
"def number_of_features_with_NaN(df):\n",
" _s_na = df.isna().sum()\n",
" return len(_s_na[_s_na > 0])\n",
"\n",
"\n",
"print(f\"The number features with NaN values are {number_of_features_with_NaN(df)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Near Zero Variance\n",
"\n",
"Simulating our intitial workflow, we will remove features with NaN values with our implementation of caret's R library [near zero variance](\"http://topepo.github.io/caret/pre-processing.html#zero--and-near-zero-variance-predictors\") function (with their default values). \n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"After processeing the dataset via `near_zero_variance` we are left with 18 features with NaN values.\n",
"\n"
]
}
],
"source": [
"def near_zero_variance(df, cols=None, frq_cut=95 / 5, unique_cut=10):\n",
" if not cols:\n",
" cols = df.columns\n",
" drop_cols = []\n",
" for col in cols:\n",
" val_count = list(\n",
" df[col].value_counts(dropna=False, normalize=True).to_dict().items()\n",
" )\n",
" if len(val_count) == 1:\n",
" drop_cols.append(col)\n",
" continue\n",
" lunique = len(val_count)\n",
" percent_unique = 100 * lunique / len(df[col])\n",
" freq_ratio = val_count[0][1] / val_count[1][1] + 1e-5\n",
"\n",
" if (freq_ratio > frq_cut) & (percent_unique <= unique_cut):\n",
" drop_cols.append(col)\n",
" return df[[c for c in df.columns if c not in drop_cols]]\n",
"\n",
"\n",
"df_nzr = near_zero_variance(df)\n",
"print(\n",
" f\"After processeing the dataset via `near_zero_variance` we are left with {number_of_features_with_NaN(df_nzr)} features with NaN values.\\n\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deeper NaN Analysis\n",
"Since we are only interested in understanding the missing values in the dataset, we can view how many `NaN` values are in the various features.\n",
"Let's now plot the remaining features relative to the percent of `NaN`s in each feature"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x360 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"def plot_percent_nan_in_features(df):\n",
" nan = df.isna().sum() / len(df)\n",
" feature_nan_threshold = {\n",
" round(threshold, 2): len(df.columns) - len(nan[nan < threshold])\n",
" for threshold in np.arange(0, 1.01, 0.05)\n",
" }\n",
" _df = pd.DataFrame.from_dict(\n",
" feature_nan_threshold, orient=\"index\", columns=[\"num_of_features\"]\n",
" )\n",
" _df.plot(\n",
" kind=\"bar\",\n",
" ylabel=\"Number of Features\",\n",
" xlabel=\"Percentage of NaNs in feature\",\n",
" figsize=(10, 5),\n",
" grid=True,\n",
" )\n",
"\n",
"\n",
"plot_percent_nan_in_features(df_nzr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the above plot, we can see that the number of features with more than 40% of NaNs are 2 features and above 25% are 6 features. \n",
"\n",
"We can see that some of the features have very high `NaN` values. \n",
"\n",
"Let's now remove these problematic features.\n",
"For this example we will set a threshold of 35% . \n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"After drop_features_above_threshold_max_na the number of features with NaNs that we are left with are : 14\n"
]
}
],
"source": [
"threshold_max_na = 0.35\n",
"\n",
"\n",
"def drop_features_above_threshold_max_na(df, threshold_max_na=threshold_max_na):\n",
" nan = df.isna().sum() / len(df)\n",
" nan_threshold = nan[nan > threshold_max_na]\n",
" df = df.drop(columns=nan_threshold.index).copy()\n",
" return df\n",
"\n",
"\n",
"df_nzr_threshold = drop_features_above_threshold_max_na(df_nzr)\n",
"print(\n",
" f\"After drop_features_above_threshold_max_na the number of features with NaNs that we are left with are : {number_of_features_with_NaN(df_nzr_threshold)}\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We may assume that we have removed the problematic features and can try to imputate our NaN data and run our pipeline/model. \n",
"\n",
"But before we do so let's look a bit more closely at our classification. \n",
"\n",
"## NaN Distribution Among the Classifer Labels \n",
"Looking at the classifier label feature `outcome` we can see the following distribution"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>num</th>\n",
" <th>percent</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>lived</th>\n",
" <td>178</td>\n",
" <td>0.60</td>\n",
" </tr>\n",
" <tr>\n",
" <th>died</th>\n",
" <td>77</td>\n",
" <td>0.26</td>\n",
" </tr>\n",
" <tr>\n",
" <th>euthanized</th>\n",
" <td>44</td>\n",
" <td>0.15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" num percent\n",
"lived 178 0.60\n",
"died 77 0.26\n",
"euthanized 44 0.15"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def create_value_counts_df(df, s2=pd.Series(dtype=float), col=None, check_na=False):\n",
" if col:\n",
" s1 = df[col].value_counts(dropna=False)\n",
" if check_na:\n",
" s1 = df.isna().sum()\n",
" s2 = df.isna().sum() / len(df)\n",
" if s2.empty:\n",
" s2 = df[col].value_counts(normalize=True, dropna=False)\n",
" df = pd.concat([s1, s2], axis=1)\n",
" df.columns = [\"num\", \"percent\"]\n",
" return df.sort_values(by=\"percent\", ascending=False)\n",
"\n",
"\n",
"df_labels_counts = create_value_counts_df(df, col=label_col)\n",
"df_labels_counts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the distribution of the classes is uneven. \n",
"Our class distribution is apporximately 60%, 25%, 15% between the lived, died, euthanized classes (respectively). \n",
"\n",
"**But how are our NaNs distriputed**? \n",
"\n",
"What is the distribution of `NaN`s in each feature with relation to our classification field.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>all_sum_na</th>\n",
" <th>lived_sum_na</th>\n",
" <th>died_sum_na</th>\n",
" <th>euthanized_sum_na</th>\n",
" <th>all_percentage_na</th>\n",
" <th>lived_percentage_na</th>\n",
" <th>died_percentage_na</th>\n",
" <th>euthanized_percentage_na</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>rectal_temperature</th>\n",
" <td>60</td>\n",
" <td>26</td>\n",
" <td>24</td>\n",
" <td>10</td>\n",
" <td>0.20</td>\n",
" <td>0.15</td>\n",
" <td>0.31</td>\n",
" <td>0.23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>pulse</th>\n",
" <td>24</td>\n",
" <td>12</td>\n",
" <td>11</td>\n",
" <td>1</td>\n",
" <td>0.08</td>\n",
" <td>0.07</td>\n",
" <td>0.14</td>\n",
" <td>0.02</td>\n",
" </tr>\n",
" <tr>\n",
" <th>respiratory_rate</th>\n",
" <td>58</td>\n",
" <td>31</td>\n",
" <td>19</td>\n",
" <td>8</td>\n",
" <td>0.19</td>\n",
" <td>0.17</td>\n",
" <td>0.25</td>\n",
" <td>0.18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>temperature_of_extremities</th>\n",
" <td>56</td>\n",
" <td>32</td>\n",
" <td>13</td>\n",
" <td>11</td>\n",
" <td>0.19</td>\n",
" <td>0.18</td>\n",
" <td>0.17</td>\n",
" <td>0.25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>peripheral_pulse</th>\n",
" <td>69</td>\n",
" <td>39</td>\n",
" <td>18</td>\n",
" <td>12</td>\n",
" <td>0.23</td>\n",
" <td>0.22</td>\n",
" <td>0.23</td>\n",
" <td>0.27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mucous_membranes</th>\n",
" <td>47</td>\n",
" <td>28</td>\n",
" <td>11</td>\n",
" <td>8</td>\n",
" <td>0.16</td>\n",
" <td>0.16</td>\n",
" <td>0.14</td>\n",
" <td>0.18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>capillary_refill_time</th>\n",
" <td>32</td>\n",
" <td>19</td>\n",
" <td>10</td>\n",
" <td>3</td>\n",
" <td>0.11</td>\n",
" <td>0.11</td>\n",
" <td>0.13</td>\n",
" <td>0.07</td>\n",
" </tr>\n",
" <tr>\n",
" <th>pain</th>\n",
" <td>55</td>\n",
" <td>34</td>\n",
" <td>12</td>\n",
" <td>9</td>\n",
" <td>0.18</td>\n",
" <td>0.19</td>\n",
" <td>0.16</td>\n",
" <td>0.20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>peristalsis</th>\n",
" <td>44</td>\n",
" <td>22</td>\n",
" <td>15</td>\n",
" <td>7</td>\n",
" <td>0.15</td>\n",
" <td>0.12</td>\n",
" <td>0.19</td>\n",
" <td>0.16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>abdominal_distension</th>\n",
" <td>56</td>\n",
" <td>31</td>\n",
" <td>14</td>\n",
" <td>11</td>\n",
" <td>0.19</td>\n",
" <td>0.17</td>\n",
" <td>0.18</td>\n",
" <td>0.25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>nasogastric_tube</th>\n",
" <td>104</td>\n",
" <td>62</td>\n",
" <td>25</td>\n",
" <td>17</td>\n",
" <td>0.35</td>\n",
" <td>0.35</td>\n",
" <td>0.32</td>\n",
" <td>0.39</td>\n",
" </tr>\n",
" <tr>\n",
" <th>rectal_examination</th>\n",
" <td>102</td>\n",
" <td>56</td>\n",
" <td>26</td>\n",
" <td>20</td>\n",
" <td>0.34</td>\n",
" <td>0.31</td>\n",
" <td>0.34</td>\n",
" <td>0.45</td>\n",
" </tr>\n",
" <tr>\n",
" <th>packed_cell_volume</th>\n",
" <td>29</td>\n",
" <td>13</td>\n",
" <td>8</td>\n",
" <td>8</td>\n",
" <td>0.10</td>\n",
" <td>0.07</td>\n",
" <td>0.10</td>\n",
" <td>0.18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>total_protein</th>\n",
" <td>33</td>\n",
" <td>13</td>\n",
" <td>12</td>\n",
" <td>8</td>\n",
" <td>0.11</td>\n",
" <td>0.07</td>\n",
" <td>0.16</td>\n",
" <td>0.18</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" all_sum_na lived_sum_na died_sum_na \\\n",
"rectal_temperature 60 26 24 \n",
"pulse 24 12 11 \n",
"respiratory_rate 58 31 19 \n",
"temperature_of_extremities 56 32 13 \n",
"peripheral_pulse 69 39 18 \n",
"mucous_membranes 47 28 11 \n",
"capillary_refill_time 32 19 10 \n",
"pain 55 34 12 \n",
"peristalsis 44 22 15 \n",
"abdominal_distension 56 31 14 \n",
"nasogastric_tube 104 62 25 \n",
"rectal_examination 102 56 26 \n",
"packed_cell_volume 29 13 8 \n",
"total_protein 33 13 12 \n",
"\n",
" euthanized_sum_na all_percentage_na \\\n",
"rectal_temperature 10 0.20 \n",
"pulse 1 0.08 \n",
"respiratory_rate 8 0.19 \n",
"temperature_of_extremities 11 0.19 \n",
"peripheral_pulse 12 0.23 \n",
"mucous_membranes 8 0.16 \n",
"capillary_refill_time 3 0.11 \n",
"pain 9 0.18 \n",
"peristalsis 7 0.15 \n",
"abdominal_distension 11 0.19 \n",
"nasogastric_tube 17 0.35 \n",
"rectal_examination 20 0.34 \n",
"packed_cell_volume 8 0.10 \n",
"total_protein 8 0.11 \n",
"\n",
" lived_percentage_na died_percentage_na \\\n",
"rectal_temperature 0.15 0.31 \n",
"pulse 0.07 0.14 \n",
"respiratory_rate 0.17 0.25 \n",
"temperature_of_extremities 0.18 0.17 \n",
"peripheral_pulse 0.22 0.23 \n",
"mucous_membranes 0.16 0.14 \n",
"capillary_refill_time 0.11 0.13 \n",
"pain 0.19 0.16 \n",
"peristalsis 0.12 0.19 \n",
"abdominal_distension 0.17 0.18 \n",
"nasogastric_tube 0.35 0.32 \n",
"rectal_examination 0.31 0.34 \n",
"packed_cell_volume 0.07 0.10 \n",
"total_protein 0.07 0.16 \n",
"\n",
" euthanized_percentage_na \n",
"rectal_temperature 0.23 \n",
"pulse 0.02 \n",
"respiratory_rate 0.18 \n",
"temperature_of_extremities 0.25 \n",
"peripheral_pulse 0.27 \n",
"mucous_membranes 0.18 \n",
"capillary_refill_time 0.07 \n",
"pain 0.20 \n",
"peristalsis 0.16 \n",
"abdominal_distension 0.25 \n",
"nasogastric_tube 0.39 \n",
"rectal_examination 0.45 \n",
"packed_cell_volume 0.18 \n",
"total_protein 0.18 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def get_na_per_label(df, label_col, labels=df_labels_counts.index):\n",
" cols = [c for c in df.columns if c != label_col]\n",
" sum_na = (\n",
" [df[col].isna().sum()]\n",
" + [\n",
" df[df[label_col] == label][col].isna().sum()\n",
" for label in df_labels_counts.index.tolist()\n",
" ]\n",
" for col in cols\n",
" )\n",
" df_sum_na = pd.DataFrame(\n",
" columns=[f\"{col}_sum_na\" for col in [\"all\"] + df_labels_counts.index.tolist()],\n",
" data=sum_na,\n",
" index=cols,\n",
" )\n",
" df_sum_na[\"all_percentage_na\"] = df.isna().sum() / len(df)\n",
" for label in labels:\n",
" df_sum_na[f\"{label}_percentage_na\"] = (\n",
" df_sum_na[f\"{label}_sum_na\"] / df_labels_counts.loc[label, \"num\"]\n",
" )\n",
" return df_sum_na\n",
"\n",
"def get_na_cols(df):\n",
" _s_na = df.isna().sum()\n",
" na_cols = _s_na[_s_na > 0].index\n",
" na_cols = list(na_cols) + [label_col]\n",
" return na_cols\n",
"\n",
"na_cols = get_na_cols(df_nzr_threshold)\n",
"df_sum_na = get_na_per_label(df_nzr_threshold[na_cols], label_col)\n",
"df_sum_na"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the distributions of the `NaN`s across the classification field are not even. _e.g._ the `rectal_temperature` feature has twice as much NaN in the `died` & `lived` classes than in the `euthanized` class. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assuming that we don't want to remove any features that have less than 15% of `NaN`s, no matter how the `NaN` distribution is across the classification field."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>all_sum_na</th>\n",
" <th>lived_sum_na</th>\n",
" <th>died_sum_na</th>\n",
" <th>euthanized_sum_na</th>\n",
" <th>all_percentage_na</th>\n",
" <th>lived_percentage_na</th>\n",
" <th>died_percentage_na</th>\n",
" <th>euthanized_percentage_na</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>rectal_temperature</th>\n",
" <td>60</td>\n",
" <td>26</td>\n",
" <td>24</td>\n",
" <td>10</td>\n",
" <td>0.20</td>\n",
" <td>0.15</td>\n",
" <td>0.31</td>\n",
" <td>0.23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>respiratory_rate</th>\n",
" <td>58</td>\n",
" <td>31</td>\n",
" <td>19</td>\n",
" <td>8</td>\n",
" <td>0.19</td>\n",
" <td>0.17</td>\n",
" <td>0.25</td>\n",
" <td>0.18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>temperature_of_extremities</th>\n",
" <td>56</td>\n",
" <td>32</td>\n",
" <td>13</td>\n",
" <td>11</td>\n",
" <td>0.19</td>\n",
" <td>0.18</td>\n",
" <td>0.17</td>\n",
" <td>0.25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>peripheral_pulse</th>\n",
" <td>69</td>\n",
" <td>39</td>\n",
" <td>18</td>\n",
" <td>12</td>\n",
" <td>0.23</td>\n",
" <td>0.22</td>\n",
" <td>0.23</td>\n",
" <td>0.27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mucous_membranes</th>\n",
" <td>47</td>\n",
" <td>28</td>\n",
" <td>11</td>\n",
" <td>8</td>\n",
" <td>0.16</td>\n",
" <td>0.16</td>\n",
" <td>0.14</td>\n",
" <td>0.18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>pain</th>\n",
" <td>55</td>\n",
" <td>34</td>\n",
" <td>12</td>\n",
" <td>9</td>\n",
" <td>0.18</td>\n",
" <td>0.19</td>\n",
" <td>0.16</td>\n",
" <td>0.20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>abdominal_distension</th>\n",
" <td>56</td>\n",
" <td>31</td>\n",
" <td>14</td>\n",
" <td>11</td>\n",
" <td>0.19</td>\n",
" <td>0.17</td>\n",
" <td>0.18</td>\n",
" <td>0.25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>nasogastric_tube</th>\n",
" <td>104</td>\n",
" <td>62</td>\n",
" <td>25</td>\n",
" <td>17</td>\n",
" <td>0.35</td>\n",
" <td>0.35</td>\n",
" <td>0.32</td>\n",
" <td>0.39</td>\n",
" </tr>\n",
" <tr>\n",
" <th>rectal_examination</th>\n",
" <td>102</td>\n",
" <td>56</td>\n",
" <td>26</td>\n",
" <td>20</td>\n",
" <td>0.34</td>\n",
" <td>0.31</td>\n",
" <td>0.34</td>\n",
" <td>0.45</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" all_sum_na lived_sum_na died_sum_na \\\n",
"rectal_temperature 60 26 24 \n",
"respiratory_rate 58 31 19 \n",
"temperature_of_extremities 56 32 13 \n",
"peripheral_pulse 69 39 18 \n",
"mucous_membranes 47 28 11 \n",
"pain 55 34 12 \n",
"abdominal_distension 56 31 14 \n",
"nasogastric_tube 104 62 25 \n",
"rectal_examination 102 56 26 \n",
"\n",
" euthanized_sum_na all_percentage_na \\\n",
"rectal_temperature 10 0.20 \n",
"respiratory_rate 8 0.19 \n",
"temperature_of_extremities 11 0.19 \n",
"peripheral_pulse 12 0.23 \n",
"mucous_membranes 8 0.16 \n",
"pain 9 0.18 \n",
"abdominal_distension 11 0.19 \n",
"nasogastric_tube 17 0.35 \n",
"rectal_examination 20 0.34 \n",
"\n",
" lived_percentage_na died_percentage_na \\\n",
"rectal_temperature 0.15 0.31 \n",
"respiratory_rate 0.17 0.25 \n",
"temperature_of_extremities 0.18 0.17 \n",
"peripheral_pulse 0.22 0.23 \n",
"mucous_membranes 0.16 0.14 \n",
"pain 0.19 0.16 \n",
"abdominal_distension 0.17 0.18 \n",
"nasogastric_tube 0.35 0.32 \n",
"rectal_examination 0.31 0.34 \n",
"\n",
" euthanized_percentage_na \n",
"rectal_temperature 0.23 \n",
"respiratory_rate 0.18 \n",
"temperature_of_extremities 0.25 \n",
"peripheral_pulse 0.27 \n",
"mucous_membranes 0.18 \n",
"pain 0.20 \n",
"abdominal_distension 0.25 \n",
"nasogastric_tube 0.39 \n",
"rectal_examination 0.45 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"threshold_min_na = 0.15\n",
"mask_threshold_min_na = df_sum_na[\"all_percentage_na\"] > threshold_min_na\n",
"df_sum_min_na = df_sum_na[mask_threshold_min_na].copy()\n",
"df_sum_min_na"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can further analyse our data. Let's see the ratio between the classifications."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 864x360 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"def create_ratio_between_classes(df, label_classes):\n",
" for label_a, label_b in combinations(label_classes, 2):\n",
" df[f\"ratio_percentage_{label_a}_{label_b}\"] = (\n",
" df[f\"{label_a}_percentage_na\"] / df[f\"{label_b}_percentage_na\"]\n",
" )\n",
" col_ratio = [col for col in df.columns if \"ratio_percentage\" in col]\n",
" return df[col_ratio]\n",
"\n",
"\n",
"create_ratio_between_classes(df_sum_min_na, df_labels_counts.index).plot(\n",
" figsize=(12, 5), rot=45, grid=True\n",
");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Values closer to 1 have similar percentage of `NaN`s, whereas values that are further away a higher distributions of the `NaN`s across the classification field. \n",
"\n",
"We can set a lower and upper threshold for filtering out the problematic features. Once all the ratios are between these limits we will want to keep this feature. Any value outside these limits we can assume that the `NaN`s are unevenly distributed, and the features should be removed. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['rectal_examination',\n",
" 'abdominal_distension',\n",
" 'rectal_temperature',\n",
" 'temperature_of_extremities',\n",
" 'respiratory_rate']"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def get_features_outside_threshold(\n",
" df,\n",
" lt_ratio_threshold=0.7,\n",
" gt_ratio_threshold=1.3,\n",
"):\n",
" features_to_drop = []\n",
" for col in [col for col in df.columns if \"ratio\" in col]:\n",
" mask_ratio_threshold = ~(\n",
" df[col].between(lt_ratio_threshold, gt_ratio_threshold)\n",
" )\n",
" features_to_drop.extend(mask_ratio_threshold[mask_ratio_threshold].index)\n",
" features_to_drop = list(set(features_to_drop))\n",
" return features_to_drop\n",
"\n",
"\n",
"features_to_drop = get_features_outside_threshold(df_sum_min_na)\n",
"features_to_drop"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"After removing the features from get_features_outside_threshold function we are left with 9 features with `NaN`s that we are going to imputate\n"
]
}
],
"source": [
"df_for_model = df_nzr_threshold.drop(columns=features_to_drop)\n",
"print(\n",
" f\"After removing the features from get_features_outside_threshold function we are left with {number_of_features_with_NaN(df_for_model)} features with `NaN`s that we are going to imputate\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Summary \n",
"\n",
"This post describes the issues while analysing `NaN`s for feature selection. \n",
"\n",
"Simple filtering methods do not always perform as expected and additional emphsesis should be taken when working with sparse matrices.\n",
"\n",
"We can analyse `NaN` within features in multiple levels: \n",
"1. At the global level - i.e. the total amount of NaNs within a feature (both for removing and for keeping features) \n",
"2. At the label/classification level - i.e. the relative distribution of NaNs per class.\n",
"\n",
"Finally we recommand trying out the [missingno package](https://github.com/ResidentMario/missingno) for graphical analysis of `NaN` values\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment