Skip to content

Instantly share code, notes, and snippets.

@sazio

sazio/blog.md Secret

Created August 29, 2020 12:11
Show Gist options
  • Save sazio/8ed354a58d1ba11425537a6e086c6037 to your computer and use it in GitHub Desktop.
Save sazio/8ed354a58d1ba11425537a6e086c6037 to your computer and use it in GitHub Desktop.

Handling missing values

"Missingness" is almost always informative by itself, and we should tell our algorithm if a value is missing. Even if we build a model to impute our values, we are not adding any real information. We’re just reinforcing the patterns already provided by other features.

Basically, there are three categories of missing data:

  • MCAR (Missing Completely At Random) where the pattern of missinginess is statistically independent of the data record. Example: you have a data set on a piece of paper and you spill coffee on the paper destroying part of the data.

  • MAR (Missing At Random) where the probability distribution of the pattern of missingness is functionally dependent upon the observable component in the record. MCAR is a special case of MAR. Example: if a child does not attend an educational assessment because the child is (genuinely) ill, this might be predictable from other data we have about the child’s health, but it would not be related to what we would have measured had the child not been ill.

  • MNAR (Missing Not at Random) which is defined as the case which is NOT MAR, or when the missingness is specifically related to what is missing. Example: a person does not attend a drug test because the person took drugs the night before.

Let's see a few strategies to impute missing values, i.e. to infer them from the known part of the data.

Univariate Feature Imputation

We can rely on scikit-learn's SimpleImputer class, which provides a few strategies for imputing missing values, such as : imputing by a constant value, by using statistics (mean, median, etc).

Let's see how we can replace missing values (np.nan) using the mean value of the columns that contain the missing values.

https://gist.github.com/c92e55fb567d48be9bed812cf13aee6b

[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]

SimpleImputer can also be used in conjunction with pandas and in particular with data represented as strings or categoricals, by using most_frequentor constantstrategy:

https://gist.github.com/dc90bd9c7d1666e12c90c2b79936c60d

[['a' 'x']
 ['a' 'y']
 ['a' 'y']
 ['b' 'y']]

Multivariate Feature Imputation

Another approach is to use the IterativeImputer class, which models each feature with missing values as a function of other features. How does each iteration work? At each step, a feature column is designated as output yand the other feature columns are treated as inputs X. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion.

https://gist.github.com/cdc5ad045de12964d7e143f1488993b9

[[ 1.  2.]
 [ 6. 12.]
 [ 3.  6.]]

IterativeImputer is very flexible, it allows you to use a variety of estimators, if you want to further delve into this class take a look at Imputing missing values with variants of IterativeImputer

Nearest Neighbors Imputation

The KNNImputer class offers imputation for filling in missing values using the k-Nearest Neighbors approach. By default it uses an euclidean distance metric that supports missing values, nan_euclidean_distances.

The following code snippet shows how to replace missing values using the mean feature value of the two nearest neighbors of samples with missing values:

https://gist.github.com/6ae18b9038fc612c456dd0d9d5865cfc

[[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
[[1.  2.  4. ]
 [3.  4.  3. ]
 [5.5 6.  5. ]
 [8.  8.  7. ]]

References & Additional Material

Preprocessing Data

Feature Selection

Encoding Categorical Features

Feature Importance

Feature Selection

Handling Missing Values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment