"Missingness" is almost always informative by itself, and we should tell our algorithm if a value is missing. Even if we build a model to impute our values, we are not adding any real information. We’re just reinforcing the patterns already provided by other features.
Basically, there are three categories of missing data:
-
MCAR (Missing Completely At Random) where the pattern of missinginess is statistically independent of the data record. Example: you have a data set on a piece of paper and you spill coffee on the paper destroying part of the data.
-
MAR (Missing At Random) where the probability distribution of the pattern of missingness is functionally dependent upon the observable component in the record. MCAR is a special case of MAR. Example: if a child does not attend an educational assessment because the child is (genuinely) ill, this might be predictable from other data we have about the child’s health, but it would not be related to what we would have measured had the child not been ill.
-
MNAR (Missing Not at Random) which is defined as the case which is NOT MAR, or when the missingness is specifically related to what is missing. Example: a person does not attend a drug test because the person took drugs the night before.
Let's see a few strategies to impute missing values, i.e. to infer them from the known part of the data.
We can rely on scikit-learn's SimpleImputer class, which provides a few strategies for imputing missing values, such as : imputing by a constant value, by using statistics (mean, median, etc).
Let's see how we can replace missing values (np.nan
) using the mean value of the columns that contain the missing values.
https://gist.github.com/c92e55fb567d48be9bed812cf13aee6b
[[ 7. 2. 3. ]
[ 4. 3.5 6. ]
[10. 3.5 9. ]]
SimpleImputer can also be used in conjunction with pandas and in particular with data represented as strings or categoricals, by using most_frequent
or constant
strategy:
https://gist.github.com/dc90bd9c7d1666e12c90c2b79936c60d
[['a' 'x']
['a' 'y']
['a' 'y']
['b' 'y']]
Another approach is to use the IterativeImputer class, which models each feature with missing values as a function of other features. How does each iteration work? At each step, a feature column is designated as output y
and the other feature columns are treated as inputs X
. Then, the regressor is used to predict the missing values of y
. This is done for each feature in an iterative fashion.
https://gist.github.com/cdc5ad045de12964d7e143f1488993b9
[[ 1. 2.]
[ 6. 12.]
[ 3. 6.]]
IterativeImputer is very flexible, it allows you to use a variety of estimators, if you want to further delve into this class take a look at Imputing missing values with variants of IterativeImputer
The KNNImputer class offers imputation for filling in missing values using the k-Nearest Neighbors approach. By default it uses an euclidean distance metric that supports missing values, nan_euclidean_distances
.
The following code snippet shows how to replace missing values using the mean feature value of the two nearest neighbors of samples with missing values:
https://gist.github.com/6ae18b9038fc612c456dd0d9d5865cfc
[[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
[[1. 2. 4. ]
[3. 4. 3. ]
[5.5 6. 5. ]
[8. 8. 7. ]]
Preprocessing Data
Feature Selection
Encoding Categorical Features
Feature Importance
Feature Selection
Handling Missing Values