Too often, we take imputation for granted. At level one, we apply imputation
strategies as if we're debugging, i.e. we run .fillna('mean')
because we just want it to let us train up for inference. At level two, we put a little more TLC into EDA and experimentation,
ending up with motivated imputation strategies, i.e. thinking critically about the
interaction between imputation and other things in our pipeline, informed mostly by validation loss.
In this talk from SciPy 2018 by Dillon Niederhut I found a proposal for level three. The problem with level two is that we are erasing information with imputation-- namely, the distribution of missing vs. nonmissing values. Without recalling that distribution, the reporting of results can't be fully transparent. Niederhut calls for data scientists to push publishing conventions in the direction of reporting the missingness at which you found the data as a minimal requirement to interpret results.
The three regimes of missingness are MCAR, MAR, and MNAR:
- MCAR: Missing Completely at Random. Here, an imp is going around just deleting things at roll of the dice, just to get on your nerves. As Niederhut explains, there is a very high burden of proof on believing that your data is MCAR.
- MAR: Missing at Random. Here, you can show that any feature's missingness is correlated with missingness elsewhere in the data. This is the only one that was easy to script up, and you can view some code before.
- MNAR: Missing Not at Random. Here, information _within a feature** can explain the missingness of that feature. Showing exactly what that explanation is is quite a serious inference problem in itself even on a case by case basis, so we probably won't expect a general solution.
All imputation introduces bias, including just dropping rows. If you have an
MCAR feature, filling with mean is a reasonable compromise (but the bargain is
that if you believe your feature is MCAR, it's probably wishful thinking). You
can show a feature is MAR with respect to other features by a simple script
involving the correlation matrix of df.isna().astype(int)
(which is what I did), and something like
from fancyimpute import IterativeImputer
will be the most successful. If a
feature is MNAR, dropping is better than filling with mean.
To find the MAR correspondence between features, take the correlation matrix of the binary matrix that marks whether or not a value is null.
missingness_corr: pd.DataFrame = df.isna().astype(int).corr()
While many EDA needs want to know when values of data are correlated, understanding missingness requires us to know when missingness is correlated.
Introduce a subjective strength parameter corr_strength: float
, a value in (0,1)
, that decides whether a correlation is "strong enough" to be logged.
For a given feature featu
,
xs = [k for k, v in (abs(corr_mat[featu]) > corr_strength).items() if v]
if len(xs) > 1:
missing_correlates = [x for x in xs if x != featu]
We have a list of features that correlate in missingness to featu
.
If you were writing a report or a missingness tracker, you might print this
'MAR(' + ', '.join(missing_correlates) + ')'
for each feature.
A tracker like this would help you while doing data science by suggesting imputation strategies, and help you talking about your data science by giving you the ability to report what regime of missingness you found your data in.
Missingness is important, and the easiest/fastest imputation strategies can be seductive. Analysis and experiment can show which of the three main regimes of missingness you're working with, feature to feature. MAR
is easy to script up and learn about analytically for your data. For MNAR
and MCAR
, the best you can do is experiment (unless you're a level 89 stats wizard with maxed out arcanery). Getting into the habit of writing loggers and trackers to accompany you in EDA is a good idea. Reporting the missingness at which you found the data is an important part of results.
Further resource: