farrajota/missing_value_imputation.md

## missing_value_imputation.md

      
    Raw
  

              missing_value_imputation.md
            
          
    Missing values in data

Types of missing values

Missing completely at random (MCAR)

MCAR exists when missing values are randomly distributed across all observations.
Missingness in given variable does not depend on any other variable, whether observed or unobserved.
MCAR can be confirmed by dividing respondents into those with and without missing data, then using t-tests
of mean differences on income, age, gender, and other key variables to establish that the two groups do not
differ significantly on any variable in the model, including the dependent variable. If missing data are MCAR
in a sufficiently large sample, cases with missing values may be dropped listwise from the analysis without
biasing the estimates. If dropping MCAR cases appreciably reduces sample size, however standard errors will
be increased, increasing the chance of Type II error (false negative inferences; statistical power is
diminished). MCAR, however, is unusual.
Missing At Random (MAR)

The phrase “missing at random” is misleading since MAR data reflect a systematic rather than random pattern
of missingness. Data are missing at random (MAR) when (1) not MCAR, indicated by Little’s MCAR test being
significant; and (2) missingness may be predicted by other observed variables and does not depend on any
unobserved variables. If missingness may be well predicted from observed variables, then multiple imputation
(MI) is appropriate. In fact, MI assumes MAR as defined here. Listwise deletion will introduce bias if data
are MAR. MAR is much more common than MCAR. Exploratory testing for MAR is discussed below.
To elaborate, for MAR data, missingness is not independent of the values of other variables in the model but
is predictable by them. This implies that for MAR to be demonstrated, it must be assumed that missingness
does not depend on unobserved variables. This assumption may be wrong (this is the model specification problem).
If missingness is not predictable from observed variables, data are “missing not at random”.
MAR is a spectrum, depending on how much of missingness can be explained by other observed variables. A pure
MAR example would be if there were test scores, test1 and test2, representing scores on two sequential tests.
If students scoring 90 or greater on test1 were excused from test2, and if there were no other dropouts,
missingness on test2 would be completely determined by the test1 variable. At the other end of the spectrum,
in a large dataset it might happen that missingness on a given variable was significantly related to another
observed variable (hence not MCAR) but the relation was so trivial in effect size that missingness could not
be predicted from that variable. The point on this spectrum where prediction ceases to be useful is the point
separating MAR from MNAR.
Missing Not At Random (MNAR)

Missing not at random (MNAR), also called non-ignorable missingness, is the most problematic form. It exists
when missing values are neither MCAR nor MAR. This happens when missingness depends at least in part on
unobserved variables(which is why observed variables fail to predict missingness, making data not MAR). Under
MNAR conditions, variables in the dataset are inadequate predictors of missingness because the variable with
missing cases is insufficiently correlated with other variables in the dataset, undermining the effectiveness
of the usual imputation methods, including multiple imputation (MI).
One approach to non-ignorable missingness is to impute values based on data otherwise external to the research
design, as, for instance, estimating race based on Census block data associated with the address of the
respondent, but while missingness cannot be ignored, there is no well-accepted method of dealing with
non-ignorable missingness.
Sources


http://www.statisticalassociates.com/missingvaluesanalysis_p.pdf

Strategies for treating MCAR, MAR e MNAR missing values

MCAR


listwise deletion (if the % of missing values is not large)
imputation

MAR


imputation
imputation + dummy variable (create a dummy variable containing a mask of values filled vs not filled)
feature deletion (remove variable from the data)

MNAR


feature engineering + imputation (find the cause of missing values and then use imputation - special cases)
fill value with outlier / new category + dummy variable (fill the missing values with with a different value and
then create a dummy variable containing a mask of values filled vs not filled)
feature deletion (remove variable from the data)

Sources


https://www.theanalysisfactor.com/missing-data-mechanism/
http://www.statisticalassociates.com/missingvaluesanalysis_p.pdf

When imputation should not be used


If data are MCAR, imputation may not be not needed.
If missingness is due to unmeasured variables related to the dependent variable, data are MNAR and should not be imputed.
Imputation assumes data are MAR and should not be used with sparse data. Sparse data occur when missingness is non-random,
such as a shopping cart survey of items purchased (coded 1) or not purchased (coded 0), because the null response (0) is
non-random, due to unmeasured factors possibly not even known to the shopper.
Imputation should not be used to impute all the data for a subject
Imputation should not be used for a missing value for a given observation if that observation is also missing values on
predictively critical variables in the imputation model. While this is difficult to check for each value to be imputed,
a table of missing value patterns will show how many cases missing on a given variable also have missing values on other
variables. In some cases this may lead a researcher to reject imputation.
Imputation should not be used if over 50% of data are missing (some authors use lower cutoffs, such as 20%).
Imputation is used with cross-sectional or historical data and is not appropriate for imputing future data in a time series.
Use of imputation is suspect if it generates values outside valid ranges.
Imputation based on a single pass is not acceptable due to the probabilistic nature of imputation. While as few as 3 – 5
imputations may suffice for reliability, today 20 – 100 or more imputations are usual.

Sources


http://www.statisticalassociates.com/missingvaluesanalysis_p.pdf

When imputation should not be used


If data are MCAR, imputation may not be not needed.
If missingness is due to unmeasured variables related to the dependent variable, data are MNAR and should not be imputed.
Imputation assumes data are MAR and should not be used with sparse data. Sparse data occur when missingness is non-random, such as a shopping cart survey of items purchased (coded 1) or not purchased (coded 0), because the null response (0) is non-random, due to unmeasured factors possibly not even known to the shopper.
Imputation should not be used to impute all the data for a subject
Imputation should not be used for a missing value for a given observation if that observation is also missing values on predictively critical variables in the imputation model. While this is difficult to check for each value to be imputed, a table of missing value patterns will show how many cases missing on a given variable also have missing values on other variables. In some cases this may lead a researcher to reject imputation.
Imputation should not be used if over 50% of data are missing (some authors use lower cutoffs, such as 20%).
Imputation is used with cross-sectional or historical data and is not appropriate for imputing future data in a time series.
Use of imputation is suspect if it generates values outside valid ranges.
Imputation based on a single pass is not acceptable due to the probabilistic nature of imputation. While as few as 3 – 5 imputations may suffice for reliability, today 20 – 100 or more imputations are usual.

Sources


http://www.statisticalassociates.com/missingvaluesanalysis_p.pdf