Skip to content

Instantly share code, notes, and snippets.

@farrajota
Created October 25, 2018 13:32
Show Gist options
  • Save farrajota/eb64e1b9b327bb4698fc97982a4a33c1 to your computer and use it in GitHub Desktop.
Save farrajota/eb64e1b9b327bb4698fc97982a4a33c1 to your computer and use it in GitHub Desktop.
How to treat missing values in your data

Missing values in data

Types of missing values

Missing completely at random (MCAR)

MCAR exists when missing values are randomly distributed across all observations. Missingness in given variable does not depend on any other variable, whether observed or unobserved. MCAR can be confirmed by dividing respondents into those with and without missing data, then using t-tests of mean differences on income, age, gender, and other key variables to establish that the two groups do not differ significantly on any variable in the model, including the dependent variable. If missing data are MCAR in a sufficiently large sample, cases with missing values may be dropped listwise from the analysis without biasing the estimates. If dropping MCAR cases appreciably reduces sample size, however standard errors will be increased, increasing the chance of Type II error (false negative inferences; statistical power is diminished). MCAR, however, is unusual.

Missing At Random (MAR)

The phrase “missing at random” is misleading since MAR data reflect a systematic rather than random pattern of missingness. Data are missing at random (MAR) when (1) not MCAR, indicated by Little’s MCAR test being significant; and (2) missingness may be predicted by other observed variables and does not depend on any unobserved variables. If missingness may be well predicted from observed variables, then multiple imputation (MI) is appropriate. In fact, MI assumes MAR as defined here. Listwise deletion will introduce bias if data are MAR. MAR is much more common than MCAR. Exploratory testing for MAR is discussed below.

To elaborate, for MAR data, missingness is not independent of the values of other variables in the model but is predictable by them. This implies that for MAR to be demonstrated, it must be assumed that missingness does not depend on unobserved variables. This assumption may be wrong (this is the model specification problem). If missingness is not predictable from observed variables, data are “missing not at random”.

MAR is a spectrum, depending on how much of missingness can be explained by other observed variables. A pure MAR example would be if there were test scores, test1 and test2, representing scores on two sequential tests. If students scoring 90 or greater on test1 were excused from test2, and if there were no other dropouts, missingness on test2 would be completely determined by the test1 variable. At the other end of the spectrum, in a large dataset it might happen that missingness on a given variable was significantly related to another observed variable (hence not MCAR) but the relation was so trivial in effect size that missingness could not be predicted from that variable. The point on this spectrum where prediction ceases to be useful is the point separating MAR from MNAR.

Missing Not At Random (MNAR)

Missing not at random (MNAR), also called non-ignorable missingness, is the most problematic form. It exists when missing values are neither MCAR nor MAR. This happens when missingness depends at least in part on unobserved variables(which is why observed variables fail to predict missingness, making data not MAR). Under MNAR conditions, variables in the dataset are inadequate predictors of missingness because the variable with missing cases is insufficiently correlated with other variables in the dataset, undermining the effectiveness of the usual imputation methods, including multiple imputation (MI).

One approach to non-ignorable missingness is to impute values based on data otherwise external to the research design, as, for instance, estimating race based on Census block data associated with the address of the respondent, but while missingness cannot be ignored, there is no well-accepted method of dealing with non-ignorable missingness.

Sources

Strategies for treating MCAR, MAR e MNAR missing values

MCAR

  • listwise deletion (if the % of missing values is not large)
  • imputation

MAR

  • imputation
  • imputation + dummy variable (create a dummy variable containing a mask of values filled vs not filled)
  • feature deletion (remove variable from the data)

MNAR

  • feature engineering + imputation (find the cause of missing values and then use imputation - special cases)
  • fill value with outlier / new category + dummy variable (fill the missing values with with a different value and then create a dummy variable containing a mask of values filled vs not filled)
  • feature deletion (remove variable from the data)

Sources

When imputation should not be used

  • If data are MCAR, imputation may not be not needed.
  • If missingness is due to unmeasured variables related to the dependent variable, data are MNAR and should not be imputed.
  • Imputation assumes data are MAR and should not be used with sparse data. Sparse data occur when missingness is non-random, such as a shopping cart survey of items purchased (coded 1) or not purchased (coded 0), because the null response (0) is non-random, due to unmeasured factors possibly not even known to the shopper.
  • Imputation should not be used to impute all the data for a subject
  • Imputation should not be used for a missing value for a given observation if that observation is also missing values on predictively critical variables in the imputation model. While this is difficult to check for each value to be imputed, a table of missing value patterns will show how many cases missing on a given variable also have missing values on other variables. In some cases this may lead a researcher to reject imputation.
  • Imputation should not be used if over 50% of data are missing (some authors use lower cutoffs, such as 20%).
  • Imputation is used with cross-sectional or historical data and is not appropriate for imputing future data in a time series.
  • Use of imputation is suspect if it generates values outside valid ranges.
  • Imputation based on a single pass is not acceptable due to the probabilistic nature of imputation. While as few as 3 – 5 imputations may suffice for reliability, today 20 – 100 or more imputations are usual.

Sources

When imputation should not be used

  • If data are MCAR, imputation may not be not needed.
  • If missingness is due to unmeasured variables related to the dependent variable, data are MNAR and should not be imputed.
  • Imputation assumes data are MAR and should not be used with sparse data. Sparse data occur when missingness is non-random, such as a shopping cart survey of items purchased (coded 1) or not purchased (coded 0), because the null response (0) is non-random, due to unmeasured factors possibly not even known to the shopper.
  • Imputation should not be used to impute all the data for a subject
  • Imputation should not be used for a missing value for a given observation if that observation is also missing values on predictively critical variables in the imputation model. While this is difficult to check for each value to be imputed, a table of missing value patterns will show how many cases missing on a given variable also have missing values on other variables. In some cases this may lead a researcher to reject imputation.
  • Imputation should not be used if over 50% of data are missing (some authors use lower cutoffs, such as 20%).
  • Imputation is used with cross-sectional or historical data and is not appropriate for imputing future data in a time series.
  • Use of imputation is suspect if it generates values outside valid ranges.
  • Imputation based on a single pass is not acceptable due to the probabilistic nature of imputation. While as few as 3 – 5 imputations may suffice for reliability, today 20 – 100 or more imputations are usual.

Sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment