Skip to content

Instantly share code, notes, and snippets.

@farrajota
Last active July 3, 2018 15:49
Show Gist options
  • Save farrajota/5567b31d6703f9a0c7a3804aa76b8ae4 to your computer and use it in GitHub Desktop.
Save farrajota/5567b31d6703f9a0c7a3804aa76b8ae4 to your computer and use it in GitHub Desktop.
Data cleaning guidelines for multivariate data exploration (using Python's scipy stack).

Missing data

A Simple Example of a Missing Data Analysis Understanding the Reasons Leading to Missing Data Ignorable Missing Data Other Types of Missing Data Processes Examining the Patterns of Missing Data Diagnosing the Randomness of the Missing Data Process

Rules of Thumb

(How Much Missing Data is Too Much?)

  • Missing data under 10 percent for an individual case or observation can generally be ignored, except when the missing data occurs in a specific nonrandom fashion (e.g., concentration in a specific set of questions, attrition at the end of the questionnaire, etc.)
  • The number of cases with no missing data must be sufficient for the selected analysis technique if replacement values will not be substituted (imputed) for missing data

(Deletions Based on Missing Data)

  • Variables with as little as 15 percent missing data are candidates for deletion, but higher levels of missin data (20% to 30%) can often be remedied
  • Be sure the overall decrease in missing data is large enough to justify deleting an individual variable or case
  • Cases with missing data for dependent variable(s) typically are deleted to avoid any artificial increase in relationships with independent variables
  • When deleting a variable, ensure that alternative variables, hopefully highly correlated, are available to represent the intent of the original variable
  • Always consider performing the analysis both with and without the deleted cases or variables to identify any marked differences

Imputation of missing data

Rules of Thumb

(Imputation of Missing Data)

  • Under 10%: Any of the imputation methods can be applied when missing data are this low, although the complete case method has been shown to be the least preferred
  • 10% to 20%: The increased presence of missing data makes the all-available, hot deck case substitution, and regression methods most preferred to MCAR data, whereas model-based methods are necessary for MAR missing data processes
  • over 20%: If it deemed necessary to impute missing data when the level is over 20 percent, the preferred methods are:
    • The regression method for MCAR situations
    • Model-based methods when MAR missing data occur

Outliers

  • Detecting Outliers:
    • Univariate Detection
      • Find observations that fall in the extremes of the distribution and that are truly distinctive
        • A typical approach is to convert the values to z scores, with mean 0 and std 1
    • Bivariate Detection
      • scatterplots (find (groups of) observations that deviate from the main cluster)
      • Limit the number of scatterplots to dependent vs independent variables
    • Multivariate Detection
      • Mahalanobis D^2 measure
def MahalanobisDist(x, y):
    """
    Computes the Mahalanobis D^2 distance between two variables.
    """
    covariance_xy = np.cov(x,y, rowvar=0)
    inv_covariance_xy = np.linalg.inv(covariance_xy)
    xy_mean = np.mean(x),np.mean(y)
    x_diff = np.array([x_i - xy_mean[0] for x_i in x])
    y_diff = np.array([y_i - xy_mean[1] for y_i in y])
    diff_xy = np.transpose([x_diff, y_diff])
    
    md = []
    for i in range(len(diff_xy)):
        md.append(np.sqrt(np.dot(np.dot(np.transpose(diff_xy[i]),inv_covariance_xy),diff_xy[i])))
    return md


def MD_removeOutliers(x, y, threshold=1.5):
    """
    Detects outliers using the Mahalanobis D^2 distance.
    """
    MD = MahalanobisDist(x, y)
    range_threshold = np.mean(MD) * threshold
    nx, ny, outliers = [], [], []
    for i in range(len(MD)):
        if MD[i] <= range_threshold:
            nx.append(x[i])
            ny.append(y[i])
        else:
            outliers.append(i) # position of removed pair
    return (np.array(nx), np.array(ny), np.array(outliers))
  • Outlier Description and Profiling
  • Retention or Deletion of the Outlier
    • Exclude only observations that are truly non-representative observations from the population (can use a combination of univariate + viariate + multivariate detections)

Rules of Thumb

(Outlier Detection)

  • Univariate methods: Examine all metric variables to identify unique or extreme observations
    • For small samples (80 or fewer observations), outliers typically are defined as cases with standard scores of 2.5 or greater
    • For larger sample sizes, increase the threshold value of standard scores up to 4
    • If standard scores are not used, identify cases failing outside the ranges of 2.5 versus 4 standard deviations, depending on the sample size
  • Bivariate methods: Focus their use on specific variable relatioships, such as the independent versus dependent variables
    • Use scatterplots swith confidence intervals at a specified alpha level
  • Multivariate methods: Best suited for examining a comp+lete variate, such as the independent variables in regression of the variables in factor analysis
    • Threshold leves for the D^2/df measure should be conservative (.005 or .001), resulting in values of 2.5 (small samples) versus 3 or 4 in larger samples

Testing the Assumptions of Multivariate Analysis

Assessing Individual Variables Versus the Variate

Normality

(Statistical models and tests depend on the normal distribution assumption)

  • Graphical Analysis of Normality:
    • histograms of distribution
    • normal probability plot (stats.probplot())
  • Statistical Tests of Normality
    • kurtosis (peakedness or flatness of distribution)
    • skewness (distribution shifted to the left or right side)
  • Remedies for Nonnormality:
    • Data transformations

Homoscedasticity

(often due to nonnormality of one of the variables)

  • Graphical Tests of Equal Variance Dispersion
    • scatter plots (check for cone or diamond shapes)
    • box plots (check for same degree of variation via box lenght and whiskers)
  • Statistical Tests for Homoscedasticity
  • Remedies for Heteroscedasticity
    • Data transformations

Linearity

  • Graphical Tests using scatter plots
  • Examine residuals of a simple regression analysis
  • Remedies:
    • Transform one or both variables to achieve linearity

Absence of Correlated Errors

  • Identifying Correlated Errors
    • Find the cause affecting the relationship between groups
      • Group values for a variable and examine for any patterns (differences between groups)
  • Remedies for Correlated Errors
    • Include the omitted causal factor into the multivariate analysis (add a variable that represents the omitted factor)

Rules of Thumb

(Testting Statistical Assumptions)

  • Normality can have serious effects in small samples (fewer than 50 cases), but the impact effectively diminishes when sample sizes reach 200 cases or more
  • Most cases of heteroscedasticity are a result of nonnormality in one or more variables; thus, remedying normality may not be needed due to sample size, but may be needed to equalize the variance
  • Nonlinear relationships can be well defined, but seriously understated unless the data are transformed to a linear pattern or explicit model components are used to represent the nonlinear portion of the relationship
  • Correlated errors arise form a process that must be treated much like missing data; that is, the researcher must first define the causes among variables either internal or external to the dataset; if they are not found and remedied, serious biases can occur in the results, many times unknown to the researcher

Transforming data

(Transforming to achieve normality, homoscedasticity and linearity)

  • box cox (better overall, used for any type of skewed distributions)
  • log (positively skewed distributions)
  • inverse (1/N) (cone opens to the right)
  • square root (positively skewed distributions)
  • squared or cube (negatively skewed distributions)
  • arcsin (for proportions)

Rules of Thumb

(Transforming Data)

  • To judge the potential impact of a transformation, calculate the ratio of the variable's mean to its standard deviation:
    • Noteceable effets should occur when the ratio is less than 4
    • When the transformation can be performed on either of two variables, select the variable with the smallest ratio
  • Transformations should be applied tpo the independent variables except n the case of heteroscedasticity
  • Heteroscedasticity can be remedied only by transformation of the dependent variable in a dependence relationship; if a heteroscedastic relationship is also nonlinear, the dependent variable, and pherhaps the independent variables, must be transformed
  • Transformations may change the interpretation of the variables; for example, transforming variables by taking their logarithm translates the relationship into a measure of proportional change (elasticity); always be sure to explore thoroughly the possible interpretations of thte transformed variables
  • Use variables in their original format (untransformed) when profiling or interpreting results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment