farrajota/guideline.md

## guideline.md

      
    Raw
  

              guideline.md
            
          
    Missing data

A Simple Example of a Missing Data Analysis
Understanding the Reasons Leading to Missing Data
Ignorable Missing Data
Other Types of Missing Data Processes
Examining the Patterns of Missing Data
Diagnosing the Randomness of the Missing Data Process
Rules of Thumb

(How Much Missing Data is Too Much?)

Missing data under 10 percent for an individual case or observation can generally be ignored, except when the missing data occurs in a specific nonrandom fashion (e.g., concentration in a specific set of questions, attrition at the end of the questionnaire, etc.)
The number of cases with no missing data must be sufficient for the selected analysis technique if replacement values will not be substituted (imputed) for missing data

(Deletions Based on Missing Data)

Variables with as little as 15 percent missing data are candidates for deletion, but higher levels of missin data (20% to 30%) can often be remedied
Be sure the overall decrease in missing data is large enough to justify deleting an individual variable or case
Cases with missing data for dependent variable(s) typically are deleted to avoid any artificial increase in relationships with independent variables
When deleting a variable, ensure that alternative variables, hopefully highly correlated, are available to represent the intent of the original variable
Always consider performing the analysis both with and without the deleted cases or variables to identify any marked differences

Imputation of missing data

Rules of Thumb

(Imputation of Missing Data)

Under 10%: Any of the imputation methods can be applied when missing data are this low, although the complete case method has been shown to be the least preferred
10% to 20%: The increased presence of missing data makes the all-available, hot deck case substitution, and regression methods most preferred to MCAR data, whereas model-based methods are necessary for MAR missing data processes
over 20%: If it deemed necessary to impute missing data when the level is over 20 percent, the preferred methods are:

The regression method for MCAR situations
Model-based methods when MAR missing data occur


Outliers


Detecting Outliers:

Univariate Detection

Find observations that fall in the extremes of the distribution and that are truly distinctive

A typical approach is to convert the values to z scores, with mean 0 and std 1


Bivariate Detection

scatterplots (find (groups of) observations that deviate from the main cluster)
Limit the number of scatterplots to dependent vs independent variables


Multivariate Detection

Mahalanobis D^2 measure


def MahalanobisDist(x, y):
    """
    Computes the Mahalanobis D^2 distance between two variables.
    """
    covariance_xy = np.cov(x,y, rowvar=0)
    inv_covariance_xy = np.linalg.inv(covariance_xy)
    xy_mean = np.mean(x),np.mean(y)
    x_diff = np.array([x_i - xy_mean[0] for x_i in x])
    y_diff = np.array([y_i - xy_mean[1] for y_i in y])
    diff_xy = np.transpose([x_diff, y_diff])
    
    md = []
    for i in range(len(diff_xy)):
        md.append(np.sqrt(np.dot(np.dot(np.transpose(diff_xy[i]),inv_covariance_xy),diff_xy[i])))
    return md


def MD_removeOutliers(x, y, threshold=1.5):
    """
    Detects outliers using the Mahalanobis D^2 distance.
    """
    MD = MahalanobisDist(x, y)
    range_threshold = np.mean(MD) * threshold
    nx, ny, outliers = [], [], []
    for i in range(len(MD)):
        if MD[i] <= range_threshold:
            nx.append(x[i])
            ny.append(y[i])
        else:
            outliers.append(i) # position of removed pair
    return (np.array(nx), np.array(ny), np.array(outliers))

Outlier Description and Profiling
Retention or Deletion of the Outlier

Exclude only observations that are truly non-representative observations from the population (can use a combination of univariate + viariate + multivariate detections)


Rules of Thumb

(Outlier Detection)

Univariate methods: Examine all metric variables to identify unique or extreme observations

For small samples (80 or fewer observations), outliers typically are defined as cases with standard scores of 2.5 or greater
For larger sample sizes, increase the threshold value of standard scores up to 4
If standard scores are not used, identify cases failing outside the ranges of 2.5 versus 4 standard deviations, depending on the sample size


Bivariate methods: Focus their use on specific variable relatioships, such as the independent versus dependent variables

Use scatterplots swith confidence intervals at a specified alpha level


Multivariate methods: Best suited for examining a comp+lete variate, such as the independent variables in regression of the variables in factor analysis

Threshold leves for the D^2/df measure should be conservative (.005 or .001), resulting in values of 2.5 (small samples) versus 3 or 4 in larger samples


Testing the Assumptions of Multivariate Analysis

Assessing Individual Variables Versus the Variate

Normality

(Statistical models and tests depend on the normal distribution assumption)

Graphical Analysis of Normality:

histograms of distribution
normal probability plot (stats.probplot())


Statistical Tests of Normality

kurtosis (peakedness or flatness of distribution)
skewness (distribution shifted to the left or right side)


Remedies for Nonnormality:

Data transformations


Homoscedasticity

(often due to nonnormality of one of the variables)

Graphical Tests of Equal Variance Dispersion

scatter plots (check for cone or diamond shapes)
box plots (check for same degree of variation via box lenght and whiskers)


Statistical Tests for Homoscedasticity

Levene's test
Box's M test


Remedies for Heteroscedasticity

Data transformations


Linearity


Graphical Tests using scatter plots
Examine residuals of a simple regression analysis
Remedies:

Transform one or both variables to achieve linearity


Absence of Correlated Errors


Identifying Correlated Errors

Find the cause affecting the relationship between groups

Group values for a variable and examine for any patterns (differences between groups)


Remedies for Correlated Errors

Include the omitted causal factor into the multivariate analysis (add a variable that represents the omitted factor)


Rules of Thumb

(Testting Statistical Assumptions)

Normality can have serious effects in small samples (fewer than 50 cases), but the impact effectively diminishes when sample sizes reach 200 cases or more
Most cases of heteroscedasticity are a result of nonnormality in one or more variables; thus, remedying normality may not be needed due to sample size, but may be needed to equalize the variance
Nonlinear relationships can be well defined, but seriously understated unless the data are transformed to a linear pattern or explicit model components are used to represent the nonlinear portion of the relationship
Correlated errors arise form a process that must be treated much like missing data; that is, the researcher must first define the causes among variables either internal or external to the dataset; if they are not found and remedied, serious biases can occur in the results, many times unknown to the researcher

Transforming data

(Transforming to achieve normality, homoscedasticity and linearity)

box cox (better overall, used for any type of skewed distributions)
log (positively skewed distributions)
inverse (1/N) (cone opens to the right)
square root (positively skewed distributions)
squared or cube (negatively skewed distributions)
arcsin (for proportions)

Rules of Thumb

(Transforming Data)

To judge the potential impact of a transformation, calculate the ratio of the variable's mean to its standard deviation:

Noteceable effets should occur when the ratio is less than 4
When the transformation can be performed on either of two variables, select the variable with the smallest ratio


Transformations should be applied tpo the independent variables except n the case of heteroscedasticity
Heteroscedasticity can be remedied only by transformation of the dependent variable in a dependence relationship; if a heteroscedastic relationship is also nonlinear, the dependent variable, and pherhaps the independent variables, must be transformed
Transformations may change the interpretation of the variables; for example, transforming variables by taking their logarithm translates the relationship into a measure of proportional change (elasticity); always be sure to explore thoroughly the possible interpretations of thte transformed variables
Use variables in their original format (untransformed) when profiling or interpreting results