The following is a list of some of the most commonly used missing data techniques with a brief descriptions and a
summary of their advantages and disadvantages.
Listwise or Casewise Deletion Omission of all records that contain a missing data for any one
variable. Considered a safe and conservative approach, it is the default solution in many statistical packages.
Its advantages are its simplicity and need for minimal computational time. When data is not missing at random,
however, the results will be plainly biased. In addition, it can reduce the variance and inflate the Rē value,
as well as reduce the power since standard errors and t-tests are a function of sample size. Unless all missing
mechanisms have been included or there is strong certainty concerning the randomness of the missing data, this method
is generally not recommended.
Pairwise Deletion Use of a correlation matrix where correlation between each pair of variables is
calculated from all cases that have valid data for those two variables. Again, this technique works well under MAR or
MCAR conditions and is a more powerful solution than listwise deletion. Like listwise, however, when data is not
missing at random the results may be strongly biased. In addition the resulting correlation matrix may not be
suitable for further analysis such as multiple regression or cluster analysis, which assume a true correlation
matrix where the coefficients are more consistent and transitive
Mean Substitution Replacement of all missing instances of a given variable with the mean value for
that variable. Mean substitution is a good solution when data is both missing at random and somewhat normally
distributed. It is also more advantageous than pairwise deletion for the fact that it produces a more internally
consistent set of correlation matrix. Like listwise deletion, however, when the proportion of missing data increases
it can reduce the variance and inflate the R squared value. Moreover, it results in an inflated sample size that is biased, e
ven if the missing data occurs at random.
Imputation by Regression Prediction of the missing data based on a regression equation that uses
all other relevant variables as predictors. The advantage of this method is that it preserves the variance and
covariance of the variables with missing data. The danger is that if standard errors are ignored when predicting the
missing values, it may inflate the predictive power of the model since the missing values of the dependent variables
are presented as perfectly predicted.
Hot Deck Imputation Replacement of missing values with randomly selected values present in a pool
of similar complete cases. Because the replacement values are randomly selected, hot deck imputation introduces the
variations seen in the pool of complete cases resulting in less tendency toward the mean. The two main areas of
concern are (1) selecting valid characteristic sets for identifying the potential pools containing values with
reasonable variance, and (2) ensuring that characteristic sets will allow for large enough donor pools with reasonable
variance. The technique has been used extensively by government agencies and has been widely accepted as providing
accurate samples of study population.
Expectation Maximization (EM) Algorithm A two step iterative approach that estimates the
parameters of a model starting from an initial guess. Each iteration consists of two steps: (1) an expectation step
that finds the distribution for the missing data based on the known values for the observed variables and the current
estimate of the parameters; and (2) a maximization step that substitutes the missing data with the expected value. An
elegant and powerful approach, it also requires specialized programming that can be quite time intensive.
Raw Maximum Likelihood or Full Information Maximum Likelihood (FIML) Method Typically represented
as a covariance matrix of the variables and a vector of means, this method uses all available information about the
observed data, including the means and variances based on the available data points for each variable. The advantage
over the EM method is that it allows for the direct computation of appropriate standard errors and test statistics.
The disadvantage is that it is difficult to include new variables to improve the accuracy of the parameter estimates
of the missing values, but may not be utilized in the final statistical model as predictors or outcomes. It also
requires specialized programming and can be very time intensive.
Multiple Imputations Much like the EM algorithm, multiple imputations generates a maximum
likelihood-based covariance matrix and vector of means and introduces statistical uncertainty into the model and
uses that uncertainty to replicate the natural variability found among the complete case data. It then imputes actual
data values to fill in the incomplete data points in the data matrix, similar to the hot deck method. The difference
is that it requires construction of five to 10 databases with imputed values, each of which is analyzed individually.
The results are then combined in one summary set of findings. Once again, although very powerful, it can be a very
time intensive method.
|