home * about us * contact us * past features * columns * resource links * site map


9/11 Remembered
Missing Data Methods

The following is a list of some of the most commonly used missing data techniques with a brief descriptions and a summary of their advantages and disadvantages.

Listwise or Casewise Deletion – Omission of all records that contain a missing data for any one variable. Considered a safe and conservative approach, it is the default solution in many statistical packages. Its advantages are its simplicity and need for minimal computational time. When data is not missing at random, however, the results will be plainly biased. In addition, it can reduce the variance and inflate the Rē value, as well as reduce the power since standard errors and t-tests are a function of sample size. Unless all missing mechanisms have been included or there is strong certainty concerning the randomness of the missing data, this method is generally not recommended.

Pairwise Deletion – Use of a correlation matrix where correlation between each pair of variables is calculated from all cases that have valid data for those two variables. Again, this technique works well under MAR or MCAR conditions and is a more powerful solution than listwise deletion. Like listwise, however, when data is not missing at random the results may be strongly biased. In addition the resulting correlation matrix may not be suitable for further analysis such as multiple regression or cluster analysis, which assume a true correlation matrix where the coefficients are more consistent and transitive

Mean Substitution – Replacement of all missing instances of a given variable with the mean value for that variable. Mean substitution is a good solution when data is both missing at random and somewhat normally distributed. It is also more advantageous than pairwise deletion for the fact that it produces a more internally consistent set of correlation matrix. Like listwise deletion, however, when the proportion of missing data increases it can reduce the variance and inflate the R squared value. Moreover, it results in an inflated sample size that is biased, e ven if the missing data occurs at random.

Imputation by Regression – Prediction of the missing data based on a regression equation that uses all other relevant variables as predictors. The advantage of this method is that it preserves the variance and covariance of the variables with missing data. The danger is that if standard errors are ignored when predicting the missing values, it may inflate the predictive power of the model since the missing values of the dependent variables are presented as perfectly predicted.

Hot Deck Imputation – Replacement of missing values with randomly selected values present in a pool of similar complete cases. Because the replacement values are randomly selected, hot deck imputation introduces the variations seen in the pool of complete cases resulting in less tendency toward the mean. The two main areas of concern are (1) selecting valid characteristic sets for identifying the potential pools containing values with reasonable variance, and (2) ensuring that characteristic sets will allow for large enough donor pools with reasonable variance. The technique has been used extensively by government agencies and has been widely accepted as providing accurate samples of study population.

Expectation Maximization (EM) Algorithm – A two step iterative approach that estimates the parameters of a model starting from an initial guess. Each iteration consists of two steps: (1) an expectation step that finds the distribution for the missing data based on the known values for the observed variables and the current estimate of the parameters; and (2) a maximization step that substitutes the missing data with the expected value. An elegant and powerful approach, it also requires specialized programming that can be quite time intensive.

Raw Maximum Likelihood or Full Information Maximum Likelihood (FIML) Method – Typically represented as a covariance matrix of the variables and a vector of means, this method uses all available information about the observed data, including the means and variances based on the available data points for each variable. The advantage over the EM method is that it allows for the direct computation of appropriate standard errors and test statistics. The disadvantage is that it is difficult to include new variables to improve the accuracy of the parameter estimates of the missing values, but may not be utilized in the final statistical model as predictors or outcomes. It also requires specialized programming and can be very time intensive.

Multiple Imputations – Much like the EM algorithm, multiple imputations generates a maximum likelihood-based covariance matrix and vector of means and introduces statistical uncertainty into the model and uses that uncertainty to replicate the natural variability found among the complete case data. It then imputes actual data values to fill in the incomplete data points in the data matrix, similar to the hot deck method. The difference is that it requires construction of five to 10 databases with imputed values, each of which is analyzed individually. The results are then combined in one summary set of findings. Once again, although very powerful, it can be a very time intensive method.