The Curse of the Missing Data
Posted by Yong Kim
Introduction
Regardless of the field, incomplete or missing data is almost always a problem when it comes to data analysis. It invariably raises the level of complexity in any examination, as a result of which many analysts take the "head in the sand" approach, doing little to evaluate its impact. Still, failure to properly mitigate its influence can cause substantial bias, diminishing, if not completely nullifying, the value of your results.
Missing data can occur for a variety of reasons. With surveys, for example, subjects may find certain questions inapplicable. In other cases, the questions may be relevant but the given responses are not. Furthermore, some subjects may simply refuse to answer certain types of questions, and this is to say nothing of the strictly technical pitfalls associated with corrupt databases, merging data from various sources, and input error.
Whatever the reason, however, missing data requires the analyst to consider additional issues. In many instances, identifying variables that explain the cause of missing data can help to mitigate the bias. For example, individuals with little education and less economic resources may skip questions about income. Such explanatory variables are known as "mechanism" variables, and by including them in your models you can eliminate the bias caused by some types of missing data. In fact, theoretically, if all the mechanism variables associated with a particular piece of missing data can be identified and included in a model as controls, the impact of the missing data can be statistically adjusted to the point where it is “ignorable” (Little & Rubin, 1987). In practice, however, it is extremely unlikely that mechanism variables can be identified for all cases of missing data.
Prediction vs. Estimation
In terms of practical applications, another question has to do with estimation and prediction. In his paper "Prediction With Missing Inputs," Warren S. Sarle of the SAS institute points out that traditionally missing data methods have been developed from the perspective of estimating parameters and computing test statistics. Although statistically sound, these traditional methods are not necessarily the most optimal or effective approaches in situations such as data mining, where the main emphasis is on generalization, i.e. the ability of a model to make good predictions.
For example, in cases where the proportion of missing data is fairly low, a common approach for estimation is to throw out the records with missing data. However, when you need to make predictions for cases with missing inputs, this is not an option. In fact, many popular missing data algorithms are inadequate when dealing with prediction models, either because they depend on arbitrary factors such as the order of cases in the dataset or some sort of pseudo random number sequence (e.g. hot deck imputations) or because they are computationally intensive and therefore too slow for real time computation (e.g. multiple imputation and maximum likelihood). In addition, most traditional methods require a dataset of covariance or the mean in the form of a matrix, while in data mining, data sets are usually provided as a row of records.
At the same time, there are a number of methods that are inappropriate for estimation but excellent for prediction. Methods such as ordinary least square (OLS) with single imputation of conditional or unconditional means and the use of dummy variables to flag records with missing data are two such examples. Essentially, any method used for estimation of the complete model can be combined with methods of prediction based on the complete model. As Sarle points out, “instead of estimating one complete model, you can estimate many different models for different combinations of nonmissing inputs. And there are methods intended only for prediction that do not even try to estimate the complete model.”
MCAR, MAR, and Nonignorable
All that being said, estimation nevertheless has a place in the real world, especially when doing research involving survey data such as panel studies. In these types of studies nonresponses are common especially in topics such as income, education, motivation, and crime. Consequently, use of missing data methods is critical for reaching valid conclusions. As mentioned above, choosing the appropriate method will depend on the assumptions one makes about the missing data mechanism.
Little and Rubin divide these mechanisms into three categories: Missing Completely at Random (MCAR), Missing at Random (MAR), and Nonignorable.
Missing data is considered to be MCAR when given two variables, A and B, the probability of response is independent of variables A and B. In other words, “missingness” is not related to the specified variables. As an example, suppose weight and age are variables of interest for a particular study. If the likelihood that a person will provide his or her weight information is the same for all individuals regardless of their weight or age, then the missing data is considered to be MCAR. This is the most restrictive of the three conditions. Generally you can test whether MCAR conditions can be met by comparing the distribution of the observed data between the respondents and nonrespondents. If MCAR is plausible then methods such as Listwise or Casewise Deletion (omission of all records that contain a missing data for any one variable) can be a good choice. It is considered a safe and conservative approach and is the default solution in many statistical packages. Its advantages are its simplicity and need for minimal computational time. When data is not MCAR, however, the results can be biased, so that other more robust methods might be more appropriate.
Missing data is considered to be MAR when given two variables, A and B, the probability of response depends on A but not on B. Again using the example of weight and age, if the likelihood that a person will provide his or her weight varied according to an individual’s weight but not his or her age, then the missing data is considered to be MAR. Most missing data methods are designed under this assumption.
Lastly, missing data is considered to be nonignorable when given two variables, A and B, the probability of response depends on A and possibly B. In other words, missingness is nonrandom and is not predictable from any one variable in the database. An example of this would be if the likelihood of an individual providing his or her weight varied according to a person’s weight in each age category. Typically this type of missing data is the hardest condition to deal with, but unfortunately, the most likely to occur as well. Consequently, there is much work being done in the area of noningnorable missing data.
Conclusion
Dealing with missing data is a fact of life, and though the source of many headaches, developments in missing data algorithms for both prediction and parameter estimation purposes are providing some relief. Still, they are no substitute for critical planning. When it comes to missing data, prevention is the best medicine.
|