Roderick Rose

Handling Missing Data

Posted on February 16, 2006

Roderick Rose (bio) suggests multiple imputation as a way of dealing with some types of missing data.


Q: Why should I think about how to handle missing data?
A: Incomplete or missing data is almost always a problem when it comes to data analysis. Respondents often leave items blank on questionnaires or decline to give a response during interviews. Sometimes the portion of such missing data can be quite sizeable. Failing to account for the influence of missing data can cause substantial bias. Missing data represent a serious threat to internal validity.

The most appropriate way to handle missing data will depend upon how data points are missing. There are three types of missing data mechanisms:

MCAR (missing completely at random)
  • The probability of having a missing value for X is unrelated to the value of X itself or to any other variables in the data set

  • Example: Income is usually assumed to be MCAR if people who do not report their income have the same income (and also match on each of the other variables in the data set) on average as people who do report income.

MAR (missing at random)
  • The probability of having a missing value for X can be explained by other variables in the data set.

  • Example: Income is MAR if the probability of missing data on income depends on marital status, but within each category of marital status (single, married, divorced, etc.), the probability of missing income is unrelated to the value of income. Marital status must be observed.

NMAR (not missing at random)
  • The probability of having a missing value for X is not completely random and cannot be explained by other variables in the data set.

  • Example: Income is NMAR if households with low incomes are less likely to report their income even after adjusting for other variables.


Q: How do I handle missing data?
A: With MCAR, some researchers may choose to discard subjects with missing items or surveys or perform ad hoc procedures such as substituting the sample mean of each missing item or the mean of each scale for the missing values. Mean substitution is not recommended however because it reduces standard errors and increases the chances of observing a significant effect when one does not exist. A likelihood ratio test for MCAR exists (Little, 1988), but your data will probably not pass that test because MCAR is unlikely in most scenarios. At that point, multiple imputation may be a good option.

Q: What is multiple imputation?
A: In multiple imputation, missing values for any variable are predicted using existing values from other variables. The predicted or imputed values are substituted for the missing values, resulting in an imputed data set. This process is performed multiple times, producing multiple imputed data sets. The desired statistical analysis is then carried out on each imputed data set, producing multiple analysis results, which are combined to produce one overall result. The quality of the data generated with multiple imputation depends on assumptions about the missing data, the analytical model, and the model used to impute the data. The imputed values produced from an imputation model are not guesses about what a particular missing value might be. The goal is to create an imputed data set that maintains the overall variability in the population while preserving relationships with other variables.

Q: What are the steps to multiple imputation?
A: First, don't assume you have to use multiple imputation. Multiple imputation reduces bias in MAR and NMAR situations. Listwise deletion (deleting any case missing any variable) is better when data are MCAR, and maximum likelihood techniques may work under some circumstances with MAR data.

Second, you need to know what your analytical model will be. The analytical model will be used for testing hypotheses. Knowing your analytical model will allow you to establish and document the X, Y, and Z variables in your imputation model. X variables are fully-observed variables that will be used in the analytical model. Y variables are analytical variables as well, but they may contain missing observations You must use all X variables that are available and all Y variables in the imputation model in order to reduce bias. Z variables are those in your data set that will not be used in the analysis but can improve the quality of the imputed data if they are included in your imputation model. Your Z variables may have missing observations. You should include Z variables with relatively low rates of missingness and that are highly informative of the missing data.

Example: Assume we want to test the influence of gender and marital status on income, and we have a data set containing education, marital status, gender, and income, with education, marital status and income missing on a subset of observations. X contains gender (fully-observed analytical variable); Y contains marital status and income (partly-missing analytical variables), and Z contains education (partly-missing non-analytical, but possibly informative about income).

Third, multivariate methods of exploratory research for establishing a good imputation model may be impugned by the missing data itself. Be wary of large models (too many variables, too many missing data points) and models with very small degrees of freedom. Alternatively, use methods that pairwise delete only, which means that cases will be excluded from any calculations involving variables for which they have missing data.

Based on presentation at the Society for Social Work and Research Annual Conference in San Antonio, TX in January 2006. Referenced article: Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83, 1198-1202.

 

More About "Missing Data"