Roderick Rose

Handling Missing Data

Posted on February 16, 2006

Roderick Rose (bio) suggests multiple imputation as a way of dealing with some types of missing data.


Q: Why is it important to consider missing data early in the research process?
A: Incomplete or missing data is almost always a problem when it comes to data analysis. Respondents often leave items blank on questionnaires or decline to give a response during interviews. Sometimes the portion of such missing data can be quite sizeable. Failing to account for the influence of missing data can cause substantial parameter bias and can also influence efficiency and power. Missing data represent a serious threat to internal validity.

The most appropriate way to handle missing data will depend upon how data points are missing, specifically, to what extent they are completely or conditionally random. There are three types of missing data distributions, also known as missing data mechanisms:

MCAR (missing completely at random)

  • The probability of having a missing value for X is unrelated to the value of X itself or to any other variables in the data set.

  • Example: Income is MCAR if people who do not report their income have the same income on average as people who do report income. The subjects with the missing data have to match the subjects who have data on each of the other variables in the data set, like marital status. In other words, there is no relationship at all between non-response and the values of other variables.


MAR (missing at random)

  • This is actually a systematic type of missingness. The missing value for X can be explained by other variables in the data set. After accounting for these other variables, however, the missing values are random. This is why the data are called "missing at random," a better term might be "conditionally random."

  • Example: Income is MAR if the probability of missing data on income depends on an observed variable like marital status, but within each category of marital status (single, married, divorced, etc.), the probability of missing income is unrelated to the value of income. Marital status must be observed.


NMAR (not missing at random)

  • This is also a systematic type of missingness. The missing value is not completely random and cannot be completely explained by other variables in the data set. Even after the relationships between missing and observed data are accounted for, missing values remain conditioned on other unobserved factors, including the missing values themselves.

  • Example: Income is NMAR if households with low incomes are less likely to report their income even after researchers adjust for other observed variables.


Understanding these three types of distributions is critical, but the distinctions between MAR and NMAR are less important in practice than in theory.

Q: How do I handle missing data?
A: Some researchers may choose to discard surveys with missing items or perform ad hoc procedures such as substituting the sample mean of each missing item or the mean of each scale for the missing values. Mean substitution is not recommended however because it reduces standard errors and increases the chances of observing a significant effect when one does not exist.

It is actually ideal if the missing values are MCAR, because listwise deletion (deleting any case missing any variable) does not produce biased estimates when data are MCAR. A likelihood ratio test for MCAR exists (Little, 1988), but most data will not pass that test because MCAR is unlikely in most scenarios. At that point, multiple imputation may be a good option.

Q: What is multiple imputation?
A: In multiple imputation, missing values for any variable are predicted using existing values from other variables. The predicted or imputed values are substituted for the missing values, resulting in an imputed data set. This process is performed multiple times, producing multiple imputed data sets. The creation of multiple versions of the data set with different imputed values in each version helps to account for the uncertainty of the missing values.

The desired statistical analysis is then carried out on each imputed data set, producing multiple analysis results, which are combined to produce one overall result. The standard error of the overall result is inflated to account for the variation between the multiple versions of the imputed data, making the uncertainty about the missing values an explicit part of the hypothesis testing.

The quality of the data generated with multiple imputation depends on assumptions about the missing data, the analytical model, and the model used to impute the data. The imputed values produced from an imputation model are not guesses about what a particular missing value might be. The goal is to create an imputed data set that maintains the overall variability in the population while preserving relationships with other variables.

Q: Should I always use multiple imputation with missing data?
A: Don't assume you have to use multiple imputation. Multiple imputation reduces bias in MAR and NMAR situations. Listwise deletion (deleting any case missing any variable) is possible when data are MCAR, and maximum likelihood techniques may work under some circumstances with MAR data. The Little test can tell you whether it is likely your data are MCAR or not.

Q: What are the steps to multiple imputation?
A: Once you have determined that imputation is necessary, you need to know what your analytical model will be. The analytical model will be used for testing hypotheses. Knowing your analytical model will allow you to establish and document the X, Y, and Z variables in your imputation model.

  • X variables: fully-observed variables that will be used in the analytical model

  • Y variables: analytical variables that may contain missing observations

  • Z variables: those in your data set that will not be used in the analysis but can improve the quality of the imputed data if they are included in your imputation model

You must use all X variables and all Y variables in the imputation model in order to preserve associations inherent in the analysis model; otherwise, parameters may end up biased. Z variables, also known as auxiliary variables, may have missing observations. You should include Z variables that are highly informative of the missing data and have a relatively low rate of missingness.

Example: Assume we want to test the influence of gender and marital status on income, and we have a data set containing education, marital status, gender, and income, with education, marital status, and income missing on a subset of observations.
  • X contains gender (fully-observed analytical variable)

  • Y contains marital status and income (partly-missing analytical variables)

  • Z contains education (partly-missing non-analytical, but possibly informative about income)

Next, consider the distributions of the variables used in the imputation model. Usually, imputation software assumes multivariate normality of the variables, which disqualifies categorical data and multilevel data of all types. Specialized software is available for imputing categorical and multilevel data. Fortunately, imputation has been shown to be robust to large deviations from the multivariate normality assumption.

Q: Are there ways besides buying special software to use multivariate analysis with categorical and multilevel data?
A: Certain tricks can be employed to make categorical data and multilevel data behave in a way consistent with multivariate normal.

Categorical data can be converted to binary (e.g., if you have a 5-category categorical variable, it can be converted to 4 dummy variables), and binary variables can be imputed under the assumption that they are distributed normally.

Longitudinal data--a type of multilevel data whereby repeated measures are collected on study cases--can be imputed using multivariate normal software by using one record per case, with a separate variable for each repeated measure. After imputation, the data can be converted to the period-by-case format required for analysis.

Clustered data can be imputed by using “by” processing: conducting a separate imputation for each cluster of cases.

Q: What is important to keep in mind when choosing an imputation model?
A: Multivariate methods of exploratory research for establishing a good imputation model may be impugned by the missing data itself. Be wary of large models (too many variables, too many missing data points) and models with very small degrees of freedom. The overall robustness of the imputation model is important in determining whether it reduces parameter bias or worsens it.

Towards this goal, the model should be tested before imputation for collinearity. The VIF (Variance Inflation Factor) is a standard tool for assessing collinearity. Usually VIFs approaching 10 indicate harmful collinearity. However, when there are missing data and the VIF is a multivariate procedure that will listwise delete all observations with missing values, the VIF may be artificially inflated.

Statistics such as the fraction of missing information and the relative efficiency, which are calculated for each variable after imputation of each variable, can help compare the robustness of the model to the data. Those tools are usually available in imputation software.

The process described above is presented in detail in Rose & Fraser (2008).

Based on presentation at the Society for Social Work and Research Annual Conference in San Antonio, TX in January 2006.

Referenced articles: Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83, 1198-1202. Rose, R. A. & Fraser, M. W. (2008). A simplified framework for using multiple imputation in social work research. Social Work Research.

 

More About "Missing Data"