Missing Data and Imputation Methods

Patrick Schober; Thomas R Vetter

doi:10.1213/ANE.0000000000005068

. 2020 Oct 20;131(5):1419–1420. doi: 10.1213/ANE.0000000000005068

Missing Data and Imputation Methods

Patrick Schober ^*,^✉, Thomas R Vetter ^†

PMCID: PMC7553195 PMID: 33079865

KEY POINT:

Missing data reduce statistical power, may bias the analysis results, and thus should be appropriately described and addressed in any research report.

Related Article, see p 1421

In this issue of Anesthesia & Analgesia, Guglielminotti and Li¹ report results of a retrospective cohort study on the relationship between general anesthesia for cesarean delivery and postpartum depression. In their dataset, a variable amount of data was missing for several variables, which the authors addressed by multiple imputation.

Missing data pose several problems for the data analysis, in particular, loss of statistical power and potential for bias. It is thus important that researchers clearly disclose which and how much data are missing. The underlying cause or mechanism for missing data has important implications (1) for the potential to bias the analysis results and (2) for the techniques that can be used to address the missing data problem. Three such mechanisms are distinguished:

Missing completely at random (MCAR): The probability of missingness is unrelated to the values of the specific variable with the missing values and also unrelated to other variable(s).
Missing at random (MAR): The probability of missingness is unrelated to the values of the specific variable with the missing values, but it is systematically related to other variable(s). For example, when blood pressure measurements are missing more often in patients with severe comorbidities (eg, because these patients are less mobile and thus less likely to show up in clinic), the mechanism would be compatible with MAR.
Missing not at random (MNAR): The probability of missing data for a specific variable is systematically related to the values of this variable itself. For example, if individuals with a higher salary are less likely to report their income in a questionnaire, the probability of missingness is related to income level itself.

Determining the mechanism for missing data is not straightforward. Statistical tests, like Little test, are available to test the null hypothesis that the data are MCAR. However, while a significant test statistic does provide evidence refuting MCAR, a nonsignificant test does not provide evidence for MCAR. Moreover, there is no way to test whether the missingness is related to the variable itself without knowing the values of the missing data. It is therefore only possible to reject the MCAR assumption, but not to test for a particular mechanism. Postulated considerations on why data are missing, and on how this could be related to variables in the dataset, are helpful to make assumptions on the missing data mechanism.

The most common approach to deal with missing data—often used by default by statistical software—is to exclude study subjects with incomplete data for any variable of interest from the analysis. This approach, called listwise deletion or complete case analysis, produces unbiased estimates if the data are MCAR, but it may result in bias when data are MAR or MNAR. The bias and loss of power are minimal when the proportion of missing data is trivial. However, for any nontrivial amount of missing data (say, >5%)―in particular when the MCAR assumption is not plausible―listwise deletion is not recommended.

Traditional approaches to deal with missing data attempt to impute (fill-in) the missing values by single estimates of the respective value, for example, by (1) using the mean value of the observed values; (2) estimating the value from a regression model; or (3) using observed values from a “similar” subject. However, even when data are MCAR, most single imputation methods provide biased estimates and incorrect standard errors―and are thus seldom appropriate.

In contrast to single imputation, multiple imputation creates multiple copies of the dataset, in which an algorithm imputes missing data based on the available data, with different estimates in each copy of the dataset (Figure). As conventionally recommended, Guglielminotti and Li¹ imputed 5 datasets. More recently, a larger number of imputations (between 20 and 100, depending on the amount of missing data) are typically recommended.

Figure. — Schematic overview over the 3 steps involved in multiple imputation of missing study data. In step 1, multiple datasets are created (nos. 1, 2, 3…m), each with different estimates of the missing data. In step 2, each imputed dataset is analyzed. In step 3, the results obtained in step 2 are pooled to obtain an overall estimate.

The multiple imputation model must at a minimum contain the outcome variable(s) as well as all independent variables, including interactions, that are to be used in the subsequent analysis of the relationship between the variables. Auxiliary variables, which are not of direct interest for the analysis but are related to variables with missing data or which are related to the probability that other data are missing, are also commonly included. After the imputation step, each of the multiple datasets is analyzed with the same techniques that would also have been used to analyze an originally complete dataset, like simple regression techniques, mixed-effects models, or Cox proportional hazards models.^2–4 In the final step, estimates from each analysis are pooled to generate an overall result. If properly performed, this provides an unbiased estimate under the assumption that data are MCAR or MAR.

Full Information Maximum Likelihood estimation is an alternative to multiple imputation. This technique is not yet widely accessible to researchers, but it may likely gain popularity in the future. None of the techniques discussed here produce unbiased results when the data are MNAR, and the assumption of MAR is difficult if not impossible to verify in practice. Various statistical approaches, like the selection model and the pattern mixture model, have been developed to analyze data that are MNAR. However, these also heavily rely on untestable assumptions and are not commonly used.

It is important to realize that there is no universally useful and accepted technique to handle missing data and that statistical methods, including multiple imputation, do not necessarily solve the missing data problem. The MAR assumption is likely often violated, such that bias is still quite possible. Clearly, the best approach to address missing data is to avoid them in the first place, and researchers should thus make strong efforts to collect a dataset that is as complete as possible. Little et al⁵ provide useful guidance on how clinical trials can be designed and conducted to limit missing data, and similar considerations should also be applied to observational research.

REFERENCES

1.Guglielminotti J, Li G. Exposure to general anesthesia for cesarean delivery and odds of severe postpartum depression requiring hospitalization.. Anesth Analg. 2020;131:1421–1429. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Vetter TR, Schober P. Regression: the apple does not fall far from the tree.. Anesth Analg. 2018;127:277–283. [DOI] [PubMed] [Google Scholar]
3.Schober P, Vetter TR. Repeated measures designs and analysis of longitudinal data: if at first you do not succeed-try, try again.. Anesth Analg. 2018;127:569–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Schober P, Vetter TR. Survival analysis and interpretation of time-to-event data: the tortoise and the hare.. Anesth Analg. 2018;127:792–798. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Little RJ, D’Agostino R, Cohen ML, et al. The prevention and treatment of missing data in clinical trials.. N Engl J Med. 2012;367:1355–1360. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Guglielminotti J, Li G. Exposure to general anesthesia for cesarean delivery and odds of severe postpartum depression requiring hospitalization.. Anesth Analg. 2020;131:1421–1429. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Vetter TR, Schober P. Regression: the apple does not fall far from the tree.. Anesth Analg. 2018;127:277–283. [DOI] [PubMed] [Google Scholar]

[R3] 3.Schober P, Vetter TR. Repeated measures designs and analysis of longitudinal data: if at first you do not succeed-try, try again.. Anesth Analg. 2018;127:569–575. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Schober P, Vetter TR. Survival analysis and interpretation of time-to-event data: the tortoise and the hare.. Anesth Analg. 2018;127:792–798. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Little RJ, D’Agostino R, Cohen ML, et al. The prevention and treatment of missing data in clinical trials.. N Engl J Med. 2012;367:1355–1360. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Missing Data and Imputation Methods

Patrick Schober, MD, PhD, MMedStat

Thomas R Vetter, MD, MPH

KEY POINT:

Figure.

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Missing Data and Imputation Methods

Patrick Schober, MD, PhD, MMedStat

Thomas R Vetter, MD, MPH

KEY POINT:

Figure.

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases