Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Dec 1.
Published in final edited form as: J Addict Med. 2020 Dec;14(6):454–456. doi: 10.1097/ADM.0000000000000644

Addressing missing data in substance use research: a review and data justice-based approach

Caroline King 1,2, Honora Englander 3,4, Kelsey C Priest 1, P Todd Korthuis 4,5, Sterling McPherson 6
PMCID: PMC7483132  NIHMSID: NIHMS1558114  PMID: 32142055

Abstract

Missing data in substance use disorder (SUD) research can pose a challenge as researchers attempt to publish reliable findings based on the limited available information. Tools to address missing data exist, but are underused and may not address all types of missingness. Missing data are more than a statistical problem: for underserved populations and people with SUDs who may have missing data for a myriad of reasons, missing data represents missing stories and information that can have real-world impacts on system and policy-level decision making. This paper reviews types of missing data and, through a data justice lens, asserts the importance of the increased use and development of statistical tools to handle missing data in SUD research.

Keywords: missing data, substance use disorder research, multiple imputation

Introduction

Missing data in substance use disorder (SUD) research can pose a challenge, as researchers attempt to publish reliable findings based on limited information. Biostatisticians have long advocated for robust statistical methods to address missing data (1, 2), yet evidence-based and expert-recommended tools are underused. In a recent study observed among the top three epidemiology journals, nearly 81% of survey-research articles simply dropped participants’ missing data when analyzing surveys, and most articles did not examine the type of missing data present (3). While some tools for addressing missing data are easy to use, the most traditionally challenging problems with missing data do not have readily implementable solutions; however, reminding authors to simply report on the amount of missing data remains a challenge.

Where data are missing, people are missing. The concept of data justice recognizes that statistical analyses are not purely technical, but instead inextricably contextualized within the social, political, economic, and cultural landscape underlying the data collection and subsequent analyses (4). Thus, missing data should be understood as more than a statistical problem, particularly for underserved and marginalized populations. Missing data may represent omitted information with potential real-world implications for systems and policy-related decisions. Conducting more just analyses in SUD research necessitates not only an enhanced understanding of missing data, but intentional reflection by researchers on the potential implications that missing data may have on the interpretation of study findings within the broader socio-political environment.

As the U.S. drug overdose epidemic continues, and additional research is conducted to support people with SUDs, researchers must learn what statistical and design tools are available and when to use them to address missing data. This paper reviews types of missing data and makes a case rooted in data justice for the use and development of applied statistical tools to handle missing data.

Missing Data Types

A three category typology has become the standard for describing missing data types, each with their own assumptions about the underlying cause of the missing data: Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR) (5). Unfortunately, it is not possible to test what kind of missing data exists in a data set (6). Differentiating types of missing data is best understood by examining the how and why of missing values. MCAR data are missing for reasons that have nothing to do with the study participants or data collection (e.g., surveys randomly lost in the mail) and is uncommon. MAR data are missing largely, if not entirely, because of the variables collected (e.g., missing mailed survey responses because participants are homeless, and homelessness information was collected as part of the study). MNAR data are missing because of variables not collected as part of the research project, but rather because of the variable that is missing itself. For example, if survey participants do not respond to questions about drug use because their drug use has increased, this would mean that the data are missing because of something the question itself sought to capture, making this data MNAR. Importantly, in our MAR example, if we had not collected information about homelessness, our missing data would not be MAR but rather MNAR. The terms used to describe missing data are useful in directing researchers to potential solutions, but are not immutable truths for any analysis; classification reflects how the analysis and data collection were conducted and which missing data mechanism is the most plausible.

Addressing Missing Data

Often, ignoring missing data and using “complete case” analysis is appropriate if data can be assumed to be MCAR (7), though this condition is rare. Complete case analysis means requiring participants to have complete, non-missing data, including all covariates; if they are missing even one piece of information, they are dropped from the analysis. It is rarely appropriate to use complete-case analysis for either MAR or MNAR data, in part because it often leads to significant statistical power reductions but also because it assumes that the people who are being “dropped” are the same as those who remain in the study. Under the conditions of MAR or MNAR data, this is rarely true.

There are several approaches to replacing missing values. Single-imputation is a method in which participants with missing data have their data replaced with a value and the analysis is re-run. Traditionally, single-imputation refers to a variety of methods, including Last Observation Carried Forward, in which a participant’s most recent non-missing score is assumed to be identical to the last observed value. Another approach is to impute a measure of central tendency, in which the group’s mean, median or mode value replaces all missing values by group. Single imputation is commonly used in substance use research when participants are missing follow-up urine toxicology screen results, and the result is imputed as “positive” (8). These approaches generate substantial bias, particularly when participants who are missing data are different from those who are not missing data (9).

Another challenge is that data are often missing for multiple reasons. Across a dataset, data could be MAR, MNAR and MCAR. If the data are likely MAR, multiple evidence-based methods of addressing missingness can be successfully applied. Multiple imputation generates estimates for missing values, based on similar patients with complete data, and adds random errors to the process to account for the uncertainty associated with the missing data. Multiple imputation has been used in missing data research in SUD trials successfully, and has been superior to complete-case analysis, single-imputation techniques, and several other statistical approaches when handling MAR data. Maximum likelihood methods performs almost identically to multiple imputation under the same conditions of missingness; this method does not generate new datasets and instead makes use of all available data (10).

We recognize the ease of using complete-case analysis to handle any type of missing data. Anecdotally, researchers may state that they prefer not to interfere, or tamper with their data, which could make it feel less “true.” However, the missing data itself has already made it more difficult to observe the “truth” sought in the original analysis. Studies show multiple imputation is superior to complete-case analysis in estimating what the results could have been if data were not missing (11). Further, a primary goal of any analysis is to use data to represent phenomena we observe. In doing so, we attempt to minimize error and bias in analyses, thus providing a more accurate representation of the phenomena. Multiple imputation and maximum likelihood are superior to single imputation and complete-case analysis in reducing bias, and thus this tool is preferred when data are MAR (12). Conveniently, multiple imputation packages, tools, and tutorials exist for many frequently used statistical programs (e.g., SAS, Stata, SPSS, R).

Of all missing data types, MNAR data can be the most challenging to address, and yet are likely common in SUD research. Limited tools exist to address this kind of missingness appropriately, and these tools are rarely covered in introductory and clinical-research focused statistics courses. Additionally, not all tools are easy to implement in the aforementioned statistical packages and often put significant onus on the user to specify the model correctly. For example, MNAR models come with their own assumptions that can be challenging to demonstrate as being met prior to conducting such an analysis. However, MNAR models continue to advance both in accessibility and in ease of assumption demonstration and should be considered as additional tools for researchers facing missing data that may be MNAR.

Researchers conducting clinical trials, observational, or survey-based studies all face missing data challenges. Clinical trial protocols require researchers to have a plan to address missing data a priori; in contrast, observational researchers may publish research without demonstrating that their missing data plan was constructed in advance. Often in clinical trials, researchers will pick a main method of handling missing data (such as complete case analysis) for their primary analysis, and conduct sensitivity analyses using single imputation and multiple imputation methods to see if their results differ than those obtained using the primary tool. In general, prespecifying analysis plans with a primary analysis tool, conducting sensitivity analyses using other missing data approaches, and publishing those results if they differ from the primary analysis, can improve the rigor and reliability of published addiction research, and clarify how assumptions made about the data impacted analyses results.

The case for data justice and development of applied statistical tools for missing data

Generating valid SUD research relies on addressing missing data challenges, even in simple data collection schemes like baseline and post-intervention surveys. Researchers must interpret and address missing data in ways that acknowledge the intricate relationships to the social, political, economic and cultural landscape that permeates our analyses (4). These seemingly external domains are where research implications will be debated and acted upon; thus, creating reliable analyses that reflect observed phenomena most closely can help prioritize, and more justly, inform decision making post-hoc.

The framework of data justice builds upon the tenet of justice identified by Beaucamp & Childress (1979) in their moral analytical framework in medical ethics (13). Here, justice names an “obligation of fairness” in how risks and benefits are distributed (13). Because the validity of research illustrating these needs may differ based on how missing data are addressed, a more just research approach that attempts to reduce error and bias due to missing data is warranted.

When researchers ignore missing data, they exclude patient experiences that are critically relevant to the research question. They may be ignoring the most marginalized or medically vulnerable of those sampled. Ignoring this data removes the responsibility of those interpreting research results post-analysis to serve this group, and may further reinforce disparities (14).

Further, though MNAR data are challenging to address statistically, researchers should still pursue MNAR methods to realize more just data analyses. While it is challenging and potentially expensive to enact MNAR data methods, more just research pursues these costs, and funding sources should be explored to improve the ease of accessing statistical experts, even in unfunded work. One common approach to handling MNAR data is to use “worst case” single imputation. “Worst case” is designed to assume that the patient has done the “worst” thing researchers would expect them to do; above, we mentioned the common method of imputing positive urine drug screens for all patients who miss follow-up appointments. This perpetuates further stigmatization of an already marginalized population and is neither just nor statistically sound. Better MNAR models should be explored by researchers facing missing data that may be MNAR.

Future clinical scientists, implementation scientists, and health services evaluators should typify their missing data, and address missingness, either statically or reflectively in their study limitations. Much research has explored upstream interventions to prevent missing data (15); however, it is rare for a study to have no missing data, despite best efforts to achieve completeness. Further statistical and methodologic research is needed to improve the use of existing statistical tools, and to create new tools to support less biased analyses. Realizing more just and ethical data frameworks will mean using existing and identifying novel ways to address missing data among underserved populations in study design and analyses.

Financial Disclosures

CK and TK were supported by grants from the National Institutes of Health, National Institute on Drug Abuse (UG1DA015815 / R01DA037441). KCP was supported by a grant from the National Institute on Drug Abuse (F30 DA044700UG). SMM is supported by NIH grants P20MD006871, UG1DA013714, R01EY027476, N44DA162246, R01AA022070, R01AA020248, P60AA026112, R41AA026793, N44DA171210, R01AG042467, VA grant I01HX002518 and CDC grant 75D301-19-Q-69877.

This publication was also made possible with support from the Oregon Clinical and Translational Research Institute (OCTRI), grant number UL1TR002369 from the National Center for Research Resources (NCRR), a component of the National Institutes of Health (NIH), and NIH Roadmap for Medical Research.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Declarations: Dr. McPherson has received research funding from Ringful Health, Managed Health Connections, Providence St. Joseph Health, Consistent Care, the Bristol-Myers Squibb Foundation, and the Orthopedic Specialty Institute. He has also received consultation fees from the US Attorney’s Office of the Eastern District of Washington. This funding is in no way related to the data reported here. Dr. Korthuis serves as principal investigator for NIH-funded studies that receive donated study medications from Alkermes and Indivior.

References

  • 1.Kang H The prevention and handling of the missing data. Korean J Anesthesiol. 2013;64(5):402–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Dziura JD, Post LA, Zhao Q, Fu Z, Peduzzi P. Strategies for dealing with missing data in clinical trials: from design to analysis. Yale J Biol Med. 2013;86(3):343–58. [PMC free article] [PubMed] [Google Scholar]
  • 3.Eekhout I dBM, Twisk J, de Vet H, Heymans MW. Missing Data: A systematic review of how they are reported and handled. Epidemiology. 2012;23(5):729–32. [DOI] [PubMed] [Google Scholar]
  • 4.Dencik L, Hintz A, Redden J, Trere E. Exploring Data Justice: Conceptions, Applications and Directions. Information, Communication & Society. 2019;22(7):873–81. [Google Scholar]
  • 5.Mack C, Su Z, Westreich D. Managing Missing Data in Patient Registries: Addendum to Registries for Evaluating Patient Outcomes: A User’s Guide, Third Edition AHRQ Methods for Effective Health Care. Rockville (MD)2018. [PubMed] [Google Scholar]
  • 6.Mack CSZ, Westerich D. Managing missing data in patient registries: addendum to registries for evaluating patient outcomes: a user’s guide, third edition. Agency for Healthcare Research and Quality. 2018. [PubMed] [Google Scholar]
  • 7.Enders C Applied Missing Data Analysis. New York: Gilford Press; 2010. [Google Scholar]
  • 8.McPherson S, Barbosa-Leiker C, Burns GL, Howell D, Roll J. Missing data in substance abuse treatment research: current methods and modern approaches. Exp Clin Psychopharmacol. 2012;20(3):243–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zhang Z Missing data imputation: focusing on single imputation. Ann Transl Med. 2016;4(1):9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Schafer JL, Graham JW. Missing data: our view of the state of the art. Psychol Methods. 2002;7(2):147–77. [PubMed] [Google Scholar]
  • 11.van der Heijden GJ, Donders AR, Stijnen T, Moons KG. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol. 2006;59(10):1102–9. [DOI] [PubMed] [Google Scholar]
  • 12.Harel O, Mitchell EM, Perkins NJ, Cole SR, Tchetgen Tchetgen EJ, Sun B, et al. Multiple Imputation for Incomplete Data in Epidemiologic Studies. Am J Epidemiol. 2018;187(3):576–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Beauchamp TL CJ. Principles of biomedical ethics. 1st ed. New York: Oxford University Press; 1979. [Google Scholar]
  • 14.Stereotyping Arndt S. and the treatment of missing data for drug and alcohol clinical trials. Subst Abuse Treat Prev Policy. 2009;4:2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Little RJ, D’Agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar JT, et al. The prevention and treatment of missing data in clinical trials. N Engl J Med. 2012;367(14):1355–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES