smdi: an R package to perform structural missing data investigations on partially observed confounders in real-world evidence studies

Janick Weberpals; Sudha R Raman; Pamela A Shaw; Hana Lee; Bradley G Hammill; Sengwee Toh; John G Connolly; Kimberly J Dandreo; Fang Tian; Wei Liu; Jie Li; José J Hernández-Muñoz; Robert J Glynn; Rishi J Desai

doi:10.1093/jamiaopen/ooae008

. 2024 Jan 31;7(1):ooae008. doi: 10.1093/jamiaopen/ooae008

smdi: an R package to perform structural missing data investigations on partially observed confounders in real-world evidence studies

Janick Weberpals ^1,^✉, Sudha R Raman ², Pamela A Shaw ³, Hana Lee ⁴, Bradley G Hammill ⁵, Sengwee Toh ⁶, John G Connolly ⁷, Kimberly J Dandreo ⁸, Fang Tian ⁹, Wei Liu ¹⁰, Jie Li ¹¹, José J Hernández-Muñoz ¹², Robert J Glynn ¹³, Rishi J Desai ¹⁴

¹ Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02120, United States

² Department of Population Health Sciences, Duke University School of Medicine, Durham, NC 27701, United States

³ Biostatistics Division, Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States

⁴ Office of Biostatistics, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD 20993, United States

⁵ Department of Population Health Sciences, Duke University School of Medicine, Durham, NC 27701, United States

⁶ Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA 02215, United States

⁷ Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA 02215, United States

⁸ Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA 02215, United States

⁹ Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD 20993, United States

¹⁰ Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD 20993, United States

¹¹ Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD 20993, United States

¹² Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD 20993, United States

¹³ Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02120, United States

¹⁴ Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02120, United States

^✉

Corresponding author: Janick Weberpals, RPh, PhD, Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, 1620 Tremont Street, Suite 3030-R, Boston, MA 02120 (jweberpals@bwh.harvard.edu)

PMCID: PMC10833461 PMID: 38304248

Abstract

Objectives

Partially observed confounder data pose a major challenge in statistical analyses aimed to inform causal inference using electronic health records (EHRs). While analytic approaches such as imputation are available, assumptions on underlying missingness patterns and mechanisms must be verified. We aimed to develop a toolkit to streamline missing data diagnostics to guide choice of analytic approaches based on meeting necessary assumptions.

Materials and methods

We developed the smdi (structural missing data investigations) R package based on results of a previous simulation study which considered structural assumptions of common missing data mechanisms in EHR.

Results

smdi enables users to run principled missing data investigations on partially observed confounders and implement functions to visualize, describe, and infer potential missingness patterns and mechanisms based on observed data.

Conclusions

The smdi R package is freely available on CRAN and can provide valuable insights into underlying missingness patterns and mechanisms and thereby help improve the robustness of real-world evidence studies.

Keywords: missing data, confounder, EHR, R, software, real-world evidence

Background and significance

Administrative health insurance claims databases and electronic health records (EHRs) are important data sources to generate real-world evidence (RWE) when they are found fit-for-purpose for the study question at hand. While administrative health insurance claims databases have traditionally been the backbone for the majority of pharmacoepidemiologic studies, a notable drawback lies in their inability to capture important clinical prognostic factors like vital signs and labs. To overcome this limitation, substantial initiatives are underway, for instance in the FDA Sentinel initiative,¹ linking claims databases and EHRs to generate real-world evidence (RWE) and complement data from randomized controlled trials (RCTs).¹^,² Due to their capture of clinical details, EHR can significantly improve the ability to mitigate imbalances in prognostic factors between treatment groups.³ At the moment, substantial efforts focusing on the linkage of claims databases and EHR are underway, for instance, in the FDA Sentinel Initiative.¹ However, prognostic factors coming from EHR are often only partially observed, posing a challenge to the statistical analysis and potentially leading to bias in treatment effect estimates if not handled appropriately.^4–6

In order to inform decisions about the most appropriate analytic approach, it is useful to investigate the potential patterns and mechanisms that underlie the partially observed confounder (POC) data (see definitions box).^7–9 Existing guidance frameworks have suggested various routine diagnostics to investigate missing data patterns and mechanisms. These methods comprise standard procedures such as comparing baseline characteristics and outcomes between patients with and without the POC,^10–14 checking the ability to predict missingness¹¹ and assessing if causal relationships between variables and their missingness are recoverable based on available data¹⁵ using directed acyclic graphs¹⁶^,¹⁷ or M-graphs.¹⁸ However, these methods have so far only been described and tested in isolation from each other and no unified principled approach exists. In addition, the practical implementation of such diagnostics is time-consuming and consequently infrequently performed.^19–21

Inline graphic Definitions: Basic missing data taxonomies.

Patterns (adapted from Van Buuren⁷)
• Monotone pattern: If Y_j is the jth column in a dataset Y, a missing data pattern is said to be monotone if the variables Y_j can be ordered such that if Y_j is missing then all variables Y_k with k > j are also missing. This can occur, for example, in longitudinal studies with drop-out.
• Non-monotone pattern: If the pattern is not monotone, it is called non-monotone or general.
Mechanisms¹¹
• Missing completely at random (MCAR): The missingness does not depend on any other observed or unobserved covariate(s).
• Missing at random (MAR): The missingness depends and can be explained by other observed covariates.
• Missing not at random (MNAR): The missingness depends on unobserved covariate(s). For example, the missingness may be explained by other covariate(s) which is/are not observed in the underlying dataset (MNAR_unmeasured). The missingness can also just depend on the actual value of the partially observed covariate itself (MNAR_value).

Open in a new tab

Considering these limitations, we have recently developed and evaluated a principled approach combining multiple missing data diagnostics²² using a database linkage from the Mass General Brigham Research Patient Data Registry EHR in Boston²³ linked with Medicare fee-for-service claims data.²⁴ The results of this large-scale study revealed that the combination of these diagnostics effectively identified underlying mechanisms and provided helpful guidance for the choice of appropriate analytic methods to handle POC data.

Objective

To streamline the implementation of routine missing data diagnostics for POC data in RWE studies, we developed the smdi (structural missing data investigations) R package.²⁵

Methods

The smdi R package was written in R language (version 4.2.1). The package is available on the comprehensive R archive network (https://cran.r-project.org/web/packages/smdi) and GitLab (https://gitlab-scm.partners.org/janickweberpals/smdi) and can be installed via install.packages(“smdi”). To ensure quality, we implemented comprehensive unit tests with a coverage of 95.81% and established automated R CMD checks²⁶ via continuous integration and deployment. Additional resources such as documentation and vignettes are provided on the package website: https://janickweberpals.gitlab-pages.partners.org/smdi.

Results

Main package functions

Figure 1 illustrates the recommended workflow to systematically approach diagnostics on POCs.

The workflow is generally categorized into descriptives, pattern diagnostics, and inferential diagnostics on potentially underlying missingness mechanisms. In this section, we cover the principles behind the main package functions, a selection of parameters users can specify, the returned results and how these can be interpreted. Examples are illustrated using a synthetic dataset that is part of the package and simulates an oncology cohort with a binary exposure, a time-to-event outcome and several baseline confounders and prognostic covariates including 3 POCs (EGFR and PD-L1 [biomarkers] and ECOG [performance score]) following a MAR, MNAR, and MCAR mechanism, respectively (more details: https://janickweberpals.gitlab-pages.partners.org/smdi/articles/a_data_generation.html).

For all functions in the smdi package, a dataframe is expected (data parameter) as input with a format where one row represents one unique patient and the columns represent relevant variables, ie, exposure, outcome, fully observed covariates, and the POCs. Any non-informative columns, for example, patient identifiers, should be dropped from the dataframe before calling the functions. Throughout all functions, users have the option to specify a vector with the column name(s) of the POC(s) that should be investigated (covar parameter). If nothing is specified, all functions automatically consider any variable in the dataframe that exhibits at least one missing value.

Details on missingness assumptions, key statistical principles, and further information on all functions can be found in the Supplementary Methods and in the documentation of each respective function which can be accessed in R by preceding the function name with a question mark, eg:

? smdi_diagnose()

Descriptives and pattern diagnostics

As a first step to explore the missingness in new datasets, the smdi package provides a few basic functions to describe and summarize missingness across all covariates. The smdi_summarize() function returns the amount and proportion of missing observations, which can also be stratified by a grouping variable. The smdi_vis() function returns a corresponding bar chart plot (example Figure 2A).

Figure 2. — Exemplary visual outputs of the (A) smdi_vis(), (B) gg_miss_upset(), (C) smdi_asmd(), and (D) smdi_rf() functions, respectively. Sub-figure (A) displays the proportion of missing observations for each partially observed covariate stratified by exposure. The upset plot in sub-figure (B) demonstrates how a monotone missingness pattern between partially observed covariates can be visually inspected using a set visualization technique.²⁸ Sub-figure (C) illustrates absolute standardized mean differences (ASMDs) in patient characteristics between patients with and without a value observed for the PD-L1 (pdl1_num) biomarker as a measure of imbalance. Sub-figure (D) demonstrates the variable importance of fully observed covariates for predicting missingness in the partially observed ECOG performance score variable (ecog_cat).

To visually inspect potential missing data patterns, we re-exported the gg_miss_upset() function of the naniar package.²⁷ This function uses a set visualization technique to visually infer potential (non-)monotone patterns based on the number of intersecting missing observations across all POCs.²⁸ For example, a monotone pattern could be visually evident if, for a set of 2 or more lab variables which are typically measured together as part of a lab panel (eg, renal or liver panel), the missingness of one lab is indicative of the missingness in the other lab and hence all or the majority of combinations of cells are missing together (example Figure 2B). The md.pattern() function, a re-export of the mice package,²⁹ provides a similar functionality and returns a matrix displaying the frequency of each observed missing data pattern.

Inferential three group diagnostics

The core functions to infer potentially underlying missingness mechanisms are categorized into 3 group diagnostics based on their general analytic properties (Table 1).

Table 1.

Overview of the main functions in smdi to characterize potential underlying missingness mechanisms.

Function	Description	Generic S3 print() output	Object output	Interpretation
Group 1 Diagnostics—Comparing the distribution of observed covariates between patients with versus without a value for the partially observed covariate
smdi_asmd()	Computes the absolute standardized mean differences (ASMDs) of patient characteristics between patients with versus without a value for the partially observed covariate(s)	Aggregated summary table of the average/median and minimum/maximum ASMD range for all specified partially observed covariates	- Detailed Table 1 illustrating distributions and individual ASMD for each compared patient characteristic - ggplot2 graph illustrating the individual ASMD for each compared patient characteristic in descending order - Aggregate summary of the average/median and minimum/maximum ASMD range for the selected partially observed covariate	- ASMD <0.1: no imbalances in observed patient characteristics; missingness may be likely completely at random or not at random (∼MCAR, ∼MNAR) - ASMD >0.1: imbalances in observed patient characteristics; missingness may be likely at random (∼MAR)
smdi_hotelling()	Computes Hotelling’s multivariate t-test for each partially observed covariate, examining patient differences conditional on having an observed covariate value or not.	Aggregated summary table of the Hotelling’s test P-values for all specified partially observed covariates	Detailed Hotelling test statistics	High test statistics and low P-values indicate differences in baseline covariate distributions and null hypothesis would be rejected (∼MAR)
smdi_little()	Computes a single global chi-square test statistic across all partially observed covariates with a null hypothesis that the data are missing completely at random.	Detailed Little’s test statistics	Detailed Little’s test statistics	High test statistics and low P-values indicate differences in baseline covariate distributions and null hypothesis would be rejected (∼MAR)
Group 2 Diagnostics—Assessing the ability to predict missingness based on observed covariates
smdi_rf()	Trains and fits a random forest classification model to assess the ability to predict missingness indicator for the partially observed covariate(s).	Aggregated summary table with the area under the receiver operating characteristic curve (AUC) value for each partially observed covariate	- Individual AUC value - ggplot2 figure illustrating the variable importance for the prediction made expressed by the mean decrease in accuracy per predictor - Estimated out-of-bag (OOB) error	- AUC values ∼ 0.5 indicate completely random or not at random prediction (∼MCAR, ∼MNAR) - Values meaningfully above 0.5 indicate stronger relationships between covariates and missingness (∼MAR)
Group 3 Diagnostics—Evaluates whether missingness of a covariate is associated with the outcome
smdi_outcome()	Fits outcome model (linear, glm, or proportional hazards depending on the outcome under study) with the missingness indicator of the partially observed covariate(s). The estimates are computed both as a univariate model (just considering the missingness indicator) and an adjusted model with all covariates in the dataset.	Aggregated summary table with the univariate and adjusted estimate for each partially observed covariate	Aggregated summary table with the univariate and adjusted estimate for each partially observed covariate	- No association in either univariate or adjusted model and no meaningful difference in the log HR after full adjustment (∼MCAR). - Association in univariate but not fully adjusted model (∼MAR). - Meaningful difference in the log HR also after full adjustment (∼MNAR).

Open in a new tab

Group 1 diagnostics

The aim of the smdi_asmd(), smdi_hotelling(), and smdi_little() functions is to explore dissimilarities in patient characteristics between those with and without observed values for the POC. According to Rubin’s framework,⁸ when missingness is at random (MAR), it can be explained by observed covariates. Consequently, significant differences in patient characteristics would be expected under a MAR mechanism between strata of patients with and without the POC. If the missingness depends only on unobserved factors (missing not at random [MNAR]) or does not depend on either observed or unobserved covariates (missing completely at random [MCAR]), differences should not be observable.

To quantify such differences, the smdi_asmd() function computes absolute standardized mean differences (ASMDs) of observed patient characteristics.^30–32 The function returns an asmd object which displays an aggregated summary of the average or median ASMD along with a corresponding range of minimum and maximum ASMDs for each POC, respectively. The object also returns detailed “ Table 1’s” and plots³³ for each POC displaying the distributions of observed covariates and resulting ASMDs between patients with and without an observed value for the POC (example Figure 2C).

The smdi_hotelling() and smdi_little() functions complement the smdi_asmd() function by examining differences in patient characteristics as a formal statistical hypothesis test. Hotelling’s test¹²^,³⁴ formalizes this as a multivariate t-test, which means that smdi_hotelling() returns a test statistic and P-value for each POC. In contrast, smdi_little()¹³^,²⁷ computes a single global chi-square test statistic and P-value across all POCs with the null hypothesis that the data are (globally) MCAR.

Applying group 1 diagnostics to the synthetic example dataset would indicate that the ECOG POC (median ASMD 0.03, min-max 0.00-0.07, P-value .78) does not show any differences in observed patient characteristics between patients with and without and observed value for ECOG which would give evidence for a MCAR mechanism (Figure 3 bottom, Group 1 diagnostics—orange boxes). Conversely, in the case of EGFR and PD-L1, the group 1 diagnostics display larger differences and hence may rather underlie a MAR or MNAR mechanism (Figures 2C and 3).

Figure 3. — Example of how smdi diagnostics can be applied to compute and compare diagnostic parameters of partially observed covariates to expected parameters of common missingness mechanisms based on a former large-scale simulation study.²²

Group 2 diagnostics

Group 2 diagnostics assess the ability to predict missingness based on observed covariates via the smdi_rf() function. This function trains and fits a random forest classification model¹¹^,³⁵ to predict the missing indicator of each POC given exposure, outcome, follow-up time, and covariates plus missingness indicator for other POC as the predictors. If the resulting area under the receiver operating characteristic curve (AUC) is meaningfully >0.5, this would give some evidence for MAR/against MCAR being the underlying missingness mechanism. In case of values close 0.5, this would indicate the model is unable to discriminate missing versus observed values based on available data; this could be due to a mechanism that is close to MCAR or one where the missingness is associated with unmeasured data (MNAR).

The function returns an object of class rf which generically prints an overview of the AUC value of each POC. The AUC is based on the prediction made in the respective test dataset which is sampled as part of the function and for which the train-test split ratio, number of trees, and CPU cores to parallelize over can be specified (train_test_ratio, ntree, and n_cores parameter, respectively).³⁵^,³⁶ The rf object further returns a graph for each POC displaying the relative importance of the predictors in the training dataset expressed as the mean decrease in accuracy (example Figure 2D). This metric can be valuable for interpreting and identifying strong predictors of missingness. It quantifies how much the accuracy of the prediction (ie, the ratio of correct predictions to the total number of predictions made) would decrease if we excluded a specific predictor from the model. In case of inflated AUC values (>0.9), the function prompts a message to the user reporting the most important predictor. If in such a scenario missingness in another POC is identified as a perfect predictor, the presence of a monotone missing data pattern may be likely in which case it is recommended to run the diagnostics for each POC independently rather than jointly.

Figure 3 (Group 2 diagnostics—blue boxes), for example, illustrates the AUC values of the output of smdi_rf() when applied to the synthetic example dataset. Since the missingness of the EGFR POC follows a true MAR mechanism, the resulting AUC of 0.63 is expectedly meaningfully higher than what is observed for ECOG (0.51) and PD-L1 (0.52) which follow a true MCAR and MNAR mechanism, respectively.

Group 3 diagnostics

The third group of diagnostics with the smdi_outcome() function examines the association of the missingness indicator of the POC and the outcome under study. The function computes both a univariate model and a model adjusted for all other covariates in the dataset. In simulations, we discerned distinct patterns in both univariate and adjusted associations between the missing indicator and the outcome, closely mirroring simulated missingness mechanisms (Figure 3, top).²² As expected, under a MCAR mechanism the simulation suggested no difference in the outcome between patients with and without a value for the POC. Under MAR, given that missingness can be sufficiently explained by observed covariates, a spurious association in the univariate model disappeared after adjustment. If the missingness followed any MNAR mechanism, an association was observed regardless of adjustment.

smdi_outcome() supports multiple outcome regression types including linear regression (lm³⁷) for continuous outcomes, Cox proportional hazards model (coxph³⁸) for time-to-event outcomes, and generalized linear regression models (glm³⁷) for which the family of conditional distributions of the outcome can be selected using the glm_family parameter (the default is binomial(link="logit")). Besides the regression type (model parameter) and the glm_family (in case of a glm model), users need to specify the column containing the outcome using the form_lhs parameter (eg, Surv(eventtime, status) in case of a Cox model). The function returns a table with univariate and adjusted beta coefficients and 95% CIs for each POC.

Demonstrating the utilization of smdi_outcome() using the example dataset, the derived logHR coefficients for the missingness indicators of the POCs (Figure 3, bottom, Group 3 diagnostics—green boxes) align with the anticipated outcomes from our simulations.²² Specifically, EGFR manifests no discernible difference in the outcome after adjustment for fully observed covariates (logHR −0.01, 95% CI, −0.10 to 0.09), suggesting a MAR mechanism. ECOG exhibits no distinction in either the unadjusted or adjusted model (logHR −0.06, −0.16 to 0.03), indicating MCAR. Conversely, PD-L1 showcases differences in the outcome in both models, suggesting an MNAR context.

smdi_diagnose() to compute all three group diagnostics

Finally, the smdi_diagnose() function enables users to compute all of the above-discussed group diagnostics within a single function call.

# minimal example of a smdi_diagnose() function call

smdi_diagnose(

data =smdi_data,

model =“cox”,

form_lhs =“Surv(eventtime, status)”,

n_cores = 3

)

The function returns an object of class smdi containing a table with the results of all diagnostics for each specified POC and Little’s test P-value across all covariates (Table 2). By cross-checking all resulting diagnostic parameters to expected estimates as illustrated in in the above examples (Figure 3),²² the diagnostics can provide valuable insights into underlying missingness mechanisms and thereby help elucidate if analytic approaches such as imputation analyses are viable options.

Table 2.

Example output of the smdi_diagnose() function applied to the examplary smdi_data dataset.

Covariate	ASMD (min/max)^a	P Hotelling^a	AUC^b	Beta univariate (95% CI)^c	beta (95% CI)^c
ecog_cat	0.029 (0.003, 0.071)	.783	0.510	−0.06 (−0.16 to 0.03)	−0.06 (−0.16 to 0.03)
egfr_cat	0.243 (0.010, 0.485)	<.001	0.629	0.06 (−0.03 to 0.15)	−0.01 (−0.10 to 0.09)
pdl1_num	0.062 (0.019, 0.338)	<.001	0.516	0.12 (0.01-0.23)	0.11 (−0.00 to 0.22)

Open in a new tab

In this example, ECOG performance score (ecog_cat) shows no imbalances in patient characteristics between patient with and without an observed value (absolute standardized mean difference [ASMD] 0.029, P[Hotelling] .783, group 1 diagnostic). Additionally missingness cannot be predicted well (AUC = 0. 510, group 2 diagnostic) and no difference in the outcome can be observed between patients with and without ecog_cat (log HR −0.06 [95% CI, −0.16 to 0.03], group 3 diagnostic). Accordingly, the missingness diagnostics indicate that ECOG follows a missing completely at random missingness (MCAR) mechanism. Similarly, the EGFR (egfr_cat) and PD-L1 (pdl1_num) biomarker variables can be interpreted as following a missing at random (MAR) and missing not at random value (MNARvalue) mechanism. See also Figure 3. P little: <.001.

Abbreviations: ASMD, median absolute standardized mean difference across all covariates; AUC, area under the curve; beta, beta coefficient; CI, confidence interval; max, maximum; min, minimum.

Group 1 diagnostic: Differences in patient characteristics between patients with and without covariate.

Group 2 diagnostic: Ability to predict missingness.

Group 3 diagnostic: Assessment if missingness is associated with the outcome (univariate, adjusted).

The smdi_style_gt() function is an ancillary function that takes an object of class smdi and produces a formatted and publication-ready gt table³⁹ which can be seamlessly exported to different file formats (eg, .docx, .pdf, etc.) for reports or manuscripts.

Discussion

Missing data are ubiquitous in RWE studies involving EHR and may introduce bias if not handled appropriately. To address this issue, we developed the smdi R package to streamline routine diagnostics of missing data.

The package should be used with certain limitations in mind. Most importantly, the true underlying mechanism causing the missing data can never be inferred with absolute certainty from the observed data. Therefore, it is important to complement diagnostic results with substantive expert knowledge to factor in how covariates are measured in routine care, which could be system-specific, and contextualize potential reasons for missingness. This collaborative approach allows for a contextualized understanding of potential causes for missing data in EHR.

Conclusions

The smdi R package is a powerful and convenient tool to implement principled routine missing data diagnostics in RWE studies. This will improve the robustness of studies involving POCs by elucidating if certain analytic approaches are viable for a given dataset.

Supplementary Material

ooae008_Supplementary_Data

Click here for additional data file.^{(603.5KB, pdf)}

Acknowledgments

We would like to thank all beta testers and attendees of the Division of Pharmacoepidemiology and Pharmacoeconomics Methods Incubator who gave valuable feedback on early versions of the smdi R package.

Contributor Information

Janick Weberpals, Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02120, United States.

Sudha R Raman, Department of Population Health Sciences, Duke University School of Medicine, Durham, NC 27701, United States.

Pamela A Shaw, Biostatistics Division, Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States.

Hana Lee, Office of Biostatistics, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD 20993, United States.

Bradley G Hammill, Department of Population Health Sciences, Duke University School of Medicine, Durham, NC 27701, United States.

Sengwee Toh, Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA 02215, United States.

John G Connolly, Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA 02215, United States.

Kimberly J Dandreo, Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA 02215, United States.

Fang Tian, Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD 20993, United States.

Wei Liu, Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD 20993, United States.

Jie Li, Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD 20993, United States.

José J Hernández-Muñoz, Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD 20993, United States.

Robert J Glynn, Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02120, United States.

Rishi J Desai, Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02120, United States.

Author contributions

J.W. designed and developed the smdi R package and drafted the manuscript. S.R.R., P.A.S., H.L., B.G.H., S.T., J.G.C., K.J.D., F.T., W.L., J.L., J.J.H., R.J.G., and R.J.D. contributed to the conception, design, and interpretation and provided important feedback. All authors critically reviewed the manuscript for important intellectual content and approved of the final version of the manuscript.

Supplementary material

Supplementary material is available at JAMIA Open online.

Funding

This project was supported by the US Food and Drug Administration (FDA) (Master Agreement 75F40119D10037).

Conflicts of interest

The FDA approved the study protocol, statistical analysis plan and reviewed and approved this manuscript. Coauthors from the FDA participated in the results interpretation and in the preparation and decision to submit the manuscript for publication. The FDA had no role in data collection, management, or analysis. The views expressed are those of the authors and not necessarily those of the US FDA. J.W. reports prior employment by Hoffmann-La Roche and previously held shares in Hoffmann-La Roche. P.A.S. is a named inventor on a patent licensed to Novartis by the University of Pennsylvania for an unrelated project. S.T. serves as a consultant for Pfizer, Inc. and TriNetX, LLC. R.J.G. has received research funding through his employer from Amarin, Kowa, Novartis, and Pfizer. R.J.D. reports serving as Principal Investigator on investigator-initiated grants to the Brigham and Women’s Hospital from Novartis, Vertex, and Bristol-Myers-Squibb on unrelated projects. All remaining authors report no disclosures or conflicts of interest.

Data availability

The R package presented in this study and corresponding data can be downloaded from the comprehensive R archive network (CRAN) via install.packages(“smdi”) (version 0.2.2 at time of manuscript submission) or from https://janickweberpals.gitlab-pages.partners.org/smdi. This manuscript was written using Quarto version 1.3.433 (https://quarto.org/) and R version 4.1.2. All R code, materials, and dependencies can be found at https://gitlab-scm.partners.org/drugepi/smdi-manuscript or https://github.com/janickweberpals/smdi-manuscript.

References

1. Desai RJ, Matheny ME, Johnson K, et al. Broadening the reach of the FDA sentinel system: a roadmap for integrating electronic health record data in a causal analysis framework. NPJ Digit Med. 2021;4(1):170. 10.1038/s41746-021-00542-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. United States Food and Drug Administration. Framework for FDA’s Real World Evidence Program. United States Food and Drug Administration; 2018. Accessed June 30, 2023. https://www.fda.gov/downloads/ScienceResearch/SpecialTopics/RealWorldEvidence/UCM627769.pdf [Google Scholar]
3. Asfaw A, Ascha M, Yerram P, et al. SA27 comparison of comorbidity indices between electronic health records (EHR) derived database and claims data among patients with metastatic breast cancer. Value Health. 2022;25(12):S488. [Google Scholar]
4. Gorelick MH. Bias arising from missing data in predictive models. J Clin Epidemiol. 2006;59(10):1115-1123. [DOI] [PubMed] [Google Scholar]
5. Ayilara OF, Zhang L, Sajobi TT, et al. Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health Qual Life Outcomes. 2019;17(1):106. 10.1186/s12955-019-1181-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Groenwold RHH, Dekkers OM.. Missing data: the impact of what is not there. Eur J Endocrinol. 2020;183(4):E7-E9. [DOI] [PubMed] [Google Scholar]
7. Van Buuren S. Flexible Imputation of Missing Data. CRC Press; 2018. https://stefvanbuuren.name/fimd/missing-data-pattern.html [Google Scholar]
8. Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581-592. [Google Scholar]
9. Little RJ, Rubin DB.. Statistical Analysis with Missing Data. John Wiley & Sons; 2019. [Google Scholar]
10. Lee KJ, Tilling KM, Cornish RP, et al. ; STRATOS Initiative. Framework for the treatment and reporting of missing data in observational studies: the treatment and reporting of missing data in observational studies framework. J Clin Epidemiol. 2021;134:79-88. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Sondhi A, Weberpals J, Yerram P, et al. A systematic approach towards missing lab data in electronic health records: a case study in non-small cell lung cancer and multiple myeloma. CPT Pharmacometrics Syst Pharmacol. 2023;12(9):1201-1212. 10.1002/psp4.12998 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Hotelling H. The generalization of Student’s ratio. Ann Math Statist. 1931;2(3):360-378. [Google Scholar]
13. Little RJA. A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 1988;83(404):1198-1202. [Google Scholar]
14. Pedersen A, Mikkelsen E, Cronin-Fenton D, et al. Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol. 2017;9:157-166. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Madley-Dowd P, Hughes R, Tilling K, et al. The proportion of missing data should not be used to guide decisions on multiple imputation. J Clin Epidemiol. 2019;110:63-73. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Lee KJ, Carlin JB, Simpson JA, et al. Assumptions and analysis planning in studies with missing data in multiple variables: moving beyond the MCAR/MAR/MNAR classification. Int J Epidemiol. 2023;52(4):1268-1275. 10.1093/ije/dyad008 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Moreno-Betancur M, Lee KJ, Leacy FP, et al. Canonical causal diagrams to guide the treatment of missing data in epidemiologic studies. Am J Epidemiol. 2018;187(12):2705-2715. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Mohan K, Pearl J.. Graphical models for processing missing data. J Am Stat Assoc. 2021;116(534):1023-1037. [Google Scholar]
19. Carroll OU, Morris TP, Keogh RH.. How are missing data in covariates handled in observational time-to-event studies in oncology? A systematic review. BMC Med Res Methodol. 2020;20(1):134. 10.1186/s12874-020-01018-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Wood AM, White IR, Thompson SG.. Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals. Clin Trials. 2004;1(4):368-376. [DOI] [PubMed] [Google Scholar]
21. Harel O, Pellowski J, Kalichman S.. Are we missing the importance of missing values in HIV prevention randomized clinical trials? Review and recommendations. AIDS Behav. 2012;16(6):1382-1393. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Weberpals J, Raman SR, Shaw PA, et al. A Principled Approach to Characterize and Analyze Partially Observed Confounder Data from Electronic Health Records. 2023. (in review) [DOI] [PMC free article] [PubMed]
23. Nalichowski R, Keogh D, Chueh HC, et al. Calculating the benefits of a research patient data repository. AMIA Annu Symp Proc. 2006;2006:1044. [PMC free article] [PubMed] [Google Scholar]
24. CMS resdac. Accessed November 16, 2023. https://resdac.org/
25. Weberpals J. smdi: perform structural missing data investigations. 2023. https://CRAN.R-project.org/package=smdi [DOI] [PMC free article] [PubMed]
26. Wickham H, Bryan J.. R Packages. O’Reilly Media, Inc.; 2023. [Google Scholar]
27. Tierney N, Cook D.. Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations. J Stat Soft. 2023;105(7):105. 10.18637/jss.v105.i07 [DOI] [Google Scholar]
28. Ruddle RA, Adnan M, Hall M.. Using set visualisation to find and explain patterns of missing values: a case study with NHS hospital episode statistics data. BMJ Open. 2022;12(11):e064887. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. van Buuren S, Groothuis-Oudshoorn K.. Mice: multivariate imputation by chained equations in R. J Stat Softw. 2011;45(3):1-67. [Google Scholar]
30. Schober P, Vetter TR.. Correct baseline comparisons in a randomized trial. Anesth Analg. 2019;129(3):639. [DOI] [PubMed] [Google Scholar]
31. Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav Res. 2011;46(3):399-424. [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Yoshida K, Bartel A. Tableone: create ‘table 1’ to describe baseline characteristics with or without propensity score weights. 2022. https://CRAN.R-project.org/package=tableone
33. Wickham H. ggplot2: elegant graphics for data analysis. 2016. https://ggplot2.tidyverse.org
34. Curran J, Hersh T. Hotelling: Hotelling’s t² test and variants. 2021. https://CRAN.R-project.org/package=Hotelling
35. Liaw A, Wiener M.. Classification and regression by randomForest. 2002;2:18-22. https://CRAN.R-project.org/doc/Rnews/ [Google Scholar]
36. Breiman L. Random forests. Mach Learn. 2001;45(1):5-32. [Google Scholar]
37. R Core Team. R: a language and environment for statistical computing. Foundation for Statistical Computing; 2022. https://www.R-project.org/
38. Therneau TM. A package for survival analysis in R. 2023. https://CRAN.R-project.org/package=survival
39. Iannone R, Cheng J, Schloerke B, et al. Gt: easily create presentation-ready display tables. 2023. https://CRAN.R-project.org/package=gt

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ooae008_Supplementary_Data

Click here for additional data file.^{(603.5KB, pdf)}

Data Availability Statement

[ooae008-B1] 1. Desai RJ, Matheny ME, Johnson K, et al. Broadening the reach of the FDA sentinel system: a roadmap for integrating electronic health record data in a causal analysis framework. NPJ Digit Med. 2021;4(1):170. 10.1038/s41746-021-00542-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae008-B2] 2. United States Food and Drug Administration. Framework for FDA’s Real World Evidence Program. United States Food and Drug Administration; 2018. Accessed June 30, 2023. https://www.fda.gov/downloads/ScienceResearch/SpecialTopics/RealWorldEvidence/UCM627769.pdf [Google Scholar]

[ooae008-B3] 3. Asfaw A, Ascha M, Yerram P, et al. SA27 comparison of comorbidity indices between electronic health records (EHR) derived database and claims data among patients with metastatic breast cancer. Value Health. 2022;25(12):S488. [Google Scholar]

[ooae008-B4] 4. Gorelick MH. Bias arising from missing data in predictive models. J Clin Epidemiol. 2006;59(10):1115-1123. [DOI] [PubMed] [Google Scholar]

[ooae008-B5] 5. Ayilara OF, Zhang L, Sajobi TT, et al. Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health Qual Life Outcomes. 2019;17(1):106. 10.1186/s12955-019-1181-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae008-B6] 6. Groenwold RHH, Dekkers OM.. Missing data: the impact of what is not there. Eur J Endocrinol. 2020;183(4):E7-E9. [DOI] [PubMed] [Google Scholar]

[ooae008-B7] 7. Van Buuren S. Flexible Imputation of Missing Data. CRC Press; 2018. https://stefvanbuuren.name/fimd/missing-data-pattern.html [Google Scholar]

[ooae008-B8] 8. Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581-592. [Google Scholar]

[ooae008-B9] 9. Little RJ, Rubin DB.. Statistical Analysis with Missing Data. John Wiley & Sons; 2019. [Google Scholar]

[ooae008-B10] 10. Lee KJ, Tilling KM, Cornish RP, et al. ; STRATOS Initiative. Framework for the treatment and reporting of missing data in observational studies: the treatment and reporting of missing data in observational studies framework. J Clin Epidemiol. 2021;134:79-88. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae008-B11] 11. Sondhi A, Weberpals J, Yerram P, et al. A systematic approach towards missing lab data in electronic health records: a case study in non-small cell lung cancer and multiple myeloma. CPT Pharmacometrics Syst Pharmacol. 2023;12(9):1201-1212. 10.1002/psp4.12998 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae008-B12] 12. Hotelling H. The generalization of Student’s ratio. Ann Math Statist. 1931;2(3):360-378. [Google Scholar]

[ooae008-B13] 13. Little RJA. A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 1988;83(404):1198-1202. [Google Scholar]

[ooae008-B14] 14. Pedersen A, Mikkelsen E, Cronin-Fenton D, et al. Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol. 2017;9:157-166. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae008-B15] 15. Madley-Dowd P, Hughes R, Tilling K, et al. The proportion of missing data should not be used to guide decisions on multiple imputation. J Clin Epidemiol. 2019;110:63-73. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae008-B16] 16. Lee KJ, Carlin JB, Simpson JA, et al. Assumptions and analysis planning in studies with missing data in multiple variables: moving beyond the MCAR/MAR/MNAR classification. Int J Epidemiol. 2023;52(4):1268-1275. 10.1093/ije/dyad008 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae008-B17] 17. Moreno-Betancur M, Lee KJ, Leacy FP, et al. Canonical causal diagrams to guide the treatment of missing data in epidemiologic studies. Am J Epidemiol. 2018;187(12):2705-2715. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae008-B18] 18. Mohan K, Pearl J.. Graphical models for processing missing data. J Am Stat Assoc. 2021;116(534):1023-1037. [Google Scholar]

[ooae008-B19] 19. Carroll OU, Morris TP, Keogh RH.. How are missing data in covariates handled in observational time-to-event studies in oncology? A systematic review. BMC Med Res Methodol. 2020;20(1):134. 10.1186/s12874-020-01018-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae008-B20] 20. Wood AM, White IR, Thompson SG.. Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals. Clin Trials. 2004;1(4):368-376. [DOI] [PubMed] [Google Scholar]

[ooae008-B21] 21. Harel O, Pellowski J, Kalichman S.. Are we missing the importance of missing values in HIV prevention randomized clinical trials? Review and recommendations. AIDS Behav. 2012;16(6):1382-1393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae008-B22] 22. Weberpals J, Raman SR, Shaw PA, et al. A Principled Approach to Characterize and Analyze Partially Observed Confounder Data from Electronic Health Records. 2023. (in review) [DOI] [PMC free article] [PubMed]

[ooae008-B23] 23. Nalichowski R, Keogh D, Chueh HC, et al. Calculating the benefits of a research patient data repository. AMIA Annu Symp Proc. 2006;2006:1044. [PMC free article] [PubMed] [Google Scholar]

[ooae008-B24] 24. CMS resdac. Accessed November 16, 2023. https://resdac.org/

[ooae008-B25] 25. Weberpals J. smdi: perform structural missing data investigations. 2023. https://CRAN.R-project.org/package=smdi [DOI] [PMC free article] [PubMed]

[ooae008-B26] 26. Wickham H, Bryan J.. R Packages. O’Reilly Media, Inc.; 2023. [Google Scholar]

[ooae008-B27] 27. Tierney N, Cook D.. Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations. J Stat Soft. 2023;105(7):105. 10.18637/jss.v105.i07 [DOI] [Google Scholar]

[ooae008-B28] 28. Ruddle RA, Adnan M, Hall M.. Using set visualisation to find and explain patterns of missing values: a case study with NHS hospital episode statistics data. BMJ Open. 2022;12(11):e064887. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae008-B29] 29. van Buuren S, Groothuis-Oudshoorn K.. Mice: multivariate imputation by chained equations in R. J Stat Softw. 2011;45(3):1-67. [Google Scholar]

[ooae008-B30] 30. Schober P, Vetter TR.. Correct baseline comparisons in a randomized trial. Anesth Analg. 2019;129(3):639. [DOI] [PubMed] [Google Scholar]

[ooae008-B31] 31. Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav Res. 2011;46(3):399-424. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae008-B32] 32. Yoshida K, Bartel A. Tableone: create ‘table 1’ to describe baseline characteristics with or without propensity score weights. 2022. https://CRAN.R-project.org/package=tableone

[ooae008-B33] 33. Wickham H. ggplot2: elegant graphics for data analysis. 2016. https://ggplot2.tidyverse.org

[ooae008-B34] 34. Curran J, Hersh T. Hotelling: Hotelling’s t² test and variants. 2021. https://CRAN.R-project.org/package=Hotelling

[ooae008-B35] 35. Liaw A, Wiener M.. Classification and regression by randomForest. 2002;2:18-22. https://CRAN.R-project.org/doc/Rnews/ [Google Scholar]

[ooae008-B36] 36. Breiman L. Random forests. Mach Learn. 2001;45(1):5-32. [Google Scholar]

[ooae008-B37] 37. R Core Team. R: a language and environment for statistical computing. Foundation for Statistical Computing; 2022. https://www.R-project.org/

[ooae008-B38] 38. Therneau TM. A package for survival analysis in R. 2023. https://CRAN.R-project.org/package=survival

[ooae008-B39] 39. Iannone R, Cheng J, Schloerke B, et al. Gt: easily create presentation-ready display tables. 2023. https://CRAN.R-project.org/package=gt

PERMALINK

smdi: an R package to perform structural missing data investigations on partially observed confounders in real-world evidence studies

Janick Weberpals, RPh, PhD

Sudha R Raman, PhD

Pamela A Shaw, PhD, MS

Hana Lee, PhD

Bradley G Hammill, DrPH

Sengwee Toh, ScD

John G Connolly, ScD

Kimberly J Dandreo, MS

Fang Tian, PhD

Wei Liu, PhD

Jie Li, PhD

José J Hernández-Muñoz, PhD

Robert J Glynn, PhD, ScD

Rishi J Desai, PhD

Abstract

Objectives

Materials and methods

Results

Conclusions

Background and significance

Objective

Methods

Results

Main package functions

Figure 1.

Descriptives and pattern diagnostics

Figure 2.

Inferential three group diagnostics

Table 1.

Group 1 diagnostics

Figure 3.

Group 2 diagnostics

Group 3 diagnostics

Table 2.

Discussion

Conclusions

Supplementary Material

Acknowledgments

Contributor Information

Author contributions

Supplementary material

Funding

Conflicts of interest

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases