Abstract
Background
Missing data in confounding variables present a frequent challenge in generating evidence using real-world data, including electronic health records (EHR). Our objective was to apply a recently published toolkit for characterizing missing data patterns and based on the toolkit results about likely missingness mechanisms, illustrate the decision-making process for analyses in an empirical case example.
Methods
We utilized the Structural Missing Data Investigations (SMDI) toolkit to characterize missing data patterns in the context of a pharmacoepidemiology study comparing cardiovascular outcomes of initiating sodium-glucose-cotransporter-2 inhibitors (SGLT2i) and dipeptidyl peptidase‐4 inhibitors (DPP‐4i) among older adults. The study used a linked EHR-Medicare claims dataset from Duke Health patients (2015–2017), focusing on partially observed confounders from EHR data (HbA1c lab and body mass index [BMI] values). Our analysis incorporated SMDI's descriptive functions and diagnostic tests to explore missingness patterns and determine missingness mitigation approaches. We used findings from these investigations to inform estimation of adjusted hazard ratios comparing the two classes of medications.
Results
High levels of missingness were noted for important confounding variables including HbA1c (63.6%) and BMI (16.5%). Diagnostic tests resulted in output that described: 1) the distributions of patient characteristics, exposure, and outcome between patients with or without an observed value of the partially observed covariate, 2) the ability to predict missingness based on observed covariates, and 3) estimate if the missingness of a partially observed covariate is differential with respect to the outcome. There was evidence that missingness could be sufficiently described using observed data, which allowed multiple imputation by chained equations using random forests to address missing confounder data in estimating treatment effects. Multiple imputation resulted in improved alignment of effect estimates with previous studies.
Conclusions
We were able to demonstrate the practical application of the SMDI toolkit in a real-world setting. Application of the SMDI toolkit and the resulting insights of potential missingness patterns can inform the choice of appropriate analytic methods and increase transparency of research methods in handling missing data. This type of approach can inform analytic decision making and may increase our ability to generate evidence from real-world data.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12874-024-02330-2.
Keywords: Missing data, Electronic health records, Real-world evidence
Introduction
Data from electronic health records (EHR) are increasingly used to generate real-world evidence about health care delivery and interventions [1, 2]. However, since EHR are generated from the dynamic and heterogeneous workflows of clinical health care, and not in carefully planned prospective studies or clinical trials, use of EHR data for research is inherently challenged by missing data.
Despite the proliferation of EHR-based observational research, prevailing practices in addressing missing data lag behind established guidance [3–5]. For example, complete case analysis, i.e. restricting analysis to only observations without missing values and assuming that observations are missing completely at random (MCAR) is still the most common approach to missing data in clinical research [6, 7]. These methods rely on the assumption that missing data are independent of observed and unobserved covariates. However, this assumption is not realistic in the majority of cases as missing EHR values likely depend on observed or unobserved factors. More appropriate methods like imputation that use collected variables to predict missing values can lead to less biased results and less uncertainty [8]. However, bias can still arise if the missing values are different from the observed values in a way that cannot be estimated from existing information. Best practices suggest that the choice of an appropriate analysis strategy should be based on a number of a priori considerations, including the missingness mechanisms present in the data [5]. Evidence from real-world data requires clear explanations for how missing data are quantified and addressed during statistical analysis [9, 10].
To move towards more appropriate analytic approaches for real-world data, researchers need study-specific knowledge about the underlying patterns of missing data. Existing guidance has suggested straightforward diagnostic tests to elucidate missing data patterns [4, 11–13], however these tests have not been integrated into a cohesive, easily implemented statistical package for the broader non-specialist audience. The objective of this paper is to apply a recently published R package for systematically characterizing missing data patterns [14] to an empirical case example and to illustrate the decision-making process for analyses based on that package’s output. Though exposure, outcome, and confounder data may be partially observed in EHR data, we focus here on a case study where partially observed confounder data from EHR are used to supplement claims data, as the proposed use case for EHR data in the US Food and Drug Administration (FDA) Sentinel System [15].
Methods
The Structural Missing Data Investigations (SMDI) R package was designed to provide researchers and statisticians with an integrated and user-friendly interface to assess and inform analyses. The SMDI package was developed as part of an FDA-funded Sentinel Innovation Center study that more deeply investigated the performance of diagnostic procedures to characterize underlying missingness patterns and assessed the performance of various analytic approaches to address partially observed EHR confounders [16]. The SMDI R toolkit was developed to efficiently carry out these diagnostic tests in an analytic dataset.
The R toolkit consists of two parts, descriptive functions and diagnostic tests which provide evidence for underlying missingness mechanisms. The toolkit is available on the comprehensive R archive network (CRAN) [17], GitLab (https://janickweberpals.gitlab-pages.partners.org/smdi) and in two previous publications [14, 16]. With substantive knowledge about the context and provenance of the data (i.e. how the data are generated), the results of these diagnostics can provide evidence for certain missingness patterns and suggest possible missingness mitigation approaches.
Empirical case example construction
To illustrate the application of this package to a typical pharmacoepidemiology study, our empirical case example was designed as a retrospective cohort study among older adults to compare new users of sodium-glucose-cotransporter-2 inhibitors (SGLT2i) to new users of dipeptidyl peptidase‐4 inhibitors (DPP‐4i). The outcome of interest was major adverse cardiovascular events (MACE), which included all-cause mortality, stroke, or myocardial infarction occurring over the first year of use. From recent trial and observational investigations, we expected that initiation of SGLT2i medication would be associated with a lower risk of MACE in comparison to DPP-4i initiation [18–21]. Claims data do not routinely have information on some cardiovascular risk factors that could be potential confounders. EHR data may contain these clinical measurements, however EHR data are often incomplete [22, 23]. In this empirical case example, we used level of hemoglobin A1c (HbA1c) and body mass index (kg/m2, BMI) from EHR data to supplement Medicare fee-for-service claims data in an effort to better estimate the comparative effectiveness of these alternative treatments.
Data sources
We used Duke EHR-Medicare claims linked data that included Duke Health patients who were covered by Medicare and who were residents of North or South Carolina between 2013 and 2017. Medicare data were used to define the cohort, exposure, outcome, and most covariates. Additional partially observed covariates of interest were derived from EHR linked data.
Study population
Patients were included if they initiated treatment with either an SGLT2i or a DPP-4i medication between October 1, 2015, and December 31, 2017, without use of SGLT2i or DPP‐4i medications in the prior 12 months (baseline period). The index date was defined as the patient’s first prescription fill date. Patients must have been enrolled in fee-for-service Medicare Parts A and B and Medicare Part D throughout the baseline period. Additionally, a diagnosis of type 2 diabetes (T2D) was required, as defined by at least one T2D diagnosis code on a Medicare claim from any encounter setting within the baseline period. To confirm active engagement with the healthcare system, we used EHR data to determine if patients had at least one encounter of any type in the Duke EHR system within the baseline period. Exclusion criteria included missing age or gender, a history of type 1 diabetes, secondary or gestational diabetes, malignancy, end-stage renal disease, human immunodeficiency virus (HIV), skilled nursing facility admission, or organ transplant in the baseline period (see Additional File 1 Section A for all definitions). Patients were followed from index date until occurrence of outcome, death, 365 days, end of Medicare enrollment or the study end date (Dec. 31st, 2017), whichever came first.
Exposure, outcome, and covariate definition
Medicare claims data were used to identify patient exposure (to SGLT2i or a DPP-4i medication), outcomes (myocardial infarction, stroke, or all cause death), and patient characteristics. Patient characteristics selected a priori as potential confounders included demographics (measured at index date), clinical conditions (cardiovascular and other comorbidities, general health indices such as comorbidity score and frailty index, measured in the 12 month baseline period), recent diabetes medication exposure (sulfonylureas, metformin, insulin, measured in the prior 4 months), recent cardiovascular medication exposure (anticoagulant, antiplatelets, statins, antihypertensive medications, measured in the 12 month baseline period), and health care utilization (e.g. hospitalizations, emergency department visits, measured in the 12 month baseline period). These covariates were defined by ICD-9/10-CM diagnosis and procedure codes, HCPCS/CPT, and medication names (see Additional File 1 Section A for all definitions). Additional covariates derived from EHR data included HbA1c lab results (continuous) and BMI (categorical: < 18.5 underweight, 18.5–24.9 healthy weight, 25.0–29.9 overweight, ≥ 30 obese) [24]. These covariates were chosen as potentially important confounders and were expected to have some missing values.
Statistical analysis
SMDI: data structure and formatting
We generated the analytic dataset for use with the SMDI toolkit, which included the primary exposure, outcome, and necessary covariates. The package can accommodate a single outcome and one of a number of analytic strategies (e.g., generalized linear models and proportional hazards (time-to-event) models). All relevant data cleaning and variable definitions were completed, and the analytic data set was structured such that one row represented a unique patient and each column represented the variables; exposure, outcome, the fully observed covariates, and the partially observed covariates.
SMDI: descriptive functions and missingness diagnostic tests
We used the SMDI descriptive functions to visualize the proportions of missing observations and investigate missing data patterns. We used the SMDI missing diagnostics to 1) compare the distributions of patient characteristics, exposure, and outcome between patients with or without an observed value of the partially observed covariate, 2) assess the ability to predict missingness based on observed covariates, and 3) estimate if the missingness of a partially observed covariate is differential with respect to the outcome. We used these results to determine a possible missingness mechanism for each partially observed covariate and to inform our missingness mitigation approach (Table 1).
Table 1.
SMDI functions
| Descriptive Functions | |
|---|---|
| smdi_check_covar | Identifies which covariates exhibit missingness |
| smdi_na_indicator | Creates a binary missing indicator variable for missing observations for each partially observed variable |
| smdi_summarize | Summarizes and visualize missingness |
| smdi_vis | Produces the proportion missing for each variable under study, with an option to stratify by exposure |
| gg_miss_upseta | Identifies the number of patients who have missingness in multiple variables |
| md_patterna | Produces a table displaying the missing pattern |
| Missingness Diagnostics | |
|
Group 1 Diagnostics smdi_asmd smdi_hotelling smdi_little |
Computes the absolute standardized mean differences (ASMD) of patient characteristics between patients with versus without a value for the partially observed covariate(s) and provides the median (min/max) and plot for individual covariates Computes Hotelling's multivariate t-test [25] for each partially observed covariate, examining patient differences conditional on having an observed covariate value or not. Yields a test statistic, with the null hypothesis that there are no differences in the baseline covariate distributions Little’s test [26] computes a single global chi-square test statistic across all partially observed covariates with a null hypothesis that the data are missing completely at random |
|
Group 2 Diagnostics smdi_fr |
Train and fits a random forest classification model to assess the ability to predict missingness in the partially observed covariate and provides an AUC value and variables importance plot based on mean decrease in accuracy per predictor |
|
Group 3 Diagnostics smdi_outcome |
Fits outcome model with the missingness indicator of the partially observed covariate(s) and provides estimates from a univariate model (just considering the missingness indicator) and an adjusted model with all covariates |
|
Group 1–3 Diagnostics smdi_diagnose |
Provides a summary table with results from the above missingness diagnostics |
agg_miss_upset() and md.pattern() are re-exports of the naniar [27] and mice package [28], respectively. When using the toolkit, a specific variable (column name) can be noted by the investigator using covar_parameter, or if not specified, all functions will automatically consider any variable in the dataset that exhibits at least one missing value. Each of the functions can be called separately
We described baseline characteristics, the number of events, and incidence rates for each of the two study groups. We adopted an intent to treat approach which defined exposure at the index date and fit an unadjusted Cox proportional hazard model with medication as the exposure and MACE as the outcome. We then fit a hazard model, adjusted for 36 demographic and clinical characteristics, with and without the two partially observed covariates. Unadjusted and adjusted hazard ratios (HRs) and 95% confidence intervals (CI) were estimated. The proportional hazards assumption was tested for all above models.
This study has been determined as exempt by the Duke University Institutional Review Board (IRB) (study number Pro00111511). Data use agreement approval for reuse to use the linked data was received July 25, 2023.
Results
We identified 2,102 eligible patients who initiated an SLGT2i medication (n = 387) or DPP-4i medication (n = 1,715) (Fig. 1). The distribution of baseline characteristics showed that SLGT2i initiators were younger, more likely to be male, and have fewer comorbidities (see Additional File 1 Section B). Most patients (94.6% and 91.5% of the SLGT2i and DPP-4i groups, respectively) were censored. The mean (SE) follow-up time was 267 (116) days and 282 (107) days in the SLGT2i and DPP‐4i group, respectively. The median (IQR) follow-up time was 339 (192, 354) days and 343 (229, 354) days in the SLGT2i and DPP‐4i group, respectively. The crude MACE incidence rates were 88.2 and 134.7 events/1,000 person-years, respectively. Observed events and person-years in the SLGT2i and DPP-4i group were: SGLT2i group: 25 events during 283.4 person-years; DPP-4i group: 179 events during 1,328.9 person-years.
Fig. 1.
Patient flow diagram
SMDI descriptive functions
The three descriptive functions identified that HbA1c and BMI were covariates with at least one missing value, created a binary missing indicator variable for each, and output a summary of these values. We observed that HbA1c was missing in 64% of study participants (72% in SGLT2i group compared to 62% in the DPP-4i group). For BMI, overall, 16% of participants were missing values, and this was similar between exposure groups (Fig. 2, see Additional File 1 Section C for SMDI commands).
Fig. 2.
SMDI descriptive output- Percentage missing For each variable with missing values, SMDI can display the overall proportion missing, and the proportion by exposure/treatment group
The gg_miss_upset function enabled us to assess the intersection of missingness across variables. A monotone missingness pattern is when missingness in one variable is associated with missingness in another variable. For instance, since height and weight are usually assessed simultaneously, it is highly likely that weight will be missing when height is missing. In our empirical case example, of study participants who were missing at least one value (n = 1,375, 65.4%), about 22.4% (308/1,375) were missing both HbA1c and BMI (Fig. 3). Since monotonicity was not observed, we proceeded to apply the SMDI to a dataset that included both partially observed covariates. In situations where monotonicity is observed, researchers may choose to assess each partially observed covariate separately to avoid distorted values in the missingness diagnostic tests.
Fig. 3.
SMDI descriptive functions – BMI and HbA1c missingness. The gg_miss_upset and md.pattern functions examine the possibility of monotonicity in the missingness patterns between variables. As missingness patterns across variables exhibit more monotonicity, covariates that exhibit a monotone pattern may result in inflated AUC values in Group 2 diagnostic tests. In situations where missingness of one covariate may perfectly predict missingness in another covariate, researchers should apply the missingness diagnostics for each partially observed covariate independently. The exception is Little’s test, which is intended to be used when there are multiple partially observed covariates
Missingness diagnostic tests
We used the SMDI toolkit to assess both covariates of interest simultaneously. The smdi_diagnose runs all of the missingness diagnostic functions and produces a summary table with the most important test results for each partially observed confounder in a single table (Table 2).
Table 2.
SMDI_diagnose results for multiple partially observed covariates
| Group 1 | Group 2 | Group 3 | |||
|---|---|---|---|---|---|
| Covariate | ASMD (min/max)a | p Hotellinga | AUCb | LogHR univariate (95% CI)c | LogHR adjusted (95% CI)c |
| BMI | 0.092 (0.002, 0.244) | < .001 | 0.506 | -0.03 (95% CI -0.40, 0.34) | 0.07 (95% CI -0.33, 0.48) |
| HbA1c | 0.072 (0.001, 0.312) | < .001 | 0.643 | 0.09 (95% CI -0.20, 0.37) | -0.04 (95% CI -0.37, 0.29) |
Little’s test: Calculated for all partially observed covariates jointly, p value: < .001
Abbreviations: ASMD Median absolute standardized mean difference across all covariates, AUC Area under the curve, LogHR model beta coefficient, CI Confidence interval, max Maximum, min = Minimum
aGroup 1 diagnostics: Differences in patient characteristics between patients with and without covariate
bGroup 2 diagnostic: Ability to predict missingness
cGroup 3 diagnostics: Assessment if missingness is associated with the outcome (univariate, adjusted)
Group 1 diagnostic tests aim to quantify differences in the distribution of covariates between the populations who are and are not missing values for the partially observed covariate(s). The absolute standardized mean differences (ASMD) for all the covariates are summarized in two ways, the ASMD median (with a min and max) as well as visually in a plot of the value for individual covariates from smallest to largest. Values below and over 0.1 are identified by color and those under 0.1 indicate a small difference between the prevalence or mean of the covariate [29]. The median (max, min) ASMD for HbA1c was 0.072 (0.001, 0.312). The plots of the individual covariate ASMD values, showed that about 60% of the values were under the 0.1 threshold [29]. Of those that were over 0.1, all but two covariate values were under 0.2. Two ASMD values were close to 0.3 (total internal medication visits, and use of sulfonylureas medication). For BMI, the pattern was similar; median (max, min) ASMD for BMI was 0.092 (0.002, 0.244), about 40% of the ASMD values were under 0.1, the remainder had ASMD values between 0.1 and 0.2, and three values were between 0.2 and 0.25 with the highest value observed for Charlson comorbidity score (Fig. 4). The SMDI_asmd function also produces a table that indicates direction and magnitude of the univariate differences (see Additional File 1 Section D). We were able to observe that those with an HbA1c value had a higher mean number of internal medicine visits than those without an HbA1c value.
Fig. 4.
Group 1 diagnostics. ASMD plot for a) HbA1c, and b) BMI. ASMD values under 0.1 indicate that the underlying pattern of missingness is not associated with other observed covariates and may be completely at random (MCAR), whereas ASMD values greater than 0.1 provide evidence against MCAR
Group 1 diagnostic tests also include Hotelling’s multivariate t-test/Little’s chi-square test for differences between the covariate distribution between groups with and without a value for the partially observed covariate. In our empirical case example tests for both covariates had a p value < 0.001 which indicates significant differences in the distribution of observed baseline characteristics.
The Group 2 diagnostic assesses the ability of the dataset variables to predict missingness. Smdi_rf trains and fits a random forest model to assess the ability to predict missingness based on the observed covariates and produced an AUC value for each partially observed covariate (0.51 for BMI and 0.64 for HbA1c). The covariate importance plots show the mean decrease in accuracy for each covariate (i.e., the degree to which the accuracy of the prediction [# of correct predictions/total # of predictions made] would decrease, had we left out this specific predictor). For HbA1c, the plot indicated that BMI missingness and total internal medicine visits in the past year were most important for predicting HbA1c missingness. Results for BMI indicate that the highest values in mean decrease in accuracy were low (< 0.015), meaning that none of the variables were particularly important for predicting missingness (Fig. 5).
Fig. 5.
Group 2 diagnostics Covariate importance for a) predicting HbA1c missingness, b) predicting BMI missingness
The Group 3 diagnostic examined the crude and adjusted association between the missingness of the partially observed covariate and the outcome under study. In our empirical case example, the Group 3 diagnostic results for HbA1c and BMI yielded unadjusted and adjusted estimates close to the null value, with CIs that included the null.
Interpreting SMDI results to inform analytic decision-making
Each test in Group 1 was able to provide some evidence about the underlying missingness mechanism (Table 3) [16]. The mean ASMD for both partially observed covariates was < 0.1 (evidence that supports MCAR); however, the plot of individual covariate ASMDs showed that many variables had an ASMD greater than 0.1, and the Little’s test p-value was low (both pieces of evidence against MCAR).
Table 3.
Expected SMDI results under various missingness mechanisms [16]
aVery sensitive to sample size
bDiagnostic result depend on strength of the unmeasured confounder and its correlation with observed (auxilliary) covariates
In general, it is important to note that the Hotelling’s/Little’s test results (i.e., rejection of the null hypothesis that the missingness mechanism is MCAR) are sensitive to small differences that may not be apparent in the ASMD mean value and assumptions about the underlying data. Though we found the ASMD mean to be under 0.1, evidence from the ASMD plot can indicate the extent to which other covariates could be used to recover some of the missingness. In other words, strong evidence against MAR would entail observing an ASMD median below 0.1 and few, if any, variables with an individual ASMD > 0.1. Group 1 diagnostic results should be interpreted in the context of both ASMD and Hotelling/Little’s statistics, and MCAR or MNAR considered only when Group 1 results contain little to no evidence to the contrary. Collectively, these results provided evidence against MCAR for both partially observed covariates and indicated that existing covariates have information that may be leveraged to inform missingness mitigation techniques.
Group 2 diagnostics consist of a single test that yields AUC values that range from 0.5 to 1.0. A value of 0.5 indicates a complete lack of ability to predict missingness, and higher values indicate stronger relationships between covariates and missingness. In our empirical case example, the AUC for BMI was 0.51, and the low mean decreases in accuracy values of even the strongest predictors indicate that there likely are no informative covariates. These results for BMI align with the expected results for a MCAR or MNAR missingness mechanism (Table 3) [16]. In comparison to the maximum AUC values in a true simulated MAR mechanism (~ 0.59) [16], the observed AUC of 0.64 for HbA1c indicates strong evidence for MAR and against MCAR/MNAR. The relatively higher covariate importance of two variables; BMI missingness, and total internal medicine visits also support this mechanism. This may align with a possible clinical explanation, in that a higher frequency of internal medicine visits and missingness of BMI could reflect a more intensive treatment regimen where HbA1c is more likely to be measured regularly. Since a few observed covariates are able to predict missingness relatively well, we interpret that the underlying missingness mechanism for HbA1c may be missing at random (MAR).
The Group 3 diagnostic examines the association between the missingness of the partially observed covariate and the outcome under study and produces a crude and adjusted LogHR. In general, examining the resulting values and observing the differences between both the point estimates and the confidence interval provides evidence for various missingness mechanisms. For example, no apparent association in either the crude or adjusted setting provides evidence for MCAR, as the missingness without or with the observed covariates is not associated with the outcome. If the association is present in the crude model but not the adjusted, this indicated that observed covariates may have a MAR missingness pattern. If an association remains after adjustment, i.e., the association between missingness and the outcome cannot be explained by observed covariates, this may be indicative of an MNAR mechanism. It is important to note that only the Group 3 results can distinguish between MCAR and MNAR.
In our empirical case example, for both HbA1c and BMI, the unadjusted and adjusted estimates were close to zero before and after adjustment, providing some evidence for MCAR. In our prior simulation study, we consistently observed that under MNAR or MAR mechanisms, the Group 3 diagnostics resulted in crude estimates indicating an association with the outcome, which was not observed in our study. We note, however, that confidence intervals in both results included the null value.
In summary, results of the SMDI (in particular the Little’s/Hotelling test p value < 0.05 and the relatively high AUC) was able to provide evidence that the missingness mechanism of the HbA1c was likely MAR. Therefore, for HbA1c, we have the ability to use the distribution of measured covariates to improve imputation of missing values. The SMDI results for BMI indicated a MCAR mechanism (ASMD < 0.1 and AUC ~ 0.5). However, the ASMD plot and Hotelling/Little’s test results indicates that we could also, to a lesser extent, leverage observed variables to better impute missing data.
Use case example results
The crude hazard ratio comparing the two groups was 0.64 (95%CI: 0.43, 0.98). The hazard ratio adjusting for all covariates except the partially observed EHR covariates was 0.91 (95%CI: 0.58, 1.41, n = 2,102). Adjusting for demographic and clinical characteristics (which executed a complete case analysis, deleting 1375 observations due to missingness in EHR covariates of interest), showed a marked but uncertain reduction in MACE events for SLGT2i medication initiators compared to DPP-4i initiators (Hazard Ratio (HR):0.50 [95%CI:0.16, 1.60], n = 737). Using the MICE random forest approach to missingness mitigation yielded a HR of 0.90 (95%CI: 0.58–1.41, n = 2,102).
Discussion
In our application of the SMDI in this use case, we aimed to leverage linked EHR data to better account for potential confounders. We were able to apply the SMDI toolkit to systematically describe missingness patterns and provide evidence for the underlying missingness mechanism for each partially observed covariate and utilize an appropriate missingness approach.
The descriptive functions of the SMDI alerted us to the large proportion of the population who were missing an HbA1c value, and the relatively smaller proportion who were missing BMI values. As both covariates were missing more than the 5% or 10% missing that is typically considered consequential [30], further investigation was needed. The impact of missingness on research results may be more dependent on the missingness patterns, as well as the availability of observed data to impute missingness, than on the proportion of data missing [12].
In terms of using these results to guide analytic decisions, using the SMDI was able to provide evidence that the missingness pattern of the two partially observed covariates showed little indication that the underlying missingness mechanism is MNAR. MNAR, where missingness is systematically related to unobserved data, presents the most challenges to missingness mitigation as compared to MCAR or MAR [31]. While the Group 2 diagnostic for BMI showed only poor predictiveness, there were still a few covariates that exhibited imbalances and associations with the BMI missingness indicator. In addition, the Group 3 diagnostic did not show any significant difference in the time to MACE between patients with and without an observed BMI value which would not be expected under a strong MNAR scenario. SMDI diagnostics of HbA1c showed stronger relationships of covariates indicating intensified diabetic treatment (e.g., higher frequency of internal medicine visits and concomitant use of sulfonylureas) and missingness, and no significant differences in outcomes between patients with and without an observed HbA1c value which indicates a potential MAR mechanism. Considering these two interpretations, we felt confident that the missingness of both EHR confounders is unlikely to be affected by unobserved factors, and so we could use imputation methods.
We observed in our example that using a complete case analysis decreased sample size and appeared to bias the effect estimate away from the null. Our simulation paper in a linked claims-EHR database found that imputation using a random forest algorithm was consistently able to better recover information from existing data, even in settings with underlying MCAR patterns [32]. Therefore, we elected to use the random forest multivariate imputation by chained equations (MICE) approach to estimate the HR under study and leverage the ability of other variables to better impute missing BMI and HbA1c values.
Our study estimate is relatively similar to those found in previous randomized trials papers, which estimated a 16% decreased risk for MACE for patients given the SGLT2i canagliflozin as compared to placebo (HR 0.86; 95% CI, 0.75 to 0.97). Similar results have been reported from observation studies [18, 19], including a recent study that found reductions in MACE events among new users of SGLT2i vs DPP-4i (HR, 0.85; 95% CI, 0.75–0.95) [33]. In comparison to these existing estimates, our complete case analysis results, that showed a much larger but very uncertain benefit for SGLT2i medication, were likely strongly biased. We included confounders that reflected degree of primary and secondary health care utilization, markers of disease severity and comorbidities, but also recognize that these types of variables have also been recommended for use as covariates in the imputation process [34]. For each of the partially observed covariates, there was evidence that the other variables could help predict missingness. Accordingly, the model using the recommended approach for missing data was able to use much more data to generate an estimate. In this case, the addition of partially observed covariates from EHR did not result in any differences when compared to a claims data only analysis. We suggest that the chosen EHR confounders were not strong prognostic factors after adjusting for all claims confounders, or that the missingness could not be predicted well by the imputation. These results are likely to vary depending on the use case. Despite the fact that we were not able to see marked differences with the addition of imputed values for the partially observed covariates, the SMDI approach could still be useful to assess EHR variables, and potentially inform decisions about the utility of additional variables.
In this paper, we have successfully demonstrated the practical application of the SMDI toolkit and provided a tangible and pragmatic example of how to analyze, interpret, and manage SMDI results in the context of a pharmacoepidemiology study. This example, with the previous more detailed development and package publications [14], facilitates a better understanding of the descriptive and missingness mechanism identification process and aids the interpretation of the results derived from SMDI. Additionally, this example helps researchers acknowledge and describe missingness in their specific setting and inform analytic decisions regarding missing data and sensitivity analyses. This approach not only streamlines methodological options but may also improve transparency of analytic choices researchers make when using real-world data.
Several limitations to our empirical case example and application of the SMDI warrant consideration. Firstly, our analysis did not comprehensively address the missingness in exposure, outcome, and population selection, or the many analytic decisions and specifications (e.g., assumptions, the specification of the imputation model, number of imputations, and criteria of convergence) that may be required in using multiple imputation approaches, all of which may impact the validity and potential for bias. For this, we refer reader to related work, e.g. for sensitivity analyses under MNAR mechanisms [35–37]. Second, the small sample size potentially impacted the power of diagnostic tests and the precision of overall study estimates. Third, our study focuses on linked claim-EHR data, where the linked population may not be representative of the source populations. Despite these considerations, which are common in many real-world situations, the SMDI was able to provide useful information about how to minimize the impact of missing data, and its function and performance would be expected to be comparable in a variety of settings beyond those presented here, particularly when used in conjunction with other best practices.
As we demonstrated in our empirical case example, toolkit results do not always provide unequivocal evidence of a single missingness mechanism across the various diagnostics, and researchers will have to consider that multiple missingness mechanisms may exist at once. Using clinical knowledge and understanding of how data are generated, both in general data types and at specific sites when using EHR data, is crucial for interpreting these data correctly. Additionally, since there is always potential for bias due to missing confounder data, and to avoid the complete case analyses that are the default in regression analyses from usual statistical software, using these approaches consistently to evaluate and address missing data is recommended.
In conclusion, application of the SMDI toolkit in a real-world setting and the resulting insights about potential missingness patterns can suggest the choice of appropriate analytic methods and increase transparency of research methods. This type of approach can improve our understanding of real-world data and may facilitate our ability to generate evidence from real-world data studies.
Supplementary Information
Acknowledgements
N/A.
Abbreviations
- AUC
Area under the ROC curve
- ASMD
Absolute standardized mean differences
- BMI
Body mass index
- DPP‐4i
Dipeptidyl peptidase‐4 inhibitors
- EHR
Electronic health records
- FDA
US Food and Drug Administration
- HR
Hazard Ratio
- MACE
Major adverse cardiovascular events
- MAR
Missing at random
- MCAR
Missing completely at random
- MICE
Multivariate imputation by chained equations
- MNAR
Missing not at random
- SGLT2i
Sodium-glucose-cotransporter-2 inhibitors
- SMDI
Structural Missing Data Investigations
Authors’ contributions
SRR and BGH designed the study, worked with V.N. to apply the toolkit and produce the results, and SRR, BGH, and VN drafted the manuscript. J.W., P.A.S., H.L., V.N., S.T., J.G.C., K.J.D., F.T., W.L., J.L., J.J.H., R.J.G., and R.J.D. contributed to the conception, design, and interpretation and provided important feedback. All authors critically reviewed the manuscript for important intellectual content and approved of the final version of the manuscript.
Funding
This project was supported by Master Agreement 75F40119D10037 from the US FDA. The FDA played a role in the design of the study, interpretation of results and in writing the manuscript.
Data availability
Data supporting this study were obtained with approvals from each data source and a data use agreement that enabled individual level linkage. Access to the data would require similar permissions with each data steward as well as new data use and data sharing agreement.
Declarations
Ethics approval and consent to participate
This study was deemed to not meet the definition of research by the Duke University Health System Institutional Review Board.
Consent for publication
This study was deemed to not meet the definition of research by the Duke University Health System Institutional Review Board.
Competing interests
The US Food and Drug Administration (FDA) approved the study protocol, statistical analysis plan and reviewed and approved this manuscript. Coauthors from the FDA participated in the results interpretation and in the preparation and decision to submit the manuscript for publication. The FDA had no role in data collection, management, or analysis. Janick Weberpals reports prior employment by Hoffmann-La Roche and previously held shares in Hoffmann-La Roche. Pamela Shaw is a named inventor on a patent licensed to Novartis by the University of Pennsylvania for an unrelated project. Sengwee Toh serves as a consultant for Pfizer, Inc. and TriNetX, LLC. on unrelated projects. Robert J Glynn has received research funding through his employer from Amarin, Kowa, Novartis, and Pfizer. Dr. Desai reports serving as Principal Investigator on investigator-initiated grants to the Brigham and Women’s Hospital from Novartis, Vertex, and Bristol-Myers-Squibb on unrelated projects. All remaining authors report no disclosures or conflicts of interest.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Franklin JM, Platt R, Dreyer NA, London AJ, Simon GE, Watanabe JH, et al. When can nonrandomized studies support valid inference regarding effectiveness or safety of new medical treatments? Clin Pharmacol Ther. 2022;111(1):108–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Orsini LS, Berger M, Crown W, Daniel G, Eichler HG, Goettsch W, et al. Improving transparency to build trust in real-world secondary data studies for hypothesis testing-why, what, and how: recommendations and a road map from the real-world evidence transparency initiative. Value Health. 2020;23(9):1128–36. [DOI] [PubMed] [Google Scholar]
- 3.Hunt NB, Gardarsdottir H, Bazelier MT, Klungel OH, Pajouheshnia R. A systematic review of how missing data are handled and reported in multi-database pharmacoepidemiologic studies. Pharmacoepidemiol Drug Saf. 2021;30(7):819–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Moreno-Betancur M, Lee KJ, Leacy FP, White IR, Simpson JA, Carlin JB. Canonical causal diagrams to guide the treatment of missing data in epidemiologic studies. Am J Epidemiol. 2018;187(12):2705–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lee KJ, Tilling KM, Cornish RP, Little RJA, Bell ML, Goetghebeur E, et al. Framework for the treatment and reporting of missing data in observational studies: the treatment and reporting of missing data in observational studies framework. J Clin Epidemiol. 2021;134:79–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bell ML, Fiero M, Horton NJ, Hsu CH. Handling missing data in RCTs; a review of the top medical journals. BMC Med Res Methodol. 2014;14:118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Eekhout I, de Boer RM, Twisk JW, de Vet HC, Heymans MW. Missing data: a systematic review of how they are reported and handled. Epidemiology. 2012;23(5):729–32. [DOI] [PubMed] [Google Scholar]
- 8.Ross RK, Breskin A, Westreich D. When is a complete-case approach to missing data valid? The importance of effect-measure modification. Am J Epidemiol. 2020;189(12):1583–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wang SV, Pinheiro S, Hua W, Arlett P, Uyama Y, Berlin JA, et al. STaRT-RWE: structured template for planning and reporting on the implementation of real world evidence studies. BMJ. 2021;372:m4856. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Callahan A, Shah NH, Chen JH. Research and reporting considerations for observational studies using electronic health record data. Ann Intern Med. 2020;172(11 Suppl):S79-s84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lee KJ, Carlin JB, Simpson JA, Moreno-Betancur M. Assumptions and analysis planning in studies with missing data in multiple variables: moving beyond the MCAR/MAR/MNAR classification. Int J Epidemiol. 2023;52(4):1268–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Madley-Dowd P, Hughes R, Tilling K, Heron J. The proportion of missing data should not be used to guide decisions on multiple imputation. J Clin Epidemiol. 2019;110:63–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Mohan K, Pearl J. Graphical models for processing missing data. J American Statist Assoc. 2021;116:1023–37. [Google Scholar]
- 14.Weberpals J, Raman SR, Shaw PA, Lee H, Hammill BG, Toh S, et al. smdi: an R package to perform structural missing data investigations on partially observed confounders in real-world evidence studies. JAMIA Open. 2024;7(1):ooae008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Desai RJ, Matheny ME, Johnson K, Marsolo K, Curtis LH, Nelson JC, et al. Broadening the reach of the FDA Sentinel system: a roadmap for integrating electronic health record data in a causal analysis framework. NPJ Digit Med. 2021;4(1):170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Weberpals J, Raman SR, Shaw PA, Lee H, Russo M, Hammill BG, et al. A Principled approach to characterize and analyze partially observed confounder data from electronic health records. Clin Epidemiol. 2024;16(null):329–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Weberpals J. smdi: perform structural missing data investigations comprehensive R archive network. Available from: https://CRAN.R-project.org/package=smdi. Cited 2024 2/7. [DOI] [PMC free article] [PubMed]
- 18.Patorno E, Pawar A, Franklin JM, Najafzadeh M, Déruaz-Luyet A, Brodovicz KG, et al. Empagliflozin and the risk of heart failure hospitalization in routine clinical care. Circulation. 2019;139(25):2822–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Patorno E, Pawar A, Wexler DJ, Glynn RJ, Bessette LG, Paik JM, et al. Effectiveness and safety of empagliflozin in routine care patients: results from the EMPagliflozin compaRative effectIveness and SafEty (EMPRISE) study. Diabetes Obes Metab. 2022;24(3):442–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Zinman B, Wanner C, Lachin JM, Fitchett D, Bluhmki E, Hantel S, et al. Empagliflozin, cardiovascular outcomes, and mortality in type 2 diabetes. N Engl J Med. 2015;373(22):2117–28. [DOI] [PubMed] [Google Scholar]
- 21.Zou CY, Liu XK, Sang YQ, Wang B, Liang J. Effects of SGLT2 inhibitors on cardiovascular outcomes and mortality in type 2 diabetes: a meta-analysis. Medicine (Baltimore). 2019;98(49):e18245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Haneuse S, Arterburn D, Daniels MJ. Assessing missing data assumptions in EHR-based studies: a complex and underappreciated task. JAMA Network Open. 2021;4(2):e210184. [DOI] [PubMed] [Google Scholar]
- 23.Tan ALM, Getzen EJ, Hutch MR, Strasser ZH, Gutiérrez-Sacristán A, Le TT, et al. Informative missingness: What can we learn from patterns in missing laboratory data in the electronic health record? J Biomed Inform. 2023;139:104306. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.About Adult BMI: Centers for disease control and prevention; 2024. Available from: https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html.
- 25.Hotelling H. The generalization of student’s ratio. Ann Math Stat. 1931;2(3):360–78. [Google Scholar]
- 26.Little RJA. A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 1988;83(404):1198–202. [Google Scholar]
- 27.Tierney N, Cook D. Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations. J Stat Softw. 2023;105(7):1–31.36798141 [Google Scholar]
- 28.van Buuren S, Groothuis-Oudshoorn K. mice: multivariate imputation by chained equations in R. J Stat Softw. 2011;45(3):1–67. [Google Scholar]
- 29.Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med. 2009;28(25):3083–107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Schafer JL. Multiple imputation: a primer. Stat Methods Med Res. 1999;8(1):3–15. [DOI] [PubMed] [Google Scholar]
- 31.Heymans MW, Twisk JWR. Handling missing data in clinical research. J Clin Epidemiol. 2022;151:185–8. [DOI] [PubMed] [Google Scholar]
- 32.Janick Weberpals, Sudha R. Raman, Shaw; PA, Hana Lee, Bradley G. Hammill, Sengwee Toh, et al. A principled approach to characterize and analyze partially observed confounder data from electronic health records. 2024. [DOI] [PMC free article] [PubMed]
- 33.D’Andrea E, Wexler DJ, Kim SC, Paik JM, Alt E, Patorno E. Comparing effectiveness and safety of SGLT2 inhibitors vs DPP-4 inhibitors in patients with type 2 diabetes and varying baseline HbA1c levels. JAMA Intern Med. 2023;183(3):242–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wells BJ, Chagin KM, Nowacki AS, Kattan MW. Strategies for handling missing data in electronic health record derived data. EGEMS (Wash DC). 2013;1(3):1035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Tompsett DM, Leacy F, Moreno-Betancur M, Heron J, White IR. On the use of the not-at-random fully conditional specification (NARFCS) procedure in practice. Stat Med. 2018;37(15):2338–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Weberpals J. NARFCS Sensitivity Analysis 2023. Available from: https://janickweberpals.gitlab-pages.partners.org/smdi/articles/d_narfcs_sensitivity_analysis.html#illustrative-example.
- 37.van Buuren S. Flexible Imputation of Missing Data: Chapman & Hall/CRC Press; 2018. Available from: https://stefvanbuuren.name/fimd/sec-sensitivity.html. Cited 2024 June 6.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data supporting this study were obtained with approvals from each data source and a data use agreement that enabled individual level linkage. Access to the data would require similar permissions with each data steward as well as new data use and data sharing agreement.






