Abstract
Evaluating appropriate methodologies for imputation of missing outcome data from electronic medical records (EMRs) is crucial but lacking for observational studies. Using US EMR in people with type 2 diabetes treated over 12 and 24 months with dipeptidyl peptidase 4 inhibitors (DPP-4i, n = 38,483) and glucagon-like peptide 1 receptor agonists (GLP-1RA, n = 8,977), predictors of missingness of disease biomarker (HbA1c) were explored. Robustness of multiple imputation (MI) by chained equations, two-fold MI (MI-2F) and MI with Monte Carlo Markov Chain were compared to complete case analyses for drawing inferences. Compared to younger people (age quartile Q1), those in age quartile Q3 and Q4 were less likely to have missing HbA1c by 25–32% (range of OR CI: 0.55–0.88) at 6-month follow-up and by 26–39% (range of OR CI: 0.50–0.80) at 12-month follow-up. People with HbA1c ≥ 7.5% at baseline were 12% (OR CI: 0.83, 0.93) and 14% (OR CI: 0.77, 0.97) less likely to have missing data at 6-month follow-up in the DPP-4i and GLP-1RA groups, respectively. All imputation methods provided similar HbA1c distributions during follow-up as observed with complete case analyses. The clinical inferences based on absolute change in HbA1c and by proportion of people reducing HbA1c to a clinically acceptable level (≤ 7%) were also similar between imputed data and complete case analyses. MI-2F method provided marginally smaller mean difference between observed and imputed data with relatively smaller standard error of difference, compared to other methods, while evaluating for consistency through artificial within-sample analyses. The established MI techniques can be reliably employed for missing outcome data imputations in large EMR-based relational databases, leading to efficiently designing and drawing robust clinical inferences in pharmaco-epidemiological studies.
Supplementary Information
The online version contains supplementary material available at 10.1007/s41666-022-00119-w.
Keywords: Missing data, Multiple imputation, Electronic medical records, Comparative effectiveness studies, Real-world evidence
Introduction
There is ever increasing emphasis on evaluating the comparative effectiveness of therapies at population level, with significant push from the regulatory bodies to use electronic medical records (EMRs) and claims data for real-world evidence generation [1]. Advances in the design and implementation of large EMRs from national primary/ambulatory care databases have created new opportunities in real-world data based clinical and pharmaco-epidemiological studies [2]. These databases have been extensively used to evaluate the risk factor changes in people with different clinical conditions [3–6]. However, one of the critical problems with EMR data, as with all longitudinal observational data, is the issue of missing data [7–10], challenging our ability to draw robust inferences from comparative effectiveness and outcome studies. This is particularly crucial in the context of regulatory considerations on real-world evidence [11, 12].
The data entry in EMRs depends on the nature and level of engagement between the individual and the clinical service provider. For example, depending upon the health system, a person with diabetes would generally be advised to get blood tests done every 6 months for the assessment of various risk factors including glucose level and lipids. However, given the severity of the disease state and the nature of anti-diabetes drug (ADD) titration, the person might need laboratory tests done more frequently. In primary care settings, it is hypothesized that younger people and those with lower risk profiles are less likely to visit the service provider leading to less frequent risk factor measures. The missing data may also arise simply because a person failed to attend the scheduled consultations. These aspects complicate the assertion about the nature of missing data in EMRs, making it difficult to appropriately differentiate between random and non-random missingness patterns. Although the problem of having significant proportions of missing data in longitudinal studies can be minimised through careful design, it is almost unavoidable in most clinical and epidemiological studies [13–15].
The inference drawn from a comparative effectiveness or safety/outcome study may be compromised when individuals have missing data on health indicators, and inadequate handling of the missing data can lead to substantial bias in the inferences drawn [12, 14, 15]. Hence, before investigating and imputing for the missing data, understanding the mechanisms behind the missing data is crucial. In practice, incomplete data are typically considered as missing at random (MAR) even if they may not be [14, 16]. In most EMRs, some variables would be expected to partially explain some of the variation in missingness, which indicates imputation under MAR setting [14]. A previous study reported that the standard imputation of missing EMR data with not missing at random (NMAR) assumption but without NMAR model might produce biased estimates, although the bias might not be large [17].
Multiple imputations (MIs) for missing data, compared to a single imputation, accounts for the statistical uncertainty in missing values. MI can lead to consistent, asymptotically normal and efficient estimates for a dataset with MAR missing pattern which makes it very attractive [10, 18]. Several statistical and machine learning methods including the MI techniques have been used to deal with the complex problem of missing data [8, 9, 19]. There is a strong body of literature on the methodological and application aspects of MI of missing values [10, 16, 18, 20, 21]. However, the studies addressing the fundamental aspects of missingness patterns in risk factor data from EMRs and the practical implications of such missingness while conducting comparative effectiveness studies are scarce [3, 20].
Using nationally representative EMR from the USA, the aims of this study were to (1) evaluate the association of different patient-level characteristics with the likelihood of missingness of the risk factor or outcome data and (2) investigate the performance of three MI techniques for the imputation for missing longitudinal clinical risk factor data in the context of evaluating comparative therapeutic effectiveness. Three MI techniques for missing disease biomarker data (glycated haemoglobin, HbA1c) were compared in people with type 2 diabetes (T2DM), treated with dipeptidyl peptidase-4 inhibitor (DPP-4i) and glucagon-like peptide-1 receptor agonist (GLP-1RA) under a new-user design setup. The robustness of drawing clinical inferences with imputed data and complete cases (CC) was explored for comparison of effectiveness of the therapies at population level.
Methods
Data
The Centricity Electronic Medical Records (CEMR) incorporate patient-level data from over 40,000 independent physician practices, academic medical centres, hospitals and large integrated delivery networks covering all states of the USA. The similarity of the general population characteristics and cardiometabolic risk factors in the CEMR database with those reported in the US national health surveys has been reported [22]. For example, diabetes prevalence (7.1% identified by diagnostic codes) is similar to National Diabetes Statistics (6.7% diagnosed diabetes in 2014) [23].
The database has been extensively used for academic research [4, 24]. Using CEMR database, a robust methodology for the extraction and assessment of longitudinal patient-level medication data was developed, and a detailed account of anti-diabetic drug (ADD) use in the US population was reported [25]. The clinically driven machine learning-based algorithms to identify people with type 2 diabetes from EMRs have been described [26, 27].
For more than 34 million individuals, longitudinal EMRs were available from 1995 until April 2016. This database contains comprehensive patient-level information on demographics, anthropometric, clinical and laboratory variables including age, sex, ethnicity, and longitudinal measures of HbA1c. Medication data includes brand names and doses for individual medications prescribed, along with start/stop dates and specific fields to track treatment alterations. The CEMR also contains patient reported medications, including prescriptions received outside the EMR network and over-the-counter medications [24, 25].
Study Population
The T2DM study cohort was selected under the following conditions: (1) diagnosis of T2DM, (2) 18–80 years old at the date of treatment initiation (baseline), (3) no missing data on age and sex, and (4) available baseline HbA1c measure. A robust clinically guided and iterative machine learning algorithm that uses the disease diagnosis (ICD-9, ICD-10, and SNOMED-CT) codes, prescriptions for insulin, ADDs, HbA1c, and age was developed to identify people with diabetes and to distinguish between its types was implemented [27]. In a new-user study design set-up, study cohort was selected from the time of initiation of DPP-4i or GLP-1RA when added to the first-line metformin (baseline). The number of people with minimum 12/24 months of follow-up post initiation of DPP-4i and GLP-1RA was 38,483/23,859 and 8,977/5,312, respectively. These people were receiving the respective therapies for a minimum of 1 year. HbA1c measures at baseline, 6, 12, 18, and 24 months were obtained as the nearest valid measure (4.1–30%) 3 months either side of the time point.
Statistical Methods
To evaluate the association of various factors that could be associated with missingness of HbA1c measures during follow-up, the likelihood of missingness of HbA1c at 6- and 12-month follow-up from baseline for each treatment group (DPP-4i and GLP-1RA) was estimated using logistic regression, adjusting for age, sex, comorbidities (cardiovascular disease (CVD), microvascular disease including chronic kidney disease (CKD)), baseline HbA1c ≥ 7.5% and use of other ADDs. The data-driven quartiles of age groups were considered: Q1 (18–50 years), Q2 (50–58 years), Q3 (58–66 years) and Q4 (66–80 years).
The missing HbA1c (%) measurements at 6-, 12-, 18- and 24-month follow-up were imputed using two-fold MI (MI-2F), MI by chained equations (MICE) and MI with Monte Carlo Markov Chain (MI-MCMC) and results compared with CC analysis [21]. Although various MI techniques are available in literature, we chose these three methods because they can handle arbitrary missing data patterns, reduces collinearity and provides flexibility in imputing for monotone or non-monotone missingness. Details of these three methods are provided in Appendix 1.
The STATA twofold, mi impute with option chained and mi impute with option mvn were used to implement the MI-2F, MICE and MI-MCMC procedures [28]. These functions are well validated and used extensively in various fields of studies including EMR data [29–31]. We set 25 imputations in all methods and aggregated them as a mean for main analyses. Imputations were conducted for those with non-missing baseline HbA1c, and the condition of at least 2 non-missing observations over 24-month follow-up. CC analysis considered only the non-missing HbA1c at the same time points. For all imputations, imputed values were adjusted for age, sex, diabetes duration, and addition of any third-line ADD within 2 years of follow-up.
Basic statistics were presented by number (percentage), mean (SD), mean (95% CI) or median (first quartile, third quartile) separately for the two treatment groups, as appropriate. Both unadjusted and adjusted change in HbA1c (%) at 6 and 12 months by the two treatment groups were evaluated, the adjustment factors being age and diabetes duration at baseline, sex and time to second-line ADD from first-line metformin. Among people with baseline HbA1c ≥ 7.5%, logistic regression was used to evaluate the odds of reducing HbA1c below 7% (glucose management target in those with T2DM) at 6 and 12 months of follow-up in the GLP-1RA group compared to the DPP-4i group. Treatment status is usually not randomized in observational comparative effectiveness studies, which implies that the outcome and treatment are not necessarily independent. To avoid this issue, we used a propensity-score-based inverse probability weighted treatment effects model adjusting/balancing for age, sex, diabetes duration at baseline and time to second-line ADD to make treatment and outcome independent conditioning on those covariates [32].
To evaluate the within sample consistency of the imputations by three different imputation methods, among those with complete HbA1c observations at baseline, 6 and 12 months, artificially random missing data samples (10 sets of samples) were generated at 6- and 12-month follow-up with combinations of proportions of missing with 20–30% at the two follow-up time points. The difference in mean and standard error (SE) between imputed versus observed complete case HbA1c (%) at 6 and 12 months were graphed over standard errors of differences. Also, the trajectory of mean (95% CI) HbA1c (%) at baseline and follow-up for two treatment groups was plotted for complete case and on imputed data by three imputation methods.
Results
We present our findings in this section. First, we describe the characteristics of people in our study cohort and describe the missing HbA1c data. We then present the possible association of various characteristics at therapy initiation with the likelihood of missing HbA1c data at 6 and 12 months of follow-up. Finally, we compare the performance of the three imputation methods on estimating the mean change in HbA1c and the likelihood of reducing HbA1c below 7%.
The basic characteristics of people at the time of initiation of DPP-4i or GLP-1RA with minimum 12- and 24-month follow-up are presented in Table 1. In the DPP-4i and GLP-1RA groups with minimum 12-month therapy exposure, respectively, the mean (SD) age was 58 (12) and 54 (11) years, 49% and 35% were male, and 27,205 out of 38,483 (71%) and 6908 out of 8977 (77%) of the people were White Caucasian. Median (Q1, Q3) HbA1c at baseline in DPP-4 and GLP-1RA groups were 7.5 (6.8, 8.8)% and 7.1 (6.5, 8.3)%, respectively.
Table 1.
Basic statistics and missingness of HbA1c (%) by treatment group with a minimum 12-month and 24-month treatment duration with DPP-4i and GLP-1RA
| Minimum 12-month treatment duration | Minimum 24-month treatment duration | |||
|---|---|---|---|---|
| DPP-4i | GLP1-RA | DPP4i | GLP1-RA | |
| N | 38,483 | 8977 | 23,859 | 5312 |
| Age in years† | 58 (12) | 54 (11) | 58 (12) | 54 (11) |
| Male‡ | 18,721 (49) | 3136 (35) | 11,672 (49) | 1834 (35) |
| Ethnicity‡ | ||||
| White | 27,205 (71) | 6908 (77) | 16,901 (71) | 4131 (78) |
| Black | 4393 (11) | 711 (8) | 2682 (11) | 412 (8) |
| Asian | 828 (2) | 100 (1) | 541 (2) | 65 (1) |
| Others/unknown | 6057 (16) | 1258 (14) | 3735 (16) | 704 (13) |
| HbA1c (%)§ | ||||
| Baseline | 8.1 (6.5, 18.8) | 7.7 (6.5, 16.5) | 8.1 (6.5, 18.8) | 7.7 (6.5, 16.5) |
| 6 months | 7.1 (5, 17.7) | 6.9 (5, 15.9) | 7.1 (5, 17.7) | 6.9 (5, 15.9) |
| 12 months | 7.2 (5, 17.9) | 7.0 (5, 15.6) | 7.2 (5, 17.9) | 7.0 (5, 15.6) |
| 18 months | - | - | 7.2 (5, 17.5) | 7.1 (5, 17.4) |
| 24 months | - | - | 7.3 (5, 17.9) | 7.1 (5, 16.7) |
| HbA1c (%)¶ | ||||
| Baseline | 7.5 (6.8, 8.8) | 7.1 (6.5, 8.3) | 7.5 (6.8, 8.8) | 7.1 (6.5, 8.3) |
| 6 months | 6.8 (6.3, 7.5) | 6.6 (6, 7.4) | 6.8 (6.3, 7.5) | 6.6 (6, 7.4) |
| 12 months | 6.9 (6.3, 7.7) | 6.6 (6, 7.5) | 6.9 (6.3, 7.7) | 6.6 (6, 7.5) |
| 18 months | - | - | 6.9 (6.3, 7.7) | 6.7 (6.1, 7.6) |
| 24 months | - | - | 6.9 (6.3, 7.8) | 6.8 (6.1, 7.6) |
| HbA1c (%)‡ – Missingness | ||||
| Baseline | 0 | 0 | 0 | 0 |
| 6 months | 6622 (28) | 1553 (31) | 4110 (28) | 883 (29) |
| 12 months | 6904 (30) | 1643 (32) | 4093 (28) | 945 (31) |
| 18 months | - | - | 4518 (31) | 1032 (33) |
| 24 months | - | - | 4680 (32) | 1050 (34) |
Unless stated otherwise; †: Mean (SD); ‡: N (%); §: Mean (Min, Max); ¶: Median (Q1, Q3)
There were no missing data on HbA1c at baseline by design. The proportions of missing HbA1c data for people with a minimum 12 and 24 months of treatment are presented for every 6 months of follow-up in Table 1. Among people with a minimum treatment duration of 12 months, proportions of missing HbA1c ranged from 28 to 32% in the two treatment groups. Similar missing proportions (28–34%) were observed at 24-month follow-up in people with a minimum of 24 months of treatment.
The possible association of various characteristics of people at therapy initiation with the likelihood of missing HbA1c in the study cohort at 6 and 12 months of follow-up is presented in Table 2. Age at treatment initiation had significant influence on the likelihood of missing HbA1c. Among people treated with DPP-4i, compared to youngest people (age quartile Q1), older people in age quartile Q3 and Q4 had 25% (OR: 0.75; 95% CI: 0.70, 0.81) and 30% (OR: 0.70; 95% CI: 0.64, 0.76) less likelihood of missing HbA1c measures at 6 months of follow-up, respectively. In GLP1-RA group, older people in age quartile Q3 and Q4 had 32% (OR: 0.68; 95% CI: 0.58, 0.80) and 31% (OR: 0.69; 95% CI: 0.55, 0.88) less likelihood of missing HbA1c measures at 6 months of follow-up, respectively. These patterns of the association of older age with less likelihood of missing HbA1c were similar at 12 months of follow-up and similar in both treatment groups.
Table 2.
Odds ratio (95% CI) and p values for likelihood of missingness of HbA1c (%) measure at 6 and 12 months of follow-up in people treated with DPP-4i and GLP1-RA adjusted for age quartiles (Q1, Q2, Q3, Q4), sex, pre-existing CVD and microvascular diseases and people with baseline HbA1c ≥ 7.5%
| 6-month follow-up | 12-month follow-up | |||||||
|---|---|---|---|---|---|---|---|---|
| DPP-4i | p values | GLP1-RA | p values | DPP-4i | p values | GLP1-RA | p values | |
| Age quartiles | ||||||||
| Q2 |
0.84 (0.78, 0.90) |
< 0.001 |
0.80 (0.70, 0.91) |
< 0.001 |
0.79 (0.74, 0.85) |
< 0.001 |
0.69 (0.60, 0.78) |
< 0.001 |
| Q3 |
0.75 (0.70, 0.81) |
< 0.001 |
0.68 (0.58, 0.80) |
< 0.001 |
0.74 (0.69, 0.80) |
< 0.001 |
0.61 (0.52, 0.71) |
< 0.001 |
| Q4 |
0.70 (0.64, 0.76) |
< 0.001 |
0.69 (0.55, 0.88) |
< 0.001 |
0.66 (0.60, 0.72) |
< 0.001 |
0.62 (0.50, 0.78) |
< 0.001 |
| Male vs female |
1.02 (0.97, 1.08) |
0.48 |
1.05 (0.94, 1.19) |
0.39 |
1.03 (0.98, 1.09) |
0.28 |
0.87 (0.78, 0.98) |
0.024 |
| Cardiovascular disease (CVD) |
1.06 (0.98, 1.14) |
0.16 |
1.03 (0.86, 1.23) |
0.76 |
1.08 (1.00, 1.16) |
0.06 |
1.13 (0.95, 1.35) |
0.17 |
| Microvascular disease |
1.01 (0.88, 1.16) |
0.88 |
1.02 (0.70, 1.50) |
0.30 |
0.92 (0.80, 1.06) |
0.26 |
0.99 (0.68, 1.43) |
0.95 |
| HbA1c ≥ 7.5% at baseline |
0.88 (0.83, 0.93) |
< 0.001 |
0.86 (0.77, 0.97) |
0.011 |
1.00 (0.94, 1.05) |
0.81 |
0.94 (0.84, 1.05) |
0.24 |
Gender and pre-existing CVD or microvascular diseases did not have any influence on the likelihood of missingness of HbA1c at 6- or 12-month follow-up. People with HbA1c ≥ 7.5% at baseline were 12% (95% CI of OR: 0.83, 0.93) and 14% (95% CI of OR: 0.77, 0.97) less likely to have missing data at 6-month follow-up in the DPP-4i and GLP-1RA groups, respectively. However, this association disappeared at 12-month follow-up.
There was no significant difference in distribution of imputed HbA1c at 6 and 12 months based on the three imputation approaches in people treated with DPP-4i (Fig. 1A and B). In people treated with GLP-1RA, at 6 months of follow-up, although MICE indicated a slightly leptokurtic distribution due to its higher variability (SD = 1.4, Fig. 1C), there was no difference at 12 months (Fig. 1D) between the imputation methods. The distributions of imputed HbA1c were very similar to the distribution of HbA1c based on complete case analyses (Fig. 1).
Fig. 1.
Distribution of HbA1c (%) and change in HbA1c (∆ HbA1c %) at 6 months and 12 months for DPP-4 and GLP-1RA, respectively, for complete case, MICE, two-fold and MI-MCMC imputation
The estimates of unadjusted and adjusted changes in HbA1c at 6- and 12-months follow-up from baseline were also similar for all imputation approaches in both treatment groups. Also, there was no significant difference in these estimates with the CC analyses (Table 3 and Figs. 1E-H). People treated with DPP-4i had a higher mean HbA1c level at baseline (8.1% vs 7.7% in GLP-1RA) and maintained it during follow-up compared to those treated with GLP-1RA (Supplementary Fig. 1). No significant difference in the trajectories of HbA1c over 24 months of follow-up was observed between CC analyses and the three imputation techniques in both DPP-4i and GLP-1RA groups (Supplementary Fig. 1).
Table 3.
(i) Mean (SD) for HbA1c (%); (ii) change in HbA1c (%); (iii) among people with HbA1c ≥ 7.5% at baseline, the proportion with reduced HbA1c ≤ 7% at 6 and 12 months during follow-up; and (iv) odds ratio for HbA1c ≤ 7% in those treated with GLP-1RA compared to DPP-4i group at 6 and 12 months by treatment group for complete case and on imputed data by three imputation methods
| 6-month follow-up | 12-month follow-up | |||
|---|---|---|---|---|
| DPP-4i | GLP1-RA | DPP-4i | GLP1-RA | |
| HbA1c (%)† | ||||
| CC | 7.1 (1.3) | 6.9 (1.3) | 7.2 (1.4) | 7.0 (1.4) |
| MICE | 7.1 (1.3) | 6.9 (1.4) | 7.2 (1.3) | 7.0 (1.4) |
| Two-fold | 7.1 (1.3) | 6.9 (1.3) | 7.2 (1.3) | 7.0 (1.4) |
| MI-MCMC | 7.1 (1.3) | 6.9 (1.3) | 7.2 (1.3) | 7.0 (1.4) |
| Change in HbA1c (%)‡—unadjusted | ||||
| CC | − 1.05 (− 1.07, − 1.02) | − 0.89 (− 0.93, − 0.84) | − 0.91 (− 0.94, − 0.89) | − 0.73 (− 0.77, − 0.68) |
| MICE | − 1.00 (− 1.02, − 0.98) | − 0.84 (− 0.87, − 0.80) | − 0.91 (− 0.93, − 0.89) | − 0.71 (− 0.75, − 0.69) |
| Two-fold | − 1.00 (− 1.02, − 0.98) | − 0.84 (− 0.87, − 0.80) | − 0.92 (− 0.94, − 0.90) | − 0.72 (− 0.76, − 0.69) |
| MI-MCMC | − 1.00 (− 1.02, − 0.98) | − 0.85 (− 0.88, − 0.81) | − 0.92 (− 0.94, − 0.90) | − 0.72 (− 0.76, − 0.69) |
| Change in HbA1c (%)‡—adjusted for age, gender, baseline HbA1c (%), diabetes duration, and time to second-line ADD as measured on complete cases and on imputed data | ||||
| CC | − 1.14 (− 1.16, − 1.11) | − 0.98 (− 1.02, − 0.94) | − 1.00 (− 1.03, − 0.98) | − 0.81 (− 0.85, − 0.76) |
| MICE | − 1.09 (− 1.11, − 1.07) | − 0.93 (− 0.96, − 0.89) | − 1.00 (− 1.02, − 0.98) | − 0.79 (− 0.83, − 0.75) |
| Two-fold | − 1.09 (− 1.11, − 1.07) | − 0.92 (− 0.95, − 0.89) | − 1.01 (− 1.03, − 0.99) | − 0.80 (− 0.84, − 0.77) |
| MI-MCMC | − 1.09 (− 1.11, − 1.07) | − 0.93 (− 0.96, − 0.89) | − 1.00 (− 1.02, − 0.98) | − 0.80 (− 0.84, − 0.77) |
| n (%) of people with HbA1c ≥ 7.5% at baseline achieving HbA1c ≤ 7% at follow-up § | ||||
| CC | 5193 (48) | 1043 (49) | 4768 (46) | 956 (48) |
| MICE | 6142 (45) | 1312 (47) | 5673 (43) | 1238 (45) |
| Two-fold | 6221 (45) | 1294 (46) | 5883 (43) | 1226 (45) |
| MI-MCMC | 6112 (45) | 1311 (47) | 5774 (43) | 1239 (45) |
| Odds ratio (95% CI)—adjusted for baseline HbA1c (%), age, gender, diabetes duration, time to second-line ADD and third-line ADD started within 6 (or 12 month) or not ‡ | ||||
| CC | Ref | 1.03 (1.00, 1.06) | Ref | 1.02 (0.99, 1.04) |
| MICE | 1.03 (1.00, 1.06) | 1.02 (1.00, 1.04) | ||
| Two-fold | 1.03 (1.00, 1.06) | 1.01 (1.00, 1.04) | ||
| MI-MCMC | 1.03 (1, 1.05) | 1 (0.99, 1.03) | ||
Unless stated otherwise; †: Mean (SD); ‡: Mean (95% CI); §: N (%)
Among people with clinically high HbA1c at baseline (≥ 7.5%), the proportion identified to have reduced HbA1c to a clinically acceptable level (≤ 7%) at 6- and 12-months follow-up were similar using all three imputation approaches (Table 3). While making clinical inference on the likelihood of reducing HbA1c below 7% in the GLP-1RA group, compared to those treated with DPP-4i, there was no disagreement among the three imputation approaches, and this inference was also in line with the analysis of complete cases (odd ratios and confidence intervals in Table 3).
The performance of the three imputation methods while imputing for artificially generated within sample missing data are presented in Fig. 2. The MI-2F method provided marginally smaller mean difference between observed and imputed data, with relatively smaller standard error of difference, compared to other imputation methods.
Fig. 2.
Mean difference by standard error difference between imputed versus observed complete case HbA1c (%) by different levels of missingness at 6 months and 12 months for MICE, two-fold and MI-MCMC imputations
Discussion
Missing data in EMRs is one of the most challenging issues pertaining to our ability to draw robust clinical inference from comparative effectiveness studies, particularly in chronic disease areas. Our extensive empirical assessment of statistical techniques to understand the factors influencing data missingness and to impute for missing risk factor data from EMRs will inform researchers in efficient design and analysis of pharmaco-epidemiological studies.
A novel component of this study is the investigation of the likelihood of missingness of follow-up risk factor measures (HbA1c) with demographic and clinical characteristics (age, sex, pre-existing comorbidities and disease severity (baseline HbA1c)). The results clearly indicated that missingness in the follow-up risk factor data is more likely in younger people, irrespective of the drug they are taking for glycaemic control. This also suggests that there is a trend in younger people in missing their scheduled clinic visits, which has significant adverse population level health management implications given the fact that people with diabetes at young age have higher cardiovascular and mortality risk [33, 34]. We also observed that people with higher disease severity (HbA1c above 7.5% at baseline) are more likely to visit their care provider at 6 months post treatment intensification/therapy titration—which disappears over longer follow-up time—partially likely to be associated with the effectiveness of the therapies in terms of better risk factor control.
Another novel component of our study is the comparative assessment of the usability and robustness of using MI imputed data for making robust clinical inferences in comparative effectiveness studies at the population level using large real-world EMRs. We observed that the inferences drawn on the risk factor changes in the context of comparative effectiveness study are similar between CC and imputed data-based analyses. More importantly, the clinical contexts of evaluating the effectiveness of the therapies, using continuous measures of risk factors or clinical categorization of the risk factors for evaluating therapeutic achievements, were well supported with confidence in making robust inferences using different methods of imputation. The EMR database presents a formidable challenge that the “missing data” have an intermittent pattern of missingness over time (non-monotone), and inferences drawn on the basis of CC analyses could be biased, statistically inefficient or misleading [35]. While the possibility of non-random missingness cannot be ruled out, given the association of various factors with the patterns of missingness in the outcome variable of interest, these missingness patterns are more likely to be MAR.
One of the important aspects in this study was to evaluate whether the distribution of imputed HbA1c data reflects the anticipated distribution of HbA1c as observed from the complete case analyses. However, our intention was not to compare the effectiveness between the therapies.
The results of our study should be interpreted with caution, as we were unable to assess socioeconomic characteristics, the nature of insurance and other cultural drivers as missingness predictors due to lack of reliable data. Furthermore, while we do not account for number of clinic visits as a missing risk factor predictor, it would be interesting to investigate how many people have a missing HbA1c and at least 1 clinic visit and manually investigate the nature of visits by reviewing free text comments.
As expected in the EMRs, many people had missing HbA1c in the 6-monthly follow-up data over 24 months. We observed that in estimating changes in HbA1c at 6 and 12 months from baseline, MICE produced marginally lower estimates compared to the other two methods and slightly leptokurtic in density plot. In the extensive within-sample validation exercise, the two-fold MI method showed better consistency. The simplified nature of imputing missing values at a given time by using values at nearby times makes two-fold imputation a more attractive technique in the context of EMR databases. This reduces the complexity of the imputation models, collinearity and overfitting issues [36]. Our findings encourage further studies in the applicability of two-fold method of imputation of patient-level missing clinical or laboratory data which are measured in short time window (days or weeks) from the EMRs or patient-reported data.
The MAR approach assumes that like the observed values, the missing observations are random samples that are generated from the same sampling distribution. In our study, the distributions of the imputed data obtained using three different imputation methodologies were similar compared to the data for CC because the underlying theory behind these imputation techniques is the same, as multiple imputations are primarily based upon MAR assumption. This is a fundamental weakness as it is near impossible to distinguish random and non-random missingness in EMRs. The missing outcome measure data in this context also raises the issue of indication bias—the fact that people with better glycaemic control are less representative in the follow-up outcome measure data. In this case, any analysis on the CC is highly likely to bias the result towards those who are doing poorly in terms of glycaemic control, as observed in this study.
Kim (2004) showed that under the regression model, the bias of the MI variance estimator decreases with large sample size [37]. Given the large sample size in this study, we had almost unbiased variance estimator for the imputed data. Clearly, the use of robust statistical analytical techniques employed on analysis of imputed data is highly likely to produce robust and reliable clinical inferences, compared to that based on the CC analyses. In this context, while the use of all three MI techniques (MICE, two-fold and MI-MCMC) to impute for a relatively large proportion of missing outcome data is theoretically and empirically justified, their performances in prospectively designed comparative effectiveness studies with relatively small number of people needs to be evaluated. Finally, validation of observed results should be conducted in different healthcare settings.
Conclusion
This extensive statistical exploratory study clearly suggests the suitability of using established multiple imputation techniques to impute for longitudinal missing risk factor or outcome data to achieve reliable inferences from EMR-based comparative effectiveness studies. This also provides a strong background on efficiently designing retrospective as well as prospective comparative effectiveness studies by evaluating the potential influences of various measurable factors on data missingness in primary outcome measures.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
University of Melbourne gratefully acknowledges the support from the Australian Government’s National Collaborative Research Infrastructure Strategy (NCRIS) initiative through Therapeutic Innovation Australia. No separate funding was obtained for this study. JL acknowledges the Ph.D. scholarship from Royal Melbourne Institute of Technology, Australia, and her co-supervisors Prof. Charlie Xue and Associate Prof Tony Zhang of the same University.
Abbreviations
- ADD
Anti-diabetes drug
- CEMR
Centricity Electronic Medical Records
- CKD
Chronic kidney disease
- CC
Complete case
- CVD
Cardiovascular disease
- DPP-4i
Dipeptidyl peptidase 4 inhibitors
- EMRs
Electronic medical records
- GLP-1RA
Glucagon-like peptide-1 receptor agonists
- HbA1c
Glycated haemoglobin
- MAR
Missing at random
- MI
Multiple imputation
- MI-2F
Two-fold multiple imputation
- MI-MCMC
Multiple imputation with Monte Carlo Markov Chain
- MICE
Multiple imputation by chained equations
- NMAR
Not missing at random
- T2DM
Type 2 diabetes
Author Contribution
SKP and JL conceptualized the study. SKP, JL and MS were responsible for the primary design of the study. MS and JL extracted and analysed the data with input from OM, MS and SKP. The first draft of the manuscript was developed by SKP and JL, and all authors contributed to the finalization of the manuscript. SKP and OM had full access to all the data in the study and are the guarantors, taking responsibility for the integrity of the data and the accuracy of the data analysis.
Data Availability
The Centricity EMR data that is used for this study was licensed from GE Healthcare. Restrictions apply to the availability of these data.
Declarations
Ethics Approval
The CEMR database contains longitudinal information for de-identified individuals. Research studies using the individual patient-level data from CEMR database are exempt from ethics approval from an institutional review board and informed consent—US Department of Health and Human Services Exemption 4 (CFR 46.101(b)(4)).
Conflict of Interest
SKP is currently a full-time employee of AstraZeneca. He has acted as a consultant and/or speaker for Novartis, GI Dynamics, Roche, AstraZeneca, Guangzhou Zhongyi Pharmaceutical and Amylin Pharmaceuticals LLC. He has received grants in support of investigator and investigator-initiated clinical studies from Merck, Novo Nordisk, AstraZeneca, Hospira, Amylin Pharmaceuticals, Sanofi-Aventis and Pfizer. JL, MS and OM have no conflict of interest to declare.
Footnotes
Sanjoy K. Paul and Joanna Ling are joint first authors.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.ElZarrad MK, Corrigan-Curay J. The US Food and Drug Administration’s real-world evidence framework: a commitment for engagement and transparency on real-world evidence. Clin Pharmacol Ther. 2019;106(1):33–35. doi: 10.1002/cpt.1389. [DOI] [PubMed] [Google Scholar]
- 2.Hecht J. The future of electronic health records. Nature. 2019;573(7775):S114–s116. doi: 10.1038/d41586-019-02876-y. [DOI] [PubMed] [Google Scholar]
- 3.Montvida O, Klein K, Kumar S, Khunti K, Paul SK. Addition of or switch to insulin therapy in people treated with glucagon-like peptide-1 receptor agonists: a real-world study in 66 583 patients. Diabetes Obes Metab. 2017;19(1):108–117. doi: 10.1111/dom.12790. [DOI] [PubMed] [Google Scholar]
- 4.Montvida O, Shaw JE, Blonde L, Paul SKJD, Obesity M. Long-term sustainability of glycaemic achievements with second-line antidiabetic therapies in patients with type 2 diabetes: a real-world study. Diabetes Obes Metab. 2018;20(7):1722–1731. doi: 10.1111/dom.13288. [DOI] [PubMed] [Google Scholar]
- 5.Zhao J, Feng Q, Wu P, et al. Learning from longitudinal data in electronic health record and genetic data to improve cardiovascular event prediction. Sci Rep. 2019;9(1):717–717. doi: 10.1038/s41598-018-36745-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Montvida O, Verma S, Shaw JE, Paul SK. Cardiometabolic risk factor control in black and white people in the United States initiating sodium-glucose co-transporter-2 inhibitors: a real-world study. Diabetes Obes Metab. 2020;22(12):2384–2397. doi: 10.1111/dom.14164. [DOI] [PubMed] [Google Scholar]
- 7.Carroll OU, Morris TP, Keogh RH. How are missing data in covariates handled in observational time-to-event studies in oncology? A systematic review. BMC Med Res Methodol. 2020;20(1):134. doi: 10.1186/s12874-020-01018-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Biering K, Hjollund NH, Frydenberg M. Using multiple imputation to deal with missing data and attrition in longitudinal studies with repeated measures of patient-reported outcomes. Clin Epidemiol. 2015;7:91–106. doi: 10.2147/clep.s72247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Thomas G, Klein K, Paul S. Statistical challenges in analysing large longitudinal patient-level data: the danger of misleading clinical inferences with imputed data. J Indian Soc Agric Stat. 2014;68(2):39–54. [Google Scholar]
- 10.Sterne JAC, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:157–160. doi: 10.1136/bmj.b2393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kahn MG, Callahan TJ, Barnard J, et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. EGEMS (Washington, DC) 2016;4(1):1244. doi: 10.13063/2327-9214.1244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Girman CJ, Ritchey ME, Zhou W, Dreyer NA. Considerations in characterizing real-world data relevance and quality for regulatory purposes: a commentary. Pharmacoepidemiol Drug Saf. 2019;28(4):439–442. doi: 10.1002/pds.4697. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Little RJ, D'Agostino R, Cohen ML, et al. The prevention and treatment of missing data in clinical trials. N Engl J Med. 2012;367(14):1355–1360. doi: 10.1056/NEJMsr1203730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wells BJ, Chagin KM, Nowacki AS, Kattan MW. Strategies for handling missing data in electronic health record derived data. EGEMS (Washington, DC) 2013;1(3):1035. doi: 10.13063/2327-9214.1035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Madden JM, Lakoma MD, Rusinak D, Lu CY, Soumerai SB. Missing clinical and behavioral health data in a large electronic health record (EHR) system. J Am Med Inform Assoc. 2016 doi: 10.1093/jamia/ocw021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Mackinnon A. The use and reporting of multiple imputation in medical research - a review. J Intern Med. 2010;268(6):586–593. doi: 10.1111/j.1365-2796.2010.02274.x. [DOI] [PubMed] [Google Scholar]
- 17.Lin JH, Haug PJ. Exploiting missing clinical data in Bayesian network modeling for predicting medical problems. J Biomed Inform. 2008;41(1):1–14. doi: 10.1016/j.jbi.2007.06.001. [DOI] [PubMed] [Google Scholar]
- 18.Spratt M, Carpenter J, Sterne JA, et al. Strategies for multiple imputation in longitudinal studies. Am J Epidemiol. 2010;172(4):478–487. doi: 10.1093/aje/kwq137. [DOI] [PubMed] [Google Scholar]
- 19.Jerez JM, Molina I, García-Laencina PJ, et al. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med. 2010;50(2):105–115. doi: 10.1016/j.artmed.2010.05.002. [DOI] [PubMed] [Google Scholar]
- 20.Bounthavong M, Watanabe JH, Sullivan KM. Approach to addressing missing data for electronic medical records and pharmacy claims data research. Pharmacotherapy. 2015;35(4):380–387. doi: 10.1002/phar.1569. [DOI] [PubMed] [Google Scholar]
- 21.Carpenter JK, Michael (2013) Multiple imputation and its application. In. Wiley
- 22.Montvida O, Dibato J, Paul SK (2020) Evaluating the representativeness of US centricity electronic medical records with reports from Centers for Disease Control and Prevention: office visits and cardiometabolic conditions. JMIR Medical Informatics in production [DOI] [PMC free article] [PubMed]
- 23.Control CfD, Prevention (2014) National diabetes statistics report: estimates of diabetes and its burden in the United States, 2014. Atlanta, GA: US Department of Health and Human Services 2014
- 24.Paul SK, Bhatt DL, Montvida O. The association of amputations and peripheral artery disease in patients with type 2 diabetes mellitus receiving sodium-glucose cotransporter type-2 inhibitors: real-world study. Eur Heart J. 2020;42(18):1728–1738. doi: 10.1093/eurheartj/ehaa956. [DOI] [PubMed] [Google Scholar]
- 25.Montvida O, Shaw J, Atherton JJ, Stringer F, Paul SK. Long-term trends in antidiabetes drug usage in the US: real-world evidence in patients newly diagnosed with type 2 diabetes. Diabetes Care. 2018;41(1):69–78. doi: 10.2337/dc17-1414. [DOI] [PubMed] [Google Scholar]
- 26.Moreno-Iribas C, Sayon-Orea C, Delfrade J, et al. Validity of type 2 diabetes diagnosis in a population-based electronic health record database. BMC Med Inform Decis Mak. 2017;17(1):34. doi: 10.1186/s12911-017-0439-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Owusu Adjah ES, Montvida O, Agbeve J, Paul SK (2017) Data mining approach to identify disease cohorts from primary care electronic medical records: a case of diabetes mellitus. The Open Bioinformatics Journal 10(1)
- 28.StataCorp LLC. Stata multiple-imputation reference Manual Release 17. Texas: Stata Press; 2021. [Google Scholar]
- 29.Welch C, Bartlett J, Petersen I. Application of multiple imputation using the two-fold fully conditional specification algorithm in longitudinal clinical data. Stata J. 2014;14(2):418–431. doi: 10.1177/1536867X1401400213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Royston P, White IR (2011) Multiple imputation by chained equations (MICE): implementation in Stata. 2011 45(4): 20. 10.18637/jss.v045.i04
- 31.Lee KJ, Carlin JB. Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. Am J Epidemiol. 2010;171(5):624–632. doi: 10.1093/aje/kwp425. [DOI] [PubMed] [Google Scholar]
- 32.Cattaneo MD. Efficient semiparametric estimation of multi-valued treatment effects under ignorability. Journal of Econometrics. 2010;155(2):138–154. doi: 10.1016/j.jeconom.2009.09.023. [DOI] [Google Scholar]
- 33.Ellis DA, McQueenie R, McConnachie A, Wilson P, Williamson AE. Demographic and practice factors predicting repeated non-attendance in primary care: a national retrospective cohort analysis. The Lancet Public Health. 2017;2(12):e551–e559. doi: 10.1016/S2468-2667(17)30217-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Dibato JE, Montvida O, Zaccardi F, et al. Association of cardiometabolic multimorbidity and depression with cardiovascular events in early-onset adult type 2 diabetes a multiethnic study in the US. diabetes Care. 2020;44(1):231–239. doi: 10.2337/dc20-2045. [DOI] [PubMed] [Google Scholar]
- 35.Little RJA, Rubin, Donald B. (2002) Statistical analysis with missing data. Second edn. Wiley-Interscience
- 36.Welch CA, Petersen I, Bartlett JW, et al. Evaluation of two-fold fully conditional specification multiple imputation for longitudinal electronic health record data. Stat Med. 2014;33(21):3725–3737. doi: 10.1002/sim.6184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Kim JK. Finite sample properties of multiple imputation estimators. Ann Stat. 2004;32(2):766–783. doi: 10.1214/009053604000000175. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The Centricity EMR data that is used for this study was licensed from GE Healthcare. Restrictions apply to the availability of these data.


