Abstract
Purpose
To assess the performance of different machine learning (ML) approaches in identifying risk factors for diabetic ketoacidosis (DKA) and predicting DKA.
Methods
This study applied flexible ML (XGBoost, distributed random forest [DRF] and feedforward network) and conventional ML approaches (logistic regression and least absolute shrinkage and selection operator [LASSO]) to 3400 DKA cases and 11 780 controls nested in adults with type 1 diabetes identified from Optum® de‐identified Electronic Health Record dataset (2007–2018). Area under the curve (AUC), accuracy, sensitivity and specificity were computed using fivefold cross validation, and their 95% confidence intervals (CI) were established using 1000 bootstrap samples. The importance of predictors was compared across these models.
Results
In the training set, XGBoost and feedforward network yielded higher AUC values (0.89 and 0.86, respectively) than logistic regression (0.83), LASSO (0.83) and DRF (0.81). However, the AUC values were similar (0.82) among these approaches in the test set (95% CI range, 0.80–0.84). While the accuracy values >0.8 and the specificity values >0.9 for all models, the sensitivity values were only 0.4. The differences in these metrics across these models were minimal in the test set. All approaches selected some known risk factors for DKA as the top 10 features. XGBoost and DRF included more laboratory measurements or vital signs compared with conventional ML approaches, while feedforward network included more social demographics.
Conclusions
In our empirical study, all ML approaches demonstrated similar performance, and identified overlapping, but different, top 10 predictors. The difference in selected top predictors needs further research.
Keywords: AUC, diabetic ketoacidosis, least absolute shrinkage and selection operator, logistic regression, machine learning, prediction model
KEY POINTS.
Flexible machine learning (ML) approaches do not automatically result in improved performance over conventional ML approaches
The performances of flexible ML and conventional ML approaches were similar in predicting diabetic ketoacidosis (DKA) using an electronic health records data source
Flexible ML and conventional ML approaches identified overlapping, but different, top 10 risk factors for DKA
Flexible ML approaches only provided the relative importance for each predictor, while logistic regression could also estimate odds ratio for each predictor
Interpretation of the findings of flexible ML approaches is challenging
1. INTRODUCTION
Artificial intelligence including machine learning (ML) has been increasingly used to analyze healthcare data including electronic health records (EHR). 1 ML is a natural extension to conventional analysis approaches such as logistic regression, and has been widely used to learn complex relationships or patterns from data to make accurate predictions. 1 , 2 Conventional analysis approaches are commonly used to identify risk factors for a disease or outcome in clinical epidemiology. However, there is limited data on performance of different ML approaches in such studies, and the theoretical superiority of more flexible ML such as XGBoost and distributed random forest (DRF) over conventional ML approaches has not been consistently observed in real‐world settings. 3
Diabetic ketoacidosis (DKA) is an acute life‐threatening but preventable complication of type 1 diabetes (T1D). 4 In the United States, DKA hospitalization had increased by 54.9% from 2009 to 2014 after a decline in 2000–2009. 5 Because DKA is caused by insulin deficiency that is often precipitated by discontinuation of insulin or inadequate insulin treatment, it is important to identify risk factors for DKA and to predict DKA for effective T1D management. However, no prediction model for DKA is developed and used in clinical practice. Given the growing use of ML in clinical prediction including pharmacoepidemiology 6 with limited model performance assessment, we conducted a study to assess the empirical performances of different ML approaches in identifying risk factors for DKA and predicting DKA in adults with T1D using an EHR database.
2. METHODS
Different ML approaches were applied in a case–control study nested in adults with T1D to identify risk factors and predict DKA. The nested case–control design can more readily and efficiently identify risk factors for DKA compared with a cohort design. Previous studies have demonstrated that this design can be used for development of clinical prediction models. 7 , 8
2.1. Data source
Optum de‐identified EHR database was used in this study. It currently encompasses approximately 80 million patients from all Census regions in the United States, with at least 5 million patients from each region. Data on the full spectrum of inpatient and outpatient treatments are collected from more than 140 000 physicians at more than 600 hospitals and 6500 clinics. On average, patients contribute 4 years of medical history to the database. Information such as patient‐reported symptoms and outcomes as well as treatment rationale is captured directly from providers' notes via natural language processing. Approximately 82% of patients in this database are part of an Integrated Delivery Network, which includes hospital and emergency care as well as outpatient visits. In addition, about 20% of patients can be linked with administrative claims data.
2.2. Study patients
Patients with T1D were identified from Optum EHR database between January 1, 2007 and September 30, 2018 by adapting the Klompas algorithm. 9 The positive predictive value (PPV) of T1D using the Klompas algorithm was 89% in the original publication and 94.5% in an external validation study. 9 , 10 For each patient with T1D, the start of follow up for DKA event was the later of the first date on which a patient met the diabetes surveillance algorithm criteria or when a patient turned 18 years. The end of follow up was the date of first hospitalized DKA event during the follow‐up, the date of last‐recorded clinical activity in the database, or September 30, 2018, whichever occurred earliest.
Hospitalized DKA event was identified using ICD‐9‐CM (250.1) or ICD‐10‐CM (E1X.1) diagnosis codes which have been reported to have a PPV of 88.9%. 11 Because DKA is an emergency condition and patients with DKA rarely meet criteria to be safely discharged from emergency departments, 12 outpatient or emergency encounters without subsequent hospitalization were excluded to minimize false positive cases. The index date of a case was the date of the first DKA event occurred during the follow up. The DKA cases included in the study must had ≥1 non‐emergency clinical activity encounter and insulin treatment within 365 days before the index date, had T1D for ≥365 days before the index date, and had ≥1 HbA1c measurement within 183 days before the index date. Cases who were pregnant and those who used antihyperglycemic agents indicated for type 2 diabetes only (except for metformin) within 365 days prior to or on the index date were excluded.
For each DKA case, up to 10 controls without DKA on the index date of a case were randomly selected from the T1D cohort using incidence‐density sampling without replacement and matched on whether a patient's EHR data was linked with claims data. The index date for controls was assigned as the date of DKA diagnosis of their matching case. The same inclusion and exclusion criteria for cases were also applied to controls.
2.3. Potential predictors
Seven groups of potential predictors (i.e., features in ML) and the T1D cohort entry year as well as year of the index date were explored. These groups included social demographics, lifestyle factors, health service use, treatment, chronic comorbidities, acute medical conditions as well as laboratory test results, vital signs and other common measurements (Table S1). The selection of predictors was guided by the background knowledge of DKA and data availability.
2.4. Statistical methods
The overall structure of patient data that served as input to the analysis algorithm is shown in Figure 1. The whole study subjects were first randomly split into a training and a test data sets in a ratio of 4:1. The training set was used in the feature selection preprocessing and model building. The test set was used to assess each model performance.
FIGURE 1.

Data processing flowchart. DKA, diabetic ketoacidosis; LASSO, least absolute shrinkage and selection operator; T1D, type 1 diabetes
2.4.1. Feature selection preprocessing
We carried out the feature selection preprocessing from all potential predictors in the training set. First, any features with missing data ≥60% were removed. This threshold was used to keep as many as laboratory measurements which were most likely due to an absence of testing. The remaining features with missing data are listed in Table S2. The missing records of these features were each imputed with the average values of the available records of that feature or grouped as “unknown” where applicable. Otherwise, many patients' records were incomplete for models to make prediction. 13 Second, to avoid collinearity among the features which can cause unstable estimates and the sign flipping of the coefficients, 14 , 15 variance inflation factor (VIF) for each feature was calculated (Table S3). Features with VIF values ≥10 were flagged as highly correlated features. 16 Only one feature with higher clinical utility or lower missing data percentage was selected within a group of highly correlated features or those caused sign flipping issue.
2.4.2. Predictive modeling algorithms
Two conventional and three flexible ML approaches were utilized to identify important risk factors for DKA from the selected features and to build prediction models.
Conventional ML approaches included logistic regression and least absolute shrinkage and selection operator (LASSO). For logistic regression, backward selection with a cutoff p‐value of 0.05 was used to select statistically significant predictors; adjusted odds ratio (OR) and its 95% confidence interval (CI) was estimated for each selected predictor. LASSO is a technique of parameter regularization, which was applied to a logistic regression. During regularization, penalties are introduced to the model building process to avoid over‐fitting and reduce the number of covariates. 17
Flexible ML approaches included XGBoost, DRF and feedforward network. XGBoost is a supervised learning algorithm that implements a process called boosting to yield accurate models. 18 Boosting refers to the ensemble learning technique of building many models sequentially, with each new model attempting to correct for the deficiencies in the previous one. DRF generates a forest of classification trees, rather than a single classification or regression tree. Each of these trees is a weak learner built on a subset of rows and columns. More trees can reduce the variance. The classification process takes the average prediction over all trees to make a final prediction. Neural networks are learning algorithms inspired by the brain research. Feedforward network we used is the simplest form of neural network since its flow of information only moves in forward direction without circling back. 19 It consists of an input layer, an output layer, and several hidden layers in between. Each layer includes multiple nodes with different weights combining with input can determine the output of the network. During the learning process, these weights are updated to minimize the loss function. 20
2.4.3. Cross validation
During the training process, k‐fold cross validation was conducted for each modeling. That is, the training set was further split into a training subset and a validation subset in a ratio of 4:1 to tune hyperparameters. This process was repeated five times (k = 5). The final model was then built by aggregating the five cross validation models and evaluated in the full training data set. Further details are provided in Appendix I.
2.4.4. Model performance assessment
The model performance was assessed using different metrics below in the test set.
Area under the receiver operating characteristic (ROC) curve (AUC): It plots the true positive rate against the false positive rate, 21 and ranges from 0 to 1. The value of 1 indicates that the model predicts perfectly. Accuracy, specificity and sensitivity values were calculated based on a confusion matrix. To build a confusion matrix, a specific threshold value is required to determine whether a probability level gets assigned to a case or a control. 22 We chose accuracy as the metric to optimize to determine the threshold value. Finally, to demonstrate the variability of the model predictions, the 95% CI of AUC, accuracy, specificity, and sensitivity values were established using 1000 bootstrap samples. 23
2.4.5. Feature importance
For each approach, the feature importance percentage was determined by calculating the relative influence of each predictor. For conventional ML approaches, it was derived and ranked by the magnitude of standardized coefficient of each selected statistically significant predictor. 24 For XGBoost and DRF, two factors were considered to determine the relative importance of each feature: whether the variable was used to divide the decision tree node and how much prediction error has been reduced as a result of the split. That is, when split in a feature contributed to a larger decrease in the squared error, that feature was regarded as one with greater relative influence. 25 For feedforward network, the Gedeon method was used to calculate feature importance. 26 It considers the weights connecting the input features to the hidden layers.
A model can be simplified by only including the top 10 features, and we assessed the AUC of the top 10 feature model for each approach.
The data management was conducted using Palantir Foundry system (https://www.palantir.com/palantir-foundry/) housed in Sanofi. The control selection process via incidence‐density sampling was conducted using SAS 9.4 (SAS Institute, Cary, NC), and all other statistical analyses were performed using R software version 3.4 (www.r-project.org). We used H2O R package to implement the conventional and flexible ML processes. 27
3. RESULTS
A total of 3400 DKA cases and 11 780 controls were selected for the final analysis (Table 1). After the feature selection preprocessing 43 features were selected to predict DKA and were described in Table 2. Compared with controls, DKA cases were younger, had lower socioeconomic status and had more comorbidities. The mean of HbA1c level was 9.3% for DKA cases and 8.0% for controls.
TABLE 1.
Study population attrition procession
| Patient counts (January 01, 2007–September 30, 2018) | ||
|---|---|---|
| Individuals in Optum® de‐identified Electronic Health Record database | 95 823 300 | |
| Individuals with diabetes | 7 153 077 | |
| Individuals with type 1 diabetes | 169 779 | |
| Individuals aged ≥18 years and within an Integrated Delivery Network | 130 052 | |
| Individuals with at least 1 HbA1c measurement and at least 1 year of clinical activity any time | 105 816 | |
| DKA case and control selection | ||
| Potential candidates |
Potential DKA Cases N = 15 454 |
Potential Controls a N = 105 816 |
After application of the study criteria on potential DKA cases before matching:
|
3400 | NA |
| Control selection via incidence density sampling based on 1:10 matching ratio | 3400 | 34 000 |
| After application of the same study criteria defined above on controls | 3400 | 11 780 |
Abbreviations: DKA, diabetic ketoacidosis; HbA1c, hemoglobin A1c.
A control could not develop DKA before or at the matched index date but could become a case after the index date.
TABLE 2.
Selected characteristics of DKA cases and controls
| DKA cases N = 3400 | Controls N = 11 780 | p‐Value a | |
|---|---|---|---|
| Calendar year of T1D cohort entry (%) | |||
| 2007 | 554 (16.3) | 1831 (15.5) | 0.111 |
| 2008 | 578 (17.0) | 1910 (16.2) | |
| 2009 | 367 (10.8) | 1175 (10.0) | |
| 2010 | 414 (12.2) | 1469 (12.5) | |
| 2011 | 376 (11.1) | 1286 (10.9) | |
| 2012 | 370 (10.9) | 1257 (10.7) | |
| 2013 | 283 (8.3) | 1009 (8.6) | |
| 2014 | 228 (6.7) | 847 (7.2) | |
| 2015 | 131 (3.9) | 532 (4.5) | |
| 2016 | 80 (2.4) | 361 (3.1) | |
| 2017 | 19 (0.6) | 103 (0.9) | |
| Age, years, mean (SD) | 42.9 (16.5) | 45.4 (16.4) | <0.001 |
| Sex (%) | |||
| Female | 1818 (53.5) | 5594 (47.5) | <0.001 |
| Male | 1579 (46.4) | 6184 (52.5) | |
| Unknown | 3 (0.1) | 2 (0.0) | |
| Race (%) | |||
| Caucasian | 2929 (86.1) | 10 653 (90.4) | <0.001 |
| African American | 346 (10.2) | 584 (5.0) | |
| Asian | 12 (0.4) | 71 (0.6) | |
| Other/Unknown | 113 (3.3) | 472 (4.0) | |
| Annual household income, $, mean (SD) | 41 773 (8302) | 42 935 (8938) | <0.001 |
| Insurance type (%) | |||
| Commercial | 1306 (38.4) | 6342 (53.8) | <0.001 |
| Medicare | 621 (18.3) | 1519 (12.9) | |
| Medicaid | 514 (15.1) | 727 (6.2) | |
| Other payor type | 134 (3.9) | 353 (3.0) | |
| Uninsured | 202 (5.9) | 255 (2.2) | |
| Unknown | 623 (18.3) | 2584 (21.9) | |
| Geographic region (%) | |||
| Midwest | 2042 (60.1) | 6792 (57.7) | <0.001 |
| Northeast | 315 (9.3) | 1389 (11.8) | |
| South | 646 (19.0) | 2237 (19.0) | |
| West | 269 (7.9) | 1008 (8.6) | |
| Other/Unknown | 128 (3.8) | 354 (3.0) | |
| Lifestyle risk factors within 365 days before index date (%) | |||
| Alcohol abuse | 175 (5.1) | 226 (1.9) | <0.001 |
| Controlled substance abuse | 523 (15.4) | 454 (3.9) | <0.001 |
| Health service use within 365 days before index date (%) | |||
| Visit to endocrinologist | 1789 (52.6) | 7129 (60.5) | <0.001 |
| Visit to primary care | 1781 (52.4) | 4959 (42.1) | <0.001 |
| Chronic comorbidities any time between study start date and index date (%) | |||
| Cardiovascular disease | 2042 (60.1) | 5927 (50.3) | <0.001 |
| Diabetic microvascular complications | 1981 (58.3) | 5160 (43.8) | <0.001 |
| Chronic liver disease | 244 (7.2) | 435 (3.7) | <0.001 |
| Chronic kidney disease | 859 (25.3) | 1328 (11.3) | <0.001 |
| Dementia | 137 (4.0) | 217 (1.8) | <0.001 |
| Psychiatric disorder | 1743 (51.3) | 3580 (30.4) | <0.001 |
| Autoimmune disorders | 432 (12.7) | 1577 (13.4) | 0.315 |
| Cancer | 377 (11.1) | 1264 (10.7) | 0.575 |
| Acute medical conditions (%) | |||
| Infection within 7 days before index date | 230 (6.8) | 96 (0.8) | <0.001 |
| Major surgery within 7 days before index date | 45 (1.3) | 6 (0.1) | <0.001 |
| Non‐DKA hospitalization within 30 days before index date | 610 (17.9) | 168 (1.4) | <0.001 |
| Treatments (%) | |||
| Insulin pump within 7 days before index date | 154 (4.5) | 153 (1.3) | <0.001 |
| Insulin type within 7 days before index date | |||
| Intermediate/long‐acting insulin | 218 (6.4) | 309 (2.6) | <0.001 |
| Rapid/short‐acting insulin | 285 (8.4) | 512 (4.3) | <0.001 |
| Premixed insulin | 10 (0.3) | 15 (0.1) | 0.061 |
| Other medications within 30 days before index date | |||
| Systemic steroids | 81 (2.4) | 101 (0.9) | <0.001 |
| Diuretics | 132 (3.9) | 221 (1.9) | <0.001 |
| Antipsychotics | 60 (1.8) | 57 (0.5) | <0.001 |
| Laboratory test results or vital signs within 183 days before index date, mean (SD) b | |||
| HbA1c, % | 9.3 (1.8) | 8.0 (1.4) | <0.001 |
| Random blood glucose level, mg/dl | 194.6 (60.6) | 169.9 (66.3) | <0.001 |
| eGFR, ml/min/1.73m2 | 83.1 (36.9) | 96.4 (30.5) | <0.001 |
| Total cholesterol, mg/dl | 178.5 (46.7) | 173.3 (38.3) | <0.001 |
| Systolic blood pressure, mm Hg | 126.8 (15.6) | 124.4 (13.8) | <0.001 |
| BMI, kg/m2 | 26.4 (5.8) | 27.9 (5.7) | <0.001 |
| Height, cm | 169.6 (10.2) | 171.1 (10.2) | <0.001 |
| White blood cell count, x103 per microliter | 9.0 (3.3) | 7.6 (2.7) | <0.001 |
| Platelet count, x103 per microliter | 268.4 (79.7) | 254.5 (72.4) | <0.001 |
| Temperature, °C | 36.7 (0.3) | 36.7 (0.3) | <0.001 |
| Pulse rate, beats per minute | 85.0 (12.8) | 78.7 (11.9) | <0.001 |
| Respiratory rate, breaths per minute | 17.4 (2.1) | 16.7 (2.1) | <0.001 |
| Hemoglobin, g/dl | 12.6 (2.1) | 13.4 (1.9) | <0.001 |
| Oxygen saturation, SpO2 (pulse oximetry) | 97.6 (1.5) | 97.6 (1.5) | 0.264 |
Abbreviations: BMI, body mass index; DKA, diabetic ketoacidosis; eGFR, estimated glomerular filtration rate; HbA1c, hemoglobin A1c; SD, standard deviation.
Based on univariate analysis.
Based on non‐missing values.
3.1. Model performance with full set of features
In the training set, XGBoost outperformed the other 4 approaches with an AUC of 0.887, followed by feedforward network (AUC = 0.859), LASSO and logistic regression (AUC = 0.829 for each), and DRF (AUC = 0.808). In the test set, the AUC values ranged from 0.817 to 0.821 among these models and the difference decreased to 0.004 only (Table 3). The 95% CI of accuracy values ranged between 0.812 and 0.839. While the specificity values were higher than 0.9 for all models, the sensitivity values were only as high as 0.4. Consistent with the AUC findings, the differences in accuracy, sensitivity and specificity between flexible and conventional ML approaches were all minimal in the test set (Table 3). The confusion matrices are provided in Table S4.
TABLE 3.
The performance of study models with full set of features in the test data set
| Models | AUC (95% CI) | Accuracy (95% CI) | Sensitivity (95% CI) | Specificity (95% CI) |
|---|---|---|---|---|
| Logistic regression |
0.821 (0.804–0.837) |
0.827 (0.814–0.839) |
0.409 (0.321–0.497) |
0.947 (0.925–0.969) |
| LASSO |
0.821 (0.805–0.838) |
0.827 (0.814–0.839) |
0.407 (0.318–0.496) |
0.948 (0.925–0.970) |
| XGBoost |
0.819 (0.802–0.836) |
0.825 (0.813–0.837) |
0.414 (0.311–0.518) |
0.944 (0.916–0.971) |
| DRF |
0.817 (0.799–0.834) |
0.827 (0.815–0.839) |
0.420 (0.319–0.522) |
0.944 (0.917–0.971) |
| Feedforward network |
0.817 (0.799–0.834) |
0.825 (0.812–0.837) |
0.400 (0.291–0.508) |
0.947 (0.920–0.975) |
Abbreviations: AUC, area under the receiver operating characteristic curve; CI, confidence interval; DRF, distributed random forest; LASSO, least absolute shrinkage and selection operator.
3.2. Feature importance
HbA1c level, non‐DKA hospitalization, and white blood cell count were identified as one of top 10 features across all 5 models (Table 4 and Figure S1). Logistic regression and LASSO consistently identified the same top 10 features with slightly different ranks and most of them are well‐established risk factors for DKA. XGBoost and DRF also identified almost the same top 10 features and eight were laboratory test results or vital signs, while feedforward network selected a very different set of top 10 features and six were social demographics. Compared with the conventional ML approaches, XGBoost and DRF identified the same five features in their top 10 features, while feedforward network identified the same four features.
TABLE 4.
Top 10 features by each study model
| Rank | Conventional machine learning | Flexible machine learning | |||
|---|---|---|---|---|---|
| Logistic regression | LASSO | XGBoost | DRF | Feedforward network | |
| 1 | Insurance type – uninsured | HbA1c | HbA1c | HbA1c | Race – Asian |
| 2 | HbA1c | Non‐DKA hospitalization | Non‐DKA hospitalization | White blood cell count | Insurance type – uninsured |
| 3 | Non‐DKA hospitalization | Insurance type ‐ uninsured | White blood cell count | Non‐DKA hospitalization | Geographic region – Northeast |
| 4 | BMI | BMI | Hemoglobin | Hemoglobin | Race – African American |
| 5 | Pulse rate | Pulse rate | Pulse rate | Pulse rate | Geographic region – West |
| 6 | Psychiatric disorder | Psychiatric disorder | BMI | Random glucose level | Platelet count |
| 7 | Age | Age | Oxygen saturation | Respiratory rate | Gender ‐ female |
| 8 | Calendar year of diabetes cohort entry | Calendar year of diabetes cohort entry | Random glucose level | Platelet count | HbA1c |
| 9 | White blood cell count | White blood cell count | Platelet count | eGFR | Non‐DKA hospitalization |
| 10 | Acute infection | Acute infection | eGFR | BMI | White blood cell count |
Abbreviations: BMI, body mass index; DKA, diabetic ketoacidosis; DRF, distributed random forest; eGFR, estimated glomerular filtration rate; HbA1c, hemoglobin A1c; LASSO, least absolute shrinkage and selection operator.
In the logistic regression model, there were 16 positive predictors of DKA (i.e., with increased risk) such as higher HbA1c, non‐DKA hospitalization, and so on, and seven negative predictors (i.e., with decreased risk) such as older age, higher annual household income, and so on (Table S5).
For each approach, the AUC values of the top 10 feature model were close to that of the full model in the test set ranging from 0.785 to 0.802 (Figure S2).
4. DISCUSSION
We evaluated the performance of different ML approaches in a nested case–control study that used an EHR database to identify risk factors for DKA in adults with T1D. We found that prediction of DKA is achievable with either conventional or flexible ML approaches and that the differences in performance were minimal among these approaches. All approaches could consistently identify the known risk factors for DKA including HbA1c and non‐DKA hospitalization. XGBoost and DRF included more laboratory test results or vital signs in their top 10 features, while feedforward included more social demographics. The flexible ML approaches only provided the relative importance for each predictor, while logistic regression could also estimate OR for each predictor. Therefore, interpreting the findings by the flexible ML approaches is challenging.
In this study, we found that flexible ML approaches offered very limited improvement over conventional ones in predicting DKA using an EHR database. In a systematic review of performance comparison of logistic regression with ML for clinical prediction modeling, the AUC of logistic regression and ML models were similar when comparisons were restricted to studies with low risk of bias. 3 Other studies also reported comparable AUCs between flexible and conventional ML models. 28 , 29 Although the accuracy and specificity were above 0.8 in this study, sensitivity was only 0.4. The main reason for the low sensitivity is the probability threshold we used in constructing the confusion matrix. For simplicity, we chose overall accuracy to optimize when determining the threshold. As a result, the threshold was set relatively high, which led to low sensitivity and high specificity.
In general, the DKA risk factors that were most often selected by different models are clinically sensible as a triggering factor for insulin deficiency or discontinuation of insulin. Known risk factors for DKA include high HbA1c level, infection, surgery/trauma, younger age, female sex, low BMI, low socioeconomic status, and so on, 4 , 30 and either conventional or flexible ML approaches included some of these factors in their top 10 features. However, XGBoost and DRF included more laboratory test results or vital sign measurements compared with conventional models, while feedforward network included more social demographics. Because each model uses different theories and algorithms to determine the feature importance, direct comparison cannot be made to explain the different features selected by various models. Despite this, one possible explanation for the difference is that some of the identified laboratory test results or vital signs reflect underlying causes of DKA, for example increased white blood cell count and pulse rate with infection, or elevated hemoglobin with dehydration. Another possible explanation is that some test results and social demographics are largely interrelated, e.g., high random blood glucose level and uninsured status may both be associated with suboptimal diabetes management.
This study has several limitations that should be considered. First, misclassification of DKA was possible, because we could not retrieve medical records to validate the outcome. However, the PPV for the proposed approach of DKA identification was 89%. 11 Second, unlike administrative claims data, there is no patient enrollment information in EHR data. To minimize the possibility that the medical encounters in EHR data are incomplete, the study patients were selected among those with non‐emergency clinical activities recorded within 1 year prior to the index date, assuming all medical encounters were captured in the defined time window. In addition, we applied other inclusion and exclusion criteria which may limit the implementation of these predictive algorithms in a read‐world setting. Third, several laboratory measurements including white blood cell count had missing data close to 60% and the impact on model performance from the use of imputation for these features is unknown. However, the high percentages of these missing values were driven by controls, suggesting they were most likely to lack laboratory testing because controls had fewer medical conditions which could trigger laboratory testing than cases. Fourth, the predictor selection is based on the knowledge of DKA and data availability. This priori feature selection may limit the learning potential performance of some of the more complex ML algorithms and affect the interpretation of feature importance results. Fifth, each model uses different theories and algorithms to determine the feature importance. Therefore, principled comparison cannot be made across models to explain the differences in feature selection. Last, the model performance was assessed based on one typical nested case–control study using an EHR database. This needs to be considered when interpreting the generalizability of these results.
Overall, the flexible ML approaches offered very limited performance improvement over conventional ones in predicting DKA using structured data recorded in EHR data source in this study. Both conventional and flexible ML approaches identified overlapping, but different top 10 risk factors for DKA. Further research is needed to determine the conditions under which the flexible ML approaches would outperform the conventional ones and vice versa and to better understand the reasons for differences in feature importance ranking among these approaches.
CONFLICT OF INTEREST
Lin Li, Chuang‐Chung Lee, Fang Liz Zhou, Cliona Molony, Zoran Doder, Evgeny Zalmover, and Juhaeri Juhaeri are employees of Sanofi. Kristen Sharma was an employee of Sanofi at the time of this study, and is a current employee of Astellas Pharmaceuticals, Inc. Chuntao Wu was an employee of Sanofi at the time of this study, and is a current employee of Alexion Pharmaceuticals, Inc.
AUTHOR CONTRIBUTIONS
Chuntao Wu, Cliona Molony, Lin Li, Chuang‐Chung Lee: Study concept and design. Chuang‐Chung Lee: Data acquisition and statistical analysis. Lin Li, Chuang‐Chung Lee, Chuntao Wu, Fang Liz Zhou, Cliona Molony, Zoran Doder, Evgeny Zalmover, Kristen Sharma, Juhaeri Juhaeri: Interpretation of data. Lin Li: Drafting of the manuscript. Lin Li, Chuang‐Chung Lee, Chuntao Wu, Fang Liz Zhou, Cliona Molony, Zoran Doder, Evgeny Zalmover, Kristen Sharma, Juhaeri Juhaeri: Critical revision of the manuscript for important intellectual content. Chuntao Wu, Juhaeri Juhaeri: Study supervision. Lin Li, Chuang‐Chung Lee: Accountable for accuracy and integrity.
Supporting information
Appendix S1: Supporting Information
ACKNOWLEDGMENTS
The study was funded by Sanofi U.S. LLC. This study was designed as an analysis based on de‐identified Electronic Health Record data, and there was no active enrollment or active follow‐up of study subjects, and no data were collected directly from individuals.
Li L, Lee C‐C, Zhou FL, et al. Performance assessment of different machine learning approaches in predicting diabetic ketoacidosis in adults with type 1 diabetes using electronic health records data. Pharmacoepidemiol Drug Saf. 2021;30:610–618. 10.1002/pds.5199
Lin Li and Chuang‐Chung Lee have equal authorship
This work was presented as a poster at the International Conference on Pharmacoepidemiology & Therapeutic Risk Management All Access in 2020.
Funding information Sanofi
REFERENCES
- 1. Beam AL, Kohane IS. Big data and machine learning in health care. JAMA. 2018;319(13):1317‐1318. [DOI] [PubMed] [Google Scholar]
- 2. Wang S, Summers RM. Machine learning and radiology. Med Image Anal. 2012;16(5):933‐951. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12‐22. [DOI] [PubMed] [Google Scholar]
- 4. Misra S, Oliver NS. Diabetic ketoacidosis in adults. BMJ. 2015;351:h5660. [DOI] [PubMed] [Google Scholar]
- 5. Benoit SR, Zhang Y, Geiss LS, Gregg EW, Albright A. Trends in diabetic ketoacidosis hospitalizations and in‐hospital mortality ‐ United States, 2000‐2014. MMWR Morb Mortal Wkly Rep. 2018;67(12):362‐365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Sessa M, Khan AR, Liang D, Andersen M, Kulahci M. Artificial intelligence in Pharmacoepidemiology: a systematic review. Part 1‐overview of knowledge discovery techniques in artificial intelligence. Front Pharmacol. 2020;11:1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Biesheuvel CJ, Vergouwe Y, Oudega R, Hoes AW, Grobbee DE, Moons KG. Advantages of the nested case‐control design in diagnostic research. BMC Med Res Methodol. 2008;8:48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Ganna A, Reilly M, de Faire U, Pedersen N, Magnusson P, Ingelsson E. Risk prediction measures for case‐cohort and nested case‐control designs: an application to cardiovascular disease. Am J Epidemiol. 2012;175(7):715‐724. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Klompas M, Eggleston E, McVetta J, Lazarus R, Li L, Platt R. Automated detection and classification of type 1 versus type 2 diabetes using electronic health record data. Diabetes Care. 2013;36(4):914‐921. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Schroeder EB, Donahoo WT, Goodrich GK, Raebel MA. Validation of an algorithm for identifying type 1 diabetes in adults based on electronic health record data. Pharmacoepidemiol Drug Saf. 2018;27(10):1053‐1059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Bobo WV, Cooper WO, Epstein RA Jr, Arbogast PG, Mounsey J, Ray WA. Positive predictive value of automated database records for diabetic ketoacidosis (DKA) in children and youth exposed to antipsychotic drugs or control medications: a Tennessee Medicaid study. BMC Med Res Methodol. 2011;11:157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Ford W, Self WH, Slovis C, McNaughton CD. Diabetes in the emergency department and hospital: acute care of diabetes patients. Curr Emerg Hosp Med Rep. 2013;1(1):1‐9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Wells BJ, Chagin KM, Nowacki AS, Kattan MW. Strategies for handling missing data in electronic health record derived data. EGEMS (Wash DC). 2013;1(3):1035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Mullet GM. Why regression coefficients have the wrong sign. J Qual Technol. 1976;8(3):121‐126. [Google Scholar]
- 15. Midi H, Sarkar SK, Rana S. Collinearity diagnostics of binary logistic regression model. J Interdiscip Math. 2010;13(3):253‐267. [Google Scholar]
- 16. Dormann CF, Elith J, Bacher S, et al. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography. 2013;36(1):27‐46. [Google Scholar]
- 17. Nykodym T, Kraljevic T, Hussami N, Rao A, Wang A. Generalized Linear Modeling with H2O. Mountain View, CA: H2O.ai; 2017; http://h2o.ai/resources/, Accessed June 26, 2019. [Google Scholar]
- 18. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Paper presented at: 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; August 13–17, 2016, 2016; San Francisco, CA, USA.
- 19. Montana DJ, Davis L. Training feedforward neural networks using genetic algorithms. Paper presented at: the 11th international joint conference on artificial intelligence1989; Barcelona, Catalonia. Spain. 1989;1:762–767. [Google Scholar]
- 20. Candel A, Parmar V, LeDell E, Arora A. Deep Learning with H2O. Mountain View, CA: H2O.ai; 2016. [Google Scholar]
- 21. Fawcett T. An introduction to ROC analysis. Pattern Recognit Lett. 2006;27(8):861‐874. [Google Scholar]
- 22. Powers DMW. Evaluation: from precision, recall and F‐measure to ROC, informedness, markedness and correlation. J Mach Learn Technol. 2011;2(1):37‐63. [Google Scholar]
- 23. Robin X, Turck N, Hainard A, et al. pROC: an open‐source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Nathans LL, Oswald FL, Nimon K. Interpreting multiple linear regression: a guidebook of variable importance. Pract Assess Res Eval. 2012;17:1–19. [Google Scholar]
- 25. Cook D. Practical Machine Learning with H2O: Powerful, Scalable Techniques for Deep Learning and AI. Sebastopol, CA: O'Reilly Media; 2016. [Google Scholar]
- 26. Gedeon TD. Data mining of inputs: analysing magnitude and functional measures. Int J Neural Syst. 1997;8(2):209‐218. [DOI] [PubMed] [Google Scholar]
- 27. H 2 O: R Interface for H2O [computer program]. 2018.
- 28. Allam A, Nagy M, Thoma G, Krauthammer M. Neural networks versus logistic regression for 30 days all‐cause readmission prediction. Sci Rep. 2019;9(1):9277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Desai RJ, Wang SV, Vaduganathan M, Evers T, Schneeweiss S. Comparison of machine learning methods with traditional models for use of administrative claims with electronic medical records to predict heart failure outcomes. JAMA Netw Open. 2020;3(1):e1918962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Danne T, Garg S, Peters AL, et al. International consensus on risk management of diabetic ketoacidosis in patients with type 1 diabetes treated with sodium‐glucose cotransporter (SGLT) inhibitors. Diabetes Care. 2019;42:1147‐1154. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Appendix S1: Supporting Information
