Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jun 1.
Published in final edited form as: Med Care. 2022 Mar 30;60(6):470–479. doi: 10.1097/MLR.0000000000001720

Comparing Machine Learning to Regression Methods for Mortality Prediction using Veterans Affairs Electronic Health Record Clinical Data

Bocheng Jing 1,2,3, W John Boscardin 1,3,4, W James Deardorff 3, Sun Young Jeon 1,3, Alexandra K Lee 1,3, Anne L Donovan 5, Sei J Lee 1,3
PMCID: PMC9106858  NIHMSID: NIHMS1784509  PMID: 35352701

Abstract

Background:

It is unclear whether machine learning methods yield more accurate electronic health record (EHR) prediction models compared to traditional regression methods.

Objective:

To compare machine learning and traditional regression models for 10-year mortality prediction using EHR data.

Design:

Cohort Study

Setting:

Veterans Affairs (VA) EHR data

Participants:

Veterans age >50 with a primary care visit in 2005, divided into separate training and testing cohorts (n=124,360 each).

Measurements and Analytic Methods:

The primary outcome was 10-year all-cause mortality. We considered 924 potential predictors across a wide range of EHR data elements including demographics (3), vital signs (9), medication classes (399), disease diagnoses (293), laboratory results (71), and healthcare utilization (149). We compared discrimination (c-statistics), calibration metrics and diagnostic test characteristics (sensitivity, specificity, and positive and negative predictive value) of machine learning and regression models.

Results:

Our cohort mean age (SD) was 68.2 (10.5), 93.9% were male; 39.4% died within 10 years. Models yielded testing cohort c-statistics between 0.827 and 0.837. Utilizing all 924 predictors, the Gradient Boosting model yielded the highest c statistic (0.837, 95%CI: 0.835–0.839). The full (unselected) logistic regression model had the highest c-statistic of regression models (0.833, 95%CI: 0.830–0.835) but showed evidence of overfitting. The discrimination of the stepwise selection logistic model (101 predictors) was similar (0.832, 95%CI: 0.830–0.834) with minimal overfitting. All models were well-calibrated and had similar diagnostic test characteristics.

Limitation:

Our results should be confirmed in non-VA EHRs.

Conclusion:

The differences in c-statistic between the best machine learning model (924-predictor Gradient Boosting) and 101-predictor stepwise logistic models for 10-year mortality prediction were modest, suggesting stepwise regression methods continue to be a reasonable method for VA EHR mortality prediction model development.

Précis:

In a large EHR setting for 10-year mortality prediction, machine learning models performed similar to traditional regression models in prediction accuracy, test characteristics, and calibration.

INTRODUCTION

Many clinical decisions depend on the prediction of future events.1 For example, the decision to initiate treatment for hyperlipidemia depends in part on predicted risk of cardiovascular events.2 In addition, guidelines recommend targeting colorectal cancer screening to older adults with a life expectancy greater than 10 years.3,4 This focus on prediction to target interventions is conceptually appealing since most interventions have risks, burdens and/or costs. While high-risk patients are likely to benefit from an intervention, low-risk patients receiving the same intervention may be more likely to be harmed than benefit. Thus, accurate risk prediction through prediction models is a critical first step in individualized clinical decision-making.

There is tremendous interest in leveraging EHR clinical data to improve the accuracy of clinical predictions. By electronically organizing huge volumes of clinical data, EHRs represent an unparalleled data resource for clinical prediction. In addition, since EHRs can also guide decision-making through decision support tools, improved EHR prediction linked with decision support tools holds the promise of helping providers individualize clinical decisions. For example, automated cardiovascular risk prediction could be linked with a clinical reminder so that providers are prompted to consider a cholesterol lowering medication for those patients with high predicted cardiovascular risk.

Machine learning methods 5,6 may be especially well-suited to utilize the large volumes of clinical data available in EHRs. By automating the identification of patterns and multi-dimensional interactions in large-scale EHR datasets, machine learning algorithms may be able to predict clinical outcomes more accurately than traditional regression methods.7,8 Despite the theoretical advantages of machine learning models for clinical prediction, studies comparing machine learning to traditional regression models have been mixed. While some studies have concluded that machine learning methods outperformed logistic regression,915 a systematic review and other studies suggest that machine learning algorithms may perform similarly to logistic regression for clinical outcome prediction.16,17,18

Thus, our objective was to compare machine learning methods (Gradient Boosting, Random Forests, Neural Networks, SuperLearner ensemble and Least Absolute Shrinkage and Selection Operator or LASSO) to traditional regression methods (Logistic, survival including Weibull, Cox and Gompertz) to predict 10-year all-cause mortality using Veterans Affairs (VA) EHR data. The term “statistical learning”6 explicitly recognizes that regression and machine learning methods are elements of a spectrum of statistical prediction methods. In this paper, we have made the somewhat artificial dichotomy of traditional regression versus machine learning both for simplicity and to mirror the terminology in other recent work in this area917. We sought to compare these models across both binary outcomes (i.e. 10-year mortality) as well as survival outcomes (time to mortality) by comparing prediction performance at 1, 2 and 5 years of follow up. We included a wide variety of potentially predictive data elements from the VA EHR including demographics, disease diagnoses, medications, healthcare utilization, laboratory results and vital signs. To compare machine learning models to regression models, we examined discrimination (c-statistics), calibration metrics and diagnostic test characteristics (sensitivity, specificity and positive and negative predictive value).

METHODS

Participants

We identified a 10% random sample of VA patients aged 50 or older who had at least one primary VA clinic visit in 2005 for mortality assessment (n=126,360 for training cohort and a separate n=126,360 for a testing cohort). We chose to use 10% random sample for two reasons. First, we found that larger cohorts did not improve our models. Second, larger sample sizes became computationally intensive and led to impractical run times. Since these larger sample, longer run-time analyses yielded the same results as our 10% sample, we present our 10% random sample as our primary analyses. Each patient’s first clinic visit date was identified as the index visit.

Outcome Measures:

Patients were followed for the outcome of interest through the end of 2017. For binary models, the outcome was all-cause mortality within 10 years of the index visit. For survival models, the outcome was time to death, with follow-up through the end of 2017. Date of death was obtained from the VA vital status file.

Predictor Measures:

We included 924 risk factors from six domains as potential predictors:

  1. Diagnoses: 293 predictors of patients’ diagnoses were obtained from both VA outpatient and inpatient encounters and further categorized by Clinical Classifications Software (CCS), coded as present or absent.

  2. Medications: 399 VA drug classes were extracted from the VA Pharmacy Benefits Management data (PBM), coded as present or absent.

  3. Laboratory tests: 71 lab tests were derived from the VA Lab Results data (LAR) and categorized as abnormal low, normal, abnormal high, and lab not performed.

  4. Utilization data: 149 types of clinic visits were captured by the VA outpatient stop codes and categorized as 0, 1 and 1+ visits.

  5. Vital signs: 9 vital signs, including body mass index (BMI), pulse, pain, respiration rate, body temperature, weight range, weight change, systolic blood pressure, and pulse oximetry, were obtained from the VA Vital Sign data. Weight range and weight change were derived from repeated weight data and included as potential predictor variables.

  6. Demographics: 3 variables (age, sex, and race/ethnicity) were derived from the VA Vital Status Mini file.

Age at index-visit was coded as a restricted cubic spline with four knots placed at pre-specified percentiles (20%, 40%, 60% and 80%) of age marginal distribution: 52, 62, 74, and 84.19 Other predictor variables (diagnoses, medications, laboratory tests, healthcare utilization and vital signs) were obtained for 1 year prior to the index visit date. When more than 1 laboratory result were available in the 1-year lookback period, we focused on lab results closest to the index-visit date. We considered focusing on the most extreme or abnormal laboratory result but discovered that the majority of laboratory values in the study only had one value in the 1-year lookback period. For example, for the most common laboratory values (potassium, sodium, creatinine and glucose), nearly half of our patients did not have any results and nearly 90% had 0 or 1 result. Since repeat laboratory results were rare, a focus on the most extreme or abnormal laboratory value would be unlikely to lead to different results than a focus on lab results closest to the index-visit date.

We handled vital signs in the same way as laboratory results, focusing on vital signs closest to the index-visit date. While repeat vital signs were common, the use of extreme vital sign values led to models with a non-significant trend toward worse discrimination compared to models using the most proximate vital signs to the index visit date. Thus, we present our most proximate vital signs models as our primary result. Additional details regarding the specification of the potential predictor variables are available in Supplemental appendix tables 15.

Statistical Analysis

We developed five traditional regression models and six machine learning models to predict all-cause mortality. Traditional regression methods included Gompertz, Weibull, and Cox survival models, as well as two Logistic regression models: a full model incorporating all 924 predictors, referred to as the Logistic Full model, and a backwards selected model, predictor retention criterion: p<0.0001, referred to as the Logistic Selected model.20,21

Machine learning models included Least Absolute Shrinkage and Selection Operator (LASSO) Logit and Cox models, Random Forest, Neural Networks, Gradient Boosting and SuperLearner Ensemble methods. We utilized published best practices in optimizing models using each machine learning method. Since previous studies have suggested the feature preselection can decrease signal-to-noise ratios and improve the performance of machine learning algorithms,6,22 we performed the univariate analysis and identified 200 predictors most strongly associated with the outcome. For each machine learning method, we tested performance with the full predictor set versus the preselected predictor set. Since feature preselection led to improved performance for Neural Network and SuperLearner (which incorporated Neural Network), we present Neural Network and SuperLearner results with feature preselection while other machine learning algorithm results utilize the full predictor set.

For LASSO, we used 10-fold cross-validation to determine the optimal shrinkage parameter λ.23 For Random Forest and Gradient Boosting, we utilized bagging and boosting, respectively, to develop an ensemble of decision trees.6,2426 For Neural Network, we searched through grids to identify the ideal hidden layer units and adjusted the weights between input layer and hidden layer connections to optimize output layer classification. For SuperLearner, we selected and trained six algorithms and ensembled the final prediction by weighting each algorithm’s predicted risk27,28. Gradient boosting, Random Forest, Neural Networks and SuperLearner ensemble methods were trained using 10-fold cross-validation. Gradient boosting, Random Forest and Neural Networks were trained using hyperparameter grid-search to optimize model discrimination.6

We determined model performance metrics in both training and testing cohorts. We used the c-statistic (area under the receiver operating characteristic curve) as our measure of discrimination to assess concordance between the model predictions and the binary indicator of 10-year mortality. To compare diagnostic test characteristics across machine learning and regression models, we examined the sensitivity, specificity, positive predicted value (PPV) and negative predicted value (NPV) at different cutoff thresholds. To compare binary vs. survival models, we calculated c-statistics for 1, 2, and 5-year binary mortality outcomes from survival model predictions without retraining models. We evaluated calibration both graphically and by computing the calibration plot intercept and slope.19 Brier score was computed as an overall measure of model performance.29,30

We used SAS 9.4 and Stata 15.1 to fit and obtain testing dataset predictions for the Gompertz, Weibull, Cox, and Logistic regression models. We used R 3.6.4 to train and test LASSO Logit, LASSO Cox, Random Forest, and Gradient Boosting models. We used the “glmnet”31 package for LASSO Logit, the “coxnet”32 package and the “C060”33 package to respectively fit and obtain out of sample predictions for LASSO Cox, the “ranger”34 package for Random Forest, and the “caret”35 package for Gradient boosting, Neural Network and SuperLearner. Specifically, we used the “xgboost” library to train Gradient Boosting and “nnet” library to train Neural Network.

RESULTS

Baseline characteristics, including demographics, vital signs, medication use, chronic conditions, laboratory results and utilization of clinic visits were similar between training and testing cohorts (Table 1). The mean age in the training cohort was 68.2 (SD 10.5), 93.9% were male, 35.5% had BMI ≥30, 37.3% had systolic blood pressure ≥140mmHg. Twenty-three percent of patients in the training cohort were on Beta blockers and 56% had a diagnosis of essential hypertension; 20.6% were diagnosed with Diabetes without complications and 10-year mortality was 39.4%. Baseline characteristics were similar in the testing cohort.

Table 1:

Selected Baseline Cohort Characteristics

Training (n=126,360) Testing (n=126,360)
Demographics
 Age (SD) 68.2 (10.5) 68.2 (10.5)
 Male 93.9% 93.8%
Vital Signs
 BMI ≥30 35.5% 35.7%
 Pulse 76.3 (25.3) 76.3 (25.2)
 Pain 1.1 (3.1) 1.1 (3.0)
 Temperature 97.5 (1.3) 97.5 (1.4)
 SBP ≥140 mmHg 37.3% 37.1%
Medications
 Antilipemic Agents 35.1% 35.2%
 Ace Inhibitors 25.4% 25.6%
 Beta Blockers 23.2% 23.2%
 Gastric Medications 15.8% 15.9%
 Oral Hypoglycemics 11.7% 11.7%
Chronic Conditions
 Essential hypertension 56.3% 56.6%
 Disorders of lipid metabolism 46.1% 46.4%
 Genitourinary Cancer 39.5% 39.4%
 Diabetes mellitus without complication 20.6% 20.9%
Laboratory measurements
Glucose (Serum)
 Normal 19.4% 19.5%
 Abnormal High 29.0% 29.1%
 Abnormal Low 0.6% 0.5%
 Not Performed 51.0% 50.8%
PSA (Prostatic Specific Antigen)
 Normal 25.2% 25.3%
 Abnormal High 3.0% 2.9%
 Not Performed 71.1% 71.0%
Total Cholesterol
 Normal 32.1% 32.2%
 Abnormal High 12.5% 12.6%
 Not Performed 55.3% 55.2%
Clinic visits (1+ visits)
 Emergency Department 11.1% 11.1%
 Optometry 10.8% 10.7%
 Cardiology 5% 5.1%
 Physical Therapy 3.2% 3.1%
 Hospitalization 0.5% 0.5%
Mortality at follow-up
 Year 1 3.5% 3.5%
 Year 2 7.5% 7.4%
 Year 5 19.4% 19.3%
 Year 10 39.4% 39.3%

BMI is body mass index

Both machine learning and regression models yielded similar model discrimination for 10-year all-cause mortality (Table 2 and Figure 1A). The Gradient Boosting model had the highest machine learning model discrimination with a testing cohort c-statistic of 0.837 (95% CI: 0.835–0.839). The unselected Logistic full model had the highest regression model discrimination with a testing c-statistic of 0.833 (95% CI: 0.830–0.835). Random Forest and Neural Networks model yielded the lowest c-statistic (0.827, 95% CI: 0.825–0.829 and 0.827, 95% CI: 0.825–0.830, respectively). Given our sample size, the differences in c-statistic across models were statistically significant (Figure 2). However, the absolute range of c-statistics was relatively small at 0.01 (lowest c-statistic of 0.827 in Random Forest and Neural Network to highest c-statistic of 0.837 in Gradient Boosting).

Table 2:

Training and Testing Cohort Discrimination for 10 Year Mortality Prediction

Model Training Method Variable selected Training cohort C statistics Testing cohort C statistics
Survival Models
Gompertz Backwards selection with entering p-value=0.0001 131 0.833 (0.831, 0.836) 0.829 (0.827, 0.832)
Weibull Backwards selection with entering p-value=0.0001 134 0.833 (0.831, 0.836) 0.831 (0.828, 0.833)
Cox Backwards selection with entering p-value=0.0001 128 0.832 (0.830, 0.835) 0.829 (0.827, 0.832)
LASSO Cox 10-fold cross validation to optimize λ 261 0.835 (0.833, 0.837) 0.831 (0.829, 0.833)
Binary outcome Models
Logistic Selected Backwards selection with entering p-value=0.0001 101 0.835 (0.833, 0.837) 0.832 (0.830, 0.834)
Logistic Full No Selection 924 0.842 (0.840, 0.844) 0.833 (0.830, 0.835)
LASSO Logit 10-fold cross validation to optimize λ 293 0.837 (0.835, 0.839) 0.833 (0.830, 0.835)
Random Forest Hyperparameter Grid Search for the model that minimizes prediction error 924 0.829 (0.827, 0.832) 0.827 (0.825, 0.829)
Gradient Boosting Hyperparameters Grid Search with 10-fold cross validation for optimal AUC 924 0.839 (0.837, 0.841) 0.837 (0.835, 0.839)
Neural network 10-fold cross validation with predictor preselection 200 0.832 (0.830, 0.834) 0.827 (0.825, 0.830)
Superlearner Ensemble Logistic, LASSO, Random Forest, Gradient, Neural network with 10-fold cross validation and predictor preselection 200 0.837 (0.834, 0.839) 0.835 (0.832, 0.837)

Machine Learning models are in italics

Figure 1. Discrimination Across Model Development Methods.

Figure 1.

A. C-statistics Comparison for 10-year mortality prediction in Training and Testing cohorts across Machine Learning and Traditional Regression Models

B. C-statistics Comparison at Various Time Points across Machine Learning and Traditional Regression Models

Figure 2. C-Statistic Differences across Machine Learning and Traditional Regression Models.

Figure 2.

a. Diagonal boxes (gray) show the c-statistics of each model

b. Off-diagonal boxes (red/blue) show the differences between c-statistics of the models (horizonal - vertical).

c. Positive numbers (blue boxes) denote higher c-statistics for the horizontal (x-axis) model, compared to the vertical (y-axis) model.

d. Negative numbers (red boxes) denote lower c-statistics for the horizontal (x-axis) model, compared to the vertical (y-axis) model.

Machine learning models showed less overfitting than the full (non-selected) logistic regression model (Table 2). Overfitting, as measured by the difference in discrimination between the training and testing cohorts, was greater in the Logistic full model (0.009, training cohort c-statistic of 0.842 - testing cohort c-statistic of 0.833) than the gradient boosting model (0.002, training cohort c-statistic 0.839 - testing cohort c-statistic 0.837).

Stepwise selected traditional models did almost as well as non-selected models, with similar discrimination and less overfitting with fewer (101 vs 924) predictors. The backwards stepwise selection Logistic regression model (101 variables) yielded a testing cohort c-statistic of 0.832 (95% CI: 0.830–0.834), which is comparable to the testing cohort c-statistic produced by the full 924-predictor Logistic model (0.833, 95% CI: 0.830–0.835). In addition, selected models are less overfit than non-selected models. While the c-statistic dropped 0.009 (0.842 to 0.833) from training to testing cohorts for the Logistic Full model, the c-statistic decreased only 0.003 for the Logistic Selected model (0.835 to 0.832) (Table 2). These results suggest that in our VA EHR setting, stepwise backwards selection identifies and removes spurious predictors and results in more parsimonious and less overfit models.

Binary outcome models yielded similar discrimination at each follow-up time points compared to survival outcome models (Figure 1B). Specifically, the highest discrimination survival model (Gompertz) had a 10, 5, and 2-year c-statistic of 0.829 (95% CI: 0.827–0.832), 0.808 (95% CI: 0.805–0.811), and 0.803 (95% CI: 0.799–0.808). The Logistic Selected model had a similar 10, 5, and 2-year c-statistic of 0.832 (95% CI: 0.830–0.835), 0.811 (95% CI: 0.808–0.814), and 0.805 (95% CI: 0.801–0.810).

Calibration plots indicated that most models were well-calibrated with calibration slopes ranging from 0.948 to 1.183 (ideal is 1.00) and calibration intercept ranging from −0.017 to 0.095 (ideal is 0.00) (Figure 3). While all models showed excellent to acceptable calibration, the Random Forest model showed the most mis-calibration, with calibration slope of 1.183 and calibration intercept of 0.086. The Neural Network model also showed evidence of miscalibration at low and high predicted risk, with the lowest predicted risk of 4% and highest predicted risk of 87%.

Figure 3: Calibration Plots for Models.

Figure 3:

Figure 3:

Figure 3:

Diagnostic test characteristics results followed the pattern of model discrimination and calibration results with both machine learning and regression models showing similar test characteristics (Appendix Table 6). At a predicted probability threshold of 0.5, Neural Network yielded the highest sensitivity (0.674) and negative predictive value (0.796); the Gompertz model yielded highest specificity (0.878) and positive predictive value (0.758). Additional diagnostic test characteristic data for different predicted probability thresholds are shown in Appendix Table 6.

DISCUSSION

For 10-year all-cause mortality prediction models using VA EHR data, we found that machine learning methods and traditional regression methods performed similarly and were both able to develop models with excellent discrimination and calibration. While our large sample size (n=126,360) provided us the statistical power to identify the Gradient Boosting model as the highest discrimination model, the differences in discrimination between models were modest. Specifically, the difference in c-statistic between the “best” model (Gradient Boosting: testing cohort c-statistic 0.837, 95%CI: 0.835, 0.839) and the “worst” model (Random Forest: testing cohort c-statistic 0.827, 95%CI: 0.825, 0.829) was 0.01, which seems unlikely to be clinically significant. Thus, our results suggest that a wide variety of machine learning and traditional regression methods are reasonable approaches when developing clinical prediction models using EHR data.

Our results add to the growing body of literature that suggests that traditional statistical methods perform well and are comparable to machine learning algorithms in “large N, small p” data settings. A recent simulation study by Austin and colleagues showed that unpenalized logistic regression performed similarly to a variety of machine learning algorithms (e.g., LASSO, Gradient Boosting and Random Forests) across a variety of simulated datasets and performance metrics. They concluded that traditional statistical methods perform well in large datasets with “large N, small p”.36 Our study adds to this growing literature by extending these results to a real-world EHR data setting and comparing survival versus binary outcome models. Previous studies suggest that machine learning algorithms are more likely to outperform traditional statistical methods in data settings with “small N, large p” as well as data settings with large interaction effects between a large number of predictors.6,37

Previous authors have expressed concern that automated stepwise selection methods would lead to biased, overfit models by identifying and selecting factors most strongly associated with the outcome, especially in small sample size settings.19,38 Overfitting models would be identified during validation, with substantial decreases in discrimination. In our EHR setting with large N (126,360) and sufficient number of outcomes per potential predictor (outcomes/predictor ~54), we found that automated selection models validated with smaller decreases in discrimination (training c-statistic 0.835 – testing c-statistic 0.832 = 0.003) than non-selected regression models (training c-statistic 0.842 – testing c-statistic 0.833 = 0.009). Since most EHR settings will have ample sample size of many thousands or even millions of subjects, our results suggest that automated selection will likely be a reasonable model development strategy for most EHR clinical prediction model development.

Our results highlight two issues suggesting that our selected 101 predictor automated stepwise Logistic Selected model may be more practical and in some ways superior to the full 924 Gradient Boosting predictor model. First, our ability to decrease from 924 to 101 predictors with minimal losses in discrimination suggests that the removed predictors did not provide much additional predictive value and were more likely spurious “noise” predictors which add little to a prediction model. Second, while data may be relatively easy to access in EHR settings, additional predictors decrease model interpretability and can increase prediction variance. Thus, the 101-predictor selected Logistic model (testing cohort c-statistic 0.832, 95%CI: 0.830, 0.834) may be preferred in most situations to the slightly higher discrimination 924 predictor gradient boosting model (testing cohort c-statistic: 0.837, 95%CI: 0.835, 0.839).

Machine learning methods often explicitly minimize overfitting by employing strategies such as bootstrap aggregation or coefficient shrinkage. Thus, it was not surprising that our machine learning models had less overfitting than unselected, full logistic regression model (difference between training and testing cohort c-statistics in gradient boosting model 0.002 vs 0.009 in unselected logistic regression model). However, we found that automated stepwise selection also decreased overfitting (difference between training and testing cohort c-statistic in automated selection model 0.003).

While we anticipated 10-year mortality binary outcome models to compare favorably with survival models for the 10-year mortality outcome, we were surprised that 10-year mortality binary outcome models did nearly as well as survival models in predicting 5-, 2-, and 1-year mortality. We propose two potential explanations for the similar results between our survival and binary models. First, while the factors for 30-day mortality prediction may be quite different from the factors for 10-year prediction, the factors most important for 5-, 2- or 1-year prediction may be quite similar to the factors for 10-year prediction. Second, a growing body of literature suggests that survival modeling can be viewed as a binary outcome classification problem.39,40 This suggests that survival and binary models have important fundamental similarities, making our similar survival and binary model results less surprising.

Our result, that Gradient Boosting appears to outperform other methods by a small but measurable amount, is most likely a function of the characteristics of our dataset. Specifically, developing evidence suggests that while Neural Networks appear to outperform other algorithms with unstructured data such as free text or images, Gradient Boosting appears to be optimal in structured data settings.41,42 Our EHR dataset was dominated by structured data and Gradient Boosting appears to perform especially well in these settings.43 Since most real-life datasets will have both structured and unstructured data, optimal prediction may require trying several algorithms to determine empirically which methods work best in a given dataset.44

We found that LASSO models had excellent discrimination and compared favorably to other survival and binary outcome models. The survival LASSO Cox model had one of the highest discriminations of the survival models (testing cohort c-statistic 0.831, 95%CI: 0.829, 0.833). The binary outcome LASSO Logit was between the Gradient Boosting model and the Logistic Selected model in terms of testing cohort discrimination (gradient boosting 0.837 vs LASSO logit 0.833 vs logistic selected 0.832) and number of predictors (gradient boosting 924 vs LASSO logit 293 vs logistic selected 101). These results suggest that when sample size constraints make automated selection less desirable and parsimonious models are an important consideration, LASSO models may be an appropriate modeling strategy for EHR clinical prediction model development.

Our results should be interpreted in light of our study’s strengths and limitations. A major strength of our study is our ability to include a large number of potential predictor variables across many different types of data elements in a real-world EHR setting. Limitations of our study include the following. First, we used VA EHR data, which may not be generalizable to non-VA EHR settings. Subsequent replication studies will be needed determine whether our conclusions are equally valid in non-VA EHR settings. Second, we conducted data cleaning and data specification before modeling. Different degrees of data cleaning and data specification may lead to different modeling methods over- or underperforming other methods. Finally, we did not examine any unstructured data elements, such as text notes and imaging data.

In conclusion, we found that for all-cause mortality prediction, machine learning methods yielded models with similar performance to the traditional regression models in the VA EHR data setting. Although the Gradient Boosting model achieved a higher discrimination compared to other models, the differences in c-statistics were small and of unlikely clinical significance. An automated stepwise selection Logistic regression model achieved excellent discrimination with minimal overfitting while utilizing 101 predictors, compared to 924 predictors in the Gradient Boosting model. Since the large samples sizes common in EHR data settings minimize the weaknesses of stepwise selection methods, our results suggest that stepwise selection regression methods continue to be a reasonable EHR clinical prediction model development strategy.

Supplementary Material

Supplemental Data File (.doc, .tif, pdf, etc.)

Funding Source:

This study is supported by VA Health Services Research and Development Service (HSR&D) Investigator Initiated Research (IIR) grant 15-434.

Footnotes

Financial Disclosure: None reported.

REFERENCE

  • 1.Karlawish J Desktop medicine. JAMA. 2010;304(18):2061–2062. [DOI] [PubMed] [Google Scholar]
  • 2.Bibbins-Domingo K, Grossman D, Curry S, et al. Statin use for the primary prevention of cardiovascular disease in adults. JAMA. 2016;316(19):1997–2007. [DOI] [PubMed] [Google Scholar]
  • 3.Lin JS, Piper MA, Perdue LA, et al. Screening for colorectal cancer: updated evidence report and systematic review for the US preventive services task force. JAMA. 2016;315(23):2576–2594. [DOI] [PubMed] [Google Scholar]
  • 4.Lee SJ, Leipzig RM, Walter LC. Incorporating lag time to benefit into prevention decisions for older adults. JAMA. 2013;310(24):2609–2610. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Samuel AL. “Some Studies in Machine Learning Using the Game of Checkers”. IBM Journal of Research and Development. 1959;44(1.2):206–226. [Google Scholar]
  • 6.Hastie T, Tibshirani R, Friedman J, The Elements of Statistical Learning. Second Edition. New York, NY: Springer;2009. [Google Scholar]
  • 7.Scott IA. Machine Learning and Evidence-Based Medicine. Ann Intern Med. 2018;169(1):44–46. [DOI] [PubMed] [Google Scholar]
  • 8.Alpaydin E Introduction to Machine Learning. Cambridge, MA: The MIT Press; 2014:640. [Google Scholar]
  • 9.Ming C, Viassolo V, Probst-Hensch N, et al. Machine learning techniques for personalized breast cancer risk prediction: comparison with the BCRAT and BOADICEA models. Breast Cancer Res. 2019;123:860–867 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Aminian A, Zajichek A., Arterburn DE, et al. Predicting 10-Year Risk of End-Organ Complications of Type 2 Diabetes With and Without Metabolic Surgery: A Machine Learning Approach. Diabetes Care. 2020;43(3): dc102057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Al-Mallah MH, Elshawi R, Ahmed AM, et al. Using Machine Learning to Define the Association between Cardiorespiratory Fitness and All-Cause Mortality (from the Henry Ford Exercise Testing Project). Am J Cardiol. 2017;120(11):2078–2084. [DOI] [PubMed] [Google Scholar]
  • 12.Desai RJ, Wang SV, Vaduganathan M, et al. Comparison of Machine Learning Methods With Traditional Models for Use of Administrative Claims With Electronic Medical Records to Predict Heart Failure Outcomes. JAMA Netw Open. 2020;3(1):e1918962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Lynch CM, Abdollahi B, Fuqua JD, et al. Prediction of lung cancer patient survival via supervised machine learning classification techniques. Int J Med Inform. 2017;108:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Taylor T, Sarik DA, Salyakina D. Development and Validation of a Web-Based Pediatric Readmission Risk Assessment Tool. Hosp Pediatr 2020;10 (3): 246–256. [DOI] [PubMed] [Google Scholar]
  • 15.Weng SF, Reps J, Kai J, Garibaldi JM, Qureshi N. Can machine-learning improve cardiovascular risk prediction using routine clinical data?. PLoS One. 2017;12(4):e0174944. Published 2017 Apr 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019; 110:12–22. [DOI] [PubMed] [Google Scholar]
  • 17.Nusinovici S, Tham YC, Chak Yan MY, et al. Logistic regression was as good as machine learning for predicting major chronic diseases. J Clin Epidemiol. 2020; 122:56–69. [DOI] [PubMed] [Google Scholar]
  • 18.Cowling TE, Cromwell DA, Bellot A, Sharples LD, & van der Meulen J (2021). Logistic regression and machine learning predicted patient mortality from large sets of diagnosis codes comparably. Journal of Clinical Epidemiology, 133, 43–52. [DOI] [PubMed] [Google Scholar]
  • 19.Harrell FE. Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. First Edition. New York, NY: Springer; 2001. [Google Scholar]
  • 20.Gompertz B On the Nature of the Function Expressive of the Law of Human Mortality, and on a New Mode of Determining the Value of Life Contingencies. Philosophical Transactions of the Royal Society of London B: Biological Sciences.1825;182:513–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wilson DL. The analysis of survival (mortality) data: fitting Gompertz, Weibull, and logistic functions. Mech Ageing Dev. 1994;74(1–2):15–33. doi: 10.1016/0047-6374(94)90095-7 [DOI] [PubMed] [Google Scholar]
  • 22.Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y. Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics. 2010. Feb 1;26(3):392–8. [DOI] [PubMed] [Google Scholar]
  • 23.Tibshirani R Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B 1996;58(1): 267–288. [Google Scholar]
  • 24.Breiman L Random Forest. Mach Learn 2001; 45: 5–32. [Google Scholar]
  • 25.Breiman L Bagging Predictors. Mach Learn 1996;24: 123–140. [Google Scholar]
  • 26.Friedman J Greedy Function Approximation: A Gradient Boosting Machine. Ann Stat 2001; 29(5):1189–1232. [Google Scholar]
  • 27.van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol. 2007;6: Article25. [DOI] [PubMed] [Google Scholar]
  • 28.Rose S Mortality risk score prediction in an elderly population using machine learning. Am J Epidemiol. 2013;177(5):443–452 [DOI] [PubMed] [Google Scholar]
  • 29.Gerds TA, Cai T, Schumacher M. The performance of risk prediction models. Biom J 2008;50(4):457–479 [DOI] [PubMed] [Google Scholar]
  • 30.Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128–138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 2010;33(1):1–22. [PMC free article] [PubMed] [Google Scholar]
  • 32.Simon N, Friedman J, Hastie T, Tibshirani R. Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” J Stat Softw 2011;39(5):1–13 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Sill M, Hielscher T, Becker N, et al. c060: Extended Inference with Lasso and Elastic-Net Regularized Cox and Generalized Linear Models.” J Stat Softw 2014;62(5):1–22 [Google Scholar]
  • 34.Wright MN., Ziegler A “ranger: A Fast Implementation of Random Forests for High Deminsional Data in C++ and R. J Stat Softw 2017;77 (1):1–17. [Google Scholar]
  • 35.Kuhn M Building Predictive Models in R Using the caret Package. J Stat Softw 2008;28(5):1–2627774042 [Google Scholar]
  • 36.Austin PC, Harrell FE Jr, Steyerberg EW. Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting. Stat Methods Med Res. 2021;30(6):1465–1483. doi: 10.1177/09622802211002867 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Elith J, Leathwick JR and Hastie R. A Working Guide to Boosted Trees. J Animal Ecology. 2008. Jul; 77(4):802–13. [DOI] [PubMed] [Google Scholar]
  • 38.Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York, NY: Springer;2009. [Google Scholar]
  • 39.Zhong C, Tibshirani R, (2019). Survival analysis as a classification problem. Stanford, CA: Stanford University; 2019; arXiv:1909.11171. [Google Scholar]
  • 40.Kirasich K, Smith T, Sadler B. Random Forest vs Logistic Regression: Binary Classification for Heterogeneous Datasets. SMU Data Science Review 2018;1(3): Article 9. [Google Scholar]
  • 41.Goldbloom A What algorithms are most successful on Kaggle. 2016. Available at: https://www.kaggle.com/antgoldbloom/what-algorithms-are-most-successful-on-kaggle/notebook. Accessed Oct 17 2021.
  • 42.Morde V, Setty VA. XGBoost algorithm: long may she reign! Towards Data Science. April 7, 2019. Available at:https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d. Accessed Oct 17 2021.
  • 43.Yadaw AS, Li YC, Bose S, Iyengar R, Bunyavanich S, Pandey G. Clinical features of COVID-19 mortality: development and validation of a clinical prediction model. Lancet Digit Health. 2020;2(10):e516–e525. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Wolpert D The supervised learning no-free-lunch theorems. 2001. Proceedings of the 6th Online World Conference of Soft Computing in Industrial Applications:10–24 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Data File (.doc, .tif, pdf, etc.)

RESOURCES