Skip to main content
Diagnostic and Prognostic Research logoLink to Diagnostic and Prognostic Research
. 2024 Nov 5;8:15. doi: 10.1186/s41512-024-00179-z

The relative data hungriness of unpenalized and penalized logistic regression and ensemble-based machine learning methods: the case of calibration

Peter C Austin 1,2,3,, Douglas S Lee 1,2, Bo Wang 4,5,6,7
PMCID: PMC11539735  PMID: 39501360

Abstract

Background

Machine learning methods are increasingly being used to predict clinical outcomes. Optimism is the difference in model performance between derivation and validation samples. The term “data hungriness” refers to the sample size needed for a modelling technique to generate a prediction model with minimal optimism. Our objective was to compare the relative data hungriness of different statistical and machine learning methods when assessed using calibration.

Methods

We used Monte Carlo simulations to assess the effect of number of events per variable (EPV) on the optimism of six learning methods when assessing model calibration: unpenalized logistic regression, ridge regression, lasso regression, bagged classification trees, random forests, and stochastic gradient boosting machines using trees as the base learners. We performed simulations in two large cardiovascular datasets each of which comprised an independent derivation and validation sample: patients hospitalized with acute myocardial infarction and patients hospitalized with heart failure. We used six data-generating processes, each based on one of the six learning methods. We allowed the sample sizes to be such that the number of EPV ranged from 10 to 200 in increments of 10. We applied six prediction methods in each of the simulated derivation samples and evaluated calibration in the simulated validation samples using the integrated calibration index, the calibration intercept, and the calibration slope. We also examined Nagelkerke’s R2, the scaled Brier score, and the c-statistic.

Results

Across all 12 scenarios (2 diseases × 6 data-generating processes), penalized logistic regression displayed very low optimism even when the number of EPV was very low. Random forests and bagged trees tended to be the most data hungry and displayed the greatest optimism.

Conclusions

When assessed using calibration, penalized logistic regression was substantially less data hungry than methods from the machine learning literature.

Keywords: Machine learning, Monte Carlo simulations, Data-generating process, Random forests, Logistic regression, Generalized boosting methods, Penalized regression

Introduction

Clinical researchers are increasingly using machine learning methods to predict patient outcomes. An increasing number of studies have compared the relative predictive performance of conventional statistical prediction methods (e.g., logistic regression) with the performance of methods from the machine learning literature (e.g., random forests or generalized boosting methods). Two recent systematic reviews in the cardiovascular literature summarized such comparative studies in predicting outcomes in patients with acute myocardial infarction (AMI) and heart failure (HF) [1, 2]. Results from such comparisons have been inconsistent, with conventional statistical methods having superior performance in some studies, while machine learning methods were found to have superior performance in other studies. Frequently, such studies focus on only one aspect of model performance: discrimination. Discrimination refers to the ability of a prediction model to discriminate between those who do and do not experience the outcome of interest. Discrimination is frequently assessed using the c-statistic (which is equivalent to the area under the receiver operating characteristic (ROC) curve) [3].

Optimism is the difference in a model performance metric between the sample in which the model was derived and in an external sample in which it was validated. Van der Ploeg and colleagues coined the term “data hungriness” to refer to the sample size needed for a modelling technique to generate a prediction model with minimal optimism [4]. When using the c-statistic as the measure of model performance, they defined the data hungriness of a modelling technique as the minimum number of events per variable (EPV) at which the optimism of the generated model was < 0.01, which they admit to being an arbitrary threshold. We suggest that the threshold would likely vary depending on the performance metric. Using a variety of data-generating processes, each based on a different learning method, Van der Ploeg and colleagues found that logistic regression was the least data hungry method, attaining a stable estimate of the c-statistic with 20 to 50 EPV. The machine learning methods that they examined were all more data hungry than logistic regression. In other words, the machine learning methods required a large sample size, as measured using EPV, compared to logistic regression to achieve a stable estimate of the c-statistic.

Most comparisons of the performance of conventional statistical prediction methods with machine learning methods have focused on discrimination. However, there are two primary components to assessing the performance of a prediction model: discrimination and calibration [5]. Calibration refers to the agreement between observed and predicted probabilities. For a prediction model to be useful in clinical practice, there needs to be good agreement between predicted and observed probabilities. However, published comparisons of conventional statistical methods with machine learning methods have rarely examined calibration and have focused almost exclusively on discrimination.

In the current study, we considered three ensemble-based machine learning methods: random forests of classification trees, bootstrap aggregated (bagged) classification trees, and boosted trees. The rationale for including random forests is that two prior systematic reviews of the use of machine learning in the cardiovascular literature found that, of the tree-based machine learning methods, random forests were the most frequently used method [1, 2]. We included boosted trees as they have often been suggested as an alternative to random forests, while bagged trees are a simplification of random forests. The current research extends two previous studies. In the first, we used empirical analyses to compare the performance of tree-based ensemble methods with that of unpenalized logistic regression for predicting mortality in patients with acute myocardial infarction (AMI) or with heart failure (HF) [6]. In the second, we used Monte Carlo simulations with data-generating processes based on unpenalized logistic regression, penalized logistic regression, bagged classification trees, random forests of classification trees, and boosted trees [7]. In the current study, we restrict our focus to those prediction methods examined in second of these papers.

Our objective was to extend the framework of van der Ploeg and colleagues and assess the relative data hungriness of six different statistical and machine learning methods when assessing calibration. We considered six prediction methods for binary outcomes: unpenalized (or conventional) logistic regression, two forms of penalized logistic regression (ridge regression and lasso regression), bootstrap aggregated (bagged) classification trees, random forests of classification trees, and boosted trees. We considered six different data-generating processes, each based on a different machine or statistical learning method and focused on optimism when assessing calibration in external samples. The paper is structured as follows: In section “ Methods”, we introduce the data on which the simulations will be based, describe six different machine and statistical learning methods, and describe the design of our simulations. In section “ Results”, we report the results of these simulations. Finally, in section “ Discussion”, we summarize our findings and place them in the context of the existing literature.

Methods

We used data from a previous study in which we examined the relative performance of six different learning methods across six different data-generating processes [7]. As in the previous study, we performed simulations in two separate cohorts: patients hospitalized with AMI and patients hospitalized with HF. Within each cohort, six different data-generating processes were used, each based on fitting a different statistical or machine learning method to a derivation sample. Simulated binary outcomes were then generated in both the derivation sample and in an independent validation sample using the fitted model (the validation sample can also be referred to as the test sample; we will use the term “validation sample” throughout the paper). In this section, we describe the data, the models and algorithms used, the data-generating processes, and the statistical analyses that were conducted. In the previous study, the size of the derivation sample was fixed, and we did not consider the effect of varying the number of EPV in the derivation sample.

Data sources

We used data from The Enhanced Feedback for Effective Cardiac Treatment (EFFECT) Study [8], which collected data on patients hospitalized with heart disease during two distinct temporal periods. Detailed clinical data were collected on patients hospitalized with AMI and HF between April 1, 1999, and March 31, 2001, and between April 1, 2004, and March 31, 2005, in Ontario, Canada. We refer to the first and second temporal periods as EFFECT Phase 1 and EFFECT Phase 2, respectively. Data on patient demographics, vital signs and physical examination at presentation, medical history, and results of laboratory tests were collected for these samples. In our simulations, we consider external validation. We used the two EFFECT Phase 1 samples (AMI and HF) as derivation samples and the two EFFECT Phase 2 samples (AMI and HF) as validation samples. For the current study, after excluding individuals with missing data on any of the variables, complete data were available on 9484 and 7000 patients hospitalized with a diagnosis of AMI during the first and second phases of the study, respectively (8240 and 7608 for HF, respectively).

The outcome was a binary variable denoting whether the patient died within 1 year (365 days) of hospital admission. In the Phase 1 AMI sample, 1871 (19.7%) patients died within 1 year of hospital admission, while in the Phase 2 AMI sample 1372 (19.6%) patients died within 1 year of hospital admission. In the Phase 1 HF sample, 2698 (32.7%) patients died within 1 year of hospital admission, while in the Phase 2 HF sample, 2381 (31.3%) patients died within 1 year of hospital admission.

We considered 33 candidate predictor variables in the AMI sample. These consisted of demographic characteristics (age, sex); presentation characteristics (cardiogenic shock, acute congestive heart failure/pulmonary edema); vital signs on presentation (systolic blood pressure, diastolic blood pressure, heart rate, respiratory rate); classic cardiac risk factors (diabetes, hypertension, current smoker, dyslipidemia, family history of coronary artery disease); comorbid conditions (cerebrovascular disease/transient ischemic attack, angina, cancer, dementia, peptic ulcer disease, previous AMI, asthma, depression, peripheral vascular disease, previous revascularization, congestive heart failure, hyperthyroidism, aortic stenosis); laboratory tests (hemoglobin, white blood count, sodium, potassium, glucose, urea, creatinine). The distribution of baseline covariates in the AMI sample is reported in Table 1.

Table 1.

Baseline and outcome variable in the EFFECT-AMI Phase 1 and Phase 2 samples

Variable EFFECT Phase 1
N = 9484
EFFECT Phase 2
N = 7000
Death within 1 year 1871 (19.7%) 1,372 (19.6%)
Age 69.0 (57.0–78.0) 70.0 (57.0–80.0)
Female 3411 (36.0%) 2590 (37.0%)
Cardiogenic shock 150 (1.6%) 22 (0.3%)
Acute congestive heart failure/pulmonary edema 537 (5.7%) 484 (6.9%)
Systolic blood pressure 146.0 (126.0–168.0) 142.0 (122.0–164.0)
Diastolic blood pressure 82.0 (70.0–95.0) 80.0 (68.0–92.0)
Heart rate 81.0 (68.0–98.0) 82.0 (69.0–99.0)
Respiratory rate 20.0 (18.0–23.0) 20.0 (18.0–22.0)
Diabetes 2491 (26.3%) 1945 (27.8%)
Hypertension 4365 (46.0%) 4086 (58.4%)
Current smoker 3061 (32.3%) 1908 (27.3%)
Dyslipidemia 2905 (30.6%) 3119 (44.6%)
Family history of CAD 2866 (30.2%) 2218 (31.7%)
Cerebrovascular disease/TIA 977 (10.3%) 861 (12.3%)
Angina 3131 (33.0%) 2115 (30.2%)
Cancer 291 (3.1%) 118 (1.7%)
Dementia 372 (3.9%) 395 (5.6%)
Peptic ulcer disease 524 (5.5%) 349 (5.0%)
Previous AMI 2189 (23.1%) 1683 (24.0%)
Asthma 519 (5.5%) 431 (6.2%)
Depression 686 (7.2%) 698 (10.0%)
Peripheral vascular disease 729 (7.7%) 600 (8.6%)
Previous revascularization 867 (9.1%) 863 (12.3%)
Congestive heart failure 472 (5.0%) 416 (5.9%)
Hyperthyroidism 121 (1.3%) 19 (0.3%)
Aortic stenosis 160 (1.7%) 138 (2.0%)
Hemoglobin 139.0 (127.0–151.0) 139.0 (125.0–151.0)
White blood count 9.6 (7.7–12.2) 9.8 (7.8–12.4)
Sodium 139.0 (137.0–141.0) 139.0 (137.0–141.0)
Potassium 4.1 (3.7–4.4) 4.1 (3.8–4.4)
Glucose 7.9 (6.4–10.9) 7.6 (6.3–10.3)
Urea 6.5 (5.1–8.7) 6.6 (5.1–9.1)
Creatinine 93.0 (78.0–115.0) 94.0 (80.0–119.0)

Continuous variables are reported as median (25th percentile–75th percentile); dichotomous variables are reported as N (%)

We considered 28 candidate predictor variables in the HF sample. These consisted of demographic characteristics (age, sex); vital signs on admission (systolic blood pressure, heart rate, respiratory rate); signs and symptoms (neck vein distension, S3, S4, rales > 50% of lung field, pulmonary edema, cardiomegaly); comorbid conditions (diabetes, cerebrovascular disease/transient ischemic attack, previous AMI, atrial fibrillation, peripheral vascular disease, chronic obstructive pulmonary disease, dementia, cirrhosis, cancer); Left bundle branch block; laboratory tests (hemoglobin, white blood count, sodium, potassium, glucose, urea, creatinine). The distribution of baseline variables in the HF sample is reported in Table 2.

Table 2.

Baseline and outcome variable in the EFFECT-CHF Phase 1 and Phase 2 samples

Variable EFFECT Phase 1
N = 8240
EFFECT Phase 2
N = 7608
Death within 1 year 2698 (32.7%) 2381 (31.3%)
Age 77.0 (70.0–84.0) 79.0 (70.0–85.0)
Female 4157 (50.4%) 3886 (51.1%)
Systolic blood pressure 146.0 (126.0–170.0) 144.0 (124.0–167.5)
Heart rate 92.0 (76.0–110.0) 90.0 (73.0–109.0)
Respiratory rate 24.0 (20.0–30.0) 24.0 (20.0–28.0)
Neck vein distension 4517 (54.8%) 4596 (60.4%)
S3 785 (9.5%) 466 (6.1%)
S4 302 (3.7%) 201 (2.6%)
Rales > 50% of lung field 903 (11.0%) 972 (12.8%)
Pulmonary edema 4218 (51.2%) 4603 (60.5%)
Cardiomegaly 2944 (35.7%) 3372 (44.3%)
Diabetes 2874 (34.9%) 2858 (37.6%)
Cerebrovascular disease/TIA 1374 (16.7%) 1401 (18.4%)
Previous AMI 3021 (36.7%) 2774 (36.5%)
Atrial fibrillation 2403 (29.2%) 2714 (35.7%)
Peripheral vascular disease 1082 (13.1%) 1026 (13.5%)
Chronic obstructive pulmonary disease 1405 (17.1%) 1747 (23.0%)
Dementia 642 (7.8%) 766 (10.1%)
Cirrhosis 63 (0.8%) 55 (0.7%)
Cancer 950 (11.5%) 880 (11.6%)
Left bundle branch block 1232 (15.0%) 1033 (13.6%)
Hemoglobin 124.0 (110.0–138.0) 123.0 (109.0–137.0)
White blood count 9.0 (7.1–11.6) 8.9 (7.0–11.5)
Sodium 139.0 (136.0–141.0) 139.0 (136.0–142.0)
Potassium 4.2 (3.9–4.6) 4.2 (3.9–4.6)
Glucose 7.5 (6.1–10.7) 7.3 (6.0–10.1)
Urea 8.4 (6.1–12.4) 8.4 (6.2–12.2)

Continuous variables are reported as median (25th percentile–75th percentile); dichotomous variables are reported as N (%)

Statistical and machine learning methods for predicting mortality

We considered six different methods for predicting the probability of 1-year mortality: unpenalized (or conventional) logistic regression, two forms of penalized logistic regression (ridge regression and lasso regression), bootstrap aggregated (bagged) classification trees, random forests of classification trees, and boosted trees [914]. We used the term “penalized logistic regression” or “penalized regression” to refer collectively to ridge regression and lasso regression. We considered these six methods as they were the methods examined in our previous paper [7].

When using unpenalized logistic regression to predict the probability of 1-year mortality, the regression model included all the variables listed above as main effects. The relationship between the log-odds of death and each continuous variable was modeled using restricted cubic smoothing splines [15]. For the unpenalized logistic regression, there was one hyper-parameter: the number of knots used when constructing restricted cubic splines. Both forms of penalized logistic regression (ridge regression and lasso regression) used all the variables included in the unpenalized logistic regression model (however, for continuous variables, only linear terms were considered). For bagged classification trees, a classification tree was grown in each of 500 bootstrap samples. A hyper-parameter was the minimum size of the terminal nodes. For random forests, 500 classification trees were grown. For random forests, there were two hyper-parameters: the minimum size of terminal nodes and the number of variables randomly sampled as candidate variables for defining each binary split. For boosted trees, we applied Friedman’s stochastic gradient boosting machines using trees as the base learners (referred to hereafter as boosted trees) [12, 16, 17]. For boosted trees, we considered sequences of 100 trees. There were two hyper-parameters: the interaction depth (specifying the maximum depth of each tree) and the shrinkage or learning rate parameter. Hyper-parameter tuning was performed in the EFFECT Phase 1 sample as described in our previous study [18]. The tuned values of the hyper-parameters were used for all subsequent analyses.

For all methods, we used implementations available in the R statistical software language (R version 3.6.3, R Foundation for Statistical Computing, Vienna, Austria). We used the lrm and rcs functions from the rms package (version 6.0–1) to estimate the unpenalized logistic regression model incorporating restricted cubic regression splines with standard maximum likelihood for model estimation. Ridge regression and lasso regression were implemented using the functions cv.glmnet (for estimating the λ parameter using tenfold cross-validation) and glmnet from the glmnet package (version 4.0–2). For bagging and random forests, we used the randomForest function from the randomForest package (version 4.6–14). When fitting bagged classification trees, the mtry parameter was set to 33 (AMI sample) or 28 (HF sample), so that all variables were considered at each split. The number of trees (500) was the default in this implementation. For boosted trees, we used the gbm function from the gbm package (version 2.1.5)). The number of trees (100) was the default in this implementation.

Six data-generating processes for simulating outcomes

We considered six different data-generating processes for each of the two diseases (AMI and HF). We describe the approach in detail for the AMI sample. An identical approach was used with the HF sample. We used the EFFECT Phase 1 sample as the derivation sample and the EFFECT Phase 2 sample as the validation sample. For a given learning method (e.g., unpenalized logistic regression), the method was fit in the EFFECT Phase 1.

We then created an enriched version of the EFFECT Phase 1 derivation sample in which we created four copies of each subject. This was done so that the sample size of the derivation sample would be large enough to assess a wide range of EVP. Without this enrichment, we would have been constrained to a maximum EPV of 56 in the AMI derivation sample.

The fitted model was then applied to both the enriched derivation sample (EFFECT Phase 1) and the validation sample (EFFECT Phase 2). Using the model fit in the original (unenriched) derivation sample, a predicted probability of the outcome was obtained for each subject in each of the two samples (the enriched Phase 1 (derivation sample) and Phase 2 (validation sample)). Using these predicted probabilities, a binary outcome was simulated for each subject using a Bernoulli distribution with the given subject-specific probability. Note that for the four copies of each subject in the enriched derivation sample, the simulated outcome was not necessarily identical across the four copies, as we generated a binary outcome for each of these subjects separately.

For the AMI sample, the required number of events is Nevents = 33 × EPV, where EPV is the specified EPV, since there were 33 candidate predictor variables. The required sample size (consisting of those with and without events) is N = Nevents/poutcome, where poutcome is the observed outcome rate in the original EFFECT1 AMI sample (0.196). The number of non-events is thus Nnon-events = N − Nevents. We stratified the enriched derivation sample into two strata. The first stratum consisted of those for whom the simulated outcome occurred (Y = 1), while the second stratum consisted of those for whom the simulated outcome did not occur (Y = 0). From the first stratum, we sampled Nevents subjects with replacement. From the second stratum, we sampled Nnon-events subjects with replacement. These two samples were joined to form the new derivation sample with the required number of EVP. This process was repeated 100 times, resulting in 100 pairs of derivation and validation samples, with the derivation sample having the required number of EVP.

We repeated this process allowing the required number of EVP to range from 10 to 200 in increments of 10.

The above process was repeated for each of the six different statistical/machine learning methods. Thus, we had a data-generating process based on unpenalized logistic regression, ridge regression, lasso regression, bagged classification trees, random forests, and boosted trees. This approach to simulating outcomes is similar to that which was used in our previous study and in van der Ploeg et al.’s study that examined the “data hungriness” of different statistical and machine learning methods when assessed using discrimination [4, 18].

We did not use the original unsimulated outcomes (i.e., the observed outcomes) because, had we done so, we would not have known what the true data-generating process was. We wanted to assess the performance of each method under different known data-generating processes. Instead, we would simply be comparing the performance of different prediction methods in an empirical dataset, which was done in a previous study [6]. Furthermore, had we done so, when we created replicates of each subject, all replicates would have the same observed outcome. Thus, no additional stochastic variation would have been introduced into the enriched samples.

Determining data hungriness of different predictive methods under different data-generating processes

For a given pair of derivation and validation samples, we fit each of the six statistical/machine learning methods (unpenalized logistic regression, ridge regression, lasso regression, bagged classification trees, random forests, and boosted trees) in the derivation sample and then applied the fitted model to both the derivation sample and the validation sample. In each of the derivation and validation samples, we obtained, for each subject, a predicted probability of the outcome for each of the six prediction methods. We assessed the performance of the model in both the derivation sample and the validation sample. Our primary focus is on assessing calibration. We focused on 3 metrics for assessing calibration: (i) the integrated calibration index (ICI); (ii) the calibration intercept; (iii) the calibration slope. The ICI is a calibration metric that denotes the mean absolute differences between observed proportions and the predicted probability of the outcome. It is equivalent to the weighted difference between a smoothed calibration curve and the diagonal line denoting perfect calibration, averaged across the distribution of predicted risk [19, 20]. Values of the ICI closer to zero denote better calibration. The calibration intercept denotes the extent to which predictions are systematically too low or too high. It assesses what has been referred to as calibration-in-the-large [5]. Ideally, the calibration intercept should be equal to zero. The calibration slope on the logit scale assesses deviation between observed and expected probabilities of mortality across the range of predicted risk. Deviation of the calibration slope from unity denotes miscalibration and indicates whether there was a need to shrink predicted probabilities at model development. The calibration slope should equal one. The calibration intercept and slope are obtained by using logistic regression to regress the binary outcome on the linear predictor (log-odds) derived from the predicted probability. As secondary measures of model accuracy, we used Nagelkerke’s generalized R2 statistic and the scaled Brier score [3, 15, 19]. Brier’s score is defined as 1Ni=1N(P^i-Yi)2, where Yi and P^i denote the observed outcome and predicted probability for the ith subject, respectively. The scaled Brier score is Brier’s score scaled by its maximum possible score, so that higher values of the scaled Brier score indicate greater predictive accuracy. Finally, to replicate the previous findings of van der Ploeg and colleagues in different analytic samples, we assess discrimination using the c-statistic. For each performance metric, the difference in performance between the derivation and validation sample is denoted optimism (optimism = performance in derivation sample − performance in validation sample). For each performance measure, we computed the optimism for each of the 100 pairs of derivation and validation samples. We then computed the mean optimism across the 100 pairs of derivation and validation samples.

These six performance measures were computed using the val.prob function from the rms package (version 5.1–3.1).

Results

AMI sample

The optimism in the estimated performance measures of the six prediction methods under the 6 different data-generating processes is reported in Figs. 1, 2, 3, 4, 5 and 6. There is one figure for each of the 6 performance measures. Within each figure, there are 6 panels, one for each data-generating processes. Note that for a given performance measure, the same vertical axis scale is used in each of the six panels. While van der Ploeg and colleagues used a threshold of 0.01 to denote minimal optimism, we interpret each figure qualitatively since the threshold denoting minimal optimism likely varies across performance metrics.

Fig. 1.

Fig. 1

Optimism in the ICI across six data-generating processes (DGP) and six models: AMI sample

Fig. 2.

Fig. 2

Optimism in the calibration intercept across six data-generating processes (DGP) and six models: AMI sample

Fig. 3.

Fig. 3

Optimism in the calibration slope across six data-generating processes (DGP) and six models: AMI sample

Fig. 4.

Fig. 4

Optimism in R2 across six data-generating processes (DGP) and six models: AMI sample

Fig. 5.

Fig. 5

Optimism in the scaled Brier score across six data − generating processes (DGP) and six models: AMI sample

Fig. 6.

Fig. 6

Optimism in the c-statistic across six data-generating processes (DGP) and six models: AMI sample

Results for the ICI are reported in Fig. 1. Figure 1 consists of six panels, one for each of the data-generating processes. For instance, the top-left panel contains the results in the scenario when outcomes were generated using unpenalized logistic regression, while the bottom-right panel contains the results in the scenario when outcomes were generated using boosted trees. Within each panel are six curves, with each curve representing the relationship between the number of EPV (horizontal axis) to optimism (vertical axis) for a given analysis method. For instance, the pink curve denotes the relationship between the number of EPV and optimism when random forests were used as the analysis method.

Lower values of the ICI indicate better calibration than do higher values. Thus, if the optimism is positive, calibration is worse in the derivation sample than in the validation sample. Conversely, if the optimism is negative, calibration is better in the derivation sample than in the validation sample. Intuitively, we would anticipate that calibration would deteriorate in the validation sample compared to in the derivation sample.

The three tree-based methods tended to have positive optimism, indicating the calibration was better in the validation sample than in the derivation sample, while the opposite tended to be observed for the three logistic regression-based methods. Ridge regression tended to display the least optimism in estimating the ICI across all six data-generating processes. Under the three logistic regression-based data-generating processes (top three panels), the use of ridge regression displayed optimism that was essentially equal to zero, even when the number of EPV was as low as 10. Under the three tree-based data-generating processes (bottom three panels), the use of ridge regression displayed an optimism that was essentially zero once the number of EPV was at least 40 (under the boosted trees data-generating process, its optimism was close to zero even when the number of EPV was 10). Under the three tree-based data-generating processes, the optimism for boosted trees tended to be similar to that of lasso regression. Across all six data-generating processes, the three logistic regression-based methods tended to reach a plateau in optimism once the number of EPV reached approximately 50 (and even lower for the two forms of penalized logistic regression). Across most data-generating processes and most values of EPV, bagged trees and random forests displayed greater optimism than did the other methods. Only when the data-generating process was based on either bagged trees or random forests and the number of EPV was very large, did the optimism of these two methods approach that of the other four methods. Finally, optimism decreased substantially with increasing EPV for bagged trees and random forests. In summary, when assessed using ICI, ridge regression was the least data hungry method across all six data-generating processes, while bagged trees and random forests tended to be the most data hungry methods.

Optimism in estimating the calibration intercept in Fig. 2. Across all six data-generating processes, the three logistic regression-based methods displayed to lowest optimism. When the number of EPV was less than 40, unpenalized logistic regression had marginally higher optimism than did the two penalized logistic regression models. The optimism of the two penalized logistic regression models were virtually indistinguishable from one another. The optimism of the three tree-based methods was greater than that of the three logistic regression-based methods. However, boosted trees displayed less optimism than did bagged trees and random forests. These latter two methods displayed large optimism even when the number of EPV was high. Results for optimism in estimating the calibration slope (Fig. 3) were qualitatively comparable to those for the calibration intercept.

Results for R2 are reported in Fig. 4. Higher values of R2 denote better model performance. Thus, positive values of optimism indicate that R2 was lower in the validation sample than in the derivation sample. In general, when the number of EPV was less than approximately 50, unpenalized logistic regression tended to display modestly greater optimism than did ridge regression and lasso regression. The two penalized forms of logistic regression tended to have very similar optimism to one another. When the data-generating process was based on one of the three logistic regression models and the number of EPV was less than approximately 120, then penalized logistic regression tended to result in estimates of R2 with the lowest optimism. When the number of EPV exceeded 120, then boosted trees tended to display the lowest optimism. However, differences between boosted trees and the three logistic regression-based methods were negligible when the number of EPV exceeded 120. Across all six data-generating processes, bagged trees and random forests produced estimates of R2 with the greatest optimism. In summary, the two forms of penalized logistic regression tended to be the least data hungry methods when assessed using R2.

Results for the scaled Brier score are reported in Fig. 5. Higher values of the scaled Brier score denote better model performance. Thus, positive values of optimism indicate that the scaled Brier score was lower in the validation sample than in the derivation sample. Results were qualitatively similar to those observed for R2.

Results for the c-statistic are reported in Fig. 6. Higher values of the c-statistic denote better model performance. Thus, positive values of optimism indicate that the c-statistic was lower in the validation sample than in the derivation sample. Results were qualitatively similar to those observed for R2 and the scaled Brier score.

HF sample

The performance of the six prediction methods under the 6 different data-generating processes is reported in Figs. 7, 8, 9, 10, 11, and 12. The figures are structured similarly to those reported in the AMI sample. Results tended to be qualitatively similar to those observed in the AMI sample.

Fig. 7.

Fig. 7

Optimism in the ICI across six data-generating processes (DGP) and six models: HF sample

Fig. 8.

Fig. 8

Optimism in the calibration intercept across six data-generating processes (DGP) and six models: HF sample

Fig. 9.

Fig. 9

Optimism in the calibration slope across six data-generating processes (DGP) and six models: HF sample

Fig. 10.

Fig. 10

Optimism in R2 across six data-generating processes (DGP) and six models: HF sample

Fig. 11.

Fig. 11

Optimism in the scaled Brier score across six data-generating processes (DGP) and six models: HF sample

Fig. 12.

Fig. 12

Optimism in the c-statistic across six data-generating processes (DGP) and six models: HF sample

Discussion

We compared the relative data hungriness of six statistical and machine learning models across six different data-generating processes and across two different cardiovascular diseases (for a total of 12 different scenarios). When assessing calibration in an independent validation sample our findings were relatively consistent across the 12 scenarios: first, penalized logistic regression (ridge regression and lasso regression) tended to display the least optimism, even at a very low number of EPV. Once the number of EPV was 40, increasing the number of EPV had virtually no effect on optimism in estimating the three main calibration metrics (ICI, calibration intercept, and calibration slope). When estimating the ICI, ridge regression displayed modestly less optimism than lasso regression in some settings when the number of EPV was low. Second, regardless of the data-generating process, bagged trees and random forests tended to display the greatest data hungriness, requiring a very large number of EPV to have an optimism that approached that of the other methods. Third, of the tree-based methods, boosted trees tended to display the least data hungriness.

Van der Ploeg and colleagues coined the term “data hungry” when comparing the effect of the number of EPV on the optimism in estimating the c-statistic for five different statistical and machine learning methods [4]. We are unaware of any other authors who have examined this issue since their initial study. The design of our simulations was motivated by those of van der Ploeg and colleagues. However, there were important differences between these two studies. While van der Ploeg and colleagues focused on discrimination, as assessed using the c-statistic, the focus of the current study was on calibration. A motivation for the focus on calibration is that it is frequently omitted when assessing the performance of machine learning methods. Their simulations were based on three different cohorts: patients with head and neck cancer, patients with traumatic brain injury, and patients suspected of head injury who underwent a CT scan. The number of available candidate predictor variables in these three cohorts were 7 (with 1 being continuous), 9 (with 4 being continuous), and 12 (with 2 being continuous), respectively. Thus, another important difference is in the number of available candidate predictor variables. In the current study, there were 33 and 28 candidate predictor variables in the AMI and HF samples, respectively. Furthermore, of these variables, several were continuous, reflective of what is frequently observed in clinical cardiovascular research. A final difference between the two studies is the learning methods that were considered. Van der Ploeg and colleagues considered five learning methods: unpenalized logistic regression, classical regression trees, support vector machines, neural networks, and random forests. We elected to not consider regression trees as they have been shown to have poor performance in predicting outcomes in cardiovascular patients [6, 21, 22]. Furthermore, their use in applied research appears to have been supplanted by tree-based ensemble methods such as random forests and boosted trees, both of which were considered in the current study. We did not consider support vector machines, as they are rarely used for predicting outcomes in cardiovascular patients [1, 2]. Furthermore, we added two forms of penalized logistic regression models to examine whether the shrinkage of regression coefficients would result in improved performance. We did not include neural networks due to difficulties in fitting these models in samples with the observed event rates. Our observations for calibration reflect what van der Ploeg and colleagues observed for discrimination. They concluded that modern prediction methods may need over 10 times as many EPV to achieve a small optimism than classical modelling techniques such as unpenalized logistic regression. Similar to the current study, they also observed that random forests were often the most data hungry method across a range of datasets and data-generating processes. By including discrimination (assessed using the c-statistic) in the current study, we were able to complement the findings of van der Ploeg and colleagues. Similar to them, we observed that unpenalized logistic regression tended to be less data hungry than modern machine learning methods. However, we also showed that penalized logistic regression was even less data hungry than was unpenalized logistic regression. We also complemented the earlier study by considering scenarios with a larger number of candidate predictor variables.

Calibration is an important, albeit frequently omitted, aspect of assessing the performance of risk prediction models. Calibration is important in clinical medicine since risk prediction models are often used for risk stratification and to inform medical decision making. Thus, one wants a clinical prediction model that results in estimated risk that corresponds closely to observed risk. An important implication of our findings is that risk prediction models developed using machine learning methods in samples in which the number of EPV is low can result in prediction models whose apparent performance is substantially better than what would be observed using independent validation data. Readers of published studies that do not use an independent validation sample or that do not provide optimism-corrected estimates of model performance should be aware that the performance of prediction models developed using machine learning methods will likely be degraded upon subsequent validation in independent samples when the models were developed in samples with a low number of EPV. The ability of a prediction model to maintain its predictive accuracy upon external validation in an independent sample is important to clinicians who require demonstration of broad validation before their use in the clinical setting [23, 24]. Since derivation and validation of a new model requires data collection, a data hungry model that requires several-fold more numbers of patients and events, may not be the optimal approach as clinical researchers are faced with the pressures of finite resources to build prediction models. Net Benefit is a decision-analytic measure that incorporates the relative costs of false negatives and false positives when making decisions informed by a clinical prediction model [25]. Van Calster and Vickers, when examining the impact of model miscalibration on Net Benefit, found that miscalibration decreased Net Benefit [26]. Thus, miscalibration can lower the clinical utility of a clinical prediction model. Our findings have practical significance for using clinical prediction models for medical decision making. We found that, for small to moderate sample sizes, tree-based machine learning methods tended to display greater optimism than did penalized logistic regression models. Thus, the apparent calibration of tree-based machine learning methods will be greater than their true calibration. Consequently, when clinicians assess the Net Benefit of a particular model, the true Net Benefit may be lower than the apparent Net Benefit, and the discrepancy between these two may be greater when tree-based methods were used compared to when penalized logistic regression was used with small to medium sample sizes.

Dhiman and colleagues conducted a systematic review of prognostic models in oncology that were developed using machine learning [27]. When developed using regression-based methods, the median number of EPV across studies was 8 (25th and 75th percentiles: 7.1 to 23.5). When using ensemble-based models, the median number of EPV across studies was 1.7 (25th and 75th percentiles: 1.1 to 6). Thus, in the large majority of models developed using ensemble-based methods in the oncology literature, the observed number of EPV was below the lower threshold for EPV considered in the current study. Our results suggest that the optimism in calibration metrics for these prognostic models is likely large. Furthermore, Dhiman and colleagues found that, of the 62 included studies, calibration was only reported in 11. This suggests that, while assessing calibration is a key component in the derivation and validation of prognostic models, it is often neglected. Andaur Navarro and colleagues conducted a systematic review of the development of machine learning prediction models [28]. Of the 152 studies included, 28 (18.4%) provided information on sample size. The median number of events per candidate predictor variable was 12.5 (interquartile range: 5.7 to 27.7). Based on our findings, we would hypothesize that many of the developed models would report prediction models whose performance in the derivation sample would be overly optimistic. The authors of the review further reported that calibration was only assessed for 5.4% of the developed prediction models. This reinforces the finding of Dhiman and colleagues that calibration is often omitted when assessing the performance of machine learning methods. For those 11 models which reported an apparent calibration slope in the derivation sample, the median was 1.05. However, for those 15 models for which an optimism-corrected calibration slope was reported, the median was 1.3, suggesting that the naïve performance estimates may be overly optimistic.

In the current study, we found that penalized logistic regression (ridge regression and lasso regression) tended to result in estimates of model performance that displayed little optimism. However, it is worth noting that penalized regression models are not a panacea to all the problems that can arise due to small sample sizes. Using both empirical analyses and simulations, Riley and colleagues found that penalized methods could produce unreliable clinical prediction models, particularly when sample sizes were small [29]. Part of the reason for the poor performance in small sample sizes is that the tuning parameter may be imprecisely estimated. In the current study, the smallest EPV that we considered was 10. However, Riley and colleagues examined EPV ranging from 2.5 to 25, in increments of 2.5. The greatest variability in the estimated tuning parameter was when the EPV was equal to 2.5. Once the number of EVP was 10 (the smallest value in our study), increasing the number of EPV had a relatively modest impact on the variability of the estimated tuning parameter. The relatively good performance of penalized regression in the current study may reflect the fact that our design did not consider settings with a very low number of EPV.

A limitation of the current study is its focus on predicting the probability of the occurrence of a binary outcome. However, predicting the occurrence of binary outcomes is common in clinical research, and many studies have applied machine learning methods for this purpose. Future research is necessary to compare the relative performance of these methods for predicting continuous and time-to-event outcomes. A second limitation is that, while the correlation structures for the candidate predictor variables reflected those observed for patients hospitalized with two different cardiovascular disease, it is conceivable that different results would be observed in settings with substantially different correlation structures. However, basing our simulations on hospitalized patients with cardiovascular disease strengthens the generalizability of our findings when considering future application of statistical and machine learning methods in similar clinical contexts. A third limitation of the current study was that the number of candidate predictor variables was approximately the same for the AMI sample and the HF sample. However, this number of predictor variables is consistent with what is often available in clinical research. Furthermore, by allowing the sample to vary, we were able to consider a wide range of EPV. A fourth limitation was the omission of neural networks. The rationale for this omission was that in our earlier study looking at the performance of different learning methods in the full EFFECT samples (i.e., without modifying the sample size to have a specified number of EPV), we were unable to successfully fit neural networks despite attempting with two different R implementations of this method [7]. This was despite the number of EPV being 56 in the AMI sample. Problems with model estimation would have been compounded when considering smaller numbers of EPV. In contrast to these limitations, a strength of the current study was its focus on external validation. By using derivation and validation samples from different time periods, we were able to examine the performance of each method in an independent validation sample. Another strength of our study was that our simulations were based on patients hospitalized with a specific cardiovascular condition. Because of this, the multivariate structure of the samples was reflective of data used in cardiovascular outcomes research.

Conclusion

Penalized logistic regression (ridge regression and lasso regression) tended to result in estimates of calibrations metrics that displayed at most minor optimism. Both unpenalized and penalized logistic regression tended to be substantially less data hungry than ensemble-based methods from the machine learning literature. Among the three ensemble-based machine learning methods, boosted trees displayed less optimism than bagged trees and random forests. We encourage researchers to consider the use of penalized logistic regression when developing clinical prediction models.

Acknowledgements

Not applicable.

Abbreviations

AMI

Acute myocardial infarction

EPV

Events per variable

EFFECT

Enhanced Feedback for Effective Cardiac Treatment

HF

Heart failure

ICI

Integrated calibration index

ROC

Receiver operating characteristic

Authors’ contributions

PCA conceived the study, design and conducted the simulations and statistical analyses, and drafted the manuscript. DSL and BW revised the manuscript for important intellectual content. All authors approved the final manuscript.

Funding

ICES is an independent, non-profit research institute funded by an annual grant from the Ontario Ministry of Health (MOH) and the Ministry of Long-Term Care (MLTC). This study was supported by ICES, which is funded by an annual grant from the Ontario Ministry of Health (MOH) and the Ministry of Long-Term Care (MLTC). This study also received funding from the Canadian Institutes of Health Research (CIHR) (PJT - 183898).

Data availability

As a prescribed entity under Ontario’s privacy legislation, ICES is authorized to collect and use health care data for the purposes of health system analysis, evaluation, and decision support. Secure access to these data is governed by policies and procedures that are approved by the Information and Privacy Commissioner of Ontario. The use of the data in this project is authorized under section 45 of Ontario’s Personal Health Information Protection Act (PHIPA) and does not require review by a Research Ethics Board. This document used data adapted from the Statistics Canada Postal CodeOM Conversion File, which is based on data licensed from Canada Post Corporation, and/or data adapted from the Ontario Ministry of Health Postal Code Conversion File, which contains data copied under license from ©Canada Post Corporation and Statistics Canada. Parts of this material are based on data and/or information compiled and provided by CIHI and the Ontario Ministry of Health. The analyses, conclusions, opinions, and statements expressed herein are solely those of the authors and do not reflect those of the funding or data sources; no endorsement is intended or should be inferred. The dataset from this study is held securely in coded form at ICES. While legal data sharing agreements between ICES and data providers (e.g., healthcare organizations and government) prohibit ICES from making the dataset publicly available, access may be granted to those who meet pre-specified criteria for confidential access, available at www.ices.on.ca/DAS (email: das@ices.on.ca).

Declarations

Ethics approval and consent to participate

The use of the data in this project is authorized under section 45 of Ontario’s Personal Health Information Protection Act (PHIPA) and does not require review by a Research Ethics Board.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Cho SM, Austin PC, Ross HJ, et al. Machine learning compared with conventional statistical models for predicting myocardial infarction readmission and mortality: a systematic review. Can J Cardiol. 2021;37:1207–14. 10.1016/j.cjca.2021.02.020. 2021/03/08. [DOI] [PubMed] [Google Scholar]
  • 2.Shin S, Austin PC, Ross HJ, et al. Machine learning vs. conventional statistical models for predicting heart failure readmission and mortality. ESC Heart Fail. 2021;8:106–15. 10.1002/ehf2.13073. 2020/11/19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Steyerberg EW. Clinical Prediction Models. 2nd ed. New York: Springer-Verlag; 2019. [Google Scholar]
  • 4.van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Med Res Methodol. 2014;14:137. 10.1186/1471-2288-14-137. 2014/12/24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21:128–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Austin PC, Lee DS, Steyerberg EW, et al. Regression trees for predicting mortality in patients with cardiovascular disease: what improvement is achieved by using ensemble-based methods? Biom J. 2012;54:657–73. 10.1002/bimj.201100251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Austin PC, Harrell FE Jr, Steyerberg EW. Predictive performance of machine and statistical learning methods: impact of data-generating processes on external validity in the “large N, small p” setting. Stat Methods Med Res. 2021;30:1465–83. 10.1177/09622802211002867. 2021/04/14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Tu JV, Donovan LR, Lee DS, et al. Effectiveness of public report cards for improving the quality of cardiac care: the EFFECT study: a randomized trial. J Am Med Assoc. 2009;302:2330–7. [DOI] [PubMed] [Google Scholar]
  • 9.Breiman L. Random Forests. Machine Learning. 2001;45:5–32. [Google Scholar]
  • 10.Buhlmann P, Hathorn T. Boosting algorithms: Regularization, prediction and model fitting. Stat Sci. 2007;22:477–505. [Google Scholar]
  • 11.Freund Y and Schapire R. Experiments with a new boosting algorithm. ICML'96: Proceedings of the Thirteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc.; 1996, pp.148–56.
  • 12.Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting (with discussion). Ann Stat. 2000;28:337–407. [Google Scholar]
  • 13.McCaffrey DF, Ridgeway G, Morral AR. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol Methods. 2004;9:403–25. [DOI] [PubMed] [Google Scholar]
  • 14.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. 2nd ed. New York, NY: Springer; 2009. [Google Scholar]
  • 15.Harrell FE Jr. Regression modeling strategies. 2nd ed. New York, NY: Springer-Verlag; 2015. [Google Scholar]
  • 16.Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal. 2002;38:367–78. [Google Scholar]
  • 17.Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–232. [Google Scholar]
  • 18.Austin PC, Harrell FE Jr, Lee DS, et al. Empirical analyses and simulations showed that different machine and statistical learning methods had differing performance for predicting blood pressure. Sci Rep. 2022;12:9312. 10.1038/s41598-022-13015-5. 2022/06/07. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Austin PC, Steyerberg EW. The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. StatMed. 2019;38:4051–65. 10.1002/sim.8281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Austin PC, Steyerberg EW. Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers. Stat Med. 2014;33:517–35. 10.1002/sim.5941. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Austin PC. A comparison of regression trees, logistic regression, generalized additive models, and multivariate adaptive regression splines for predicting AMI mortality. Stat Med. 2007;26:2937–57. [DOI] [PubMed] [Google Scholar]
  • 22.Austin PC, Tu JV, Lee DS. Logistic regression had superior performance compared to regression trees for predicting in-hospital mortality in patients hospitalized with heart failure. J Clin Epidemiol. 2010;63:1145–55. 10.1016/j.jclinepi.2009.12.004. [DOI] [PubMed] [Google Scholar]
  • 23.Reilly BM, Evans AT. Translating clinical research into clinical practice: impact of using prediction rules to make decisions. Ann Intern Med. 2006;144:201–9. 10.7326/0003-4819-144-3-200602070-00009. [DOI] [PubMed] [Google Scholar]
  • 24.Lee DS, Straus SE, Farkouh ME, et al. Trial of an Intervention to Improve Acute Heart Failure Outcomes. N Engl J Med. 2023;388(22–32):20221105. 10.1056/NEJMoa2211680. [DOI] [PubMed] [Google Scholar]
  • 25.Vickers AJ, Van CB, Steyerberg EW. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ. 2016;352:i6. 10.1136/bmj.i6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.van Calster B, Vickers AJ. Calibration of risk prediction models: impact on decision-analytic performance. MedDecisMaking. 2015;35:162–9. 10.1177/0272989X14547233. [DOI] [PubMed] [Google Scholar]
  • 27.Dhiman P, Ma J, Andaur Navarro CL, et al. Methodological conduct of prognostic prediction models developed using machine learning in oncology: a systematic review. BMC Med Res Methodol. 2022;22:101. 10.1186/s12874-022-01577-x. 2022/04/10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Andaur Navarro CL, Damen JAA, van Smeden M, et al. Systematic review identifies the design and methodological conduct of studies on machine learning-based prediction models. J Clin Epidemiol. 2023;154(8–22):20221125. 10.1016/j.jclinepi.2022.11.015. [DOI] [PubMed] [Google Scholar]
  • 29.Riley RD, Snell KIE, Martin GP, et al. Penalization and shrinkage methods produced unreliable clinical prediction models especially when sample size was small. J Clin Epidemiol. 2021;132(88–96):20201208. 10.1016/j.jclinepi.2020.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

As a prescribed entity under Ontario’s privacy legislation, ICES is authorized to collect and use health care data for the purposes of health system analysis, evaluation, and decision support. Secure access to these data is governed by policies and procedures that are approved by the Information and Privacy Commissioner of Ontario. The use of the data in this project is authorized under section 45 of Ontario’s Personal Health Information Protection Act (PHIPA) and does not require review by a Research Ethics Board. This document used data adapted from the Statistics Canada Postal CodeOM Conversion File, which is based on data licensed from Canada Post Corporation, and/or data adapted from the Ontario Ministry of Health Postal Code Conversion File, which contains data copied under license from ©Canada Post Corporation and Statistics Canada. Parts of this material are based on data and/or information compiled and provided by CIHI and the Ontario Ministry of Health. The analyses, conclusions, opinions, and statements expressed herein are solely those of the authors and do not reflect those of the funding or data sources; no endorsement is intended or should be inferred. The dataset from this study is held securely in coded form at ICES. While legal data sharing agreements between ICES and data providers (e.g., healthcare organizations and government) prohibit ICES from making the dataset publicly available, access may be granted to those who meet pre-specified criteria for confidential access, available at www.ices.on.ca/DAS (email: das@ices.on.ca).


Articles from Diagnostic and Prognostic Research are provided here courtesy of BMC

RESOURCES