Evaluation of stacked ensemble model performance to predict clinical outcomes: A COVID-19 study

Rianne Kablan; Hunter A Miller; Sally Suliman; Hermann B Frieboes

doi:10.1016/j.ijmedinf.2023.105090

. 2023 May 8;175:105090. doi: 10.1016/j.ijmedinf.2023.105090

Evaluation of stacked ensemble model performance to predict clinical outcomes: A COVID-19 study

Rianne Kablan ^a,¹, Hunter A Miller ^a,¹, Sally Suliman ^b, Hermann B Frieboes ^a,^c,^d,^⁎

PMCID: PMC10165871 PMID: 37172507

Abstract

Background

The application of machine learning (ML) to analyze clinical data with the goal to predict patient outcomes has garnered increasing attention. Ensemble learning has been used in conjunction with ML to improve predictive performance. Although stacked generalization (stacking), a type of heterogeneous ensemble of ML models, has emerged in clinical data analysis, it remains unclear how to define the best model combinations for strong predictive performance. This study develops a methodology to evaluate the performance of “base” learner models and their optimized combination using “meta” learner models in stacked ensembles to accurately assess performance in the context of clinical outcomes.

Methods

De-identified COVID-19 data was obtained from the University of Louisville Hospital, where a retrospective chart review was performed from March 2020 to November 2021. Three differently-sized subsets using features from the overall dataset were chosen to train and evaluate ensemble classification performance. The number of base learners chosen from several algorithm families coupled with a complementary meta learner was varied from a minimum of 2 to a maximum of 8. Predictive performance of these combinations was evaluated in terms of mortality and severe cardiac event outcomes using area-under-the-receiver-operating-characteristic (AUROC), F1, balanced accuracy, and kappa.

Results

The results highlight the potential to accurately predict clinical outcomes, such as severe cardiac events with COVID-19, from routinely acquired in-hospital patient data. Meta learners Generalized Linear Model (GLM), Multi-Layer Perceptron (MLP), and Partial Least Squares (PLS) had the highest AUROC for both outcomes, while K-Nearest Neighbors (KNN) had the lowest. Performance trended lower in the training set as the number of features increased, and exhibited less variance in both training and validation across all feature subsets as the number of base learners increased.

Conclusion

This study offers a methodology to robustly evaluate ensemble ML performance when analyzing clinical data.

Keywords: Machine learning, Stacked ensemble, Stacked generalization, Meta learners, Clinical data analysis, COVID-19

1. Introduction

Due to the novelty of COVID-19, as well as the varied presentation of symptoms in infected patients, an emphasis has been placed on understanding the outcome differences between patients. Various types of machine learning (ML) models have been used to analyze COVID-19 data [1], [2], [3], [4], [5], [6], [7], as well as different combinations of models [5], [6], [8], referred to as an ensemble. These models have demonstrated the capability to discern between COVID-19 infection versus other illnesses or being healthy, as well as to predict patient outcomes post-infection. In particular, ensemble learning can improve the overall predictive performance by combining the predictive capabilities of two or more base “learner” models to create a model that can reduce the bias and variance of the base learners.

This study employs stacked generalization (i.e., stacking), a type of heterogenous model using a set of learning algorithms on the same dataset [9], [10], [11], in which the base learners and meta learners are chosen to represent common ML algorithm families. Stacked ensembles have proven to generally be more accurate prediction models than any one base learner alone in clinical contexts [12], [13], [14]. In particular, a large number of studies have used stacked ensembles to study COVID-19 data, with many of them focusing on mortality (e.g., [15], [16], [17], [18], [19], [20], [21]) and a few assessing cardiac events [22], [23]. In spite of this progress, it remains unclear how to define the best model combinations for strong performance when using stacked generalization. With the goal to improve the predictive accuracy of ensembles of ML models applied to clinical datasets, this study provides a methodology to accurately select and group base learners and meta learners for clinical data analysis, e.g., as in pre-risk assessment. COVID-19 patient data is used as a representative clinical dataset to create ML ensembles using stacked generalization and to predict outcomes related to severe disease (mortality and cardiac events).

Although there is no quantifiable definition of a “family” in the context of ML algorithms, one categorization is according to their common characteristics. Several such families have been identified by Das et al. [24], where three families of classification algorithm including Naïve Bayes, Support Vector Machines, and Decision Trees were compared in terms of training time, prediction time and accuracy. In this study, eight different algorithm families were considered based on their popularity and ease of implementation, including regression, Bayesian, Decision Trees, regularization, dimensionality reduction, and Nearest-Neighbor, as well as neural networks and Support Vector Machines [24], [25], as described in Supplementary Material. A representative model from each family was selected to be utilized as base or meta learner. Different combinations of base and meta learners were investigated to determine overall classification performance, varying the number of features for training and testing, as well as the number of base learners in the ensemble. The combinations were evaluated on their performance with different outcomes with respect to the clinical data, using area under the receiver operating characteristic (AUROC), F1, balanced accuracy, and kappa.

2. Methods

All data pre-processing and subsequent analyses were performed in R version 4.1.1 with use of the caret and caretEnsemble packages along with packages for each imported model (Table 1 ).

Table 1.

Algorithm families, representative models, tuning hyper-parameters and corresponding R functions chosen for the ensemble analysis in this study.

Algorithm Family	Representative Model		Hyper-parameters	R library::R function	Version
Regression	Generalized Linear Model	GLM	None	stats::glm()	3.6.2
Bayesian	Naive Bayes	NB	laplace; usekernel; adjust	naivebayes::naive_bayes()	0.1.2
Decision tree	Random Forest	RF	Mtry	randomForest()::rf()	4.6–14
Regularization	Shrinkage Discriminant Analysis	SDA	diagonal; lambda	sda::sda()	0.3.0
Dimensionality reduction	Partial Least Squares	PLS	Ncomp	pls::pls()	0.12.0
Nearest-neighbor	K-Nearest Neighbor	KNN	k	base::knn()	7.3–20
Support vector machines	Support Vector Machine with class weights	SVM	sigma; c; weight	kernlab::svmRadialWeights()	0.9–29
Neural network	Multi-Layer Perceptron	MLP	size; decay	RSNNS::mlpWeightDecay()	0.4–14

Open in a new tab

2.1. Representative clinical dataset characteristics

De-identified clinical data were obtained from the University of Louisville Hospital. A retrospective chart review was performed from March 2020 to November 2021 with n = 1099, where patients were monitored and treated for COVID-19 symptoms. No patient consent was required. The chart review was performed by trained resident clinicians with an M.D. degree, and the data were entered into a database with pre-defined data fields. The data entry was supervised by the clinician in charge of the clinical information (Dr. Suliman) under IRB 20.0404. Data from patients who tested positive for COVID-19 were included, with no other inclusion or exclusion criteria. These observational data were used to illustrate the methodology of the study. In addition to basic demographic information, the dataset also included clinically significant outcomes such as cardiac complications (“cardiac”) and overall mortality (“mortality”), which were investigated in this study by constructing binary classification models. The cardiac event label was created by intersecting all patients who had at least one of the following during hospital stay: cardiac arrest, myocardial infarction, right ventricular failure, or new onset arrhythmia.

2.2. Data pre-processing and train/validation split

The dataset underwent a data wrangling pipeline (Fig. 1 ). The data were first separated into numeric and categorical features. Features with >30% of the associated values missing in the numeric clinical features were removed, with corresponding rows in the categorical dataset removed also. A threshold of 30% was chosen to provide room for missing percentages >30% after a train & validation split. Of note, 40% missing has been identified as a threshold for imputation, where imputation of features with >40% missing should be considered only as hypothesis generating [26]. After selecting outcomes of interest, all patients who had missing data for one or both outcomes were removed. Any remaining categorical clinical features with missing data were removed as well. Numeric variables were checked for consistency by removing or replacing erroneous values. Potential discrepancies in the data were flagged by checking numeric values with z-scores >4 in either direction. If these extreme values were found to be correct upon review by the clinical team, they were left in the dataset; otherwise, the numeric values were corrected. Any unresolved values were coerced to NA (not available) to indicate a missing value. Categorical variables were factored and checked for consistency among the factor levels prior to analysis.

After removing numeric features with >30% missing values, the dataset was split 70/30 into training and validation subsets, respectively. It was ensured that several key clinical features including age, weight, BMI, race, sex, comorbidities, and the two outcomes of interest, mortality and severe cardiac events, were balanced between the training and validation sets (Table 2 ). Data were split randomly and resampled until all conditions were met, which checked that each clinical feature of interest was adequately balanced between both data splits. A t-test was employed for checking numeric features and chi-square test was used for categorical features, with p > 0.1 satisfying the conditions (Table 2). Imputation methods k-nearest neighbors (KNN), Bayesian Principal Component Analysis (BPCA), Probabilistic PCA (PPCA), and MissForest [27], [28] were tested for relative accuracy after creating a complete subset of the original training data and introducing random missing values. After 100 iterations, MissForest was found to result in the lowest average RMSE compared to the ground truth data, and thus was used for separately imputing the training and validation datasets. Finally, the numeric and categorical datasets were combined to create the working datasets with total n = 247 (n _TRAINING = 166; n _VALIDATION = 81). Proportion of missing values in pre-imputed training and validation data are respectively in Supplementary Figs. 1 and 2.

Table 2.

Patient data clinical characteristics. Data were split randomly and resampled until all conditions were met, which checked that each clinical feature of interest was adequately balanced between both splits. A t-test was employed for checking numeric features and chi-square test was used for categorical features, with p > 0.1 satisfying the conditions. Socioeconomic status was not collected for these patients.

	N = 166	N = 81	P-value
	Training data	Validation data	P-value
Age			0.878
Mean (SD; Range)	60.9 (17.0) (21–95)	60.5 (15.7) (22–89)
Sex			0.840
Female	70 (42%)	36 (44%)
Male	96 (58%)	45 (56%)
Race
Asian	5 (3%)	1 (1%)	0.237
Black	52 (31%)	26 (32%)	0.845
Other	2 (1%)	0 (0%)	0.299
Unknown	3 (2%)	2 (2%)	0.450
White	104 (63%)	52 (64%)	0.831
Comorbidities
Diabetes Mellitus (DM)	67 (40%)	32 (40%)	1.000
Chronic Obstructive Pulmonary Disease (COPD)	30 (18%)	17 (21%)	0.672
Hypertension	95 (57%)	48 (59%)	0.868
Congestive Heart Failure (CHF)	24 (14%)	10 (12%)	0.798
Coronary Artery Disease (CAD)	22 (13%)	7 (9%)	0.397
Chronic Kidney Disease (CKD)	14 (8%)	12 (15%)	0.189
Body-Mass Index			0.862
Mean (Range)	31.0 (14.5–66.0)	31.2 (16.5–66.7)
Mortality			0.896
Expired	50 (30%)	23 (28%)
Recovered	116 (70%)	58 (72%)
Severe Cardiac Event			1.000
No	102 (61%)	50 (62%)
Yes	64 (39%)	31 (38%)

Open in a new tab

2.3. Feature selection

To determine feature importance and the optimal number of features to utilize, each model was initially trained using the pre-processed training set, i.e., after filtering numeric features based on missing value percentage threshold, removal of categorical features with any missing values, and imputation. Hyperparameters and their respective ranges for each model are reported in Supplementary Table 1. Feature selection was employed using varImp function from the caret package to calculate the importance for each feature in regard to the outcome of interest for each model. Generic ROC curve analysis was conducted on each feature using a “filter” approach to ensure consistent feature subsets amongst all models for each outcome and to determine the optimal number of features. The feature subset for each outcome ordered according to variable importance is in Supplementary Table 2. Three feature subsets were chosen to evaluate performance across all models, each subset having 20, 30, or 40 features from the overall dataset to train and evaluate ensemble performance. Count frequency of categorical features is reported in Supplementary Table 3. Violin plots of numeric features are in Supplementary Fig. 3 and Supplementary Fig. 4.

2.4. Ensemble creation and parameter tuning

The base learners were trained using caretList function from package caretEnsemble by iterating through all possible combinations of base learners, using each feature subset for each outcome. To train the meta learners, each base learner was iterated as a meta learner through all combinations of base learners (minimum of 2 and maximum of 8) using caretStack function from package caretEnsemble. Base and meta learners were each trained using repeated k-fold cross validation to find the optimal tuning parameters with k = 5, 10 resampling iterations, a tune length of 10. Kappa $K$ , used as the optimization parameter during parameter tuning, is a measurement of agreement between two raters:

K = \frac{\Pr (a) - \Pr (e)}{1 - \Pr (e)}

(1)

where $P r (a)$ is the relative observed agreement among raters or the total agreement probability, and $P r (e)$ is the hypothetical probability of change agreement that uses data to calculate the probabilities of each observer randomly guessing each category [29].

The remaining validation data was used to test each ensemble performance once using predict function with the tuned parameter values for each outcome and feature subset. Performance metrics including AUROC, F1, kappa, and balanced accuracy were calculated for both training and validation datasets to obtain a comprehensive evaluation of ensemble performance.

2.5. Decision trees

Function FFTrees from package FFTrees was used to generate the fast and frugal decision trees for predicting mortality or severe cardiac events.

2.6. Classification performance metrics and statistical analysis

AUROC, which is based on an averaging of sensitivity and specificity across all possible classification probability thresholds, was used as the primary performance metric for all statistical analyses due to its ubiquity in measuring model performance in well balanced datasets such as in this study. The other performance metrics included: F1 to give a measure of model performance based on precision and recall and to ensure a comprehensive evaluation of model performance; balanced accuracy as the average of sensitivity and specificity; kappa as a measurement of agreement accounting for random chance. The metric values for each outcome were compared across two different groupings (depending on the meta learner or the overall ensemble size) to assess the influence of including each of the eight models as a base or meta learner, as well as the number of learners, on the overall ensemble performance. Average, median, and range metric values for each grouping were calculated to directly compare between ensemble performance. Pearson correlation coefficient (r) was calculated between ensemble training and validation performance metric values for each grouping to evaluate the ensembles for overfitting to the training data.

Scoring in the top 20% for a particular statistic metric was chosen to represent “adequate” performance while scoring in the bottom 20% was considered as “poor” performance, relative to the range of values within a particular outcome.

3. Results

3.1. Ensemble performance across meta learners and base learners

An outline of the ensemble workflow is shown in Fig. 2 . Using AUROC as the performance metric, GLM, PLS, and SVM meta learners overall performed adequately for predicting mortality with the training data, while GLM and SVM performed adequately for severe cardiac events. RF performed poorly for mortality while KNN performed poorly for both outcomes (Table 3 ). In contrast, GLM, MLP, and PLS performed adequately for predicting both mortality and severe cardiac events in the validation dataset, with poor performance by KNN, and by RF for mortality and SVM for cardiac event (Table 3). In terms of the F1 metric, KNN, RF, and SVM were adequate in predicting both mortality and severe cardiac events with the training data, with all the models noticeably under-performing for mortality compared to cardiac event (Supplementary Table 4), and GLM, NB, and PLS having the lowest values. With the validation dataset, all meta learners had a narrow range of values, with the models also underperforming for mortality compared to cardiac event. In contrast, the metrics balanced accuracy and kappa showed more consistent model performance between the two outcomes (Supplementary Tables 5 and 6).

Table 3.

Average, median, and range AUROC values for ensembles trained using particular meta learners across feature subsets & outcomes of interest.

AUROC		MORTALITY				CARDIAC EVENT
		Training		Validation		Training		Validation
Meta Learner	# Features	x	x̃ (R)	x	x̃ (R)	x	x̃ (R)	x	x̃ (R)
GLM	20	0.865	0.866 (0.196)	0.870	0.874 (0.150)	0.834	0.834 (0.147)	0.740	0.743 (0.243)
	30	0.862	0.866 (0.196)	0.872	0.876 (0.228)	0.813	0.819 (0.114)	0.702	0.701 (0.209)
	40	0.853	0.862 (0.22)	0.864	0.867 (0.225)	0.816	0.823 (0.156)	0.702	0.712 (0.185)

KNN	20	0.352	0.311 (0.540)	0.352	0.305 (0.800)	0.579	0.575 (0.47)	0.484	0.477 (0.439)
	30	0.393	0.354 (0.522)	0.386	0.336 (0.798)	0.666	0.673 (0.419)	0.502	0.503 (0.468)
	40	0.443	0.430 (0.528)	0.409	0.396 (0.683)	0.673	0.687 (0.421)	0.514	0.521 (0.429)

MLP	20	0.864	0.865 (0.205)	0.854	0.857 (0.157)	0.833	0.833 (0.147)	0.727	0.728 (0.227)
	30	0.855	0.857 (0.204)	0.869	0.873 (0.237)	0.825	0.828 (0.154)	0.677	0.683 (0.214)
	40	0.850	0.857 (0.221)	0.863	0.867 (0.236)	0.824	0.830 (0.175)	0.687	0.690 (0.182)

NB	20	0.855	0.856 (0.193)	0.854	0.853 (0.149)	0.818	0.820 (0.118)	0.699	0.699 (0.265)
	30	0.851	0.853 (0.194)	0.853	0.854 (0.211)	0.810	0.814 (0.121)	0.689	0.691 (0.210)
	40	0.841	0.845 (0.211)	0.861	0.862 (0.243)	0.810	0.816 (0.175)	0.684	0.690 (0.197)

PLS	20	0.865	0.865 (0.193)	0.864	0.867 (0.15)	0.831	0.832 (0.143)	0.741	0.744 (0.242)
	30	0.859	0.864 (0.196)	0.867	0.872 (0.228)	0.812	0.818 (0.112)	0.700	0.698 (0.210)
	40	0.853	0.862 (0.216)	0.863	0.865 (0.231)	0.815	0.823 (0.153)	0.700	0.711 (0.193)

RF	20	0.668	0.679 (0.453)	0.618	0.616 (0.558)	0.828	0.848 (0.411)	0.653	0.659 (0.368)
	30	0.706	0.722 (0.423)	0.675	0.683 (0.513)	0.829	0.849 (0.442)	0.628	0.634 (0.340)
	40	0.679	0.699 (0.706)	0.619	0.622 (0.765)	0.829	0.850 (0.473)	0.627	0.643 (0.378)

SDA	20	0.853	0.854 (0.195)	0.853	0.853 (0.146)	0.829	0.83 (0.147)	0.730	0.735 (0.243)
	30	0.853	0.856 (0.195)	0.860	0.863 (0.223)	0.812	0.818 (0.113)	0.696	0.697 (0.212)
	40	0.850	0.854 (0.22)	0.863	0.867 (0.236)	0.815	0.822 (0.156)	0.699	0.707 (0.186)

SVM	20	0.879	0.882 (0.302)	0.816	0.818 (0.216)	0.860	0.86 (0.239)	0.642	0.647 (0.307)
	30	0.871	0.870 (0.302)	0.799	0.800 (0.323)	0.857	0.863 (0.262)	0.608	0.614 (0.273)
	40	0.845	0.849 (0.282)	0.793	0.799 (0.262)	0.838	0.841 (0.26)	0.616	0.614 (0.226)

Open in a new tab

Although SVM as a meta learner performed adequately (AUROC metric) in the training dataset for predicting both mortality and cardiac events, it had a tendency to overfit, as shown by the low correlations between training and validation data AUROC results (Supplementary Table 7A). SVM also had weak correlations between the training and validation results for the F1, balanced accuracy, and kappa metrics for both mortality and severe cardiac events (Supplementary Table 8A–10A), again indicating overfitting.

3.2. Ensemble performance with varying number of base leaners

Average, median, interquartile range, and outliers for AUROC values of ensemble models including various numbers of base learners for the training and validation data are shown in Fig. 3, Fig. 4, Fig. 5, Fig. 6 and Supplementary Figs. 5–8. For prediction of either outcome across both training and validation datasets, GLM, MLP, NB, PLS, and SDA remained largely consistent across the number of base learners included in the models, for all feature subsets. Performance of RF and SVM meta learners improved for mortality in the training dataset with increasing number of base learners, while that of KNN declined. For severe cardiac events, KNN had poor performance while RF had high variance. As the number of base learners increased with the validation dataset, KNN performance also declined, SVM remained consistent, and RF maintained high variance. While performances were similar for each meta learner for either clinical outcome, the range of values was higher for severe cardiac events compared to mortality. These results are highlighted by these data when shown as a heatmap (Fig. 7 ). The overall trends were consistent with the other metrics, except that variances were small for both outcomes with all models (data not shown).

Fig. 3 — Performance of GLM meta learner as a function of number of base learners included in the ensemble. “x” represents mean and horizontal line represents median of each distribution. Box boundaries represent interquartile range. Ends of whiskers represent maximum and minimum values, with points outside the whiskers being outliers.

Fig. 4 — Performance of MLP meta learner as a function of number of base learners included in the ensemble. “x” represents mean and horizontal line represents median of each distribution. Box boundaries represent interquartile range. Ends of whiskers represent maximum and minimum values, with points outside the whiskers being outliers.

Fig. 5 — Performance of PLS meta learner as a function of number of base learners included in the ensemble. “x” represents mean and horizontal line represents median of each distribution. Box boundaries represent interquartile range. Ends of whiskers represent maximum and minimum values, with points outside the whiskers being outliers.

Fig. 6 — Performance of KNN meta learner as a function of number of base learners included in the ensemble. “x” represents mean and horizontal line represents median of each distribution. Box boundaries represent interquartile range. Ends of whiskers represent maximum and minimum values, with points outside the whiskers being outliers.

Fig. 7 — Heatmap visualization for comparison of AUROC values using 8 different meta learners, 2–8 base learners, and the two outcomes of interest for each feature subset.

3.3. Ensemble performance across feature subsets

Ensemble performance for both outcomes and both datasets (training or validation) across subsets of 20, 30, or 40 features on average trended lower as the number of features increased based on the particular meta learner (Table 3, Supplementary Tables 4–6). These patterns are shown graphically for AUROC as the number of base learners was varied in Fig. 3, Fig. 4, Fig. 5, Fig. 6, and in Supplementary Figs. 5–8.

3.4. Ensemble performance compared to base learners alone

The AUROC performance of the base learners alone (Supplementary Tables 11 and 12) was compared to that of the meta learners (Table 3). In terms of AUROC, most base learners performed worse than the meta learners with the training dataset, with PLS, RF, and SVM performing adequately for mortality, and PLS, SDA and SVM for cardiac event. In the validation set, RF performed adequately for cardiac event while SDA and SVM were adequate with both outcomes, although every individual base learner struggled to predict cardiac events in comparison to mortality.

3.5. Fast and frugal decision trees

To assess the performance of a model with potentially greater explainability, fast and frugal decision trees were trained to perform predictions for mortality and severe cardiac event outcomes (Supplementary Figures 9 and 10, respectively). An explanation of each decision tree is as follows.

For mortality outcome (training data): Decide Alive If FiO2 (day 10 after admission) is ≤42.71 or, if not, If FiO2 (day 7 after admission) is ≤40; otherwise, decide Expired.

For mortality outcome (validation data) If FiO2 (day 10 after admission) is ≤42.71, FiO2 (day 7 after admission) is ≤40, FiO2 (day 4 after admission) is ≤ 44, and ICU sedation with Propofol did not occur, decide Alive; if at any point in this sequence the conditions were otherwise, decide Expired.

For severe cardiac outcome (training data): Decide no event if FiO2 (day 10 after admission) is ≤42.71 or, if not, If FiO2 (day 2 after admission) is ≤44 and ICU sedation with Propofol did not occur. Decide yes if FiO2 (day 10 after admission) is >42.71 and FiO2 (day 2 after admission) is >44 or, if ≤44 and ICU sedation with Propofol occurred.

For severe cardiac outcome (validation data): Decide no event if FiO2 (day 10 after admission) is ≤42.71, FiO2 (day 2 after admission) is ≤44, ICU sedation with Propofol did not occur, and FiO2 (day 4 after admission) is ≤44; if at any point in this sequence the conditions were otherwise, decide Yes.

4. Discussion

This study establishes a rigorous methodology to accurately evaluate ensemble ML predictive performance when analyzing clinical data, Using COVID-19 patient data as a representative clinical dataset, several trends in the performance of stacked ensembles emerge. Ensemble average performance for both outcomes trended lower as the number of features increased in the training set (Table 3, Supplementary Tables 4–6). The results support the notion that stacked ensembles are effective at filtering noise and can produce good results with a limited number of features when making predictions on clinical data. Moreover, increasing the number of base learners decreased the variance in performance, including for meta learners performing poorly (Fig. 3, Fig. 4, Fig. 5, Fig. 6 and Supplementary Figs. 5–8). Larger ensembles had the tendency to improve mortality predictions with RF (Supplementary Fig. 6) and SVM meta learners on the training data (Supplementary Fig. 8). Stacked ensembles are expected to perform best when base learner predictions are uncorrelated and the number of base learners increases. However, given the limited number of ensembles that could be created using the models analyzed in this study, further analysis is warranted to establish under what conditions larger ensembles would be expected to outperform.

Furthermore, the largest ensembles had the highest correlations between training and validation AUROC data results for both outcomes (Supplementary Table 7B), suggesting less overfitting than when a smaller number was included. This trend did not hold for the other metrics. F1 results (Supplementary Table 8B) indicated overfitting when the number of base learners was low (3–4 for mortality, 3–5 for cardiac event) or the highest (8). Both balanced accuracy (Supplementary Table 9B) and kappa (Supplementary Table 10B) also indicated overfitting for cardiac event with higher number of base learners (7–8). Additionally, balanced accuracy indicated overfitting for cardiac event with 3 base learners, while kappa indicated overfitting for mortality with 4–5 base learners or with the highest number (8). Overall, these results suggest that although AUROC indicated no overfitting with either low or high numbers of base learners, a more optimal number in between the extremes (e.g., 6 base learners) may be appropriate by considering the other metrics and the particular choice of meta learner.

Overall, three of the adequately performing meta learners for both outcomes were GLM, MLP, and PLS when using the optimal feature subset for each outcome (AUROC metric; Table 3). Therefore, a case could be made that models such as GLM, MLP, and PLS could serve as adequate meta learners when analyzing clinical data. The under-performance of KNN as a meta learner is surprising given its flexibility and stability, but it may explain the mixed performance of KNN in analyzing COVID-19 patients in current literature [30], [31], [32], [33]. Interestingly, while SVM as a meta learner overfitted for cardiac event with the AUROC and F1 metrics (Supplementary Table 7A), NB overfitted for cardiac event with the F1 metric (Supplementary Table 8A) and KNN overfitted for mortality with kappa (Supplementary Table 10A). The differences observed between the classification performance metrics AUROC, F1, balanced accuracy and kappa are consistent with the nature of these metrics. F1 and balanced accuracy are usually reported at a single class probability threshold (e.g., 0.5). Balanced accuracy is simply the average of sensitivity and specificity. F1 may be more sensitive to class imbalances. Kappa is a measurement of agreement and accounts for random chance. In contrast, AUROC essentially averages model sensitivity and specificity at all probability threshold and can reveal different trends. Altogether, these individual metrics offer complementary insights into the performance of the classification models, supporting their collective use to gain insight into ensemble performance.

As expected, the ensemble models generally outperformed the individual base learner models (Supplementary Tables 11 and 12). For mortality outcome, external validation metrics were higher than training data metrics for NB, PLS, RF, SDA, and SVM models. While validation set performance is generally lower than training set performance in classification models, this discrepancy can be explained by the homogenous validation set used in this study, chosen to assess the reproducibility of the models rather than representing a heterogeneous test set from another institution or time period. Therefore, it is conceivable that some models will by chance perform slightly better on the validation data.

Given the trends found in this study, the selection of meta learner can be argued to be the important. The exclusion of KNN and RF as meta learners may be considered no matter the feature subset size since they may produce highly varied ensemble performance with different combinations of base learners (Fig. 6, Supplementary Fig. 6). These models also had high AUROC variance, which may become more pronounced as the ensemble size increases, suggesting RF and KNN as suitable methods when creating small ensembles with a limited number of base learners, but potentially to be avoided to obtain consistent results. GLM, MLP, and PLS were adequate meta learners in terms of AUROC for both outcomes, without compromising the overall ensemble accuracy. It is more difficult to make conclusions based on the other metrics, as models were considerably more stable across meta learners and varying numbers of base learners. However, since the positive and negative outcomes in the dataset were well balanced, AUROC is considered to be the more relevant performance metric for this study.

Although fast and frugal decision trees provide a more intuitive process for decision making, they underperformed relative to most of the ensemble models (Supplementary Figs. 9 and 10); for mortality: BalAcc_TRAIN = 75; BalAcc_VALIDATION = 72; for cardiac event: BalAcc_TRAIN = 74; BalAcc_VALIDATION = 62). The trade-off between model explainability and performance is apparent. Although fast and frugal decision trees can provide a quick overview of how important features can distinguish patients, they lack class probabilities, parameter tuning and a cross-validation procedure. Unlike a stacked generalized ensemble, they are also unable to take advantage of various combinations of algorithms that can recognize linear or non-linear patterns hidden within the dataset.

Applying ML to analyze data from COVID-19 patients is by no means unique to this study, as there have been numerous publications on this topic since the pandemic began, e.g., as comprehensively reviewed by Alaballa & Turaiki in papers published between January 2020 and January 2022 in regards to diagnosis, mortality, and severity risk prediction [34]. For example, Cornelius et al. proposed a clustered RF approach to predict COVID-19 patient mortality, and applied follow-up analysis with k-means clustering and neural network modeling to obtain insight into the associated magnitude and type of mortality risks [35]. Interestingly, Alaballa & Turaiki reported that most ML algorithms applied to these data were supervised learning algorithms. Although prognostic and diagnostic features from these models were consistent with medical evidence, most models were being used in experimental research rather than clinical application. Importantly, it was found that many of the existing applications used imbalanced datasets prone to selection bias [34]. In comparison, this study implements a robust methodology to identify a number of ensemble models that accurately predict COVID-19 patient mortality from in-hospital data.

To date, although several studies have included cardiac complications as features for predicting severe COVID-19 outcomes, few studies have investigated the prediction of cardiovascular complications arising during hospitalization. Thromboembolic events were predicted across several patient cohorts with adequate to poor accuracy, with AUROC_TEST as high as 0.963 or as low as 0.511, depending on the cohort analyzed, with some cohorts having high class imbalances [36]. The performances of GLM (AUROC_TEST 0.740), MLP (AUROC_TEST 0.727), PLS (AUROC_TEST 0.741), and SDA (AUROC_TEST 0.730) in this study fall within this range (Table 3). Furthermore, the results showcase a set of models with adequate accuracy for predicting severe cardiac event outcomes while offering a detailed methodology that supports the consistency (robustness) of this accuracy.

The results in this study are consistent with previous work showing that blood chemistry and complete blood count (CBC) data offer the potential to predict COVID-19 patient outcomes. For example, Famiglini et al. employed representative models from four different families of ML algorithms (multi-layer perceptron, XGBoost, SVM, and decision trees) to implement a rigorous development procedure designed to minimize risks of bias and overfitting focused on CBC tests to predict patient ICU admission within 5 days [37]. The potential bias induced by the imputation procedure was first assessed for 50 different random imputations by comparing the differences in feature distributions before and after imputation. Since CBC data can be routinely obtained, the methodology could be applicable in resource-constrained settings and provide rapid and reliable indications. The high degree of intersection of measurements in our study with those in large cohort studies further demonstrates the generalizability of the results in the context of in-patient clinical data. For instance, Allyn et al. (n = 139,367) recently studied the frequencies of 10 blood chemistry and CBC measurements and other basic patient characteristics and demographics to assess the relationship between the use of routine laboratory tests and evolution of common laboratory parameters, many of which are included in this study’s dataset, such as potassium, bicarbonate (HCO3), creatinine, WBC count, age, gender, and ICU length of stay (days) [38]. Blanco et al. (n = 21,003) and Froom et al. (n = 10,308) each created a logistic regression model using clinical laboratory values. Some of these variables overlap, and 10 of the variables from this study’s dataset are accounted for [39], [40], including glucose, platelets, potassium, albumin, and blood urea nitrogen (BUN). Furthermore, all of the laboratory measurements in this dataset, excluding those for sedation and therapeutics, are included in the MIMIC-III dataset, which is a large (38,597 adult ICU patients) dataset used to teach medical analytics at multiple institutions [41].

The systematic analysis in this study shows that different algorithm families have distinct performances relative to each other, especially when used as meta-learners across a variety of ensemble models. However, there may be particular methods within each family that produce different results. The complexity and computational cost of evaluating every method within each family precludes an exhaustive analysis. To help overcome this limitation, this study deliberately chose specific algorithms from each family that could serve as representatives based on their availability in open source software packages (namely, R package caret) and a reasonable number of tuning parameters, which correspond to popularity and ease of implementation. In general, many methods within the same family include the same underlying assumptions and algorithms to interpret data, with only a few tuning parameters changed or added between them. In some cases, such as neural networks and SVM, differences can include the number of layers or nature of the kernel (linear, polynomial, radial, etc.).

This study adhered to the Checklist for assessment of medical AI (ChAMAI) [42] as discussed in [43] in several key areas. However, one potential limitation is the lack of an external validation set separated temporally or gathered from another institution relative to the training cohort. The findings will need evaluation in a larger cohort of patients at different institutions, since the dataset is small and may not provide a comprehensive representation of the potential patient populations. This study does not claim that the models are generalizable to different populations. Rather, the validation set was imputed independently from the training set, excluded from model training, and served only to assess reproducibility in a homogenous patient cohort [44]. Another limitation is the inclusion of a large set of features, e.g., ABG (arterial blood gas) measurements and bloodwork, such as PaO2 (partial pressure of oxygen) and WBC (white blood cell) count, recorded up to day 10 after admission (Supplementary Table 2), which may not be available as predictive features in datasets from other institutions. However, this dataset was sufficient for evaluating ensemble methods in the context of in-patient clinical data due to the inclusion of common demographic information and routine laboratory measurements, such as blood chemistry, CBC, and ABG.

The ML methodology for clinical data analysis proposed here is complementary with the application of ensemble methods to other types of patient datasets, such as imaging (e.g., [45], [46], [47]). A more accurate evaluation of ensemble method performance using time varying or image based COVID-19 patient data such as hospitalization duration or CT (computed tomography) scans could pave the way for improved interpretation of ensemble behavior on datasets with similar clinical features and modalities. Overall, this study shows that greater insight into the accuracy of ensemble algorithms and their optimization may be possible, with the goal to provide a foundation to guide future clinical data analysis.

CRediT authorship contribution statement

Rianne Kablan: Data curation, Formal analysis, Investigation, Validation, Visualization, Writing – original draft. Hunter A. Miller: Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing – review & editing. Sally Suliman: Data curation, Methodology, Project administration, Writing – review & editing. Hermann B. Frieboes: Formal analysis, Methodology, Investigation, Project administration, Resources, Supervision, Visualization, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

Acknowledgements

The authors gratefully acknowledge Marianna Weaver, Dipan Karmali, Apurv Agarwal, and Viral Desai for contributing to the retrospective chart data collection.

Summary Table

•
Ensemble learning has been increasingly used with machine learning to improve predictive performance.
•
Although stacked generalization, a type of heterogeneous ensemble of machine learning models, has emerged in clinical data analysis, it remains unclear how to define the best model combinations for strong predictive performance.
•
This study systematically evaluates “base” learner models and their optimized combination using “meta” learner models in stacked ensembles to maximize performance in the context of clinical outcomes.
•
A rigorous analysis based on evaluating combinations of “base” learners and the number of features included can identify “meta” learners that yield strong ensemble predictive performance when analyzing clinical data.

Footnotes

^{Appendix A}

Supplementary data to this article can be found online at https://doi.org/10.1016/j.ijmedinf.2023.105090.

Appendix A. Supplementary material

The following are the Supplementary data to this article:

Supplementary data 1

mmc1.pdf^{(4.8MB, pdf)}

Data Availability

The data for this study is available upon reasonable request.

References

1.Shaban W.M., Rabie A.H., Saleh A.I., Abo-Elsoud M.A. A new COVID-19 Patients Detection Strategy (CPDS) based on hybrid feature selection and enhanced KNN classifier. Knowl. Based Syst. 2020;205 doi: 10.1016/j.knosys.2020.106270. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Zhou X., Wang Z., Li S., Liu T., Wang X., Xia J., Zhao Y. Machine learning-based decision model to distinguish between COVID-19 and influenza: a retrospective, two-centered, diagnostic study. Risk Manag. Healthc. Policy. 2021;14:595–604. doi: 10.2147/RMHP.S291498. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Li W.T., Ma J., Shende N., Castaneda G., Chakladar J., Tsai J.C., Apostol L., Honda C.O., Xu J., Wong L.M., Zhang T., Lee A., Gnanasekar A., Honda T.K., Kuo S.Z., Yu M.A., Chang E.Y., Rajasekaran M.R., Ongkeko W.M. Using machine learning of clinical data to diagnose COVID-19: a systematic review and meta-analysis. BMC Med. Inf. Decis. Making. 2020;20:247. doi: 10.1186/s12911-020-01266-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Li L., Qin L., Xu Z., Yin Y., Wang X., Kong B., Bai J., Lu Y., Fang Z., Song Q., Cao K., Liu D., Wang G., Xu Q., Fang X., Zhang S., Xia J., Xia J. Using artificial intelligence to detect COVID-19 and community-acquired pneumonia based on pulmonary CT: evaluation of the diagnostic accuracy. Radiology. 2020;296:E65–E71. doi: 10.1148/radiol.2020200905. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Moulaei K., Shanbehzadeh M., Mohammadi-Taghiabad Z., Kazemi-Arpanahi H. Comparing machine learning algorithms for predicting COVID-19 mortality. BMC Med. Inf. Decis. Making. 2022;22:2. doi: 10.1186/s12911-021-01742-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Ehwerhemuepha L., Danioko S., Verma S., Marano R., Feaster W., Taraman S., Moreno T., Zheng J., Yaghmaei E., Chang A. A super learner ensemble of 14 statistical learning models for predicting COVID-19 severity among patients with cardiovascular conditions. Intell. Based Med. 2021;5 doi: 10.1016/j.ibmed.2021.100030. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Selva K.J., van de Sandt C.E., Lemke M.M., Lee C.Y., Shoffner S.K., Chua B.Y., Davis S.K., Nguyen T.H.O., Rowntree L.C., Hensen L., Koutsakos M., Wong C.Y., Mordant F., Jackson D.C., Flanagan K.L., Crowe J., Tosif S., Neeland M.R., Sutton P., Licciardi P.V., Crawford N.W., Cheng A.C., Doolan D.L., Amanat F., Krammer F., Chappell K., Modhiran N., Watterson D., Young P., Lee W.S., Wines B.D., Mark Hogarth P., Esterbauer R., Kelly H.G., Tan H.X., Juno J.A., Wheatley A.K., Kent S.J., Arnold K.B., Kedzierska K., Chung A.W. Systems serology detects functionally distinct coronavirus antibody features in children and elderly. Nat. Commun. 2021;12:2037. doi: 10.1038/s41467-021-22236-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Li C., Zhang S., Zhang H., Pang L., Lam K., Hui C., Zhang S. Using the K-nearest neighbor algorithm for the classification of lymph node metastasis in gastric cancer. Comput. Math. Methods Med. 2012;2012 doi: 10.1155/2012/876545. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Naimi A.I., Balzer L.B. Stacked generalization: an introduction to super learning. Eur. J. Epidemiol. 2018;33:459–464. doi: 10.1007/s10654-018-0390-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Sesmero M.P., Ledezma A.I., Sanchis A. Generating ensembles of heterogeneous classifiers using stacked generalization. Wires Data Min. Knowl. 2015;5:21–34. [Google Scholar]
11.Wolpert D.H. Stacked generalization. Neural Netw. 1992;5:241–259. [Google Scholar]
12.Oguntimilehin A., Adetunmbi O., Osho I. Towards achieving optimal performance using stacked generalization algorithm: a case study of clinical diagnosis of malaria fever. Int. Arab. J. Inf. Techn. 2019;16:1074–1081. [Google Scholar]
13.Ma Z., Wang P., Gao Z., Wang R., Khalighi K. Ensemble of machine learning algorithms using the stacked generalization approach to estimate the warfarin dose. PLoS One. 2018;13:e0205872. doi: 10.1371/journal.pone.0205872. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Nguyen H., Byeon H. Prediction of Parkinson’s Disease Depression Using LIME-Based Stacking Ensemble Model. Mathematics. 2023;11:708. [Google Scholar]
15.Bhosale Y.H., Patnaik K.S. PulDi-COVID: Chronic obstructive pulmonary (lung) diseases with COVID-19 classification using ensemble deep convolutional neural network from chest X-ray images to minimize severity and mortality rates. Biomed. Signal Process. Control. 2023;81 doi: 10.1016/j.bspc.2022.104445. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Ikemura K., Bellin E., Yagi Y., Billett H., Saada M., Simone K., Stahl L., Szymanski J., Goldstein D.Y., Reyes Gil M. Using automated machine learning to predict the mortality of patients with COVID-19: prediction model development study. J. Med. Internet Res. 2021;23:e23458. doi: 10.2196/23458. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Saadatmand S., Salimifard K., Mohammadi R., Kuiper A., Marzban M., Farhadi A. Using machine learning in prediction of ICU admission, mortality, and length of stay in the early stage of admission of COVID-19 patients. Ann. Oper. Res. 2022:1–29. doi: 10.1007/s10479-022-04984-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Sottile P.D., Albers D., DeWitt P.E., Russell S., Stroh J.N., Kao D.P., Adrian B., Levine M.E., Mooney R., Larchick L., Kutner J.S., Wynia M.K., Glasheen J.J., Bennett T.D. Real-time electronic health record mortality prediction during the COVID-19 pandemic: a prospective cohort study. J. Am. Med. Inform. Assoc. 2021;28:2354–2365. doi: 10.1093/jamia/ocab100. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Sankaranarayanan S., Balan J., Walsh J.R., Wu Y., Minnich S., Piazza A., Osborne C., Oliver G.R., Lesko J., Bates K.L., Khezeli K., Block D.R., DiGuardo M., Kreuter J., O'Horo J.C., Kalantari J., Klee E.W., Salama M.E., Kipp B., Morice W.G., Jenkinson G. COVID-19 mortality prediction from deep learning in a large multistate electronic health record and laboratory information system data set: algorithm development and validation. J. Med. Internet Res. 2021;23:e30157. doi: 10.2196/30157. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Cui S., Wang Y., Wang D., Sai Q., Huang Z., Cheng T.C.E. A two-layer nested heterogeneous ensemble learning predictive method for COVID-19 mortality. Appl. Soft Comput. 2021;113 doi: 10.1016/j.asoc.2021.107946. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.L. Schirato, K. Makina, F. Dwayne, S. Pouriyeh, H. Shahriar, COVID-19 Mortality Prediction Using Machine Learning Techniques, 2021 IEEE International Conference on Digital Health (ICDH)Chicago, IL, 2021.
22.Saran Kumar A., Rekha R. An improved hawks optimizer based learning algorithms for cardiovascular disease prediction. Biomed. Signal Process. Control. 2023;81 [Google Scholar]
23.Hasan M., Bath P.A., Marincowitz C., Sutton L., Pilbery R., Hopfgartner F., Mazumdar S., Campbell R., Stone T., Thomas B., Bell F., Turner J., Biggs K., Petrie J., Goodacre S. Pre-hospital prediction of adverse outcomes in patients with suspected COVID-19: development, application and comparison of machine learning and deep learning methods. Comput. Biol. Med. 2022;151 doi: 10.1016/j.compbiomed.2022.106024. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Das K., Behera R.N. A survey on machine learning: Concept, algorithms and applications. Int. J. Innov. Res. Comput. Commun. Eng. 2017;5:1301–1309. [Google Scholar]
25.Olson R., La Cava W., Mustahsan Z., Varik A., Moore J. Data-driven advice for applying machine learning to bioinformatics problems. Pac. Symp. Biocomput. 2018;23:192–203. [PMC free article] [PubMed] [Google Scholar]
26.Madley-Dowd P., Hughes R., Tilling K., Heron J. The proportion of missing data should not be used to guide decisions on multiple imputation. J. Clin. Epidemiol. 2019;110:63–73. doi: 10.1016/j.jclinepi.2019.02.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Waljee A.K., Mukherjee A., Singal A.G., Zhang Y., Warren J., Balis U., Marrero J., Zhu J., Higgins P.D. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open. 2013;3 doi: 10.1136/bmjopen-2013-002847. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Hong S., Lynn H.S. Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med. Res. Method. 2020;20:199. doi: 10.1186/s12874-020-01080-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.S.M. Vieira, U. Kaymak, J.M.C. Sousa, Cohen's kappa coefficient as a performance measure for feature selection, Proceedings 2010 IEEE World Congress on Computational Intelligence (IEEE CEC 2010, Barcelona, Spain), (2010) 1-8.
30.Arslan H., Arslan H. A new COVID-19 detection method from human genome sequences using CpG island features and KNN classifier. Eng. Sci. Technol. Int. J. 2021;24:839–847. [Google Scholar]
31.Pourhomayoun M., Shakibi M. Predicting mortality risk in patients with COVID-19 using machine learning to help medical decision-making. Smart Health. 2021;20 doi: 10.1016/j.smhl.2020.100178. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Mathew C., Asha P. Detection of Covid-19 from chest X-ray scans using machine learning. AIP Conference Proceedings. 2022;2463 [Google Scholar]
33.Hussain L., Nguyen T., Li H., Abbasi A.A., Lone K.J., Zhao Z., Zaib M., Chen A., Duong T.Q. Machine-learning classification of texture features of portable chest X-ray accurately classifies COVID-19 lung infection. Biomed. Eng. Online. 2020;19 doi: 10.1186/s12938-020-00831-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Alballa N., Al-Turaiki I. Machine learning approaches in COVID-19 diagnosis, mortality, and severity risk prediction: a review. Inform. Med. Unlock. 2021;24 doi: 10.1016/j.imu.2021.100564. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Cornelius E., Akman O., Hrozencik D. COVID-19 mortality prediction using machine learning-integrated random forest algorithm under varying patient frailty. Mathematics. 2021;9:2043. [Google Scholar]
36.Shade J.K., Doshi A.N., Sung E., Popescu D.M., Minhas A.S., Gilotra N.A., Aronis K.N., Hays A.G., Trayanova N.A. Real-time prediction of mortality, cardiac arrest, and thromboembolic complications in hospitalized patients With COVID-19. JACC Adv. 2022;1 doi: 10.1016/j.jacadv.2022.100043. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Famiglini L., Campagner A., Carobene A., Cabitza F. A robust and parsimonious machine learning method to predict ICU admission of COVID-19 patients. Med. Biol. Eng. Compu. 2022:1–13. doi: 10.1007/s11517-022-02543-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Allyn J., Devineau M., Oliver M., Descombes G., Allou N., Ferdynus C. A descriptive study of routine laboratory testing in intensive care unit in nearly 140,000 patient stays. Sci. Rep. 2022;12:21526. doi: 10.1038/s41598-022-25961-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Blanco N., Leekha S., Magder L., Jackson S.S., Tamma P.D., Lemkin D., Harris A.D. Admission laboratory values accurately predict in-hospital mortality: a retrospective cohort study. J. Gen. Intern. Med. 2020;35:719–723. doi: 10.1007/s11606-019-05282-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Froom P., Shimoni Z. Prediction of hospital mortality rates by admission laboratory tests. Clin. Chem. 2006;52:325–328. doi: 10.1373/clinchem.2005.059030. [DOI] [PubMed] [Google Scholar]
41.Johnson A.E., Pollard T.J., Shen L., Lehman L.W., Feng M., Ghassemi M., Moody B., Szolovits P., Celi L.A., Mark R.G. MIMIC-III, a freely accessible critical care database. Sci. Data. 2016;3 doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Wu S., Roberts K., Datta S., Du J., Ji Z., Si Y., Soni S., Wang Q., Wei Q., Xiang Y., Zhao B., Xu H. Deep learning in clinical natural language processing: a methodical review. J. Am. Med. Inform. Assoc. 2020;27:457–470. doi: 10.1093/jamia/ocz200. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Cabitza F., Campagner A. The need to separate the wheat from the chaff in medical informatics: introducing a comprehensive checklist for the (self)-assessment of medical AI studies. Int. J. Med. Inf. 2021;153 doi: 10.1016/j.ijmedinf.2021.104510. [DOI] [PubMed] [Google Scholar]
44.Ramspek C.L., Jager K.J., Dekker F.W., Zoccali C., van Diepen M. External validation of prognostic models: what, why, how, when and where? Clin. Kidney J. 2021;14:49–58. doi: 10.1093/ckj/sfaa188. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Balasubramaniam S., Satheesh Kumar K. Optimal Ensemble learning model for COVID-19 detection using chest X-ray images. Biomed. Signal Process. Control. 2023;81 doi: 10.1016/j.bspc.2022.104392. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Elen A. Covid-19 detection from radiographs by feature-reinforced ensemble learning. Concurr. Comput. 2022;34:e7179. doi: 10.1002/cpe.7179. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Mouhafid M., Salah M., Yue C., Xia K. Deep ensemble learning-based models for diagnosis of COVID-19 from chest CT images. Healthcare (Basel) 2022;10 doi: 10.3390/healthcare10010166. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary data 1

mmc1.pdf^{(4.8MB, pdf)}

Data Availability Statement

The data for this study is available upon reasonable request.

[b0005] 1.Shaban W.M., Rabie A.H., Saleh A.I., Abo-Elsoud M.A. A new COVID-19 Patients Detection Strategy (CPDS) based on hybrid feature selection and enhanced KNN classifier. Knowl. Based Syst. 2020;205 doi: 10.1016/j.knosys.2020.106270. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0010] 2.Zhou X., Wang Z., Li S., Liu T., Wang X., Xia J., Zhao Y. Machine learning-based decision model to distinguish between COVID-19 and influenza: a retrospective, two-centered, diagnostic study. Risk Manag. Healthc. Policy. 2021;14:595–604. doi: 10.2147/RMHP.S291498. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0015] 3.Li W.T., Ma J., Shende N., Castaneda G., Chakladar J., Tsai J.C., Apostol L., Honda C.O., Xu J., Wong L.M., Zhang T., Lee A., Gnanasekar A., Honda T.K., Kuo S.Z., Yu M.A., Chang E.Y., Rajasekaran M.R., Ongkeko W.M. Using machine learning of clinical data to diagnose COVID-19: a systematic review and meta-analysis. BMC Med. Inf. Decis. Making. 2020;20:247. doi: 10.1186/s12911-020-01266-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0020] 4.Li L., Qin L., Xu Z., Yin Y., Wang X., Kong B., Bai J., Lu Y., Fang Z., Song Q., Cao K., Liu D., Wang G., Xu Q., Fang X., Zhang S., Xia J., Xia J. Using artificial intelligence to detect COVID-19 and community-acquired pneumonia based on pulmonary CT: evaluation of the diagnostic accuracy. Radiology. 2020;296:E65–E71. doi: 10.1148/radiol.2020200905. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0025] 5.Moulaei K., Shanbehzadeh M., Mohammadi-Taghiabad Z., Kazemi-Arpanahi H. Comparing machine learning algorithms for predicting COVID-19 mortality. BMC Med. Inf. Decis. Making. 2022;22:2. doi: 10.1186/s12911-021-01742-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0030] 6.Ehwerhemuepha L., Danioko S., Verma S., Marano R., Feaster W., Taraman S., Moreno T., Zheng J., Yaghmaei E., Chang A. A super learner ensemble of 14 statistical learning models for predicting COVID-19 severity among patients with cardiovascular conditions. Intell. Based Med. 2021;5 doi: 10.1016/j.ibmed.2021.100030. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0035] 7.Selva K.J., van de Sandt C.E., Lemke M.M., Lee C.Y., Shoffner S.K., Chua B.Y., Davis S.K., Nguyen T.H.O., Rowntree L.C., Hensen L., Koutsakos M., Wong C.Y., Mordant F., Jackson D.C., Flanagan K.L., Crowe J., Tosif S., Neeland M.R., Sutton P., Licciardi P.V., Crawford N.W., Cheng A.C., Doolan D.L., Amanat F., Krammer F., Chappell K., Modhiran N., Watterson D., Young P., Lee W.S., Wines B.D., Mark Hogarth P., Esterbauer R., Kelly H.G., Tan H.X., Juno J.A., Wheatley A.K., Kent S.J., Arnold K.B., Kedzierska K., Chung A.W. Systems serology detects functionally distinct coronavirus antibody features in children and elderly. Nat. Commun. 2021;12:2037. doi: 10.1038/s41467-021-22236-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0040] 8.Li C., Zhang S., Zhang H., Pang L., Lam K., Hui C., Zhang S. Using the K-nearest neighbor algorithm for the classification of lymph node metastasis in gastric cancer. Comput. Math. Methods Med. 2012;2012 doi: 10.1155/2012/876545. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0045] 9.Naimi A.I., Balzer L.B. Stacked generalization: an introduction to super learning. Eur. J. Epidemiol. 2018;33:459–464. doi: 10.1007/s10654-018-0390-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0050] 10.Sesmero M.P., Ledezma A.I., Sanchis A. Generating ensembles of heterogeneous classifiers using stacked generalization. Wires Data Min. Knowl. 2015;5:21–34. [Google Scholar]

[b0055] 11.Wolpert D.H. Stacked generalization. Neural Netw. 1992;5:241–259. [Google Scholar]

[b0060] 12.Oguntimilehin A., Adetunmbi O., Osho I. Towards achieving optimal performance using stacked generalization algorithm: a case study of clinical diagnosis of malaria fever. Int. Arab. J. Inf. Techn. 2019;16:1074–1081. [Google Scholar]

[b0065] 13.Ma Z., Wang P., Gao Z., Wang R., Khalighi K. Ensemble of machine learning algorithms using the stacked generalization approach to estimate the warfarin dose. PLoS One. 2018;13:e0205872. doi: 10.1371/journal.pone.0205872. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0070] 14.Nguyen H., Byeon H. Prediction of Parkinson’s Disease Depression Using LIME-Based Stacking Ensemble Model. Mathematics. 2023;11:708. [Google Scholar]

[b0075] 15.Bhosale Y.H., Patnaik K.S. PulDi-COVID: Chronic obstructive pulmonary (lung) diseases with COVID-19 classification using ensemble deep convolutional neural network from chest X-ray images to minimize severity and mortality rates. Biomed. Signal Process. Control. 2023;81 doi: 10.1016/j.bspc.2022.104445. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0080] 16.Ikemura K., Bellin E., Yagi Y., Billett H., Saada M., Simone K., Stahl L., Szymanski J., Goldstein D.Y., Reyes Gil M. Using automated machine learning to predict the mortality of patients with COVID-19: prediction model development study. J. Med. Internet Res. 2021;23:e23458. doi: 10.2196/23458. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0085] 17.Saadatmand S., Salimifard K., Mohammadi R., Kuiper A., Marzban M., Farhadi A. Using machine learning in prediction of ICU admission, mortality, and length of stay in the early stage of admission of COVID-19 patients. Ann. Oper. Res. 2022:1–29. doi: 10.1007/s10479-022-04984-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0090] 18.Sottile P.D., Albers D., DeWitt P.E., Russell S., Stroh J.N., Kao D.P., Adrian B., Levine M.E., Mooney R., Larchick L., Kutner J.S., Wynia M.K., Glasheen J.J., Bennett T.D. Real-time electronic health record mortality prediction during the COVID-19 pandemic: a prospective cohort study. J. Am. Med. Inform. Assoc. 2021;28:2354–2365. doi: 10.1093/jamia/ocab100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0095] 19.Sankaranarayanan S., Balan J., Walsh J.R., Wu Y., Minnich S., Piazza A., Osborne C., Oliver G.R., Lesko J., Bates K.L., Khezeli K., Block D.R., DiGuardo M., Kreuter J., O'Horo J.C., Kalantari J., Klee E.W., Salama M.E., Kipp B., Morice W.G., Jenkinson G. COVID-19 mortality prediction from deep learning in a large multistate electronic health record and laboratory information system data set: algorithm development and validation. J. Med. Internet Res. 2021;23:e30157. doi: 10.2196/30157. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0100] 20.Cui S., Wang Y., Wang D., Sai Q., Huang Z., Cheng T.C.E. A two-layer nested heterogeneous ensemble learning predictive method for COVID-19 mortality. Appl. Soft Comput. 2021;113 doi: 10.1016/j.asoc.2021.107946. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0105] 21.L. Schirato, K. Makina, F. Dwayne, S. Pouriyeh, H. Shahriar, COVID-19 Mortality Prediction Using Machine Learning Techniques, 2021 IEEE International Conference on Digital Health (ICDH)Chicago, IL, 2021.

[b0110] 22.Saran Kumar A., Rekha R. An improved hawks optimizer based learning algorithms for cardiovascular disease prediction. Biomed. Signal Process. Control. 2023;81 [Google Scholar]

[b0115] 23.Hasan M., Bath P.A., Marincowitz C., Sutton L., Pilbery R., Hopfgartner F., Mazumdar S., Campbell R., Stone T., Thomas B., Bell F., Turner J., Biggs K., Petrie J., Goodacre S. Pre-hospital prediction of adverse outcomes in patients with suspected COVID-19: development, application and comparison of machine learning and deep learning methods. Comput. Biol. Med. 2022;151 doi: 10.1016/j.compbiomed.2022.106024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0120] 24.Das K., Behera R.N. A survey on machine learning: Concept, algorithms and applications. Int. J. Innov. Res. Comput. Commun. Eng. 2017;5:1301–1309. [Google Scholar]

[b0125] 25.Olson R., La Cava W., Mustahsan Z., Varik A., Moore J. Data-driven advice for applying machine learning to bioinformatics problems. Pac. Symp. Biocomput. 2018;23:192–203. [PMC free article] [PubMed] [Google Scholar]

[b0130] 26.Madley-Dowd P., Hughes R., Tilling K., Heron J. The proportion of missing data should not be used to guide decisions on multiple imputation. J. Clin. Epidemiol. 2019;110:63–73. doi: 10.1016/j.jclinepi.2019.02.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0135] 27.Waljee A.K., Mukherjee A., Singal A.G., Zhang Y., Warren J., Balis U., Marrero J., Zhu J., Higgins P.D. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open. 2013;3 doi: 10.1136/bmjopen-2013-002847. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0140] 28.Hong S., Lynn H.S. Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med. Res. Method. 2020;20:199. doi: 10.1186/s12874-020-01080-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0145] 29.S.M. Vieira, U. Kaymak, J.M.C. Sousa, Cohen's kappa coefficient as a performance measure for feature selection, Proceedings 2010 IEEE World Congress on Computational Intelligence (IEEE CEC 2010, Barcelona, Spain), (2010) 1-8.

[b0150] 30.Arslan H., Arslan H. A new COVID-19 detection method from human genome sequences using CpG island features and KNN classifier. Eng. Sci. Technol. Int. J. 2021;24:839–847. [Google Scholar]

[b0155] 31.Pourhomayoun M., Shakibi M. Predicting mortality risk in patients with COVID-19 using machine learning to help medical decision-making. Smart Health. 2021;20 doi: 10.1016/j.smhl.2020.100178. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0160] 32.Mathew C., Asha P. Detection of Covid-19 from chest X-ray scans using machine learning. AIP Conference Proceedings. 2022;2463 [Google Scholar]

[b0165] 33.Hussain L., Nguyen T., Li H., Abbasi A.A., Lone K.J., Zhao Z., Zaib M., Chen A., Duong T.Q. Machine-learning classification of texture features of portable chest X-ray accurately classifies COVID-19 lung infection. Biomed. Eng. Online. 2020;19 doi: 10.1186/s12938-020-00831-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0170] 34.Alballa N., Al-Turaiki I. Machine learning approaches in COVID-19 diagnosis, mortality, and severity risk prediction: a review. Inform. Med. Unlock. 2021;24 doi: 10.1016/j.imu.2021.100564. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0175] 35.Cornelius E., Akman O., Hrozencik D. COVID-19 mortality prediction using machine learning-integrated random forest algorithm under varying patient frailty. Mathematics. 2021;9:2043. [Google Scholar]

[b0180] 36.Shade J.K., Doshi A.N., Sung E., Popescu D.M., Minhas A.S., Gilotra N.A., Aronis K.N., Hays A.G., Trayanova N.A. Real-time prediction of mortality, cardiac arrest, and thromboembolic complications in hospitalized patients With COVID-19. JACC Adv. 2022;1 doi: 10.1016/j.jacadv.2022.100043. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0185] 37.Famiglini L., Campagner A., Carobene A., Cabitza F. A robust and parsimonious machine learning method to predict ICU admission of COVID-19 patients. Med. Biol. Eng. Compu. 2022:1–13. doi: 10.1007/s11517-022-02543-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0190] 38.Allyn J., Devineau M., Oliver M., Descombes G., Allou N., Ferdynus C. A descriptive study of routine laboratory testing in intensive care unit in nearly 140,000 patient stays. Sci. Rep. 2022;12:21526. doi: 10.1038/s41598-022-25961-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0195] 39.Blanco N., Leekha S., Magder L., Jackson S.S., Tamma P.D., Lemkin D., Harris A.D. Admission laboratory values accurately predict in-hospital mortality: a retrospective cohort study. J. Gen. Intern. Med. 2020;35:719–723. doi: 10.1007/s11606-019-05282-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0200] 40.Froom P., Shimoni Z. Prediction of hospital mortality rates by admission laboratory tests. Clin. Chem. 2006;52:325–328. doi: 10.1373/clinchem.2005.059030. [DOI] [PubMed] [Google Scholar]

[b0205] 41.Johnson A.E., Pollard T.J., Shen L., Lehman L.W., Feng M., Ghassemi M., Moody B., Szolovits P., Celi L.A., Mark R.G. MIMIC-III, a freely accessible critical care database. Sci. Data. 2016;3 doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0210] 42.Wu S., Roberts K., Datta S., Du J., Ji Z., Si Y., Soni S., Wang Q., Wei Q., Xiang Y., Zhao B., Xu H. Deep learning in clinical natural language processing: a methodical review. J. Am. Med. Inform. Assoc. 2020;27:457–470. doi: 10.1093/jamia/ocz200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0215] 43.Cabitza F., Campagner A. The need to separate the wheat from the chaff in medical informatics: introducing a comprehensive checklist for the (self)-assessment of medical AI studies. Int. J. Med. Inf. 2021;153 doi: 10.1016/j.ijmedinf.2021.104510. [DOI] [PubMed] [Google Scholar]

[b0220] 44.Ramspek C.L., Jager K.J., Dekker F.W., Zoccali C., van Diepen M. External validation of prognostic models: what, why, how, when and where? Clin. Kidney J. 2021;14:49–58. doi: 10.1093/ckj/sfaa188. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0225] 45.Balasubramaniam S., Satheesh Kumar K. Optimal Ensemble learning model for COVID-19 detection using chest X-ray images. Biomed. Signal Process. Control. 2023;81 doi: 10.1016/j.bspc.2022.104392. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0230] 46.Elen A. Covid-19 detection from radiographs by feature-reinforced ensemble learning. Concurr. Comput. 2022;34:e7179. doi: 10.1002/cpe.7179. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b0235] 47.Mouhafid M., Salah M., Yue C., Xia K. Deep ensemble learning-based models for diagnosis of COVID-19 from chest CT images. Healthcare (Basel) 2022;10 doi: 10.3390/healthcare10010166. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Evaluation of stacked ensemble model performance to predict clinical outcomes: A COVID-19 study

Rianne Kablan

Hunter A Miller

Sally Suliman

Hermann B Frieboes

Abstract

Background

Methods

Results

Conclusion

1. Introduction

2. Methods

Table 1.

2.1. Representative clinical dataset characteristics

2.2. Data pre-processing and train/validation split

Fig. 1.

Table 2.

2.3. Feature selection

2.4. Ensemble creation and parameter tuning

2.5. Decision trees

2.6. Classification performance metrics and statistical analysis

3. Results

3.1. Ensemble performance across meta learners and base learners

Fig. 2.

Table 3.

3.2. Ensemble performance with varying number of base leaners

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

3.3. Ensemble performance across feature subsets

3.4. Ensemble performance compared to base learners alone

3.5. Fast and frugal decision trees

4. Discussion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Acknowledgements

Footnotes

Appendix A. Supplementary material

Data Availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases