Abstract
Background
Reliable predictive modeling in high-dimensional biomedical data requires a balance between accuracy, interpretability, and computational efficiency. However, existing ensemble methods often overlook model diversity or rely on ad hoc feature-selection approaches, which limit generalizability. This study introduces a hybrid feature-selection and diversity-guided stacking framework designed to improve robustness and scalability across clinical and other data-intensive domains.
Methods
The proposed framework integrates a hybrid feature-selection pipeline—combining Variance Inflation Factor (VIF), Analysis of Variance (ANOVA), Sequential Backward Elimination (SBE), and Lasso regression—to reduce multicollinearity and overfitting. It also employs a diversity-aware stacking strategy that constructs sub-model sets based on pairwise diversity measures (Disagreement, Yule’s Q, and Cohen’s Kappa) and non-pairwise metrics (Entropy and Kohavi–Wolpert). Sixteen base classifiers and five meta-learners were trained using repeated 10-fold cross-validation. The framework was evaluated using data from 4,778 hospitalized COVID-19 patients with 116 clinical and laboratory attributes, preprocessed using robust scaling and ROSE-based class balancing.
Results
The optimal configuration, which stacked Random Forest and XGBoost models using a Neural Network meta-learner, achieved 91.4% accuracy (95% CI: 89.8–92.8), AUC = 0.955, F1 = 0.801, and MCC = 0.746, outperforming the best individual model (AdaBoost, 90.2%). Training time (~450 s) and per-case inference time (<0.2 s) demonstrated computational feasibility. Feature-importance analysis and SHAP-based interpretation confirmed clinical relevance and interpretability.
Conclusions
The hybrid feature-selection and diversity-guided stacking framework improves predictive accuracy and interpretability while maintaining computational efficiency. Although validated using COVID-19 mortality data, the approach is broadly applicable to biomedical, environmental, and engineering prediction tasks that require interpretable and scalable ensemble learning.
1 Introduction
Machine learning (ML) has become an essential component of modern biomedical research, enabling the discovery of complex, nonlinear patterns and supporting improved risk stratification across diverse clinical domains. The primary goal of ML research is to develop efficient, interpretable, and generalize algorithms capable of delivering reliable performance across heterogeneous datasets [1]. Efficiency in ML encompasses not only time and memory requirements but also data utilization and interpretability—key considerations for deployment in high-stakes clinical environments, where transparency, reproducibility, and auditability are critical.
Traditional ML models, however, often encounter significant challenges, including data quality issues, overfitting, class imbalance, and limited interpretability, which restrict their usefulness in clinical applications. These limitations have driven growing interest in ensemble learning, in which multiple algorithms are combined to reduce both variance and bias, thereby improving predictive performance compared with individual models [2–4]. Among ensemble approaches, stacking has emerged as a particularly powerful technique. Stacking integrates diverse base learners and employs a meta-learner to combine their predictions, leveraging complementary error structures to improve accuracy and robustness [5–7].
Recent advancements in various disciplines highlight both the promise and the challenges of ensemble modeling. In cardiovascular diagnostics, Feng et al. (2023) developed a hybrid model that combined hemodynamic modeling with ML to achieve >90% accuracy in under two seconds per case, demonstrating that hybrid approaches can deliver high computational efficiency while maintaining strong performance [8]. Similarly, Wang et al. (2023) used the Super Learner algorithm to improve prediction of cumulative lead exposure, though at the cost of substantial computational load due to the combination of multiple algorithms [9]. In proteomics and chemical sciences, feature selection and dimensionality reduction have proven effective in enhancing prediction accuracy while reducing model complexity.
Xu et al. (2022) benchmarked 13 ML models for protein-level inference from RNA features across more than 2,500 samples and 20 datasets. Their findings demonstrated that combining appropriate feature selection with classical models and voting ensembles improved accuracy, although computation time varied widely [10].
Reda et al. (2023) applied variable selection with partial least squares regression to predict olive-oil quality parameters using near-infrared spectroscopy, showing that variable reduction improved accuracy and decreased computational demands [11].
In oncology, stacking and hybrid ensembles have also yielded substantial gains in predictive performance and generalization. Mohammed et al. (2021) applied a CNN-based stacking ensemble to multi-cancer RNA-Seq classification, achieving superior accuracy to single models while retaining computational feasibility [12]. Wang et al. (2025) designed a multimodal stacking framework that integrated radiomics and deep learning for head-and-neck cancer prognosis (C-index = 0.93), demonstrating the influence of meta-learner selection on scalability [13]. Kwon et al. (2019) found that gradient boosting performed best as a meta-learner for accuracy, while generalized linear models minimized error in breast cancer classification, highlighting the trade-offs between model complexity and efficiency [14]. Other architectures, such as the relevance-aware capsule network [15], deep convolutional neural networks [16], and U-Net–based MRI segmentation models [17], have demonstrated that improvements in accuracy commonly require substantially higher training time and memory, emphasizing the need to balance predictive strength with practicality and interpretability.
Ensemble learning has been widely adopted in other clinical areas. Abualnaja et al. [18] analyzed 32 studies involving 142,459 patients with meningiomas and reported that combined radiomic and clinical ensemble models achieved AUCs of 0.74–0.81, demonstrating robust multimodal representation. Likewise, Lei et al. [19] analyzed 32 studies involving 142,459 patients with meningiomas and reported that combined radiomic and clinical ensemble models achieved AUCs of 0.74–0.81, demonstrating robust multimodal representation. Other studies in cardiovascular disease have shown similar results. Dhingra et al. [20] developed an ensemble model (PRESENT-SHD) using 261,228 ECGs, achieving AUROC values of 0.85–0.90 across multiple hospitals, indicating strong cross-population stability. Tseng et al. [21] used XGBoost and random forest models to predict acute kidney injury following cardiac surgery (AUC = 0.843), demonstrating ensemble learning’s value in perioperative risk prediction.
In infectious diseases research, Sawesi et al. [22] reviewed 17 leptospirosis studies and found that ML and deep learning methods—including CNN ensembles—achieved high accuracy (80–98%), though most lacked external validation. Chiasakul et al. [23] reported that AI methods for venous thromboembolism prediction outperformed traditional risk scores (mean AUC 0.79 vs 0.61), although many studies exhibited bias and limited generalizability.
Ensemble learning has also consistently outperformed single models in forecasting outbreaks of dengue, influenza, Ebola, and COVID-19. Early COVID-19 mortality forecasts demonstrated that ensembles delivered greater accuracy and precision than individual models [24].
Stacking is particularly suited to heterogenous clinical datasets, such as COVID-19 mortality prediction, which depends on numerous clinical, biochemical, and physiological indicators [25]. Berliana and Bustamam [4] demonstrated that a two-level stacking model achieved more than 97% accuracy with CT data and 99% with chest X-ray images. Cui et al [26] introduced a nested heterogeneous ensemble integrating SVR, ELM, and logistic regression, achieving improved generalization. Li et al. [27] predicted early mortality using five base models and a genetic-algorithm optimization procedure, achieving an AUC of 0.907 in a cohort of 4,711 patients. Other studies demonstrated that hybrid ensembles incorporating supervised and unsupervised learning improved performance by over 10%, and that boosted models remained competitive with strong clinical relevance [28,29].
Despite these advances, several methodological limitations persist. Systematic reviews highlight widespread issues such as small and unrepresentative datasets, weak handling of missing data, lack of external validation, and overreliance on discrimination metrics alone [29–31]. Many studies also neglect calibration, effect-size estimation, or fairness analyses, reducing clinical interpretability [32,33]. research shows a nonlinear relationship between predictive gain and computational cost: complex models often deliver higher accuracy but at significant increases in resource consumption [34–37]. While innovations such as subgraph learning [35] and simplified coronary models [8] can mitigate these burdens, a careful balance of accuracy, efficiency, and interpretability remains necessary.
Sample size adequacy is another concern. Many COVID-19 models are trained on datasets too small for their complexity, leading to instability and overfitting [38]. Class imbalance is also common in mortality modeling; although oversampling and weighting strategies are widely used, these must be validated to avoid artificial distortions [39,40].
Furthermore, many ensemble studies rely heavily on tree-based models such as random forest, XGBoost, and LightGBM, limiting diversity and restricting the full advantages of ensemble learning [41,42]. Quantitative measures of diversity—such as Yule’s Q, Disagreement, Cohen’s Kappa, or Double-Fault—remain rarely used in COVID-19 modeling despite consistent evidence that diversity improves generalization [42–44].
To address these limitations, this study introduces a computationally efficient, diversity-guided stacking ensemble framework that integrates heterogeneous base classifiers and interpretable meta-learners to predict COVID-19 mortality. Our approach incorporates:
Hybrid feature-selection using variance inflation factor (VIF) analysis, ANOVA, sequential backward elimination (SBE), and Lasso regression to control multicollinearity and enhance interpretability;
Controlled ensemble depth to balance predictive gain and computational feasibility; and
Lightweight meta-learners capable of capturing nonlinear dependencies among diverse base learners.
We constructed sub-model ensembles using multiple diversity metrics across 16 machine learning algorithms and assessed model performance using discrimination, calibration, and statistical significance tests, including Wilcoxon, McNemar, and DeLong analyses. Model interpretability was enhanced through SHAP-based explanation of global and local prediction behavior.
This study presents a generalizable diversity-aware ensemble framework designed to balance accuracy, interpretability, and computational cost. Although applied here to COVID-19 mortality prediction, the approach is suitable for a wide range of biomedical prediction problems that require robust, interpretable, and scalable machine learning solutions.
2 Materials and methods
2.1 Overview of the proposed framework
As illustrated in Fig 1, this study adopts a multi-stage framework for predicting mortality risk, integrating standard machine learning techniques with the proposed algorithmic innovations. The workflow contains two primary layers:
Fig 1. Research methodology of the proposed machine learning framework.
Foundational Stage – Data preprocessing, normalization, and training of base models using established machine learning procedures.
Algorithmic Stage – A diversity-guided stacking ensemble that integrates hybrid feature selection, explicit model diversity assessment, and a comparison of multiple meta-learners to optimize predictive performance, interpretability, and computational efficiency.
Data from 4,778 confirmed COVID-19 cases were cleaned through exclusion of incomplete records, iterative multivariate imputation, and normalization. The hybrid feature-selection process removed multicollinearity using Variance Inflation Factor (VIF), followed by Analysis of Variance (ANOVA), Sequential Backward Elimination (SBE), and Lasso regression to select 15 key predictors.
Sixteen machine learning classifiers were trained using stratified and balanced datasets and evaluated via repeated 10-fold cross-validation. To enhance predictive performance, ensemble sets were constructed based on correlation and statistical diversity metrics, then stacked using five different meta-learners. Models were evaluated based on discrimination and calibration performance and validated using significance tests. Interpretability was assessed using feature importance and SHAP-based analyses.
2.2 Data source and ethical approval
Data were obtained from 4,778 confirmed COVID-19 patients admitted to three general hospitals in Tehran, Iran, between March 2020 and March 2021. Demographic, clinical, laboratory, symptom, comorbidity, vital sign, and outcome information was extracted from clinical records reviewed by trained medical staff. Laboratory findings were collected on the first day of admission through the hospital information system, and COVID-19 diagnosis was confirmed using real-time polymerase chain reaction (RT-PCR) of nasal or oropharyngeal swab samples.
The study followed formal institutional requirements and received ethical approval from the Institutional Review Board (IRB) of Shahid Beheshti University of Medical Sciences (IR.SBMU.RIGLD.REC.1401.032).
Informed consent was waived due to the retrospective nature of the study, and data were anonymized prior to analysis in accordance with the Declaration of Helsinki. This dataset provides comprehensive temporal, demographic, and clinical information suitable for developing predictive models for COVID-19 mortality. Additional details on the epidemiological profile of the cohort are available in Hatamabadi et al. [45].
2.3 Data preprocessing
Missing data were assessed and addressed prior to model development. Patients with missing values in any categorical variable or more than two missing continuous variables were excluded. Among the 123 available variables (52 categorical and 71 numeric), no categorical variables contained missing values. Seven numerical variables with more than 5% missingness were removed to reduce risk of bias. Remaining missing values were imputed using an iterative multivariate approach implemented in Scikit-learn [46], which models each variable with missing entries as a function of all other features in a chained regression process. This preserves multivariate relationships and minimizes bias under the Missing at Random (MAR) assumption.
The resulting imputed dataset was previously validated by Hatamabadi et al. [47] to confirm realistic variable distributions and consistent multivariate relationships. To account for skewed distributions and sensitivity to outliers, continuous variables were standardized using robust scaling [48], which centers variables on the median and scales using the interquartile range (IQR). This approach enhances stability in clinical models by reducing the influence of extreme values.
2.4 Feature selection
Feature Selection (FS) is essential for managing high-dimensional clinical datasets by reducing redundant, irrelevant, or correlated predictors while improving model accuracy, computational efficiency, and interpretability [49,50].
A multi-stage hybrid feature-selection strategy was implemented to progressively eliminate multicollinearity, non-informative predictors, and weak contributors. Four complementary methods were applied:
Variance Inflation Factor (VIF): Used to detect and remove highly collinear continuous variables, thereby improving model stability and avoiding inflated variance estimates [51,52].
Analysis of Variance (ANOVA): Applied next to evaluate between-group differences using F-tests and eliminate non-discriminative features with minimal computational cost [53,54].
Sequential Backward Elimination (SBE): Iteratively removed the least informative features based on cross-validated model performance, preserving meaningful interactions and improving generalization [55].
Lasso regression: Imposed regularization to shrink weak coefficients to zero, providing sparse and stable model structures suited for correlated predictors [56].
This integrated pipeline—removing collinearity (VIF), filtering weak predictors (ANOVA), refining via model performance (SBE), and enforcing sparsity (Lasso)—produced a compact, interpretable feature set optimized for ensemble learning.
2.5 Model training and selection
Sixteen base machine learning models were trained, including ten standard algorithms and six boosting or bagging methods, representing diverse methodological families. Hyperparameters were optimized through grid search using the caret package [57,58], with 10-fold cross-validation repeated 10 times [59] to balance bias and variance Models with minimal benefit from extensive tuning (e.g., GLM, LDA, CART, Naïve Bayes) retained default settings to maximize computational efficiency.
Hyperparameter ranges were informed by existing literature and empirical results from clinical prediction research. Table 1 summarizes the optimized settings for each model.
Table 1. Parameter settings for the 16 base machine learning models.
| Models | Tuned hyper parameters |
|---|---|
| Generalized Linear Model (GLM) | none |
| Linear Discriminant Analyses (LDA) | none |
| Regularized Regression (Lasso) | Alpha = 1; lambda = 0.0014 |
| Ridge Regularized Regression (Ridge) | Alpha = 0; lambda = 0.098 |
| Elastic Net Regularized Regression (Elastic Net) | Alpha = 0.5; lambda = 0.047. |
| k-Nearest Neighbors (KNN) | k = 1 |
| Naïve Bayes (NB) | none |
| Support Vector Machine (SVM) | sigma = 0.15, C = 10 |
| Classification and Regression Trees (CART) | none |
| Neural Network (NN) | size = 10, decay = 0.1 |
| C5.0 | trials = 50, model = “tree,” winnow = FALSE |
| Stochastic Gradient Boosting (GBM) | n.trees = 250, interaction.depth = 5, shrinkage = 0.1, n.minobsinnode = 10 |
| Extreme Gradient Boosting (XGBoost) | nrounds = 250, max_depth = 5, eta = 0.4, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1, subsample = 1 |
| Random Forest (RF) | mtry = 3, ntree = 700 |
| AdaBoost Random Forest | nIter = 100, method = “Adaboost.MI” |
| Bagged CART (Treebag) | none |
Model selection was based on three criteria:
Algorithmic diversity,
Demonstrated success in biomedical or COVID-19 prediction studies, and
Complementary bias–variance profiles.
The final pool (supported by systematic evidence, e.g., Bottino et al. [60]) included linear models (GLM, Lasso, Ridge, Elastic Net), probabilistic models (Naïve Bayes, LDA), instance-based learners (KNN), tree-based models (CART, C5.0, Random Forest, XGBoost, GBM, Treebag), and Neural Networks.
This diversity ensured coverage of linear and nonlinear relationships, uncertainty modeling, and hierarchical interactions common in clinical decision data.
Cross-validated performance guided final model selection, which was subsequently confirmed through independent test-set validation.
2.6 Diversity-guided sub-model construction
Traditional stacking often selects base models with low prediction correlation to ensure each contributes complementary information. Highly correlated models (>0.75) add redundancy and weaken ensemble gains [61].
Sixteen candidate models were initially generated using the caretList() from the caretEnsemble package [57,58]. Prediction correlations were calculated. Where pairs exceeded 0.75 correlation, the less accurate model was removed across 10 repeated 10-fold cross-validation.
To enhance diversity beyond correlation filtering, additional sub-model sets were constructed using explicit diversity metrics capturing complementary error patterns among classifiers. Pairwise measures (Disagreement, Yule’s Q, Cohen’s Kappa, Double-Fault) and non-pairwise metrics (Entropy, Kohavi–Wolpert) were computed following Tattar [62].
The contingency table defining model agreements (n11and n00) and disagreements (n10 and n01) across N observations is presented in Table 2.
Table 2. Contingency table illustrating agreement and disagreement between two classifiers.
| M1 predicts 1 | M1 predicts 0 | |
|---|---|---|
| M2 predicts 1 | n11 | n10 |
| M2 predicts 0 | n01 | n00 |
2.6.1 Disagreement measure.
This quantifies the proportion of instances where the two classifiers differ in prediction(1):
| (1) |
Higher values indicate greater diversity and reduced redundancy.
2.6.2 Yule’s Q-statistic.
Yule’s Q (or Q-statistic) assesses the strength and direction of association between two classifiers’ predictions (range: –1 to +1). Lower absolute values denote weaker association and thus higher diversity(2):
| (2) |
2.6.3 Cohen’s Kappa statistic.
A widely used measure that evaluates inter-model agreement while adjusting for chance. Low or negative Kappa values suggest that classifiers make independent errors, which enhances ensemble robustness.
2.6.4 Double-fault measure.
Measures the proportion of cases where both classifiers misclassify the same instance. Smaller values indicate complementary error patterns and reduced correlated failures (3):
| (3) |
Two non-pairwise metrics were also used:
Entropy measure [63]: Reflects the overall variability of predictions across all classifiers, ranging from 0 (perfect agreement, no diversity) to 1 (maximum diversity).
Kohavi-Wolpert Measure [62]: Derived from error variance decomposition, it quantifies the dispersion of predictions across classifiers; higher values imply greater diversity and richer ensemble representation.
Pairwise metrics identified redundant learners, while non-pairwise metrics assessed overall heterogeneity within sub-model sets. This integrated diversity evaluation ensured complementary base learners and improved the generalization performance of the stacking ensemble.
2.7 Meta-learner integration
Predictions from the sub-model sets were integrated using a stacking framework with five meta-learners:
Generalized Linear Model (GLM)
Linear Discriminant Analysis (LDA)
Random Forest (RF)
Gradient Boosting Machine (GBM)
Neural Network (NN)
Linear models (GLM, LDA) were chosen for their transparency and stable inference, while tree-based (RF, GBM) and neural meta-learners modeled nonlinear dependencies among base models. This balanced design supports both interpretability and computational efficiency, aligning with the study’s objective of developing a robust, clinically applicable framework for mixed-type healthcare data.
Stacking was implemented using the caretEnsemble to effectively fuse diverse predictive outputs.
2.8 Model evaluation and statistical analysis
Performance was assessed using an independent test dataset. Discrimination metrics included accuracy, sensitivity, specificity, precision, F1-score, Cohen’s Kappa, area under the ROC curve (AUC), and the Matthews correlation coefficient (MCC). These metrics collectively capture overall correctness, class-specific detection, and robustness under class imbalance—critical in mortality prediction, where false negatives carry severe clinical risk and false positives may strain resources. AUC quantified global discrimination across thresholds, while MCC provided a balanced evaluation under uneven outcome distributions [64].
Calibration was assessed using reliability curves to compare predicted probabilities against observed outcomes. Statistical comparisons employed:
Wilcoxon signed-rank test was applied for accuracy differences,
Effect size [65] interpreted per Cohen’s criteria (large = 0.5, medium = 0.3, small = 0.1) [66],
Holm correction to control the family-wise error rate [67],
McNemar’s test to compare classification outcomes between paired models,
DeLong’s test to assess the statistical significance of AUC differences under the null hypothesis of equal performance [68].
All tests were conducted using appropriate R packages, including rstatix [69]and pROC.
2.9 Model interpretability
Feature importance and effect analyses were conducted to explain individual predictions and quantify how specific feature values influenced model outputs, using tools from the iml package [70]. To further enhance interpretability, we employed SHAP (SHapley Additive exPlanations), a model-agnostic framework that quantifies each feature’s contribution to predicted outcomes [71].
SHAP analysis was applied directly to the final stacked ensemble, treating it as a single predictive function. This approach decomposed predicted probabilities into feature-level attributions of the original clinical variables rather than intermediate model outputs, enabling transparent interpretation of global importance, feature interactions, and local instance-level effects driving mortality predictions.
3 Results
3.1 Data characteristics and preparation
The study analyzed 4,778 confirmed COVID-19 cases, comprising 116 clinical, laboratory, and demographic features. The overall mortality rate was 22% (1,050 patients). Males accounted for 59.6% of deaths. The mean age of deceased patients was 70.8 years (SD = 15.6), compared with 58.3 years (SD = 16.9) among survivors. Mortality showed significant associations with comorbidities such as hypertension, diabetes, and heart failure, highlighting their importance in COVID-19 risk assessment (Fig. 2).
Fig 2. Comorbidity distribution and mortality associations in the study cohort.
Numeric features were standardized using the robust scaler based on the interquartile range (IQR) to reduce the influence of outliers. The dataset was split into training (70%, n = 3,345) and testing (30%, n = 1,433) subsets, maintaining consistent mortality rates (21.97%) across both. The original training set exhibited class imbalance (22% “Death” vs. 78% “Alive”), which was corrected using the ROSE package [72], resulting in a balanced 1:1 ratio (“Death”: 1,616; “Alive”: 1,729). This ensured robust model training and reliable performance evaluation.
3.2 Feature Selection and correlation analysis
VIF analysis removed highly collinear variables, reducing the dataset to 109 predictors. ANOVA eliminated 39 features with limited predictive value. SBE and Lasso further refined the selection to 15 key mortality predictors: age, neutrophil count (NEUT), lactate dehydrogenase (LDH), ferritin, phosphorus (P), ventilator oxygen saturation (O₂sat.Ventilator), total iron-binding capacity (TIBC), fasting blood sugar (FBS), procalcitonin, serum sodium (Na), muscle pain, chronic kidney disease (CKD), taste/smell loss, D-dimer, and erythrocyte sedimentation rate (ESR).
This selection achieved an optimal balance between model complexity and predictive performance, as adding more features did not improve cross-validation accuracy. The final feature set captured multiple pathophysiological domains relevant to COVID-19 outcomes, including inflammation (NEUT, ESR, ferritin), oxygenation (O₂sat.Ventilator), metabolism (FBS, Na, P, TIBC), and organ dysfunction (procalcitonin, CKD). Symptom-based predictors such as muscle pain and loss of taste/smell further enhanced discrimination between severe and non-severe disease.
Correlation analysis (S1–S2 Tables in S1 File) revealed generally weak associations among predictors, indicating low multicollinearity. Moderate correlations were observed between ferritin and ESR (r = 0.26), ferritin and TIBC (r = –0.34), and TIBC and ESR (r = –0.35), reflecting physiologically coherent inflammation–iron metabolism dynamics. Overall low inter-feature correlations support model stability and generalizability.
All selected predictors are routinely collected in clinical settings, ensuring clinical interpretability and feasibility for integration into real-world decision-support systems (Fig 3).
Fig 3. Correlation between selected features and the outcome (Death/Alive) in the training dataset.
3.3 Model performance evaluation
Fig 4 and Table S3 in S1 File summarize model performance using 10 repeated 10-fold cross-validation across ten base classifiers and six boosting/bagging algorithms applied to all 15 predictive features. Accuracy estimates with 95% confidence intervals indicated that AdaBoost achieved the highest performance, with a mean accuracy of 92.81% (SD = 0.013). Optimized hyperparameters for each models are presented in S1 Fig in S1 File.
Fig 4. Performance comparison of base, boosting, and bagging machine learning algorithms using repeated 10-fold cross-validation on the training data.
3.4 Diversity-guided sub-model construction
The first sub-model set was generated using a traditional stacking approach, producing 16 candidate models with the caretList() function. Correlation analysis from repeated 10-fold cross-validation revealed strong dependencies among several learners. The Generalized Linear Model (GLM) demonstrated high correlations with LDA (0.940), Lasso (0.984), Ridge (0.966), and Elastic Net (0.966); similarly, C5.0 and Random Forest (RF) were highly correlated (0.806). AdaBoost also showed high correlations with NN (0.798), GBM (0.848), and XGBoost (0.950). To reduce redundancy, the less accurate model from each correlated pair was removed, retaining four classifiers—Ridge, KNN, CART, and AdaBoost—for the first sub-model set.
To improve model complementarity, additional sub-model sets were constructed using diversity metrics.
Pairwise analyses indicated that GLM, LDA, and other linear models exhibited high agreement, whereas GBM, NN, and RF displayed greater disagreement, suggesting complementary error patterns (Fig 5).
Fig 5. Disagreement metrics among classifier predictions on the test dataset.
Non-pairwise metrics further quantified ensemble heterogeneity. Entropy values ranged from 0.64 to 0.90, with the NN–GBM combination showing the greatest diversity. Yule’s Q ranged from 0.39 to 0.47, while Kohavi–Wolpert values ranged from 0.099 to 0.195. Negative Kappa values in some sets indicated substantial prediction disagreement, reinforcing diversity among classifiers (Fig 6, S4 Table in S1 File).
Fig 6. Inter-rater agreement among classifier predictions on the test dataset.
Ultimately, eight sub-model sets were selected based on these metrics (Table 3, Fig 7).
Table 3. Diversity metrics for the eight selected sub-model sets on the test dataset.
| Sub-Model Sets | Entropy measure | Disagreement Measure (Q statistic) | Kohavi-Wolpert measure | Interrater agreement measure (Kappa) |
|---|---|---|---|---|
| NN, GBM | 0.903 | 0.451 | 0.113 | −0.028 |
| NB, GBM | 0.892 | 0.446 | 0.111 | −0.032 |
| SVM, GBM | 0.881 | 0.440 | 0.110 | −0.082 |
| KNN, AdaBoost | 0.822 | 0.411 | 0.103 | −0.043 |
| RF, XGB | 0.799 | 0.399 | 0.099 | 0.140 |
|
NB, CART,
AdaBoost, GBM |
0.640 | 0.395 | 0.148 | 0.100 |
| C5.0,NB, GBM | 0.634 | 0.422 | 0.141 | 0.049 |
| NN, RF, CART, GBM, XGB, Treebag | 0.722 | 0.469 | 0.195 | −0.029 |
Fig 7. Comparison of diversity metrics across eight selected sub-model sets on the test dataset.
This diversity-driven selection ensured inclusion of models differing in both architecture and error behavior, enhancing ensemble robustness and reducing correlated prediction errors.
3.5 Stacking model evaluation and statistical comparison
Stacking was performed using five meta-learners: Generalized Linear Model (GLM), Linear Discriminant Analysis (LDA), Neural Network (NN), Gradient Boosting Machine (GBM), and Random Forest (RF). Table 4 summarizes accuracies across stacking configurations on the independent test dataset.
Table 4. Accuracy of stacking sub-model sets using five different meta-learners on the test dataset.
| Stack Using the Linear Discriminant Analysis Meta-Learner (95% CI) |
Stack using the Neural Network Meta-Learner (95% CI) |
Stack Using the Gradient Boosting Machine Meta-Learner (95% CI) |
Stack Using Random Forest Meta-Learner (95% CI) |
Stack Using Generalized Linear Model Meta-Learner (95% CI) |
Accuracy of The Best Classifier from The Ensemble (95% CI) |
Stacked Sub-Model Sets |
|---|---|---|---|---|---|---|
| 0.826 (0.806, 0.845) |
0.823 (0.803, 0.843) |
0.826 (0.806, 0.845) |
0.809 (0.788, 0.829) |
0.803 (0.782, 0.823) |
GBM: 0.825 (0.804, 0.844) |
Stacking NN, GBM |
| 0.830 (0.809, 0.849) |
0.832 (0.812, 0.851) |
0.833 (0.813, 0.852) |
0.8144 (0.793, 0.834) |
0.836 (0.816, 0.855) |
GBM: 0.825 (0.804, 0.844) |
Stacking NB, GBM |
| 0.865 (0.846, 0.883) |
0.877 (0.859, 0.894) |
0.883 (0.865, 0.899) |
0.765 (0.742, 0.786) |
0.865 (0.846, 0.883) |
SVM: 0.8332 (0.813, 0.852) |
Stacking SVM, GBM |
|
0.848 (0.828, 0.866) |
0.840 (0.820, 0.859) |
0.876 (0.858, 0.893) |
0.852 (0.833, 0.870) |
0.848 (0.828, 0.866) |
AdaBoost: 0.902 (0.885, 0.916) |
Stacking KNN, AdaBoost |
| 0.895 (0.878, 0.910) |
0.914 (0.898, 0.928) |
0.652 (0.626, 0.676) |
0.878 (0.860, 0.894) |
0.897 (0.880, 0.913) |
RF: 0.892 (0.875, 0.907) |
Stacking RF, XGB |
|
0.898 (0.881, 0.913) |
0.904 (0.898, 0.918) |
0.903 (0.886, 0.918) |
0.909 (0.892, 0.923) |
0.895 (0.878, 0.912) |
RF: 0.892 (0.875, 0.907) |
Stacking RF, CART, NN, GBM, XGB, Treebag |
| 0.836 (0.816, 0.855) |
0.844 (0.823, 0.863) |
0.886 (0.868, 0.901) |
0.883 (0.866, 0.899) |
0.838 (0.818, 0.857) |
AdaBoost: 0.902 (0.885, 0.916) |
Stacking NB,CART, GBM, AdaBoost |
|
0.886 (0.868, 0.901) |
0.908 (0.891, 0.922) |
0.911 (0.895, 0.926) |
0.897 (0.880, 0.912) |
0.888 (0.871, 0.904) |
C5.0: 0.890 (0.872, 0.905) |
Stacking NB, C5.0, GBM |
| 0.835 (0.815, 0.854) |
0.838 (0.818, 0.857) |
0.888 (0.870, 0.903) |
0.879 (0.860, 0.895) |
0.8374 (0.817, 0.856) |
AdaBoost: 0.902 (0.885, 0.916) |
Traditional stacking Ridge, KNN, CART, AdaBoost
|
Not all stacking configurations improved upon the strongest base classifier. Ensembles composed of highly correlated models (e.g., Ridge–KNN–CART–AdaBoost) yielded limited gains, indicating performance saturation. In contrast, heterogeneous combinations—such as NB + GBM, RF + XGB, and NB + C5.0 + GBM—achieved significant improvements (accuracy up to 0.914). The NN meta-learner consistently outperformed other meta-learners by capturing nonlinear relationships among base-model outputs.
Statistical analyses (Table 5) showed that AdaBoost remained superior to some stacking configurations, supported by significant Wilcoxon results favoring the single model.
Table 5. Results of significance tests comparing the best base classifier with the stacking model in each sub-model set.
| Compared Models | Wilcoxon test | McNemar’s test (df = 1) |
Roc test (boot.n = 2000, boot.stratified = 1) |
|---|---|---|---|
|
For GBM vs Stacking NN, GBM
using the GBM meta-learner |
V = 1982, Holm-adjusted p-value = 0.269 Effect size r = 0.107 (small) |
McNemar’s chi-squared = 0.304, p-value = 0.5808 |
D = −1.043, p-value = 0.297 |
|
For GBM vs Stacking NB, GBM
using GLM meta-learner |
V = 77, Holm-adjusted p-value = 4.36e-16 Effect size r = 0.828 (large) |
McNemar’s chi-squared = 3.809, p-value = 0.05096 |
D = 0.708, p-value = 0.478 |
|
For SVM vs Stacking SVM, GBM
using the GBM meta-learner |
V = 2087, Holm-adjusted p-value = 0.133 Effect size r = 0.151 (small) |
McNemar’s chi-squared = 58.21, p-value = 2.361e-14 |
D = −0.816, p-value = 0.414 |
|
For AdaBoost vs Stacking KNN, AdaBoost
using the GBM meta-learner |
V = 5049, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.868 (large) |
McNemar’s chi-squared = 3.822, p-value = 0.0505 |
D = −7.768, p-value = 7.952e-15 |
|
For RF vs Stacking RF, XGB
using the Neural Network meta-learner |
V = 5019, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.858 (large) |
McNemar’s chi-squared = 10.75, p-value = 0.0010 |
D = −0.446, p-value = 0.6552 |
|
For RF vs Stacking RF, CART, NN, GBM, XGB, Treebag using Random Forest
meta-learner |
V = 515, Holm-adjusted p-value = 4.87e-12 Effect size r = 0.691 (large) |
McNemar’s chi-squared = 12.41, p-value = 0.0004 |
D = 0.384, p-value = 0.7007 |
|
For AdaBoost vs Stacking NB, CART, GBM, AdaBoost
using the GBM meta-learner |
V = 5050, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.868 (large) |
McNemar’s chi-squared = 3.431, p-value = 0.063 |
D = −8.167, p-value = 3.161e-16 |
|
For C5.0
vs Stacking NB, C5.0, GBM using the GBM meta-learner |
V = 4922, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.824 (large) |
McNemar’s chi-squared = 21.00, p-value = 4.581e-06 | D = −0.371, p-value = 0.71 |
| For AdaBoost vs Traditional stacking Ridge, KNN, CART, AdaBoost using the GBM meta-learner | V = 4533, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.690 (large) |
McNemar’s chi-squared = 0.004, p-value = 0.9447 |
D = −8.9812, p-value < 2.2e-16 |
Wilcoxon signed-rank, McNemar’s, and ROC-based tests compared stacking models with their best-performing base learners, applying Holm-adjusted p-values to control family-wise error and reporting effect sizes (r). Most pairwise comparisons (e.g., GBM vs. NN–GBM stack) showed small effects (r < 0.2, p > 0.05), indicating negligible practical gain. However, combinations such as NB + GBM with GLM meta-learner, AdaBoost + KNN with GBM meta-learner, and RF + XGB with NN meta-learner achieved significant improvements (r > 0.8), reflecting meaningful accuracy gains.
The best-performing configuration—stacking RF and XGB with an NN meta-learner—achieved:
Accuracy: 0.914 (95% CI: 0.898–0.928)
AUC: 0.955
F1 score: 0.801
MCC: 0.746
This model outperformed both individual classifiers and other stacking variants (Tables 6–7, Fig 8). Wilcoxon tests showed large effect sizes (r > 0.5), confirming substantial performance improvements, whereas McNemar’s and DeLong’s tests indicated that some pairwise differences were not significant after correction. ROC curves (Fig 9) demonstrated strong sensitivity and specificity, and calibration plots (Fig 10) showed excellent alignment between predicted and observed outcomes.
Table 6. Performance evaluation of selected stacking models that outperform the most accurate individual algorithm in their respective combinations.
| Stacking Models | TN | FP | TP | FN | Accuracy (95% CI) |
Kappa | Sensitivity | Specificity | Precision | F1 | MCC | ROC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Stacking NB, GBM
using GLM meta-learner |
952 | 166 | 250 | 65 | 0.838 (0.818, 0.856) |
0.577 |
0.793 |
0.851 |
0.599 |
0.683 |
0.587 |
0.885 |
|
Stacking SVM, GBM
using the GBM meta-learner |
1027 | 91 | 222 | 93 | 0.872 (0.853, 0.888) |
0.625 |
0.705 |
0.919 |
0.709 |
0.707 |
0.624 | 0.893 |
|
Stacking RF, XGB
using the Neural Network meta-learner |
1063 | 55 | 247 | 68 | 0.914 (0.898, 0.928) |
0.746 |
0.784 |
0.951 |
0.818 |
0.801 |
0.746 |
0.955 |
|
Stacking RF, CART, NN, GBM, XGB, Treebag using Random Forest
meta-learner |
1061 | 57 | 241 | 74 | 0.909 (0.892, 0.923) |
0.728 |
0.765 |
0.949 |
0.809 |
0.786 |
0.728 |
0.949 |
|
Stacking NB, C5.0, GBM
using the GBM meta-learner |
1068 | 50 | 238 | 77 | 0.911 (0.895, 0.926) |
0.733 | 0.756 |
0.955 |
0.826 |
0.789 |
0.734 |
0.944 |
Table 7. Statistical comparison between stacked random forest (RF) and XGBoost (XGB) utilizing a neural network (NN) meta-learner and other stacking models.
| Compared Models | Wilcoxon test | McNemar’s test (df = 1) |
Roc test (DeLong method) |
|---|---|---|---|
|
stack.GLM.GBM.NB
and stack.NN.RF.XGB |
V = 175, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.808 (large) |
McNemar’s chi-squared = 58.27, p-value = 2.276e-14 |
Z = −6.354, p-value = 2.099e-1 |
|
stack.GBM.SVM.GBM
and stack.NN.RF.XGB |
V = 969, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.535 (large) |
McNemar’s chi-squared = 0.523, p-value = 0.469 |
Z = −6.036, p-value = 1.578e-0 |
|
stack.GBM.KNN.AdaBoost
and stack.NN.RF.XGB |
V = 383, Holm-adjusted p-value = 1.80e-13 Effect size r = 0.736 (large) |
McNemar’s chi-squared = 2.472, p-value = 0.115 |
Z = −4.485, p-value = 7.286e-0 |
|
stack.RF.RF.CART.NN.GBM.XGB.Treebag
and stack.NN.RF.XGB |
V = 5050, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.868 (large) |
McNemar’s chi-squared = 0.18, p-value = 0.671 |
Z = −2.270, p-value = 0.0231 |
|
stack.GBM.NB.C5.0.GBM
and stack.NN.RF.XGB |
V = 2728, Holm-adjusted p-value = 0.378 Effect size r = 0.089 (small) |
McNemar’s chi-squared = 2.913, p-value = 0.08783 |
Z = −3.066, p-value = 0.002 |
Fig 8. Performance metrics of the selected stacked models on the training dataset.
Fig 9. ROC curves of the best-performing stacked models on the test dataset.
Stacking NB, GBM using GLM meta-learner (stack.GLM.GBM.NB). Stacking SVM, GBM using the GBM meta-learner (stack. GBM.SVM.GBM). Stacking RF, CART, NN, GBM, XGB, Treebag using Random Forest meta-learner (stack. RF.RF.CART.NN.GBM.XGB.Treebag). Stacking RF, XGB using the Neural Network meta-learner (stack. NN.RF.XGB). Stacking NB, C5.0, GBM using the GBM meta-learner (stack. GBM.NB.C5.0.GBM).
Fig 10. Calibration plot of the stacked Random Forest (RF) and XGBoost (XGB) model using a Neural Network (NN) meta-learner under repeated 10-fold cross-validation.
3.6 Computational complexity and training time
Computational complexity was assessed on a standard workstation (Intel Core i5 M520 @ 2.40 GHz, 8 GB RAM, Windows 10, 64-bit). Training times varied across models: XGB required ~98 s, RF ~ 306 s, and AdaBoost ~869 s. The stacked RF–XGB–NN model required ~450 s—faster than AdaBoost despite its complexity (Table 8).
Table 8. Training and prediction times for single models and stacking ensembles.
| Models | Training Time (sec) | Prediction Time per Patient (sec) |
|---|---|---|
| AdaBoost | 868.6 | 0.70 |
| Random Forest (RF) | 306.5 | 0.14 |
| XGBoost (XGB) | 98.0 | 0.01 |
| Stacked RF + XGB + NN | 450.2 | 0.17 |
Inference times were uniformly low, ranging from 0.01 s per patient (XGB) to 0.70 s (AdaBoost). The stacked model achieved 0.17 s per prediction, supporting deployment in near real-time clinical settings through electronic decision-support tools.
3.7 Model Interpretation
Feature importance analysis of the stacked RF–XGB–NN ensemble identified age as the most influential predictor of mortality, demonstrating strong predictive stability (variability ±0.06, permutation error 0.237) (Fig 11, S5 Table in S1 File).
Fig 11. Most influential predictors contributing to “Death” outcomes in the stacked RF–XGB model with an NN meta-learner.
Neutrophil count (NEUT), phosphorus levels, and oxygen saturation while on a ventilator (O2sat.Ventilator) followed as critical predictors, reflecting infection severity, metabolic status, and respiratory function, respectively. Additional features such as lactate and ferritin contributed meaningfully, consistent with their roles in sepsis, inflammation, and critical illness.
Model-agnostic SHAP analysis, applied to the final stacked model, decomposed individual predictions into feature-level contributions. SHAP analysis indicated that for the “Death” class, advancing age (φ = –0.18), reduced O₂sat.Ventilator (φ = –0.02), and elevated NEUT (φ = –0.08) were dominant mortality drivers (Fig 12, S6 Table in S1 File). Reduced sodium (NA, φ = –0.08) and phosphorus (P, ϕ = –0.04) also contributed to poor outcomes, whereas higher lactate levels had a modest positive contribution (φ = 0.02). Features such as muscle pain, taste/smell disturbances, CKD, FBS, ESR, ferritin, and D-dimer had minimal SHAP contributions.
Fig 12. SHAP-based interpretation of the stacked RF–XGB model using a Neural Network meta-learner.
Interaction analysis (Fig 13, S7 Table in S1 File) showed that age exhibited strong interactions with O₂sat.Ventilator, NEUT, phosphorus, and ferritin, highlighting their synergistic effects on mortality risk. Together, these findings demonstrate that the stacking ensemble not only improves predictive accuracy but also maintains meaningful clinical interpretability.
Fig 13. Interaction effects between age and key clinical predictors influencing mortality in the stacked RF–XGB model with an NN meta-learner.
4 Discussion
This study introduced a hybrid feature-selection and diversity-guided stacking framework designed to improve predictive accuracy, interpretability, and computational efficiency in high-dimensional clinical data. Although demonstrated on a large cohort of 4,778 COVID-19 patients, the proposed approach is broadly applicable to biomedical, environmental, and engineering domains that require scalable and transparent ensemble learning.
Our framework integrates hybrid feature selection—combining VIF, ANOVA, SBE, and Lasso—with a diversity-based stacking strategy that systematically quantifies inter-model complementarity using both pairwise (e.g., Yule’s Q, Disagreement, Kappa) and non-pairwise (e.g., Entropy, Kohavi–Wolpert) diversity measures. This approach directly addresses major limitations of prior COVID-19 prognostic models, including small sample sizes, poor calibration, and redundant base learners [29–31]. It also mitigates the trade-offs observed in traditional models, wherein improved predictive performance often comes at the cost of computational demand or reduced interpretability—limitations frequently encountered in deep neural networks and boosting algorithms [34–37].
A key advantage of this study is the large sample size, which is substantially greater than that of many prior investigations. This enhances both the stability and generalizability of our findings. A multicenter study in Tehran reported a case fatality rate (CFR) of 10.05% across 19 hospitals [73]. Consistent with guidelines recommending at least 20 events per predictor variable for outcomes with low prevalence [74], we determined that a minimum of 3,000 patients would be required to reliably identify significant mortality predictors. Our dataset exceeded this threshold, reducing the risk of overfitting and supporting the robustness of the derived model.
The use of the Robust Scaling for standardization and ROSE-based resampling to achieve class balance were essential to model reliability where class imbalance, is a common barrier in predictive modeling [75], particularly in the context of COVID-19 [76]. The original training dataset exhibited pronounced class imbalance, with “Death” cases constituting only 22% of the cohort. This imbalance skewed model learning toward the majority “Alive” class, inflating accuracy but suppressing sensitivity—a critical limitation in mortality prediction. ROSE resampling created a fully balanced 1:1 dataset, substantially improving sensitivity and enabling models to better recognize minority-class (Death) cases. As expected, this rebalancing slightly reduced specificity due to the presence of synthetic samples. Although ROSE improves minority-class recognition and stabilizes cross-validation metrics, synthetic oversampling may introduce mild calibration shifts or artificial patterns, as noted in previous studies [77,78]. To mitigate this, final model performance was strictly evaluated on the original unbalanced test set, ensuring that reported results reflect real clinical distributions rather than synthetic balance.
Through iterative refinement, the dataset was reduced from 115 to 15 clinically interpretable features—such as age, neutrophil count, ferritin, and O₂ saturation—all of which are routinely available in electronic health records. These variables span immunologic, metabolic, and respiratory domains, collectively capturing key biological mechanisms underlying severe COVID-19 outcomes.
Traditional stacking methods that use highly correlated base learners provided only limited performance gains, confirming that redundancy constrains the potential benefits of ensemble modeling. In contrast, the proposed diversity-guided stacking approach achieved superior accuracy and generalization by leveraging heterogeneous classifiers with complementary error patterns.
For instance, Ribeiro et al. [79] demonstrated that stacking models can enhance predictive performance in COVID-19 outcomes. Their stack-ensemble model, which incorporated support vector regression, effectively forecasted mortality among 14,267 COVID-19 patients in Brazil. Similarly, our findings reinforce the notion that combining strong, heterogeneous classifiers through stacking is more effective than relying on a single best-performing classifier (53).
This observation aligns with emerging literature on ensemble learning. Hussain et al. [80] highlighted the superiority of hybrid classifier systems in improving prediction accuracy, and further reported an impressive AUC of 96.0% using a deep stacking neural network to predict mortality risk. Together, these studies support the effectiveness of diversity-aware ensemble approaches in high-stakes biomedical prediction.
Our stacking model significantly outperformed established machine learning approaches reported in the literature. For example, Yakovyna et al. [28] applied a combination of supervised and unsupervised learning techniques but did not achieve the level of discrimination observed in our stacking model. Rahmatinejad et al. [29] reported high Brier scores for Random Forest and improved precision and sensitivity for XGBoost; however, our RF–XGB–NN stacking configuration provided substantially higher accuracy and AUC, supported by rigorous statistical analyses, including Wilcoxon, McNemar, and DeLong tests. Furthermore, compared to a study using over 500 EHR variables to train an RF model for sepsis mortality prediction [81], our stacking method achieved notably better calibration and discrimination while maintaining computational efficiency.
Our findings also outperform those of de Paiva et al. [82], who analyzed 10,897 COVID-19 patients using various machine learning models—including FNet transformers, convolutional neural networks, support vector machines, LightGBM, and traditional statistical approaches such as LASSO and Generalized Additive Models (GAM). Their best models achieved an AUROC of 0.826 and a MacroF1 score of 65.4%. In contrast, our stacking framework delivered an F1 score of 80.1% and AUC of 0.955, demonstrating substantially improved predictive accuracy and reliability.
The Neural Network meta-learner emerged as the most effective combiner of base-model outputs. Neural meta-learning has previously been shown to outperform linear or tree-based approaches by capturing higher-order, nonlinear relationships among model outputs, particularly in biomedical prediction tasks [83–85]. Our cross-validation results confirmed this, with the NN meta-learner providing superior discrimination and calibration across diverse sub-model sets.
Feature importance analysis identified age as the strongest predictor of mortality, followed by neutrophil count (NEUT), phosphorus levels, oxygen saturation (O₂sat.Ventilator), and lactate. Elevated NEUT counts have been widely associated with severe disease and cytokine storm responses, while reduced oxygen saturation is a direct marker of respiratory compromise [86,87]. SHAP analysis further showed that advancing age and high NEUT levels significantly increased mortality risk. These results reinforce established clinical findings that older age and immune dysregulation critically influence COVID-19 severity [88].
Beyond performance metrics, the proposed framework emphasizes computational efficiency and real-world applicability. Although training complexity increased relative to single models, the computation remained manageable and feasible for operational clinical environments. Inference latency (<0.2 seconds per prediction) is sufficiently low for real-time or near-real-time decision support, including bedside applications and automated triage systems.
Overall, this study presents a generalizable, diversity-aware ensemble framework that balances accuracy, interpretability, and computational efficiency. While validated for COVID-19 mortality prediction, the approach is adaptable to broader biomedical domains where heterogeneous data, transparency, and performance stability are critical.
5 Limitations and biases
This study has several limitations that should be considered when interpreting the findings. First, the retrospective EHR-based design may introduce selection and information biases, as data collection depended on available hospital records, which may not fully capture all relevant clinical variables [89]. Retrospective EHR studies are particularly vulnerable to incomplete or non-standardized data, including missing values, heterogeneous logging practices, and inconsistent variable definitions across hospitals [90]. Moreover, implicit clinician bias, referral patterns, and disparities in diagnosis or treatment may have influenced the data used for training, thereby propagating systemic inequities into predictive algorithms [91].
Second, the dataset was limited to three Iranian hospitals, which constrains the generalizability and external validity of the model. Differences in genetic, sociodemographic, cultural, and healthcare system characteristics across populations may influence both feature distributions and outcome risks, and prior reviews have shown that COVID-19 prediction models often perform poorly outside their development setting [92]. For example, variations in comorbidity prevalence, access to intensive care resources, and laboratory reference ranges may affect model performance in non-Iranian populations. Without independent external validation, our results should be interpreted cautiously when applied elsewhere. Recent reviews further emphasize that prediction models can never be considered fully “validated,” as transportability depends on population, setting, and temporal context [78,93]. Accordingly, validation in multinational, diverse cohorts is essential before clinical translation.
Third, although we applied resampling (ROSE) to mitigate class imbalance, oversampling approaches may embed artificial patterns that distort calibration or inflate predictive accuracy. This limitation has been repeatedly recognized in both COVID-19 studies and broader clinical prognostic modeling research [77,78].
Fourth, although ensemble learning (stacking) enhanced predictive performance, it introduced computational complexity and challenges for clinical deployment. Even with SHAP-based interpretability, ensemble models remain partially opaque, and post-hoc explanations represent approximations rather than causal insights. This raises concerns about clinical trust and automation bias, particularly if model outputs are adopted uncritically [93].
Fifth, our models did not incorporate unmeasured contextual confounders such as evolving treatment regimens, viral variants, or social determinants of health. As shown in previous literature, omission of such factors can bias effect estimates and limit real-world applicability [91]. Relatedly, model performance drift is a potential risk, as COVID-19 epidemiology, therapeutic strategies, and patient characteristics have evolved over time, necessitating ongoing monitoring and recalibration [78].
Finally, several design-related limitations warrant consideration. Statistical literature highlights that non-random sampling and limited site representativeness reduce the generalizability of predictive models, particularly when outcome heterogeneity is present [92]. Furthermore, most machine learning studies—including our own—focus primarily on discrimination metrics (e.g., AUC), with less emphasis on calibration and fairness assessments, which limits their clinical interpretability and adoption [93].
Despite these limitations, we employed rigorous validation strategies, including repeated cross-validation, calibration evaluation, and effect size reporting, consistent with TRIPOD recommendations [94]. These measures enhance robustness and transparency; however, independent, prospective, multi-center validation in larger and more diverse populations remains essential before clinical implementation.
5 Conclusion
In conclusion, this study demonstrates that a diversity-guided stacking ensemble—integrating Random Forest, XGBoost, and a Neural Network meta-learner—can achieve high predictive accuracy and interpretability for COVID-19 mortality risk. By combining a hybrid feature selection pipeline with heterogeneous base learners, the framework effectively reduced redundancy, captured nonlinear interactions, and maintained computational efficiency suitable for near real-time deployment.
Key predictors such as age, neutrophil count, phosphorus, and oxygen saturation consistently aligned with known clinical mechanisms of severe COVID-19, supporting both the statistical and biological validity of the model. SHAP-based interpretation further illustrated how interactions among these variables shape mortality risk, helping to bridge predictive performance with clinical insight.
Nevertheless, this work has several limitations, including its retrospective design, reliance on data from a single regional health system, and the absence of external validation, all of which may restrict generalizability. Future research should extend this framework to multi-center or multi-disease cohorts, integrate multimodal data sources (e.g., imaging, genomics), and evaluate real-time performance in prospective clinical environments.
Ultimately, the proposed stacking strategy represents a scalable and interpretable modeling paradigm that can be readily adapted to a wide range of clinical prediction tasks beyond COVID-19, advancing the application of ensemble learning for precision medicine and healthcare decision support.
Supporting information
This file contains supplementary tables, figures, and additional analyses including correlation matrices, model tuning parameters, performance summaries, feature importance results, interaction analyses, and SHAP outputs.
(DOCX)
Acknowledgments
This article was part of the Ph.D. dissertation in epidemiology at Shahid Beheshti University of Medical Sciences (SBMU).
Data Availability
The dataset analyzed in this study contains sensitive patient information and cannot be shared publicly due to confidentiality restrictions. Access to de-identified data can be requested through the Shahid Beheshti University of Medical Sciences Ethics Committee (IR.SBMU.RIGLD.REC.1401.032) at urm@sbmu.ac.ir, which will review and, if appropriate, authorize data release. The corresponding author can provide further details on the application process. All analysis codes developed in this study are available from the corresponding author upon request.
Funding Statement
The author(s) received no specific funding for this work.
References
- 1.Das K, Behera RN. A survey on machine learning: concept, algorithms and applications. International Journal of Innovative Research in Computer and Communication Engineering. 2017;5(2):1301–9. [Google Scholar]
- 2.Windeatt T, Ghaderi R. Binary labelling and decision-level fusion. Information Fusion. 2001;2(2):103–12. doi: 10.1016/s1566-2535(01)00029-x [DOI] [Google Scholar]
- 3.Ghasemieh A, Lloyed A, Bahrami P, Vajar P, Kashef R. A novel machine learning model with Stacking Ensemble Learner for predicting emergency readmission of heart-disease patients. Decision Analytics Journal. 2023;7:100242. doi: 10.1016/j.dajour.2023.100242 [DOI] [Google Scholar]
- 4.Berliana AU, Bustamam A. Implementation of Stacking Ensemble Learning for Classification of COVID-19 using Image Dataset CT Scan and Lung X-Ray. In: 2020 3rd International Conference on Information and Communications Technology (ICOIACT), 2020. 148–52. doi: 10.1109/icoiact50329.2020.9332112 [DOI] [Google Scholar]
- 5.Khyani D, Jakkula S, Gowda S, Anusha KJ, Swetha KR. An interpretation of stacking and blending approach in machine learning. Int Res J Eng Technol. 2021;8(07). [Google Scholar]
- 6.Mienye ID, Sun Y. A Survey of Ensemble Learning: Concepts, Algorithms, Applications, and Prospects. IEEE Access. 2022;10:99129–49. doi: 10.1109/access.2022.3207287 [DOI] [Google Scholar]
- 7.Graczyk M, Lasota T, Trawiński B, Trawiński K. Comparison of bagging, boosting and stacking ensembles applied to real estate appraisal. In: Intelligent Information and Database Systems: Second International Conference, ACIIDS, Hue City, Vietnam, March 24-26, 2010 Proceedings, Part II, 2010. [Google Scholar]
- 8.Feng Y, Li B, Fu R, Hao Y, Wang T, Guo H, et al. A simplified coronary model for diagnosis of ischemia-causing coronary stenosis. Comput Methods Programs Biomed. 2023;242:107862. doi: 10.1016/j.cmpb.2023.107862 [DOI] [PubMed] [Google Scholar]
- 9.Wang X, Bakulski KM, Mukherjee B, Hu H, Park SK. Predicting cumulative lead (Pb) exposure using the Super Learner algorithm. Chemosphere. 2023;311(Pt 2):137125. doi: 10.1016/j.chemosphere.2022.137125 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Xu W, He H, Guo Z, Li W. Evaluation of machine learning models on protein level inference from prioritized RNA features. Brief Bioinform. 2022;23(3):bbac091. doi: 10.1093/bib/bbac091 [DOI] [PubMed] [Google Scholar]
- 11.Reda R, Saffaj T, Bouzida I, Saidi O, Belgrir M, Lakssir B, et al. Optimized variable selection and machine learning models for olive oil quality assessment using portable near infrared spectroscopy. Spectrochim Acta A Mol Biomol Spectrosc. 2023;303:123213. doi: 10.1016/j.saa.2023.123213 [DOI] [PubMed] [Google Scholar]
- 12.Mohammed M, Mwambi H, Mboya IB, Elbashir MK, Omolo B. A stacking ensemble deep learning approach to cancer type classification based on TCGA data. Sci Rep. 2021;11(1):15626. doi: 10.1038/s41598-021-95128-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wang B, Liu J, Zhang X, Lin J, Li S, Wang Z, et al. A stacking ensemble framework integrating radiomics and deep learning for prognostic prediction in head and neck cancer. Radiat Oncol. 2025;20(1):127. doi: 10.1186/s13014-025-02695-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kwon H, Park J, Lee Y. Stacking Ensemble Technique for Classifying Breast Cancer. Healthc Inform Res. 2019;25(4):283–8. doi: 10.4258/hir.2019.25.4.283 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Alhussen A, Anul Haq M, Ahmad Khan A, Mahendran RK, Kadry S. XAI-RACapsNet: Relevance aware capsule network-based breast cancer detection using mammography images via explainability O-net ROI segmentation. Expert Systems with Applications. 2025;261:125461. doi: 10.1016/j.eswa.2024.125461 [DOI] [Google Scholar]
- 16.Haq MA, Khan I, Ahmed A, Eldin SM, Alshehri ALI, Ghamry NA. DCNNBT: A novel deep convolution neural network-based brain tumor classification model. Fractals. 2023;31(06):2340102. doi: 10.1142/S0218348X23401023 [DOI] [Google Scholar]
- 17.Yousef R, Khan S, Gupta G, Siddiqui T, Albahlal BM, Alajlan SA, et al. U-Net-Based Models towards Optimal MR Brain Image Segmentation. Diagnostics (Basel). 2023;13(9):1624. doi: 10.3390/diagnostics13091624 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Abualnaja SY, Morris JS, Rashid H, Cook WH, Helmy AE. Machine learning for predicting post-operative outcomes in meningiomas: a systematic review and meta-analysis. Acta Neurochir (Wien). 2024;166(1):505. doi: 10.1007/s00701-024-06344-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lei J, Zhai J, Zhang Y, Qi J, Sun C. Supervised Machine Learning Models for Predicting Sepsis-Associated Liver Injury in Patients With Sepsis: Development and Validation Study Based on a Multicenter Cohort Study. J Med Internet Res. 2025;27:e66733. doi: 10.2196/66733 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Dhingra LS, Aminorroaya A, Sangha V, Pedroso AF, Shankar SV, Coppi A, et al. Ensemble deep learning algorithm for structural heart disease screening using electrocardiographic images: PRESENT SHD. Journal of the American College of Cardiology. 2025;85(12):1302–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Tseng PY, Chen YT, Wang CH, Chiu KM, Peng YS, Hsu SP. Prediction of the development of acute kidney injury following cardiac surgery by machine learning. Critical Care. 2020;24(1):478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Sawesi S, Jadhav A, Rashrash B. Machine Learning and Deep Learning Techniques for Prediction and Diagnosis of Leptospirosis: Systematic Literature Review. JMIR Med Inform. 2025;13:e67859. doi: 10.2196/67859 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Chiasakul T, Lam BD, McNichol M, Robertson W, Rosovsky RP, Lake L, et al. Artificial intelligence in the prediction of venous thromboembolism: A systematic review and pooled analysis. Eur J Haematol. 2023;111(6):951–62. doi: 10.1111/ejh.14110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Cramer EY, Ray EL, Lopez VK, Bracher J, Brennen A, Castro Rivadeneira AJ, et al. Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States. Proc Natl Acad Sci U S A. 2022;119(15):e2113561119. doi: 10.1073/pnas.2113561119 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Dessie ZG, Zewotir T. Mortality-related risk factors of COVID-19: a systematic review and meta-analysis of 42 studies and 423,117 patients. BMC Infect Dis. 2021;21(1):855. doi: 10.1186/s12879-021-06536-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Cui S, Wang Y, Wang D, Sai Q, Huang Z, Cheng TCE. A two-layer nested heterogeneous ensemble learning predictive method for COVID-19 mortality. Appl Soft Comput. 2021;113:107946. doi: 10.1016/j.asoc.2021.107946 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Li J, Li X, Hutchinson J, Asad M, Liu Y, Wang Y, et al. An ensemble prediction model for COVID-19 mortality risk. Biol Methods Protoc. 2022;7(1):bpac029. doi: 10.1093/biomethods/bpac029 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Yakovyna V, Shakhovska N, Szpakowska A. A novel hybrid supervised and unsupervised hierarchical ensemble for COVID-19 cases and mortality prediction. Sci Rep. 2024;14(1):9782. doi: 10.1038/s41598-024-60637-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Rahmatinejad Z, Dehghani T, Hoseini B, Rahmatinejad F, Lotfata A, Reihani H, et al. A comparative study of explainable ensemble learning and logistic regression for predicting in-hospital mortality in the emergency department. Sci Rep. 2024;14(1):3406. doi: 10.1038/s41598-024-54038-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Wynants L, Van Calster B, Collins GS, Riley RD, Heinze G, Schuit E, et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ. 2020;369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Jamshidi MB, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, et al. Artificial Intelligence and COVID-19: Deep Learning Approaches for Diagnosis and Treatment. IEEE Access. 2020;8:109581–95. doi: 10.1109/ACCESS.2020.3001973 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Sperrin M, Grant SW, Peek N. Prediction models for diagnosis and prognosis in Covid-19. BMJ. 2020. [DOI] [PubMed] [Google Scholar]
- 33.Banoei MM, Dinparastisaleh R, Zadeh AV, Mirsaeidi M. Machine-learning-based COVID-19 mortality prediction model and identification of patients at low and high risk of dying. Crit Care. 2021;25(1):328. doi: 10.1186/s13054-021-03749-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Saxena A, Nixon B, Boyd A, Evans J, Faraone SV. A systematic review of the application of graph neural networks to extract candidate genes and biological associations. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2025;:e33031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ji C, Yu N, Wang Y, Ni J, Zheng C. SGLMDA: A Subgraph Learning-Based Method for miRNA-Disease Association Prediction. IEEE/ACM Trans Comput Biol Bioinform. 2024;21(5):1191–201. doi: 10.1109/TCBB.2024.3373772 [DOI] [PubMed] [Google Scholar]
- 36.Wang J, Li J, Yue K, Wang L, Ma Y, Li Q. NMCMDA: neural multicategory MiRNA-disease association prediction. Brief Bioinform. 2021;22(5):bbab074. doi: 10.1093/bib/bbab074 [DOI] [PubMed] [Google Scholar]
- 37.Li J, Lin H, Wang Y, Li Z, Wu B. Prediction of potential small molecule-miRNA associations based on heterogeneous network representation learning. Front Genet. 2022;13:1079053. doi: 10.3389/fgene.2022.1079053 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Riley RD, Ensor J, Snell KIE, Archer L, Whittle R, Dhiman P, et al. Importance of sample size on the quality and utility of AI-based prediction models for healthcare. Lancet Digit Health. 2025;7(6):100857. doi: 10.1016/j.landig.2025.01.013 [DOI] [PubMed] [Google Scholar]
- 39.Yang Y, Khorshidi HA, Aickelin U. A review on over-sampling techniques in classification of multi-class imbalanced datasets: insights for medical problems. Front Digit Health. 2024;6:1430245. doi: 10.3389/fdgth.2024.1430245 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Salmi M, Atif D, Oliva D, Abraham A, Ventura S. Handling imbalanced medical datasets: review of a decade of research. Artif Intell Rev. 2024;57(10). doi: 10.1007/s10462-024-10884-2 [DOI] [Google Scholar]
- 41.Malhotra R, Khanna M. Particle swarm optimization-based ensemble learning for software change prediction. Information and Software Technology. 2018;102:65–84. doi: 10.1016/j.infsof.2018.05.007 [DOI] [Google Scholar]
- 42.Yakovyna V, Shakhovska N, Szpakowska A. A novel hybrid supervised and unsupervised hierarchical ensemble for COVID-19 cases and mortality prediction. Sci Rep. 2024;14(1):9782. doi: 10.1038/s41598-024-60637-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Tian W, Jiang W, Yao J, Nicholson CJ, Li RH, Sigurslid HH, et al. Predictors of mortality in hospitalized COVID-19 patients: A systematic review and meta-analysis. J Med Virol. 2020;92(10):1875–83. doi: 10.1002/jmv.26050 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.de Paiva BBM, Pereira PD, de Andrade CMV, Gomes VMR, Souza-Silva MVR, Martins KPMP, et al. Potential and limitations of machine meta-learning (ensemble) methods for predicting COVID-19 mortality in a large inhospital Brazilian dataset. Sci Rep. 2023;13(1):3463. doi: 10.1038/s41598-023-28579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Hatamabadi H, Sabaghian T, Sadeghi A, Heidari K, Safavi-Naini SAA, Looha MA. Epidemiology of COVID-19 in Tehran, Iran: A cohort study of clinical profile, risk factors, and outcomes. BioMed Research International. 2022;2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Klomp T. Iterative Imputation in Python: A Study on the Performance of the Package IterativeImputer. University Utrecht. 2022. [Google Scholar]
- 47.Barough SS, Safavi-Naini SAA, Siavoshi F, Tamimi A, Ilkhani S, Akbari S, et al. Generalizable machine learning approach for COVID-19 mortality risk prediction using on-admission clinical and laboratory features. Sci Rep. 2023;13(1):2399. doi: 10.1038/s41598-023-28943-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Sharma V. A Study on Data Scaling Methods for Machine Learning. INJGAADMIN, ftjijgasr. 2022;1(1). doi: 10.55938/ijgasr.v1i1.4 [DOI] [Google Scholar]
- 49.Mishra S, Pradhan RK. Analyzing the impact of feature correlation on classification accuracy of machine learning model. In: 2023. [Google Scholar]
- 50.Chandrashekar G, Sahin F. A survey on feature selection methods. Computers & Electrical Engineering. 2014;40(1):16–28. doi: 10.1016/j.compeleceng.2013.11.024 [DOI] [Google Scholar]
- 51.Alin A. Multicollinearity. Wiley Interdiscip Rev Comput Stat. 2010;2(3):370–4. [Google Scholar]
- 52.Daoud JI. Multicollinearity and regression analysis. Journal of Physics: Conference Series. 2017. [Google Scholar]
- 53.Sheskin DJ. Handbook of parametric and nonparametric statistical procedures. CRC Press. 2020. [Google Scholar]
- 54.Moorthy U, Gandhi UD. RETRACTED ARTICLE: A novel optimal feature selection technique for medical data classification using ANOVA based whale optimization. J Ambient Intell Human Comput. 2020;12(3):3527–38. doi: 10.1007/s12652-020-02592-w [DOI] [Google Scholar]
- 55.Ladha L, Deepa T. Feature Selection Methods and Algorithms. International Journal on Computer Science and Engineering (IJCSE). 2023;55. [Google Scholar]
- 56.Okser S, Pahikkala T, Airola A, Salakoski T, Ripatti S, Aittokallio T. Regularized machine learning in the genetic prediction of complex traits. PLoS Genet. 2014;10(11):e1004754. doi: 10.1371/journal.pgen.1004754 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Kuhn M. Building predictive models in R using the caret package. Journal of Statistical Software. 2008;28:1–26.27774042 [Google Scholar]
- 58.Kuhn M. Variable selection using the caret package. 2012. [Google Scholar]
- 59.Berrar D. Cross-validation. 2019. 542–5.
- 60.Bottino F, Tagliente E, Pasquini L, Napoli AD, Lucignani M, Figà-Talamanca L, et al. COVID Mortality Prediction with Machine Learning Methods: A Systematic Review and Critical Appraisal. J Pers Med. 2021;11(9):893. doi: 10.3390/jpm11090893 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Brownlee J. Machine learning mastery with R: Get started, build accurate models and work through projects step-by-step. Machine Learning Mastery. 2016. [Google Scholar]
- 62.Tattar PN. Hands-On Ensemble Learning with R: A Beginner’s Guide to Combining the Power of Machine Learning Algorithms Using Ensemble Techniques. Packt Publishing Ltd. 2018. [Google Scholar]
- 63.Kuncheva LI. Combining pattern classifiers: methods and algorithms. John Wiley & Sons. 2014. [Google Scholar]
- 64.Wang L, Mo T, Wang X, Chen W, He Q, Li X, et al. A hierarchical fusion framework to integrate homogeneous and heterogeneous classifiers for medical decision-making. Knowledge-Based Systems. 2021;212:106517. doi: 10.1016/j.knosys.2020.106517 [DOI] [Google Scholar]
- 65.Sullivan GM, Feinn R. Using Effect Size-or Why the P Value Is Not Enough. J Grad Med Educ. 2012;4(3):279–82. doi: 10.4300/JGME-D-12-00156.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Fritz CO, Morris PE, Richler JJ. Effect size estimates: current use, calculations, and interpretation. J Exp Psychol Gen. 2012;141(1):2–18. doi: 10.1037/a0024338 [DOI] [PubMed] [Google Scholar]
- 67.Zhu Y, Guo W. Family-Wise Error Rate Controlling Procedures for Discrete Data. Statistics in Biopharmaceutical Research. 2019;12(1):117–28. doi: 10.1080/19466315.2019.1654912 [DOI] [Google Scholar]
- 68.Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC. Package ‘pROC’. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Kassambara A. Comparing groups: Numerical variables. Sydney, Australia: Datanovia. 2019. [Google Scholar]
- 70.Molnar C, Schratz P. Package ‘iml’. R CRAN. 2020. [Google Scholar]
- 71.Roth AE. The Shapley value: essays in honor of Lloyd S. Shapley. Cambridge University Press. 1988. [Google Scholar]
- 72.Lunardon N, Menardi G, Torelli N. ROSE: a package for binary imbalanced learning. R journal. 2014;6(1). [Google Scholar]
- 73.Zali A, Gholamzadeh S, Mohammadi G, Azizmohammad Looha M, Akrami F, Zarean E. Baseline characteristics and associated factors of mortality in COVID-19 patients; an analysis of 16000 cases in Tehran, Iran. Arch Acad Emerg Med. 2020;8(1):e70. [PMC free article] [PubMed] [Google Scholar]
- 74.Ogundimu EO, Altman DG, Collins GS. Adequate sample size for developing prediction models is not simply related to events per variable. J Clin Epidemiol. 2016;76:175–82. doi: 10.1016/j.jclinepi.2016.02.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Chamseddine E, Mansouri N, Soui M, Abed M. Handling class imbalance in COVID-19 chest X-ray images classification: Using SMOTE and weighted loss. Appl Soft Comput. 2022;129:109588. doi: 10.1016/j.asoc.2022.109588 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Javidi M, Abbaasi S, Naybandi Atashi S, Jampour M. COVID-19 early detection for imbalanced or low number of data using a regularized cost-sensitive CapsNet. Sci Rep. 2021;11(1):18478. doi: 10.1038/s41598-021-97901-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.de Jong VMT, Rousset RZ, Antonio-Villa NE, Buenen AG, Van Calster B, Bello-Chavolla OY, et al. Clinical prediction models for mortality in patients with covid-19: external validation and individual participant data meta-analysis. BMJ. 2022;378:e069881. doi: 10.1136/bmj-2021-069881 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Van Calster B, Steyerberg EW, Wynants L, van Smeden M. There is no such thing as a validated prediction model. BMC Med. 2023;21(1):70. doi: 10.1186/s12916-023-02779-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Ribeiro MHDM, da Silva RG, Mariani VC, Coelho LDS. Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil. Chaos Solitons Fractals. 2020;135:109853. doi: 10.1016/j.chaos.2020.109853 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Hussain S, Songhua X, Aslam MU, Hussain F. Clinical predictions of COVID-19 patients using deep stacking neural networks. J Investig Med. 2024;72(1):112–27. doi: 10.1177/10815589231201103 [DOI] [PubMed] [Google Scholar]
- 81.Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, et al. Prediction of In-hospital Mortality in Emergency Department Patients With Sepsis: A Local Big Data-Driven, Machine Learning Approach. Acad Emerg Med. 2016;23(3):269–78. doi: 10.1111/acem.12876 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.de Paiva BBM, Pereira PD, de Andrade CMV, Gomes VMR, Souza-Silva MVR, Martins KPMP, et al. Potential and limitations of machine meta-learning (ensemble) methods for predicting COVID-19 mortality in a large inhospital Brazilian dataset. Sci Rep. 2023;13(1):3463. doi: 10.1038/s41598-023-28579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.An N, Ding H, Yang J, Au R, Ang TFA. Deep ensemble learning for Alzheimer’s disease classification. J Biomed Inform. 2020;105:103411. doi: 10.1016/j.jbi.2020.103411 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Gupta A, Jain V, Singh A. Stacking Ensemble-Based Intelligent Machine Learning Model for Predicting Post-COVID-19 Complications. New Gener Comput. 2022;40(4):987–1007. doi: 10.1007/s00354-021-00144-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Kablan R, Miller HA, Suliman S, Frieboes HB. Evaluation of stacked ensemble model performance to predict clinical outcomes: A COVID-19 study. Int J Med Inform. 2023;175:105090. doi: 10.1016/j.ijmedinf.2023.105090 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Liu Y, Du X, Chen J, Jin Y, Peng L, Wang HHX, et al. Neutrophil-to-lymphocyte ratio as an independent risk factor for mortality in hospitalized patients with COVID-19. J Infect. 2020;81(1):e6–12. doi: 10.1016/j.jinf.2020.04.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Wu C, Chen X, Cai Y, Xia J, Zhou X, Xu S, et al. Risk Factors Associated With Acute Respiratory Distress Syndrome and Death in Patients With Coronavirus Disease 2019 Pneumonia in Wuhan, China. JAMA Intern Med. 2020;180(7):934–43. doi: 10.1001/jamainternmed.2020.0994 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Xu W, Sun N-N, Gao H-N, Chen Z-Y, Yang Y, Ju B, et al. Risk factors analysis of COVID-19 patients with ARDS and prediction based on machine learning. Sci Rep. 2021;11(1):2933. doi: 10.1038/s41598-021-82492-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Sedgwick P. Retrospective cohort studies: advantages and disadvantages. BMJ. 2014;348(jan24 1):g1072–g1072. doi: 10.1136/bmj.g1072 [DOI] [Google Scholar]
- 90.Agniel D, Kohane IS, Weber GM. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ. 2018;361:k1479. doi: 10.1136/bmj.k1479 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Perets O, Stagno E, Yehuda EB, McNichol M, Anthony Celi L, Rappoport N, et al. Inherent Bias in Electronic Health Records: A Scoping Review of Sources of Bias. medRxiv. 2024;:2024.04.09.24305594. doi: 10.1101/2024.04.09.24305594 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Degtiar I, Rose S. A Review of Generalizability and Transportability. Annu Rev Stat Appl. 2023;10(1):501–24. doi: 10.1146/annurev-statistics-042522-103837 [DOI] [Google Scholar]
- 93.Collins GS, Dhiman P, Ma J, Schlussel MM, Archer L, Van Calster B, et al. Evaluation of clinical prediction models (part 1): from development to external validation. BMJ. 2024;384:e074819. doi: 10.1136/bmj-2023-074819 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Collins GS, Reitsma JB, Altman DG, Moons KGM, TRIPOD Group. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. The TRIPOD Group. Circulation. 2015;131(2):211–9. doi: 10.1161/CIRCULATIONAHA.114.014508 [DOI] [PMC free article] [PubMed] [Google Scholar]













