Hybrid feature-selection and diversity-guided stacking framework for interpretable ensemble learning: Application to COVID-19 mortality prediction

Farideh Mohtasham; Seyed Saeed Hashemi Nazari; Mohamad Amin Pourhoseingholi; Kaveh Kavousi; Mohammad Reza Zali

doi:10.1371/journal.pone.0341198

. 2026 Apr 17;21(4):e0341198. doi: 10.1371/journal.pone.0341198

Hybrid feature-selection and diversity-guided stacking framework for interpretable ensemble learning: Application to COVID-19 mortality prediction

Farideh Mohtasham ^1,^#, Seyed Saeed Hashemi Nazari ^2,^#, Mohamad Amin Pourhoseingholi ^3,^‡, Kaveh Kavousi ^4,^*, Mohammad Reza Zali ^1,^‡

Editor: Maciej Huk⁵

PMCID: PMC13089710 PMID: 41996322

Abstract

Background

Reliable predictive modeling in high-dimensional biomedical data requires a balance between accuracy, interpretability, and computational efficiency. However, existing ensemble methods often overlook model diversity or rely on ad hoc feature-selection approaches, which limit generalizability. This study introduces a hybrid feature-selection and diversity-guided stacking framework designed to improve robustness and scalability across clinical and other data-intensive domains.

Methods

The proposed framework integrates a hybrid feature-selection pipeline—combining Variance Inflation Factor (VIF), Analysis of Variance (ANOVA), Sequential Backward Elimination (SBE), and Lasso regression—to reduce multicollinearity and overfitting. It also employs a diversity-aware stacking strategy that constructs sub-model sets based on pairwise diversity measures (Disagreement, Yule’s Q, and Cohen’s Kappa) and non-pairwise metrics (Entropy and Kohavi–Wolpert). Sixteen base classifiers and five meta-learners were trained using repeated 10-fold cross-validation. The framework was evaluated using data from 4,778 hospitalized COVID-19 patients with 116 clinical and laboratory attributes, preprocessed using robust scaling and ROSE-based class balancing.

Results

The optimal configuration, which stacked Random Forest and XGBoost models using a Neural Network meta-learner, achieved 91.4% accuracy (95% CI: 89.8–92.8), AUC = 0.955, F1 = 0.801, and MCC = 0.746, outperforming the best individual model (AdaBoost, 90.2%). Training time (~450 s) and per-case inference time (<0.2 s) demonstrated computational feasibility. Feature-importance analysis and SHAP-based interpretation confirmed clinical relevance and interpretability.

Conclusions

The hybrid feature-selection and diversity-guided stacking framework improves predictive accuracy and interpretability while maintaining computational efficiency. Although validated using COVID-19 mortality data, the approach is broadly applicable to biomedical, environmental, and engineering prediction tasks that require interpretable and scalable ensemble learning.

1 Introduction

Machine learning (ML) has become an essential component of modern biomedical research, enabling the discovery of complex, nonlinear patterns and supporting improved risk stratification across diverse clinical domains. The primary goal of ML research is to develop efficient, interpretable, and generalize algorithms capable of delivering reliable performance across heterogeneous datasets [1]. Efficiency in ML encompasses not only time and memory requirements but also data utilization and interpretability—key considerations for deployment in high-stakes clinical environments, where transparency, reproducibility, and auditability are critical.

Traditional ML models, however, often encounter significant challenges, including data quality issues, overfitting, class imbalance, and limited interpretability, which restrict their usefulness in clinical applications. These limitations have driven growing interest in ensemble learning, in which multiple algorithms are combined to reduce both variance and bias, thereby improving predictive performance compared with individual models [2–4]. Among ensemble approaches, stacking has emerged as a particularly powerful technique. Stacking integrates diverse base learners and employs a meta-learner to combine their predictions, leveraging complementary error structures to improve accuracy and robustness [5–7].

Recent advancements in various disciplines highlight both the promise and the challenges of ensemble modeling. In cardiovascular diagnostics, Feng et al. (2023) developed a hybrid model that combined hemodynamic modeling with ML to achieve >90% accuracy in under two seconds per case, demonstrating that hybrid approaches can deliver high computational efficiency while maintaining strong performance [8]. Similarly, Wang et al. (2023) used the Super Learner algorithm to improve prediction of cumulative lead exposure, though at the cost of substantial computational load due to the combination of multiple algorithms [9]. In proteomics and chemical sciences, feature selection and dimensionality reduction have proven effective in enhancing prediction accuracy while reducing model complexity.

Xu et al. (2022) benchmarked 13 ML models for protein-level inference from RNA features across more than 2,500 samples and 20 datasets. Their findings demonstrated that combining appropriate feature selection with classical models and voting ensembles improved accuracy, although computation time varied widely [10].

Reda et al. (2023) applied variable selection with partial least squares regression to predict olive-oil quality parameters using near-infrared spectroscopy, showing that variable reduction improved accuracy and decreased computational demands [11].

In oncology, stacking and hybrid ensembles have also yielded substantial gains in predictive performance and generalization. Mohammed et al. (2021) applied a CNN-based stacking ensemble to multi-cancer RNA-Seq classification, achieving superior accuracy to single models while retaining computational feasibility [12]. Wang et al. (2025) designed a multimodal stacking framework that integrated radiomics and deep learning for head-and-neck cancer prognosis (C-index = 0.93), demonstrating the influence of meta-learner selection on scalability [13]. Kwon et al. (2019) found that gradient boosting performed best as a meta-learner for accuracy, while generalized linear models minimized error in breast cancer classification, highlighting the trade-offs between model complexity and efficiency [14]. Other architectures, such as the relevance-aware capsule network [15], deep convolutional neural networks [16], and U-Net–based MRI segmentation models [17], have demonstrated that improvements in accuracy commonly require substantially higher training time and memory, emphasizing the need to balance predictive strength with practicality and interpretability.

Ensemble learning has been widely adopted in other clinical areas. Abualnaja et al. [18] analyzed 32 studies involving 142,459 patients with meningiomas and reported that combined radiomic and clinical ensemble models achieved AUCs of 0.74–0.81, demonstrating robust multimodal representation. Likewise, Lei et al. [19] analyzed 32 studies involving 142,459 patients with meningiomas and reported that combined radiomic and clinical ensemble models achieved AUCs of 0.74–0.81, demonstrating robust multimodal representation. Other studies in cardiovascular disease have shown similar results. Dhingra et al. [20] developed an ensemble model (PRESENT-SHD) using 261,228 ECGs, achieving AUROC values of 0.85–0.90 across multiple hospitals, indicating strong cross-population stability. Tseng et al. [21] used XGBoost and random forest models to predict acute kidney injury following cardiac surgery (AUC = 0.843), demonstrating ensemble learning’s value in perioperative risk prediction.

In infectious diseases research, Sawesi et al. [22] reviewed 17 leptospirosis studies and found that ML and deep learning methods—including CNN ensembles—achieved high accuracy (80–98%), though most lacked external validation. Chiasakul et al. [23] reported that AI methods for venous thromboembolism prediction outperformed traditional risk scores (mean AUC 0.79 vs 0.61), although many studies exhibited bias and limited generalizability.

Ensemble learning has also consistently outperformed single models in forecasting outbreaks of dengue, influenza, Ebola, and COVID-19. Early COVID-19 mortality forecasts demonstrated that ensembles delivered greater accuracy and precision than individual models [24].

Stacking is particularly suited to heterogenous clinical datasets, such as COVID-19 mortality prediction, which depends on numerous clinical, biochemical, and physiological indicators [25]. Berliana and Bustamam [4] demonstrated that a two-level stacking model achieved more than 97% accuracy with CT data and 99% with chest X-ray images. Cui et al [26] introduced a nested heterogeneous ensemble integrating SVR, ELM, and logistic regression, achieving improved generalization. Li et al. [27] predicted early mortality using five base models and a genetic-algorithm optimization procedure, achieving an AUC of 0.907 in a cohort of 4,711 patients. Other studies demonstrated that hybrid ensembles incorporating supervised and unsupervised learning improved performance by over 10%, and that boosted models remained competitive with strong clinical relevance [28,29].

Despite these advances, several methodological limitations persist. Systematic reviews highlight widespread issues such as small and unrepresentative datasets, weak handling of missing data, lack of external validation, and overreliance on discrimination metrics alone [29–31]. Many studies also neglect calibration, effect-size estimation, or fairness analyses, reducing clinical interpretability [32,33]. research shows a nonlinear relationship between predictive gain and computational cost: complex models often deliver higher accuracy but at significant increases in resource consumption [34–37]. While innovations such as subgraph learning [35] and simplified coronary models [8] can mitigate these burdens, a careful balance of accuracy, efficiency, and interpretability remains necessary.

Sample size adequacy is another concern. Many COVID-19 models are trained on datasets too small for their complexity, leading to instability and overfitting [38]. Class imbalance is also common in mortality modeling; although oversampling and weighting strategies are widely used, these must be validated to avoid artificial distortions [39,40].

Furthermore, many ensemble studies rely heavily on tree-based models such as random forest, XGBoost, and LightGBM, limiting diversity and restricting the full advantages of ensemble learning [41,42]. Quantitative measures of diversity—such as Yule’s Q, Disagreement, Cohen’s Kappa, or Double-Fault—remain rarely used in COVID-19 modeling despite consistent evidence that diversity improves generalization [42–44].

To address these limitations, this study introduces a computationally efficient, diversity-guided stacking ensemble framework that integrates heterogeneous base classifiers and interpretable meta-learners to predict COVID-19 mortality. Our approach incorporates:

Hybrid feature-selection using variance inflation factor (VIF) analysis, ANOVA, sequential backward elimination (SBE), and Lasso regression to control multicollinearity and enhance interpretability;
Controlled ensemble depth to balance predictive gain and computational feasibility; and
Lightweight meta-learners capable of capturing nonlinear dependencies among diverse base learners.

We constructed sub-model ensembles using multiple diversity metrics across 16 machine learning algorithms and assessed model performance using discrimination, calibration, and statistical significance tests, including Wilcoxon, McNemar, and DeLong analyses. Model interpretability was enhanced through SHAP-based explanation of global and local prediction behavior.

This study presents a generalizable diversity-aware ensemble framework designed to balance accuracy, interpretability, and computational cost. Although applied here to COVID-19 mortality prediction, the approach is suitable for a wide range of biomedical prediction problems that require robust, interpretable, and scalable machine learning solutions.

2 Materials and methods

2.1 Overview of the proposed framework

As illustrated in Fig 1, this study adopts a multi-stage framework for predicting mortality risk, integrating standard machine learning techniques with the proposed algorithmic innovations. The workflow contains two primary layers:

Foundational Stage – Data preprocessing, normalization, and training of base models using established machine learning procedures.

Algorithmic Stage – A diversity-guided stacking ensemble that integrates hybrid feature selection, explicit model diversity assessment, and a comparison of multiple meta-learners to optimize predictive performance, interpretability, and computational efficiency.

Data from 4,778 confirmed COVID-19 cases were cleaned through exclusion of incomplete records, iterative multivariate imputation, and normalization. The hybrid feature-selection process removed multicollinearity using Variance Inflation Factor (VIF), followed by Analysis of Variance (ANOVA), Sequential Backward Elimination (SBE), and Lasso regression to select 15 key predictors.

Sixteen machine learning classifiers were trained using stratified and balanced datasets and evaluated via repeated 10-fold cross-validation. To enhance predictive performance, ensemble sets were constructed based on correlation and statistical diversity metrics, then stacked using five different meta-learners. Models were evaluated based on discrimination and calibration performance and validated using significance tests. Interpretability was assessed using feature importance and SHAP-based analyses.

2.2 Data source and ethical approval

Data were obtained from 4,778 confirmed COVID-19 patients admitted to three general hospitals in Tehran, Iran, between March 2020 and March 2021. Demographic, clinical, laboratory, symptom, comorbidity, vital sign, and outcome information was extracted from clinical records reviewed by trained medical staff. Laboratory findings were collected on the first day of admission through the hospital information system, and COVID-19 diagnosis was confirmed using real-time polymerase chain reaction (RT-PCR) of nasal or oropharyngeal swab samples.

The study followed formal institutional requirements and received ethical approval from the Institutional Review Board (IRB) of Shahid Beheshti University of Medical Sciences (IR.SBMU.RIGLD.REC.1401.032).

Informed consent was waived due to the retrospective nature of the study, and data were anonymized prior to analysis in accordance with the Declaration of Helsinki. This dataset provides comprehensive temporal, demographic, and clinical information suitable for developing predictive models for COVID-19 mortality. Additional details on the epidemiological profile of the cohort are available in Hatamabadi et al. [45].

2.3 Data preprocessing

Missing data were assessed and addressed prior to model development. Patients with missing values in any categorical variable or more than two missing continuous variables were excluded. Among the 123 available variables (52 categorical and 71 numeric), no categorical variables contained missing values. Seven numerical variables with more than 5% missingness were removed to reduce risk of bias. Remaining missing values were imputed using an iterative multivariate approach implemented in Scikit-learn [46], which models each variable with missing entries as a function of all other features in a chained regression process. This preserves multivariate relationships and minimizes bias under the Missing at Random (MAR) assumption.

The resulting imputed dataset was previously validated by Hatamabadi et al. [47] to confirm realistic variable distributions and consistent multivariate relationships. To account for skewed distributions and sensitivity to outliers, continuous variables were standardized using robust scaling [48], which centers variables on the median and scales using the interquartile range (IQR). This approach enhances stability in clinical models by reducing the influence of extreme values.

2.4 Feature selection

Feature Selection (FS) is essential for managing high-dimensional clinical datasets by reducing redundant, irrelevant, or correlated predictors while improving model accuracy, computational efficiency, and interpretability [49,50].

A multi-stage hybrid feature-selection strategy was implemented to progressively eliminate multicollinearity, non-informative predictors, and weak contributors. Four complementary methods were applied:

Variance Inflation Factor (VIF): Used to detect and remove highly collinear continuous variables, thereby improving model stability and avoiding inflated variance estimates [51,52].
Analysis of Variance (ANOVA): Applied next to evaluate between-group differences using F-tests and eliminate non-discriminative features with minimal computational cost [53,54].
Sequential Backward Elimination (SBE): Iteratively removed the least informative features based on cross-validated model performance, preserving meaningful interactions and improving generalization [55].
Lasso regression: Imposed regularization to shrink weak coefficients to zero, providing sparse and stable model structures suited for correlated predictors [56].

This integrated pipeline—removing collinearity (VIF), filtering weak predictors (ANOVA), refining via model performance (SBE), and enforcing sparsity (Lasso)—produced a compact, interpretable feature set optimized for ensemble learning.

2.5 Model training and selection

Sixteen base machine learning models were trained, including ten standard algorithms and six boosting or bagging methods, representing diverse methodological families. Hyperparameters were optimized through grid search using the caret package [57,58], with 10-fold cross-validation repeated 10 times [59] to balance bias and variance Models with minimal benefit from extensive tuning (e.g., GLM, LDA, CART, Naïve Bayes) retained default settings to maximize computational efficiency.

Hyperparameter ranges were informed by existing literature and empirical results from clinical prediction research. Table 1 summarizes the optimized settings for each model.

Table 1. Parameter settings for the 16 base machine learning models.

Models	Tuned hyper parameters
Generalized Linear Model (GLM)	none
Linear Discriminant Analyses (LDA)	none
Regularized Regression (Lasso)	Alpha = 1; lambda = 0.0014
Ridge Regularized Regression (Ridge)	Alpha = 0; lambda = 0.098
Elastic Net Regularized Regression (Elastic Net)	Alpha = 0.5; lambda = 0.047.
k-Nearest Neighbors (KNN)	k = 1
Naïve Bayes (NB)	none
Support Vector Machine (SVM)	sigma = 0.15, C = 10
Classification and Regression Trees (CART)	none
Neural Network (NN)	size = 10, decay = 0.1
C5.0	trials = 50, model = “tree,” winnow = FALSE
Stochastic Gradient Boosting (GBM)	n.trees = 250, interaction.depth = 5, shrinkage = 0.1, n.minobsinnode = 10
Extreme Gradient Boosting (XGBoost)	nrounds = 250, max_depth = 5, eta = 0.4, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1, subsample = 1
Random Forest (RF)	mtry = 3, ntree = 700
AdaBoost Random Forest	nIter = 100, method = “Adaboost.MI”
Bagged CART (Treebag)	none

Open in a new tab

Model selection was based on three criteria:

Algorithmic diversity,
Demonstrated success in biomedical or COVID-19 prediction studies, and
Complementary bias–variance profiles.

The final pool (supported by systematic evidence, e.g., Bottino et al. [60]) included linear models (GLM, Lasso, Ridge, Elastic Net), probabilistic models (Naïve Bayes, LDA), instance-based learners (KNN), tree-based models (CART, C5.0, Random Forest, XGBoost, GBM, Treebag), and Neural Networks.

This diversity ensured coverage of linear and nonlinear relationships, uncertainty modeling, and hierarchical interactions common in clinical decision data.

Cross-validated performance guided final model selection, which was subsequently confirmed through independent test-set validation.

2.6 Diversity-guided sub-model construction

Traditional stacking often selects base models with low prediction correlation to ensure each contributes complementary information. Highly correlated models (>0.75) add redundancy and weaken ensemble gains [61].

Sixteen candidate models were initially generated using the caretList() from the caretEnsemble package [57,58]. Prediction correlations were calculated. Where pairs exceeded 0.75 correlation, the less accurate model was removed across 10 repeated 10-fold cross-validation.

To enhance diversity beyond correlation filtering, additional sub-model sets were constructed using explicit diversity metrics capturing complementary error patterns among classifiers. Pairwise measures (Disagreement, Yule’s Q, Cohen’s Kappa, Double-Fault) and non-pairwise metrics (Entropy, Kohavi–Wolpert) were computed following Tattar [62].

The contingency table defining model agreements (n₁₁and n₀₀) and disagreements (n₁₀ and n₀₁) across N observations is presented in Table 2.

Table 2. Contingency table illustrating agreement and disagreement between two classifiers.

	M₁ predicts 1	M₁ predicts 0
M₂ predicts 1	n₁₁	n₁₀
M₂ predicts 0	n₀₁	n₀₀

Open in a new tab

2.6.1 Disagreement measure.

This quantifies the proportion of instances where the two classifiers differ in prediction(1):

D M = \frac{n_{10} + n_{01}}{N}

(1)

Higher values indicate greater diversity and reduced redundancy.

2.6.2 Yule’s Q-statistic.

Yule’s Q (or Q-statistic) assesses the strength and direction of association between two classifiers’ predictions (range: –1 to +1). Lower absolute values denote weaker association and thus higher diversity(2):

ϱ = \frac{n_{11} n_{00} - n_{10} n_{01}}{n_{11} n_{00} + n_{10} n_{01}}

(2)

2.6.3 Cohen’s Kappa statistic.

A widely used measure that evaluates inter-model agreement while adjusting for chance. Low or negative Kappa values suggest that classifiers make independent errors, which enhances ensemble robustness.

2.6.4 Double-fault measure.

Measures the proportion of cases where both classifiers misclassify the same instance. Smaller values indicate complementary error patterns and reduced correlated failures (3):

D F = \frac{\sum_{i = 1}^{N} I ({\hat{Y}}_{i 1} \neq Y_{i,} {\hat{Y}}_{i 2} \neq Y_{i})}{N}

(3)

Two non-pairwise metrics were also used:

Entropy measure [63]: Reflects the overall variability of predictions across all classifiers, ranging from 0 (perfect agreement, no diversity) to 1 (maximum diversity).
Kohavi-Wolpert Measure [62]: Derived from error variance decomposition, it quantifies the dispersion of predictions across classifiers; higher values imply greater diversity and richer ensemble representation.

Pairwise metrics identified redundant learners, while non-pairwise metrics assessed overall heterogeneity within sub-model sets. This integrated diversity evaluation ensured complementary base learners and improved the generalization performance of the stacking ensemble.

2.7 Meta-learner integration

Predictions from the sub-model sets were integrated using a stacking framework with five meta-learners:

Generalized Linear Model (GLM)
Linear Discriminant Analysis (LDA)
Random Forest (RF)
Gradient Boosting Machine (GBM)
Neural Network (NN)

Linear models (GLM, LDA) were chosen for their transparency and stable inference, while tree-based (RF, GBM) and neural meta-learners modeled nonlinear dependencies among base models. This balanced design supports both interpretability and computational efficiency, aligning with the study’s objective of developing a robust, clinically applicable framework for mixed-type healthcare data.

Stacking was implemented using the caretEnsemble to effectively fuse diverse predictive outputs.

2.8 Model evaluation and statistical analysis

Performance was assessed using an independent test dataset. Discrimination metrics included accuracy, sensitivity, specificity, precision, F1-score, Cohen’s Kappa, area under the ROC curve (AUC), and the Matthews correlation coefficient (MCC). These metrics collectively capture overall correctness, class-specific detection, and robustness under class imbalance—critical in mortality prediction, where false negatives carry severe clinical risk and false positives may strain resources. AUC quantified global discrimination across thresholds, while MCC provided a balanced evaluation under uneven outcome distributions [64].

Calibration was assessed using reliability curves to compare predicted probabilities against observed outcomes. Statistical comparisons employed:

Wilcoxon signed-rank test was applied for accuracy differences,
Effect size $(r = \frac{z}{\sqrt{N}})$ [65] interpreted per Cohen’s criteria (large = 0.5, medium = 0.3, small = 0.1) [66],
Holm correction to control the family-wise error rate [67],
McNemar’s test to compare classification outcomes between paired models,
DeLong’s test to assess the statistical significance of AUC differences under the null hypothesis of equal performance [68].

All tests were conducted using appropriate R packages, including rstatix [69]and pROC.

2.9 Model interpretability

Feature importance and effect analyses were conducted to explain individual predictions and quantify how specific feature values influenced model outputs, using tools from the iml package [70]. To further enhance interpretability, we employed SHAP (SHapley Additive exPlanations), a model-agnostic framework that quantifies each feature’s contribution to predicted outcomes [71].

SHAP analysis was applied directly to the final stacked ensemble, treating it as a single predictive function. This approach decomposed predicted probabilities into feature-level attributions of the original clinical variables rather than intermediate model outputs, enabling transparent interpretation of global importance, feature interactions, and local instance-level effects driving mortality predictions.

3 Results

3.1 Data characteristics and preparation

The study analyzed 4,778 confirmed COVID-19 cases, comprising 116 clinical, laboratory, and demographic features. The overall mortality rate was 22% (1,050 patients). Males accounted for 59.6% of deaths. The mean age of deceased patients was 70.8 years (SD = 15.6), compared with 58.3 years (SD = 16.9) among survivors. Mortality showed significant associations with comorbidities such as hypertension, diabetes, and heart failure, highlighting their importance in COVID-19 risk assessment (Fig. 2).

Numeric features were standardized using the robust scaler based on the interquartile range (IQR) to reduce the influence of outliers. The dataset was split into training (70%, n = 3,345) and testing (30%, n = 1,433) subsets, maintaining consistent mortality rates (21.97%) across both. The original training set exhibited class imbalance (22% “Death” vs. 78% “Alive”), which was corrected using the ROSE package [72], resulting in a balanced 1:1 ratio (“Death”: 1,616; “Alive”: 1,729). This ensured robust model training and reliable performance evaluation.

3.2 Feature Selection and correlation analysis

VIF analysis removed highly collinear variables, reducing the dataset to 109 predictors. ANOVA eliminated 39 features with limited predictive value. SBE and Lasso further refined the selection to 15 key mortality predictors: age, neutrophil count (NEUT), lactate dehydrogenase (LDH), ferritin, phosphorus (P), ventilator oxygen saturation (O₂sat.Ventilator), total iron-binding capacity (TIBC), fasting blood sugar (FBS), procalcitonin, serum sodium (Na), muscle pain, chronic kidney disease (CKD), taste/smell loss, D-dimer, and erythrocyte sedimentation rate (ESR).

This selection achieved an optimal balance between model complexity and predictive performance, as adding more features did not improve cross-validation accuracy. The final feature set captured multiple pathophysiological domains relevant to COVID-19 outcomes, including inflammation (NEUT, ESR, ferritin), oxygenation (O₂sat.Ventilator), metabolism (FBS, Na, P, TIBC), and organ dysfunction (procalcitonin, CKD). Symptom-based predictors such as muscle pain and loss of taste/smell further enhanced discrimination between severe and non-severe disease.

Correlation analysis (S1–S2 Tables in S1 File) revealed generally weak associations among predictors, indicating low multicollinearity. Moderate correlations were observed between ferritin and ESR (r = 0.26), ferritin and TIBC (r = –0.34), and TIBC and ESR (r = –0.35), reflecting physiologically coherent inflammation–iron metabolism dynamics. Overall low inter-feature correlations support model stability and generalizability.

All selected predictors are routinely collected in clinical settings, ensuring clinical interpretability and feasibility for integration into real-world decision-support systems (Fig 3).

3.3 Model performance evaluation

Fig 4 and Table S3 in S1 File summarize model performance using 10 repeated 10-fold cross-validation across ten base classifiers and six boosting/bagging algorithms applied to all 15 predictive features. Accuracy estimates with 95% confidence intervals indicated that AdaBoost achieved the highest performance, with a mean accuracy of 92.81% (SD = 0.013). Optimized hyperparameters for each models are presented in S1 Fig in S1 File.

3.4 Diversity-guided sub-model construction

The first sub-model set was generated using a traditional stacking approach, producing 16 candidate models with the caretList() function. Correlation analysis from repeated 10-fold cross-validation revealed strong dependencies among several learners. The Generalized Linear Model (GLM) demonstrated high correlations with LDA (0.940), Lasso (0.984), Ridge (0.966), and Elastic Net (0.966); similarly, C5.0 and Random Forest (RF) were highly correlated (0.806). AdaBoost also showed high correlations with NN (0.798), GBM (0.848), and XGBoost (0.950). To reduce redundancy, the less accurate model from each correlated pair was removed, retaining four classifiers—Ridge, KNN, CART, and AdaBoost—for the first sub-model set.

To improve model complementarity, additional sub-model sets were constructed using diversity metrics.

Pairwise analyses indicated that GLM, LDA, and other linear models exhibited high agreement, whereas GBM, NN, and RF displayed greater disagreement, suggesting complementary error patterns (Fig 5).

Non-pairwise metrics further quantified ensemble heterogeneity. Entropy values ranged from 0.64 to 0.90, with the NN–GBM combination showing the greatest diversity. Yule’s Q ranged from 0.39 to 0.47, while Kohavi–Wolpert values ranged from 0.099 to 0.195. Negative Kappa values in some sets indicated substantial prediction disagreement, reinforcing diversity among classifiers (Fig 6, S4 Table in S1 File).

Ultimately, eight sub-model sets were selected based on these metrics (Table 3, Fig 7).

Table 3. Diversity metrics for the eight selected sub-model sets on the test dataset.

Sub-Model Sets	Entropy measure	Disagreement Measure (Q statistic)	Kohavi-Wolpert measure	Interrater agreement measure (Kappa)
NN, GBM	0.903	0.451	0.113	−0.028
NB, GBM	0.892	0.446	0.111	−0.032
SVM, GBM	0.881	0.440	0.110	−0.082
KNN, AdaBoost	0.822	0.411	0.103	−0.043
RF, XGB	0.799	0.399	0.099	0.140
NB, CART, AdaBoost, GBM	0.640	0.395	0.148	0.100
C5.0,NB, GBM	0.634	0.422	0.141	0.049
NN, RF, CART, GBM, XGB, Treebag	0.722	0.469	0.195	−0.029

Open in a new tab

This diversity-driven selection ensured inclusion of models differing in both architecture and error behavior, enhancing ensemble robustness and reducing correlated prediction errors.

3.5 Stacking model evaluation and statistical comparison

Stacking was performed using five meta-learners: Generalized Linear Model (GLM), Linear Discriminant Analysis (LDA), Neural Network (NN), Gradient Boosting Machine (GBM), and Random Forest (RF). Table 4 summarizes accuracies across stacking configurations on the independent test dataset.

Table 4. Accuracy of stacking sub-model sets using five different meta-learners on the test dataset.

Stack Using the Linear Discriminant Analysis Meta-Learner (95% CI)	Stack using the Neural Network Meta-Learner (95% CI)	Stack Using the Gradient Boosting Machine Meta-Learner (95% CI)	Stack Using Random Forest Meta-Learner (95% CI)	Stack Using Generalized Linear Model Meta-Learner (95% CI)	Accuracy of The Best Classifier from The Ensemble (95% CI)	Stacked Sub-Model Sets
0.826 (0.806, 0.845)	0.823 (0.803, 0.843)	0.826 (0.806, 0.845)	0.809 (0.788, 0.829)	0.803 (0.782, 0.823)	GBM: 0.825 (0.804, 0.844)	Stacking NN, GBM
0.830 (0.809, 0.849)	0.832 (0.812, 0.851)	0.833 (0.813, 0.852)	0.8144 (0.793, 0.834)	0.836 (0.816, 0.855)	GBM: 0.825 (0.804, 0.844)	Stacking NB, GBM
0.865 (0.846, 0.883)	0.877 (0.859, 0.894)	0.883 (0.865, 0.899)	0.765 (0.742, 0.786)	0.865 (0.846, 0.883)	SVM: 0.8332 (0.813, 0.852)	Stacking SVM, GBM
0.848 (0.828, 0.866)	0.840 (0.820, 0.859)	0.876 (0.858, 0.893)	0.852 (0.833, 0.870)	0.848 (0.828, 0.866)	AdaBoost: 0.902 (0.885, 0.916)	Stacking KNN, AdaBoost
0.895 (0.878, 0.910)	0.914 (0.898, 0.928)	0.652 (0.626, 0.676)	0.878 (0.860, 0.894)	0.897 (0.880, 0.913)	RF: 0.892 (0.875, 0.907)	Stacking RF, XGB
0.898 (0.881, 0.913)	0.904 (0.898, 0.918)	0.903 (0.886, 0.918)	0.909 (0.892, 0.923)	0.895 (0.878, 0.912)	RF: 0.892 (0.875, 0.907)	Stacking RF, CART, NN, GBM, XGB, Treebag
0.836 (0.816, 0.855)	0.844 (0.823, 0.863)	0.886 (0.868, 0.901)	0.883 (0.866, 0.899)	0.838 (0.818, 0.857)	AdaBoost: 0.902 (0.885, 0.916)	Stacking NB,CART, GBM, AdaBoost
0.886 (0.868, 0.901)	0.908 (0.891, 0.922)	0.911 (0.895, 0.926)	0.897 (0.880, 0.912)	0.888 (0.871, 0.904)	C5.0: 0.890 (0.872, 0.905)	Stacking NB, C5.0, GBM
0.835 (0.815, 0.854)	0.838 (0.818, 0.857)	0.888 (0.870, 0.903)	0.879 (0.860, 0.895)	0.8374 (0.817, 0.856)	AdaBoost: 0.902 (0.885, 0.916)	Traditional stacking Ridge, KNN, CART, AdaBoost

Open in a new tab

Not all stacking configurations improved upon the strongest base classifier. Ensembles composed of highly correlated models (e.g., Ridge–KNN–CART–AdaBoost) yielded limited gains, indicating performance saturation. In contrast, heterogeneous combinations—such as NB + GBM, RF + XGB, and NB + C5.0 + GBM—achieved significant improvements (accuracy up to 0.914). The NN meta-learner consistently outperformed other meta-learners by capturing nonlinear relationships among base-model outputs.

Statistical analyses (Table 5) showed that AdaBoost remained superior to some stacking configurations, supported by significant Wilcoxon results favoring the single model.

Table 5. Results of significance tests comparing the best base classifier with the stacking model in each sub-model set.

Compared Models	Wilcoxon test	McNemar’s test (df = 1)	Roc test (boot.n = 2000, boot.stratified = 1)
For GBM vs Stacking NN, GBM using the GBM meta-learner	V = 1982, Holm-adjusted p-value = 0.269 Effect size r = 0.107 (small)	McNemar’s chi-squared = 0.304, p-value = 0.5808	D = −1.043, p-value = 0.297
For GBM vs Stacking NB, GBM using GLM meta-learner	V = 77, Holm-adjusted p-value = 4.36e-16 Effect size r = 0.828 (large)	McNemar’s chi-squared = 3.809, p-value = 0.05096	D = 0.708, p-value = 0.478
For SVM vs Stacking SVM, GBM using the GBM meta-learner	V = 2087, Holm-adjusted p-value = 0.133 Effect size r = 0.151 (small)	McNemar’s chi-squared = 58.21, p-value = 2.361e-14	D = −0.816, p-value = 0.414
For AdaBoost vs Stacking KNN, AdaBoost using the GBM meta-learner	V = 5049, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.868 (large)	McNemar’s chi-squared = 3.822, p-value = 0.0505	D = −7.768, p-value = 7.952e-15
For RF vs Stacking RF, XGB using the Neural Network meta-learner	V = 5019, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.858 (large)	McNemar’s chi-squared = 10.75, p-value = 0.0010	D = −0.446, p-value = 0.6552
For RF vs Stacking RF, CART, NN, GBM, XGB, Treebag using Random Forest meta-learner	V = 515, Holm-adjusted p-value = 4.87e-12 Effect size r = 0.691 (large)	McNemar’s chi-squared = 12.41, p-value = 0.0004	D = 0.384, p-value = 0.7007
For AdaBoost vs Stacking NB, CART, GBM, AdaBoost using the GBM meta-learner	V = 5050, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.868 (large)	McNemar’s chi-squared = 3.431, p-value = 0.063	D = −8.167, p-value = 3.161e-16
For C5.0 vs Stacking NB, C5.0, GBM using the GBM meta-learner	V = 4922, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.824 (large)	McNemar’s chi-squared = 21.00, p-value = 4.581e-06	D = −0.371, p-value = 0.71
For AdaBoost vs Traditional stacking Ridge, KNN, CART, AdaBoost using the GBM meta-learner	V = 4533, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.690 (large)	McNemar’s chi-squared = 0.004, p-value = 0.9447	D = −8.9812, p-value < 2.2e-16

Open in a new tab

Wilcoxon signed-rank, McNemar’s, and ROC-based tests compared stacking models with their best-performing base learners, applying Holm-adjusted p-values to control family-wise error and reporting effect sizes (r). Most pairwise comparisons (e.g., GBM vs. NN–GBM stack) showed small effects (r < 0.2, p > 0.05), indicating negligible practical gain. However, combinations such as NB + GBM with GLM meta-learner, AdaBoost + KNN with GBM meta-learner, and RF + XGB with NN meta-learner achieved significant improvements (r > 0.8), reflecting meaningful accuracy gains.

The best-performing configuration—stacking RF and XGB with an NN meta-learner—achieved:

Accuracy: 0.914 (95% CI: 0.898–0.928)
AUC: 0.955
F1 score: 0.801
MCC: 0.746

This model outperformed both individual classifiers and other stacking variants (Tables 6–7, Fig 8). Wilcoxon tests showed large effect sizes (r > 0.5), confirming substantial performance improvements, whereas McNemar’s and DeLong’s tests indicated that some pairwise differences were not significant after correction. ROC curves (Fig 9) demonstrated strong sensitivity and specificity, and calibration plots (Fig 10) showed excellent alignment between predicted and observed outcomes.

Table 6. Performance evaluation of selected stacking models that outperform the most accurate individual algorithm in their respective combinations.

Stacking Models	TN	FP	TP	FN	Accuracy (95% CI)	Kappa	Sensitivity	Specificity	Precision	F1	MCC	ROC
Stacking NB, GBM using GLM meta-learner	952	166	250	65	0.838 (0.818, 0.856)	0.577	0.793	0.851	0.599	0.683	0.587	0.885
Stacking SVM, GBM using the GBM meta-learner	1027	91	222	93	0.872 (0.853, 0.888)	0.625	0.705	0.919	0.709	0.707	0.624	0.893
Stacking RF, XGB using the Neural Network meta-learner	1063	55	247	68	0.914 (0.898, 0.928)	0.746	0.784	0.951	0.818	0.801	0.746	0.955
Stacking RF, CART, NN, GBM, XGB, Treebag using Random Forest meta-learner	1061	57	241	74	0.909 (0.892, 0.923)	0.728	0.765	0.949	0.809	0.786	0.728	0.949
Stacking NB, C5.0, GBM using the GBM meta-learner	1068	50	238	77	0.911 (0.895, 0.926)	0.733	0.756	0.955	0.826	0.789	0.734	0.944

Open in a new tab

Table 7. Statistical comparison between stacked random forest (RF) and XGBoost (XGB) utilizing a neural network (NN) meta-learner and other stacking models.

Compared Models	Wilcoxon test	McNemar’s test (df = 1)	Roc test (DeLong method)
stack.GLM.GBM.NB and stack.NN.RF.XGB	V = 175, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.808 (large)	McNemar’s chi-squared = 58.27, p-value = 2.276e-14	Z = −6.354, p-value = 2.099e-1
stack.GBM.SVM.GBM and stack.NN.RF.XGB	V = 969, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.535 (large)	McNemar’s chi-squared = 0.523, p-value = 0.469	Z = −6.036, p-value = 1.578e-0
stack.GBM.KNN.AdaBoost and stack.NN.RF.XGB	V = 383, Holm-adjusted p-value = 1.80e-13 Effect size r = 0.736 (large)	McNemar’s chi-squared = 2.472, p-value = 0.115	Z = −4.485, p-value = 7.286e-0
stack.RF.RF.CART.NN.GBM.XGB.Treebag and stack.NN.RF.XGB	V = 5050, Holm-adjusted p-value < 2.2e-16 Effect size r = 0.868 (large)	McNemar’s chi-squared = 0.18, p-value = 0.671	Z = −2.270, p-value = 0.0231
stack.GBM.NB.C5.0.GBM and stack.NN.RF.XGB	V = 2728, Holm-adjusted p-value = 0.378 Effect size r = 0.089 (small)	McNemar’s chi-squared = 2.913, p-value = 0.08783	Z = −3.066, p-value = 0.002

Open in a new tab

Fig 9 — Stacking NB, GBM using GLM meta-learner (stack.GLM.GBM.NB). Stacking SVM, GBM using the GBM meta-learner (stack. GBM.SVM.GBM). Stacking RF, CART, NN, GBM, XGB, Treebag using Random Forest meta-learner (stack. RF.RF.CART.NN.GBM.XGB.Treebag). Stacking RF, XGB using the Neural Network meta-learner (stack. NN.RF.XGB). Stacking NB, C5.0, GBM using the GBM meta-learner (stack. GBM.NB.C5.0.GBM).

3.6 Computational complexity and training time

Computational complexity was assessed on a standard workstation (Intel Core i5 M520 @ 2.40 GHz, 8 GB RAM, Windows 10, 64-bit). Training times varied across models: XGB required ~98 s, RF ~ 306 s, and AdaBoost ~869 s. The stacked RF–XGB–NN model required ~450 s—faster than AdaBoost despite its complexity (Table 8).

Table 8. Training and prediction times for single models and stacking ensembles.

Models	Training Time (sec)	Prediction Time per Patient (sec)
AdaBoost	868.6	0.70
Random Forest (RF)	306.5	0.14
XGBoost (XGB)	98.0	0.01
Stacked RF + XGB + NN	450.2	0.17

Open in a new tab

Inference times were uniformly low, ranging from 0.01 s per patient (XGB) to 0.70 s (AdaBoost). The stacked model achieved 0.17 s per prediction, supporting deployment in near real-time clinical settings through electronic decision-support tools.

3.7 Model Interpretation

Feature importance analysis of the stacked RF–XGB–NN ensemble identified age as the most influential predictor of mortality, demonstrating strong predictive stability (variability ±0.06, permutation error 0.237) (Fig 11, S5 Table in S1 File).

Neutrophil count (NEUT), phosphorus levels, and oxygen saturation while on a ventilator (O2sat.Ventilator) followed as critical predictors, reflecting infection severity, metabolic status, and respiratory function, respectively. Additional features such as lactate and ferritin contributed meaningfully, consistent with their roles in sepsis, inflammation, and critical illness.

Model-agnostic SHAP analysis, applied to the final stacked model, decomposed individual predictions into feature-level contributions. SHAP analysis indicated that for the “Death” class, advancing age (φ = –0.18), reduced O₂sat.Ventilator (φ = –0.02), and elevated NEUT (φ = –0.08) were dominant mortality drivers (Fig 12, S6 Table in S1 File). Reduced sodium (NA, φ = –0.08) and phosphorus (P, ϕ = –0.04) also contributed to poor outcomes, whereas higher lactate levels had a modest positive contribution (φ = 0.02). Features such as muscle pain, taste/smell disturbances, CKD, FBS, ESR, ferritin, and D-dimer had minimal SHAP contributions.

Interaction analysis (Fig 13, S7 Table in S1 File) showed that age exhibited strong interactions with O₂sat.Ventilator, NEUT, phosphorus, and ferritin, highlighting their synergistic effects on mortality risk. Together, these findings demonstrate that the stacking ensemble not only improves predictive accuracy but also maintains meaningful clinical interpretability.

4 Discussion

This study introduced a hybrid feature-selection and diversity-guided stacking framework designed to improve predictive accuracy, interpretability, and computational efficiency in high-dimensional clinical data. Although demonstrated on a large cohort of 4,778 COVID-19 patients, the proposed approach is broadly applicable to biomedical, environmental, and engineering domains that require scalable and transparent ensemble learning.

Our framework integrates hybrid feature selection—combining VIF, ANOVA, SBE, and Lasso—with a diversity-based stacking strategy that systematically quantifies inter-model complementarity using both pairwise (e.g., Yule’s Q, Disagreement, Kappa) and non-pairwise (e.g., Entropy, Kohavi–Wolpert) diversity measures. This approach directly addresses major limitations of prior COVID-19 prognostic models, including small sample sizes, poor calibration, and redundant base learners [29–31]. It also mitigates the trade-offs observed in traditional models, wherein improved predictive performance often comes at the cost of computational demand or reduced interpretability—limitations frequently encountered in deep neural networks and boosting algorithms [34–37].

A key advantage of this study is the large sample size, which is substantially greater than that of many prior investigations. This enhances both the stability and generalizability of our findings. A multicenter study in Tehran reported a case fatality rate (CFR) of 10.05% across 19 hospitals [73]. Consistent with guidelines recommending at least 20 events per predictor variable for outcomes with low prevalence [74], we determined that a minimum of 3,000 patients would be required to reliably identify significant mortality predictors. Our dataset exceeded this threshold, reducing the risk of overfitting and supporting the robustness of the derived model.

The use of the Robust Scaling for standardization and ROSE-based resampling to achieve class balance were essential to model reliability where class imbalance, is a common barrier in predictive modeling [75], particularly in the context of COVID-19 [76]. The original training dataset exhibited pronounced class imbalance, with “Death” cases constituting only 22% of the cohort. This imbalance skewed model learning toward the majority “Alive” class, inflating accuracy but suppressing sensitivity—a critical limitation in mortality prediction. ROSE resampling created a fully balanced 1:1 dataset, substantially improving sensitivity and enabling models to better recognize minority-class (Death) cases. As expected, this rebalancing slightly reduced specificity due to the presence of synthetic samples. Although ROSE improves minority-class recognition and stabilizes cross-validation metrics, synthetic oversampling may introduce mild calibration shifts or artificial patterns, as noted in previous studies [77,78]. To mitigate this, final model performance was strictly evaluated on the original unbalanced test set, ensuring that reported results reflect real clinical distributions rather than synthetic balance.

Through iterative refinement, the dataset was reduced from 115 to 15 clinically interpretable features—such as age, neutrophil count, ferritin, and O₂ saturation—all of which are routinely available in electronic health records. These variables span immunologic, metabolic, and respiratory domains, collectively capturing key biological mechanisms underlying severe COVID-19 outcomes.

Traditional stacking methods that use highly correlated base learners provided only limited performance gains, confirming that redundancy constrains the potential benefits of ensemble modeling. In contrast, the proposed diversity-guided stacking approach achieved superior accuracy and generalization by leveraging heterogeneous classifiers with complementary error patterns.

For instance, Ribeiro et al. [79] demonstrated that stacking models can enhance predictive performance in COVID-19 outcomes. Their stack-ensemble model, which incorporated support vector regression, effectively forecasted mortality among 14,267 COVID-19 patients in Brazil. Similarly, our findings reinforce the notion that combining strong, heterogeneous classifiers through stacking is more effective than relying on a single best-performing classifier (53).

This observation aligns with emerging literature on ensemble learning. Hussain et al. [80] highlighted the superiority of hybrid classifier systems in improving prediction accuracy, and further reported an impressive AUC of 96.0% using a deep stacking neural network to predict mortality risk. Together, these studies support the effectiveness of diversity-aware ensemble approaches in high-stakes biomedical prediction.

Our stacking model significantly outperformed established machine learning approaches reported in the literature. For example, Yakovyna et al. [28] applied a combination of supervised and unsupervised learning techniques but did not achieve the level of discrimination observed in our stacking model. Rahmatinejad et al. [29] reported high Brier scores for Random Forest and improved precision and sensitivity for XGBoost; however, our RF–XGB–NN stacking configuration provided substantially higher accuracy and AUC, supported by rigorous statistical analyses, including Wilcoxon, McNemar, and DeLong tests. Furthermore, compared to a study using over 500 EHR variables to train an RF model for sepsis mortality prediction [81], our stacking method achieved notably better calibration and discrimination while maintaining computational efficiency.

Our findings also outperform those of de Paiva et al. [82], who analyzed 10,897 COVID-19 patients using various machine learning models—including FNet transformers, convolutional neural networks, support vector machines, LightGBM, and traditional statistical approaches such as LASSO and Generalized Additive Models (GAM). Their best models achieved an AUROC of 0.826 and a MacroF1 score of 65.4%. In contrast, our stacking framework delivered an F1 score of 80.1% and AUC of 0.955, demonstrating substantially improved predictive accuracy and reliability.

The Neural Network meta-learner emerged as the most effective combiner of base-model outputs. Neural meta-learning has previously been shown to outperform linear or tree-based approaches by capturing higher-order, nonlinear relationships among model outputs, particularly in biomedical prediction tasks [83–85]. Our cross-validation results confirmed this, with the NN meta-learner providing superior discrimination and calibration across diverse sub-model sets.

Feature importance analysis identified age as the strongest predictor of mortality, followed by neutrophil count (NEUT), phosphorus levels, oxygen saturation (O₂sat.Ventilator), and lactate. Elevated NEUT counts have been widely associated with severe disease and cytokine storm responses, while reduced oxygen saturation is a direct marker of respiratory compromise [86,87]. SHAP analysis further showed that advancing age and high NEUT levels significantly increased mortality risk. These results reinforce established clinical findings that older age and immune dysregulation critically influence COVID-19 severity [88].

Beyond performance metrics, the proposed framework emphasizes computational efficiency and real-world applicability. Although training complexity increased relative to single models, the computation remained manageable and feasible for operational clinical environments. Inference latency (<0.2 seconds per prediction) is sufficiently low for real-time or near-real-time decision support, including bedside applications and automated triage systems.

Overall, this study presents a generalizable, diversity-aware ensemble framework that balances accuracy, interpretability, and computational efficiency. While validated for COVID-19 mortality prediction, the approach is adaptable to broader biomedical domains where heterogeneous data, transparency, and performance stability are critical.

5 Limitations and biases

This study has several limitations that should be considered when interpreting the findings. First, the retrospective EHR-based design may introduce selection and information biases, as data collection depended on available hospital records, which may not fully capture all relevant clinical variables [89]. Retrospective EHR studies are particularly vulnerable to incomplete or non-standardized data, including missing values, heterogeneous logging practices, and inconsistent variable definitions across hospitals [90]. Moreover, implicit clinician bias, referral patterns, and disparities in diagnosis or treatment may have influenced the data used for training, thereby propagating systemic inequities into predictive algorithms [91].

Second, the dataset was limited to three Iranian hospitals, which constrains the generalizability and external validity of the model. Differences in genetic, sociodemographic, cultural, and healthcare system characteristics across populations may influence both feature distributions and outcome risks, and prior reviews have shown that COVID-19 prediction models often perform poorly outside their development setting [92]. For example, variations in comorbidity prevalence, access to intensive care resources, and laboratory reference ranges may affect model performance in non-Iranian populations. Without independent external validation, our results should be interpreted cautiously when applied elsewhere. Recent reviews further emphasize that prediction models can never be considered fully “validated,” as transportability depends on population, setting, and temporal context [78,93]. Accordingly, validation in multinational, diverse cohorts is essential before clinical translation.

Third, although we applied resampling (ROSE) to mitigate class imbalance, oversampling approaches may embed artificial patterns that distort calibration or inflate predictive accuracy. This limitation has been repeatedly recognized in both COVID-19 studies and broader clinical prognostic modeling research [77,78].

Fourth, although ensemble learning (stacking) enhanced predictive performance, it introduced computational complexity and challenges for clinical deployment. Even with SHAP-based interpretability, ensemble models remain partially opaque, and post-hoc explanations represent approximations rather than causal insights. This raises concerns about clinical trust and automation bias, particularly if model outputs are adopted uncritically [93].

Fifth, our models did not incorporate unmeasured contextual confounders such as evolving treatment regimens, viral variants, or social determinants of health. As shown in previous literature, omission of such factors can bias effect estimates and limit real-world applicability [91]. Relatedly, model performance drift is a potential risk, as COVID-19 epidemiology, therapeutic strategies, and patient characteristics have evolved over time, necessitating ongoing monitoring and recalibration [78].

Finally, several design-related limitations warrant consideration. Statistical literature highlights that non-random sampling and limited site representativeness reduce the generalizability of predictive models, particularly when outcome heterogeneity is present [92]. Furthermore, most machine learning studies—including our own—focus primarily on discrimination metrics (e.g., AUC), with less emphasis on calibration and fairness assessments, which limits their clinical interpretability and adoption [93].

Despite these limitations, we employed rigorous validation strategies, including repeated cross-validation, calibration evaluation, and effect size reporting, consistent with TRIPOD recommendations [94]. These measures enhance robustness and transparency; however, independent, prospective, multi-center validation in larger and more diverse populations remains essential before clinical implementation.

5 Conclusion

In conclusion, this study demonstrates that a diversity-guided stacking ensemble—integrating Random Forest, XGBoost, and a Neural Network meta-learner—can achieve high predictive accuracy and interpretability for COVID-19 mortality risk. By combining a hybrid feature selection pipeline with heterogeneous base learners, the framework effectively reduced redundancy, captured nonlinear interactions, and maintained computational efficiency suitable for near real-time deployment.

Key predictors such as age, neutrophil count, phosphorus, and oxygen saturation consistently aligned with known clinical mechanisms of severe COVID-19, supporting both the statistical and biological validity of the model. SHAP-based interpretation further illustrated how interactions among these variables shape mortality risk, helping to bridge predictive performance with clinical insight.

Nevertheless, this work has several limitations, including its retrospective design, reliance on data from a single regional health system, and the absence of external validation, all of which may restrict generalizability. Future research should extend this framework to multi-center or multi-disease cohorts, integrate multimodal data sources (e.g., imaging, genomics), and evaluate real-time performance in prospective clinical environments.

Ultimately, the proposed stacking strategy represents a scalable and interpretable modeling paradigm that can be readily adapted to a wide range of clinical prediction tasks beyond COVID-19, advancing the application of ensemble learning for precision medicine and healthcare decision support.

Supporting information

S1 File

This file contains supplementary tables, figures, and additional analyses including correlation matrices, model tuning parameters, performance summaries, feature importance results, interaction analyses, and SHAP outputs.

(DOCX)

pone.0341198.s001.docx^{(114.3KB, docx)}

Acknowledgments

This article was part of the Ph.D. dissertation in epidemiology at Shahid Beheshti University of Medical Sciences (SBMU).

Data Availability

The dataset analyzed in this study contains sensitive patient information and cannot be shared publicly due to confidentiality restrictions. Access to de-identified data can be requested through the Shahid Beheshti University of Medical Sciences Ethics Committee (IR.SBMU.RIGLD.REC.1401.032) at urm@sbmu.ac.ir, which will review and, if appropriate, authorize data release. The corresponding author can provide further details on the application process. All analysis codes developed in this study are available from the corresponding author upon request.

Funding Statement

The author(s) received no specific funding for this work.

References

1.Das K, Behera RN. A survey on machine learning: concept, algorithms and applications. International Journal of Innovative Research in Computer and Communication Engineering. 2017;5(2):1301–9. [Google Scholar]
2.Windeatt T, Ghaderi R. Binary labelling and decision-level fusion. Information Fusion. 2001;2(2):103–12. doi: 10.1016/s1566-2535(01)00029-x [DOI] [Google Scholar]
3.Ghasemieh A, Lloyed A, Bahrami P, Vajar P, Kashef R. A novel machine learning model with Stacking Ensemble Learner for predicting emergency readmission of heart-disease patients. Decision Analytics Journal. 2023;7:100242. doi: 10.1016/j.dajour.2023.100242 [DOI] [Google Scholar]
4.Berliana AU, Bustamam A. Implementation of Stacking Ensemble Learning for Classification of COVID-19 using Image Dataset CT Scan and Lung X-Ray. In: 2020 3rd International Conference on Information and Communications Technology (ICOIACT), 2020. 148–52. doi: 10.1109/icoiact50329.2020.9332112 [DOI] [Google Scholar]
5.Khyani D, Jakkula S, Gowda S, Anusha KJ, Swetha KR. An interpretation of stacking and blending approach in machine learning. Int Res J Eng Technol. 2021;8(07). [Google Scholar]
6.Mienye ID, Sun Y. A Survey of Ensemble Learning: Concepts, Algorithms, Applications, and Prospects. IEEE Access. 2022;10:99129–49. doi: 10.1109/access.2022.3207287 [DOI] [Google Scholar]
7.Graczyk M, Lasota T, Trawiński B, Trawiński K. Comparison of bagging, boosting and stacking ensembles applied to real estate appraisal. In: Intelligent Information and Database Systems: Second International Conference, ACIIDS, Hue City, Vietnam, March 24-26, 2010 Proceedings, Part II, 2010. [Google Scholar]
8.Feng Y, Li B, Fu R, Hao Y, Wang T, Guo H, et al. A simplified coronary model for diagnosis of ischemia-causing coronary stenosis. Comput Methods Programs Biomed. 2023;242:107862. doi: 10.1016/j.cmpb.2023.107862 [DOI] [PubMed] [Google Scholar]
9.Wang X, Bakulski KM, Mukherjee B, Hu H, Park SK. Predicting cumulative lead (Pb) exposure using the Super Learner algorithm. Chemosphere. 2023;311(Pt 2):137125. doi: 10.1016/j.chemosphere.2022.137125 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Xu W, He H, Guo Z, Li W. Evaluation of machine learning models on protein level inference from prioritized RNA features. Brief Bioinform. 2022;23(3):bbac091. doi: 10.1093/bib/bbac091 [DOI] [PubMed] [Google Scholar]
11.Reda R, Saffaj T, Bouzida I, Saidi O, Belgrir M, Lakssir B, et al. Optimized variable selection and machine learning models for olive oil quality assessment using portable near infrared spectroscopy. Spectrochim Acta A Mol Biomol Spectrosc. 2023;303:123213. doi: 10.1016/j.saa.2023.123213 [DOI] [PubMed] [Google Scholar]
12.Mohammed M, Mwambi H, Mboya IB, Elbashir MK, Omolo B. A stacking ensemble deep learning approach to cancer type classification based on TCGA data. Sci Rep. 2021;11(1):15626. doi: 10.1038/s41598-021-95128-x [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Wang B, Liu J, Zhang X, Lin J, Li S, Wang Z, et al. A stacking ensemble framework integrating radiomics and deep learning for prognostic prediction in head and neck cancer. Radiat Oncol. 2025;20(1):127. doi: 10.1186/s13014-025-02695-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Kwon H, Park J, Lee Y. Stacking Ensemble Technique for Classifying Breast Cancer. Healthc Inform Res. 2019;25(4):283–8. doi: 10.4258/hir.2019.25.4.283 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Alhussen A, Anul Haq M, Ahmad Khan A, Mahendran RK, Kadry S. XAI-RACapsNet: Relevance aware capsule network-based breast cancer detection using mammography images via explainability O-net ROI segmentation. Expert Systems with Applications. 2025;261:125461. doi: 10.1016/j.eswa.2024.125461 [DOI] [Google Scholar]
16.Haq MA, Khan I, Ahmed A, Eldin SM, Alshehri ALI, Ghamry NA. DCNNBT: A novel deep convolution neural network-based brain tumor classification model. Fractals. 2023;31(06):2340102. doi: 10.1142/S0218348X23401023 [DOI] [Google Scholar]
17.Yousef R, Khan S, Gupta G, Siddiqui T, Albahlal BM, Alajlan SA, et al. U-Net-Based Models towards Optimal MR Brain Image Segmentation. Diagnostics (Basel). 2023;13(9):1624. doi: 10.3390/diagnostics13091624 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Abualnaja SY, Morris JS, Rashid H, Cook WH, Helmy AE. Machine learning for predicting post-operative outcomes in meningiomas: a systematic review and meta-analysis. Acta Neurochir (Wien). 2024;166(1):505. doi: 10.1007/s00701-024-06344-z [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Lei J, Zhai J, Zhang Y, Qi J, Sun C. Supervised Machine Learning Models for Predicting Sepsis-Associated Liver Injury in Patients With Sepsis: Development and Validation Study Based on a Multicenter Cohort Study. J Med Internet Res. 2025;27:e66733. doi: 10.2196/66733 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Dhingra LS, Aminorroaya A, Sangha V, Pedroso AF, Shankar SV, Coppi A, et al. Ensemble deep learning algorithm for structural heart disease screening using electrocardiographic images: PRESENT SHD. Journal of the American College of Cardiology. 2025;85(12):1302–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Tseng PY, Chen YT, Wang CH, Chiu KM, Peng YS, Hsu SP. Prediction of the development of acute kidney injury following cardiac surgery by machine learning. Critical Care. 2020;24(1):478. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Sawesi S, Jadhav A, Rashrash B. Machine Learning and Deep Learning Techniques for Prediction and Diagnosis of Leptospirosis: Systematic Literature Review. JMIR Med Inform. 2025;13:e67859. doi: 10.2196/67859 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Chiasakul T, Lam BD, McNichol M, Robertson W, Rosovsky RP, Lake L, et al. Artificial intelligence in the prediction of venous thromboembolism: A systematic review and pooled analysis. Eur J Haematol. 2023;111(6):951–62. doi: 10.1111/ejh.14110 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Cramer EY, Ray EL, Lopez VK, Bracher J, Brennen A, Castro Rivadeneira AJ, et al. Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States. Proc Natl Acad Sci U S A. 2022;119(15):e2113561119. doi: 10.1073/pnas.2113561119 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Dessie ZG, Zewotir T. Mortality-related risk factors of COVID-19: a systematic review and meta-analysis of 42 studies and 423,117 patients. BMC Infect Dis. 2021;21(1):855. doi: 10.1186/s12879-021-06536-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Cui S, Wang Y, Wang D, Sai Q, Huang Z, Cheng TCE. A two-layer nested heterogeneous ensemble learning predictive method for COVID-19 mortality. Appl Soft Comput. 2021;113:107946. doi: 10.1016/j.asoc.2021.107946 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Li J, Li X, Hutchinson J, Asad M, Liu Y, Wang Y, et al. An ensemble prediction model for COVID-19 mortality risk. Biol Methods Protoc. 2022;7(1):bpac029. doi: 10.1093/biomethods/bpac029 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Yakovyna V, Shakhovska N, Szpakowska A. A novel hybrid supervised and unsupervised hierarchical ensemble for COVID-19 cases and mortality prediction. Sci Rep. 2024;14(1):9782. doi: 10.1038/s41598-024-60637-y [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Rahmatinejad Z, Dehghani T, Hoseini B, Rahmatinejad F, Lotfata A, Reihani H, et al. A comparative study of explainable ensemble learning and logistic regression for predicting in-hospital mortality in the emergency department. Sci Rep. 2024;14(1):3406. doi: 10.1038/s41598-024-54038-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Wynants L, Van Calster B, Collins GS, Riley RD, Heinze G, Schuit E, et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ. 2020;369. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Jamshidi MB, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, et al. Artificial Intelligence and COVID-19: Deep Learning Approaches for Diagnosis and Treatment. IEEE Access. 2020;8:109581–95. doi: 10.1109/ACCESS.2020.3001973 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Sperrin M, Grant SW, Peek N. Prediction models for diagnosis and prognosis in Covid-19. BMJ. 2020. [DOI] [PubMed] [Google Scholar]
33.Banoei MM, Dinparastisaleh R, Zadeh AV, Mirsaeidi M. Machine-learning-based COVID-19 mortality prediction model and identification of patients at low and high risk of dying. Crit Care. 2021;25(1):328. doi: 10.1186/s13054-021-03749-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Saxena A, Nixon B, Boyd A, Evans J, Faraone SV. A systematic review of the application of graph neural networks to extract candidate genes and biological associations. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2025;:e33031. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Ji C, Yu N, Wang Y, Ni J, Zheng C. SGLMDA: A Subgraph Learning-Based Method for miRNA-Disease Association Prediction. IEEE/ACM Trans Comput Biol Bioinform. 2024;21(5):1191–201. doi: 10.1109/TCBB.2024.3373772 [DOI] [PubMed] [Google Scholar]
36.Wang J, Li J, Yue K, Wang L, Ma Y, Li Q. NMCMDA: neural multicategory MiRNA-disease association prediction. Brief Bioinform. 2021;22(5):bbab074. doi: 10.1093/bib/bbab074 [DOI] [PubMed] [Google Scholar]
37.Li J, Lin H, Wang Y, Li Z, Wu B. Prediction of potential small molecule-miRNA associations based on heterogeneous network representation learning. Front Genet. 2022;13:1079053. doi: 10.3389/fgene.2022.1079053 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Riley RD, Ensor J, Snell KIE, Archer L, Whittle R, Dhiman P, et al. Importance of sample size on the quality and utility of AI-based prediction models for healthcare. Lancet Digit Health. 2025;7(6):100857. doi: 10.1016/j.landig.2025.01.013 [DOI] [PubMed] [Google Scholar]
39.Yang Y, Khorshidi HA, Aickelin U. A review on over-sampling techniques in classification of multi-class imbalanced datasets: insights for medical problems. Front Digit Health. 2024;6:1430245. doi: 10.3389/fdgth.2024.1430245 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Salmi M, Atif D, Oliva D, Abraham A, Ventura S. Handling imbalanced medical datasets: review of a decade of research. Artif Intell Rev. 2024;57(10). doi: 10.1007/s10462-024-10884-2 [DOI] [Google Scholar]
41.Malhotra R, Khanna M. Particle swarm optimization-based ensemble learning for software change prediction. Information and Software Technology. 2018;102:65–84. doi: 10.1016/j.infsof.2018.05.007 [DOI] [Google Scholar]
42.Yakovyna V, Shakhovska N, Szpakowska A. A novel hybrid supervised and unsupervised hierarchical ensemble for COVID-19 cases and mortality prediction. Sci Rep. 2024;14(1):9782. doi: 10.1038/s41598-024-60637-y [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Tian W, Jiang W, Yao J, Nicholson CJ, Li RH, Sigurslid HH, et al. Predictors of mortality in hospitalized COVID-19 patients: A systematic review and meta-analysis. J Med Virol. 2020;92(10):1875–83. doi: 10.1002/jmv.26050 [DOI] [PMC free article] [PubMed] [Google Scholar]
44.de Paiva BBM, Pereira PD, de Andrade CMV, Gomes VMR, Souza-Silva MVR, Martins KPMP, et al. Potential and limitations of machine meta-learning (ensemble) methods for predicting COVID-19 mortality in a large inhospital Brazilian dataset. Sci Rep. 2023;13(1):3463. doi: 10.1038/s41598-023-28579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Hatamabadi H, Sabaghian T, Sadeghi A, Heidari K, Safavi-Naini SAA, Looha MA. Epidemiology of COVID-19 in Tehran, Iran: A cohort study of clinical profile, risk factors, and outcomes. BioMed Research International. 2022;2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Klomp T. Iterative Imputation in Python: A Study on the Performance of the Package IterativeImputer. University Utrecht. 2022. [Google Scholar]
47.Barough SS, Safavi-Naini SAA, Siavoshi F, Tamimi A, Ilkhani S, Akbari S, et al. Generalizable machine learning approach for COVID-19 mortality risk prediction using on-admission clinical and laboratory features. Sci Rep. 2023;13(1):2399. doi: 10.1038/s41598-023-28943-z [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Sharma V. A Study on Data Scaling Methods for Machine Learning. INJGAADMIN, ftjijgasr. 2022;1(1). doi: 10.55938/ijgasr.v1i1.4 [DOI] [Google Scholar]
49.Mishra S, Pradhan RK. Analyzing the impact of feature correlation on classification accuracy of machine learning model. In: 2023. [Google Scholar]
50.Chandrashekar G, Sahin F. A survey on feature selection methods. Computers & Electrical Engineering. 2014;40(1):16–28. doi: 10.1016/j.compeleceng.2013.11.024 [DOI] [Google Scholar]
51.Alin A. Multicollinearity. Wiley Interdiscip Rev Comput Stat. 2010;2(3):370–4. [Google Scholar]
52.Daoud JI. Multicollinearity and regression analysis. Journal of Physics: Conference Series. 2017. [Google Scholar]
53.Sheskin DJ. Handbook of parametric and nonparametric statistical procedures. CRC Press. 2020. [Google Scholar]
54.Moorthy U, Gandhi UD. RETRACTED ARTICLE: A novel optimal feature selection technique for medical data classification using ANOVA based whale optimization. J Ambient Intell Human Comput. 2020;12(3):3527–38. doi: 10.1007/s12652-020-02592-w [DOI] [Google Scholar]
55.Ladha L, Deepa T. Feature Selection Methods and Algorithms. International Journal on Computer Science and Engineering (IJCSE). 2023;55. [Google Scholar]
56.Okser S, Pahikkala T, Airola A, Salakoski T, Ripatti S, Aittokallio T. Regularized machine learning in the genetic prediction of complex traits. PLoS Genet. 2014;10(11):e1004754. doi: 10.1371/journal.pgen.1004754 [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Kuhn M. Building predictive models in R using the caret package. Journal of Statistical Software. 2008;28:1–26.27774042 [Google Scholar]
58.Kuhn M. Variable selection using the caret package. 2012. [Google Scholar]
59.Berrar D. Cross-validation. 2019. 542–5.
60.Bottino F, Tagliente E, Pasquini L, Napoli AD, Lucignani M, Figà-Talamanca L, et al. COVID Mortality Prediction with Machine Learning Methods: A Systematic Review and Critical Appraisal. J Pers Med. 2021;11(9):893. doi: 10.3390/jpm11090893 [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Brownlee J. Machine learning mastery with R: Get started, build accurate models and work through projects step-by-step. Machine Learning Mastery. 2016. [Google Scholar]
62.Tattar PN. Hands-On Ensemble Learning with R: A Beginner’s Guide to Combining the Power of Machine Learning Algorithms Using Ensemble Techniques. Packt Publishing Ltd. 2018. [Google Scholar]
63.Kuncheva LI. Combining pattern classifiers: methods and algorithms. John Wiley & Sons. 2014. [Google Scholar]
64.Wang L, Mo T, Wang X, Chen W, He Q, Li X, et al. A hierarchical fusion framework to integrate homogeneous and heterogeneous classifiers for medical decision-making. Knowledge-Based Systems. 2021;212:106517. doi: 10.1016/j.knosys.2020.106517 [DOI] [Google Scholar]
65.Sullivan GM, Feinn R. Using Effect Size-or Why the P Value Is Not Enough. J Grad Med Educ. 2012;4(3):279–82. doi: 10.4300/JGME-D-12-00156.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
66.Fritz CO, Morris PE, Richler JJ. Effect size estimates: current use, calculations, and interpretation. J Exp Psychol Gen. 2012;141(1):2–18. doi: 10.1037/a0024338 [DOI] [PubMed] [Google Scholar]
67.Zhu Y, Guo W. Family-Wise Error Rate Controlling Procedures for Discrete Data. Statistics in Biopharmaceutical Research. 2019;12(1):117–28. doi: 10.1080/19466315.2019.1654912 [DOI] [Google Scholar]
68.Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC. Package ‘pROC’. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Kassambara A. Comparing groups: Numerical variables. Sydney, Australia: Datanovia. 2019. [Google Scholar]
70.Molnar C, Schratz P. Package ‘iml’. R CRAN. 2020. [Google Scholar]
71.Roth AE. The Shapley value: essays in honor of Lloyd S. Shapley. Cambridge University Press. 1988. [Google Scholar]
72.Lunardon N, Menardi G, Torelli N. ROSE: a package for binary imbalanced learning. R journal. 2014;6(1). [Google Scholar]
73.Zali A, Gholamzadeh S, Mohammadi G, Azizmohammad Looha M, Akrami F, Zarean E. Baseline characteristics and associated factors of mortality in COVID-19 patients; an analysis of 16000 cases in Tehran, Iran. Arch Acad Emerg Med. 2020;8(1):e70. [PMC free article] [PubMed] [Google Scholar]
74.Ogundimu EO, Altman DG, Collins GS. Adequate sample size for developing prediction models is not simply related to events per variable. J Clin Epidemiol. 2016;76:175–82. doi: 10.1016/j.jclinepi.2016.02.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
75.Chamseddine E, Mansouri N, Soui M, Abed M. Handling class imbalance in COVID-19 chest X-ray images classification: Using SMOTE and weighted loss. Appl Soft Comput. 2022;129:109588. doi: 10.1016/j.asoc.2022.109588 [DOI] [PMC free article] [PubMed] [Google Scholar]
76.Javidi M, Abbaasi S, Naybandi Atashi S, Jampour M. COVID-19 early detection for imbalanced or low number of data using a regularized cost-sensitive CapsNet. Sci Rep. 2021;11(1):18478. doi: 10.1038/s41598-021-97901-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
77.de Jong VMT, Rousset RZ, Antonio-Villa NE, Buenen AG, Van Calster B, Bello-Chavolla OY, et al. Clinical prediction models for mortality in patients with covid-19: external validation and individual participant data meta-analysis. BMJ. 2022;378:e069881. doi: 10.1136/bmj-2021-069881 [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Van Calster B, Steyerberg EW, Wynants L, van Smeden M. There is no such thing as a validated prediction model. BMC Med. 2023;21(1):70. doi: 10.1186/s12916-023-02779-w [DOI] [PMC free article] [PubMed] [Google Scholar]
79.Ribeiro MHDM, da Silva RG, Mariani VC, Coelho LDS. Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil. Chaos Solitons Fractals. 2020;135:109853. doi: 10.1016/j.chaos.2020.109853 [DOI] [PMC free article] [PubMed] [Google Scholar]
80.Hussain S, Songhua X, Aslam MU, Hussain F. Clinical predictions of COVID-19 patients using deep stacking neural networks. J Investig Med. 2024;72(1):112–27. doi: 10.1177/10815589231201103 [DOI] [PubMed] [Google Scholar]
81.Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, et al. Prediction of In-hospital Mortality in Emergency Department Patients With Sepsis: A Local Big Data-Driven, Machine Learning Approach. Acad Emerg Med. 2016;23(3):269–78. doi: 10.1111/acem.12876 [DOI] [PMC free article] [PubMed] [Google Scholar]
82.de Paiva BBM, Pereira PD, de Andrade CMV, Gomes VMR, Souza-Silva MVR, Martins KPMP, et al. Potential and limitations of machine meta-learning (ensemble) methods for predicting COVID-19 mortality in a large inhospital Brazilian dataset. Sci Rep. 2023;13(1):3463. doi: 10.1038/s41598-023-28579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
83.An N, Ding H, Yang J, Au R, Ang TFA. Deep ensemble learning for Alzheimer’s disease classification. J Biomed Inform. 2020;105:103411. doi: 10.1016/j.jbi.2020.103411 [DOI] [PMC free article] [PubMed] [Google Scholar]
84.Gupta A, Jain V, Singh A. Stacking Ensemble-Based Intelligent Machine Learning Model for Predicting Post-COVID-19 Complications. New Gener Comput. 2022;40(4):987–1007. doi: 10.1007/s00354-021-00144-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
85.Kablan R, Miller HA, Suliman S, Frieboes HB. Evaluation of stacked ensemble model performance to predict clinical outcomes: A COVID-19 study. Int J Med Inform. 2023;175:105090. doi: 10.1016/j.ijmedinf.2023.105090 [DOI] [PMC free article] [PubMed] [Google Scholar]
86.Liu Y, Du X, Chen J, Jin Y, Peng L, Wang HHX, et al. Neutrophil-to-lymphocyte ratio as an independent risk factor for mortality in hospitalized patients with COVID-19. J Infect. 2020;81(1):e6–12. doi: 10.1016/j.jinf.2020.04.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
87.Wu C, Chen X, Cai Y, Xia J, Zhou X, Xu S, et al. Risk Factors Associated With Acute Respiratory Distress Syndrome and Death in Patients With Coronavirus Disease 2019 Pneumonia in Wuhan, China. JAMA Intern Med. 2020;180(7):934–43. doi: 10.1001/jamainternmed.2020.0994 [DOI] [PMC free article] [PubMed] [Google Scholar]
88.Xu W, Sun N-N, Gao H-N, Chen Z-Y, Yang Y, Ju B, et al. Risk factors analysis of COVID-19 patients with ARDS and prediction based on machine learning. Sci Rep. 2021;11(1):2933. doi: 10.1038/s41598-021-82492-x [DOI] [PMC free article] [PubMed] [Google Scholar]
89.Sedgwick P. Retrospective cohort studies: advantages and disadvantages. BMJ. 2014;348(jan24 1):g1072–g1072. doi: 10.1136/bmj.g1072 [DOI] [Google Scholar]
90.Agniel D, Kohane IS, Weber GM. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ. 2018;361:k1479. doi: 10.1136/bmj.k1479 [DOI] [PMC free article] [PubMed] [Google Scholar]
91.Perets O, Stagno E, Yehuda EB, McNichol M, Anthony Celi L, Rappoport N, et al. Inherent Bias in Electronic Health Records: A Scoping Review of Sources of Bias. medRxiv. 2024;:2024.04.09.24305594. doi: 10.1101/2024.04.09.24305594 [DOI] [PMC free article] [PubMed] [Google Scholar]
92.Degtiar I, Rose S. A Review of Generalizability and Transportability. Annu Rev Stat Appl. 2023;10(1):501–24. doi: 10.1146/annurev-statistics-042522-103837 [DOI] [Google Scholar]
93.Collins GS, Dhiman P, Ma J, Schlussel MM, Archer L, Van Calster B, et al. Evaluation of clinical prediction models (part 1): from development to external validation. BMJ. 2024;384:e074819. doi: 10.1136/bmj-2023-074819 [DOI] [PMC free article] [PubMed] [Google Scholar]
94.Collins GS, Reitsma JB, Altman DG, Moons KGM, TRIPOD Group. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. The TRIPOD Group. Circulation. 2015;131(2):211–9. doi: 10.1161/CIRCULATIONAHA.114.014508 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. 2026 Apr 17;21(4):e0341198. doi: 10.1371/journal.pone.0341198.r001

Author response to Decision Letter 0

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.

19 Apr 2025

PLoS One. doi: 10.1371/journal.pone.0341198.r002

Decision Letter 0

Maciej Huk

8 Sep 2025

Dear Dr. Kavousi,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

In particular:

language and presentation errors should be removed,
Authors should reconsider if defining well known measures is needed (e.g. MCC, Cochen's Kappa, etc.),
balancing of usage of reference 1 and other references is highly advised,
it would be beneficial to add flowchart that shows the stages of the proposed system and its operation,
limitations and biases of the study should be cleary discussed in a dedicated section,
discussion of the generalizability of the model to non-Iranian population would increase the value of the study.
Authors should not use not clear conditions in software/data availability statement (what is "reasonable" request?)
Authors should consider that for larger samples also very tiny effects can be statstically significant. Under such conditions the effect size should be analysed (e.g. Wilcoxon R),
Authors should consider applying post-hoc correction to p-values of many paired tests in order to control the family-wise error rate (FWER).

Remark: Please notice that Authors should not be forced by Reviewers to cite their research. Authors can consider such suggestions but should cite mentioned works only when it seems crucial and appropriate. Please notice also that citing or not citing works suggested by Reviewers will not influence decisions made by the Editors during Reviewing process. Reviews are presented in their original form for clarity of the reviewing process.

Please submit your revised manuscript by Oct 23 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols ..

We look forward to receiving your revised manuscript.

Kind regards,

Maciej Huk, Ph.D.

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. In the online submission form you indicate that your data is not available for proprietary reasons and have provided a contact point for accessing this data. Please note that your current contact point is a co-author on this manuscript. According to our Data Policy, the contact point must not be an author on the manuscript and must be an institutional contact, ideally not an individual. Please revise your data statement to a non-author institutional point of contact, such as a data access or ethics committee, and send this to us via return email. Please also include contact information for the third party organization, and please include the full citation of where the data can be found.

4. Your ethics statement should only appear in the Methods section of your manuscript. If your ethics statement is written in any section besides the Methods, please delete it from any other section.

5. We are unable to open your Supporting Information file [Data predictors.sav]. Please kindly revise as necessary and re-upload.

If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Partly

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.-->

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

**********

Reviewer #1: Dear Authors

The paper titled “Improving COVID-19 Mortality Predictions: A Stacking Ensemble Approach with Diverse Classifiers” shows promise. There are some limitations and drawbacks in the manuscript that should be considered:

1. Authors should clarify the novelty of the stacking approach compared to existing ensemble methods.

2. Please justify the choice of 15 features selected via hybrid feature selection.

3. Authors should explain why the Neural Network was chosen as the meta-learner over other options.

4. Provide more details on the handling of missing data beyond iterative imputation.

5. Elaborate on the criteria for selecting the 16 base classifiers.

6. The Literature Survey is weak. Authors should add the computational complexities and costs for all reviewed works. The authors should add the latest and most relevant works related to various types of cancers as [1] XAI-RACapsNet: Relevance aware capsule network-based breast cancer detection using mammography images via explainability O-net ROI segmentation; [2] dcnnbt: a novel deep convolution neural network-based brain tumor classification model; [3]brain tumor identification using data augmentation and transfer learning approach.; [4] u-net-based models towards optimal MR brain image segmentation.

7. Discuss the impact of class imbalance pre- and post-ROSE balancing.

8. Authors should clarify how diversity metrics guided the selection of sub-model sets.

9. Justify the use of both pairwise and non-pairwise diversity measures.

10. Explain why some stacking combinations did not improve accuracy.

11. Authors should provide confusion matrices for the best-performing model.

12. Discuss computational complexity and training time of the stacking approach.

13. Clarify how SHAP values were computed for the stacked model.

14. Explain the moderate correlation between Ferritin, ESR, and TIBC clinically.

15. Discuss the generalizability of the model to non-Iranian populations.

16. Justify the absence of external validation despite a large dataset.

17. Explain the choice of robust scaling over other normalization methods.

18. Discuss the clinical practicality of the top predictors (e.g., NEUT, age).

19. Clarify how hyperparameters were tuned for each base learner.

20. Explain why traditional stacking (low-correlation set) underperformed.

Reviewer #2: This manuscript presents a stacking ensemble approach integrating diverse machine learning classifiers to predict COVID-19 mortality. Utilizing data from 4,778 patients, the study employs feature selection, multiple base models, and meta-learners to achieve high predictive accuracy. The presented work shows its ambition to enhance COVID-19 mortality prediction through ensemble learning, which is commendable. However, the manuscript requires substantial revisions to meet the standards. Below are the concerns that need to be addressed:

1). The abstract lacks clarity. It should succinctly summarize the research objectives, methods, key findings, and implications in a structured manner.

2). The introduction lacks a comprehensive review of existing literature on COVID-19 mortality prediction. It should include a more thorough discussion of previous studies, highlighting gaps that the current research aims to fill.

3). The manuscript should provide more details on the data collection process, including the reliability and validity of the data sources. Information on data preprocessing steps, such as handling missing values and outliers, is insufficient.

4). The rationale behind the chosen feature selection methods (VIF, ANOVA, SBE, Lasso) should be more explicitly stated.

5). The manuscript should explain why these specific methods were selected over others and how they contribute to the study's objectives.

6). The selection of base models and meta-learners lacks justification. The manuscript should provide a clear rationale for choosing these particular algorithms, considering their strengths and limitations in the context of COVID-19 mortality prediction.

7). The diversity measures used to construct sub-model sets are not adequately explained. The manuscript should provide a detailed description of each measure (Disagreement, Yule's Q, Cohen's Kappa, etc. ) and its significance in enhancing model performance.

8). The choice of performance metrics (accuracy, sensitivity, specificity, etc.) should be justified. The manuscript should explain why these metrics are appropriate for evaluating the model's predictive capability in the context of COVID-19 mortality.

9). The manuscript should explain why these tests are suitable for the current study design and data characteristics.

10). The interpretation of results lacks depth. The manuscript should provide a more nuanced discussion of the findings, including potential limitations and biases in the data and models.

11). The feature importance analysis should be more comprehensive. The manuscript should explore the interactions between features and their combined impact on mortality risk.

12). The use of SHAP values for model interpretability is a strength, but the explanation of these values is insufficient. The manuscript should provide a clearer interpretation of SHAP values and their implications for clinical decision-making.

13). The discussion section should be more comprehensive, addressing the study's implications for clinical practice, future research directions, and potential societal impact. The conclusion should be more robust, summarizing the key findings, limitations, and future research needs.

14). The manuscript contains numerous obvious formatting and grammatical issues. The authors should carefully proofread their submission to improve its writing quality. For example, there are many evident errors in the references list.

Reviewer #3: 1- The abstract needs a brief introduction to the research gap before proceeding to state what has been proposed.

2- The percentage of using reference 1 in the introduction is very high, so it is preferable to add other references and diversify the paragraphs by diversifying their references.

3- The related works listed do not cover a large period of time, such as between 2020-2025.

4- The technical basics used are not separated from the stages of work of the proposed algorithm.

5- Add a flowchart that shows the stages of the proposed system's operation.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our For information about this choice, including consent withdrawal, please see our Privacy Policy .-->

Reviewer #1: Yes: Mohd Anul HaqMohd Anul Haq

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.. Please note that Supporting Information files do not need this step.

PLoS One. 2026 Apr 17;21(4):e0341198. doi: 10.1371/journal.pone.0341198.r003

Author response to Decision Letter 1

28 Oct 2025

Response to Reviewers

Manuscript title: Improving COVID-19 Mortality Predictions: A Stacking Ensemble Approach with Diverse Classifiers

Journal: PLOS ONE

Dear Dr. Huk,

We sincerely thank you and the reviewers for the careful evaluation of our manuscript and the constructive feedback. We have revised the paper accordingly, improving clarity, rigor, and presentation. We believe the manuscript is substantially improved as a result of these suggestions.

Below we provide a detailed, point-by-point response. Reviewer comments are shown in italics, followed by our responses. Changes are reflected in the Revised Manuscript with Track Changes.

________________________________________

🔹 Editorial Requests

1. Language and presentation errors should be removed.

Response: We thank you for this note. The manuscript was carefully proofread for grammar, syntax, and clarity. Redundant definitions and awkward phrases were removed.

2. Authors should reconsider if defining well known measures is needed (e.g. MCC, Cohen’s Kappa).

Response: We agree with this observation. Definitions of common evaluation metrics (MCC, Cohen’s Kappa, etc.) were removed, while retaining explanations only for less standard or newly applied measures.

3. Balancing of usage of reference 1 and other references is highly advised.

Response: We acknowledge the over-reliance on a single reference in the original introduction. The literature review was expanded to include more diverse and recent works (2020–2025), reducing reliance on a single reference.

4. It would be beneficial to add a flowchart that shows the stages of the proposed system and its operation.

Response: This is an excellent suggestion. A new flowchart (Figure 1) was added, summarizing the stages of data preprocessing, feature selection, model training, sub-model construction, and stacking.

5. Limitations and biases of the study should be clearly discussed in a dedicated section.

Response: We agree that this is a critical point. A new “Limitations and Biases” section was added, addressing dataset representativeness, reliance on retrospective hospital data, possible measurement bias, and lack of external validation.

6. Discussion of the generalizability of the model to non-Iranian population would increase the value of the study.

Response: A new paragraph was added (Discussion, section limitation and bias) explicitly addressing generalizability, noting cultural, healthcare system, and population differences, and suggesting the need for validation in international cohorts.

7. Authors should not use unclear conditions in software/data availability statement (what is "reasonable" request?).

Response: The Data Availability statement was revised. We now specify that the dataset cannot be shared publicly due to patient confidentiality but can be accessed via a non-author institutional contact (Shahid Beheshti University Ethics Committee).

8. Authors should consider that for larger samples also very tiny effects can be statistically significant. Effect size should be analysed (e.g., Wilcoxon R).

Response: Thank you for this important statistical point. Effect size analyses were added alongside statistical tests. We run Paired Wilcoxon with rstatix package and analyzed Effect size for paired test.

9. Authors should consider applying post-hoc correction to p-values of many paired tests to control the family-wise error rate (FWER).

Response: We agree that controlling the family-wise error rate is crucial given the number of paired tests. we added these in methods.

10. Data sharing statement must list a non-author institutional contact.

Response: Revised as requested; now lists the Ethics Committee office of Shahid Beheshti University of Medical Sciences with contact details.

11. Your ethics statement should only appear in the Methods section.

Response: The ethics statement was removed from other sections and kept only in Methods.

12. Supporting Information file [Data predictors.sav] not accessible.

Response: We apologize for the issue with the Supporting Information file. The file was removed.

________________________________________

🔹 Reviewer #1

We sincerely thank Reviewer #1 for their thorough and constructive feedback.

1. Authors should clarify the novelty of the stacking approach compared to existing ensemble methods.

Response: Expanded in Introduction to emphasize novelty: integration of heterogeneous classifiers using multiple diversity measures, going beyond correlation-based stacking.

2. Please justify the choice of 15 features selected via hybrid feature selection.

Response: We have clarified in the Methods section that the selection of 15 features was not arbitrary but was driven by both statistical and clinical considerations. Statistically, these features were identified through a hybrid feature selection pipeline integrating VIF, ANOVA, Sequential Backward Elimination, and Lasso regularization, which systematically removed redundant and low-contributing predictors. Clinically, the final 15 variables represent major physiological domains relevant to COVID-19 outcomes—such as inflammation (NEUT, ESR, Ferritin), hypoxia (O₂ saturation), metabolic disturbance (FBS, Lactate, Phosphorus), coagulation (D-dimer), and organ dysfunction (Procalcitonin, TIBC, CKD)—ensuring comprehensive coverage of mortality-related mechanisms while maintaining interpretability and avoiding overfitting. The resulting subset achieved a strong balance between model simplicity, predictive accuracy, and clinical relevance.

3. Authors should explain why the Neural Network was chosen as the meta-learner over other options.

Response: We have clarified that to integrate the predictions of sub-model sets, we applied a stacking ensemble framework using five meta-learners: Generalized Linear Model (GLM), Linear Discriminant Analysis (LDA), Random Forest (RF), Gradient Boosting Machine (GBM), and a Neural Network (NN). The choice of meta-learner was guided by both empirical testing and theoretical considerations of model diversity. Simpler linear meta-models, such as GLM or LDA, perform well when base classifiers are weakly correlated and exhibit near-additive relationships. However, when base learners capture heterogeneous and potentially non-linear patterns—as was the case in our ensemble of decision-tree, kernel-based, and regression models—more flexible meta-learners are required to effectively capture complex interactions among their predictions. Accordingly, the Neural Network meta-learner was selected as the optimal combiner due to its ability to model higher-order dependencies among base model outputs and its robustness in handling diverse feature distributions. Prior studies have shown that neural-network-based meta-learners can outperform linear or tree-based alternatives by capturing intricate non-linear relationships between model outputs, especially in biomedical and clinical prediction settings (An et al., 2020; Gupta et al., 2021; Kablan et al., 2023). We empirically verified this by cross-validation, where the NN meta-learner achieved superior discrimination and calibration compared to other stacking configurations.

4. Provide more details on the handling of missing data beyond iterative imputation.

Response: We have expanded that the dataset used in this study had already undergone preprocessing using iterative multivariate imputation in Python’s Scikit-learn. This approach estimates missing values by sequentially modeling each variable as a function of the others, thus retaining complex inter-variable dependencies. It is widely recommended for clinical datasets where variables are correlated and missingness is assumed to be at random. Beyond imputation itself, the original data processing pipeline included the exclusion of categorical variables with missing values and continuous variables with more than 5% missingness, to minimize bias from excessive imputation. The imputed dataset was subsequently validated through distributional checks and correlation analyses to confirm plausibility and consistency. We have revised the Methods section (p. 16, lines 300) accordingly to clarify this process and to emphasize that the dataset used for modeling already incorporated robust and validated imputation procedures.

5. Elaborate on the criteria for selecting the 16 base classifiers.

Response: We appreciate the reviewer’s comment and have clarified the rationale for selecting our 16 base classifiers. The models were chosen based on three key criteria: (i) methodological diversity to ensure complementary error structures, (ii) prior evidence of strong performance in COVID-19 and biomedical prediction studies, and (iii) balance between interpretability and computational feasibility for clinical use.

This included linear models (e.g., GLM, Lasso, Ridge), probabilistic and instance-based methods (Naïve Bayes, KNN), and tree-based and ensemble learners (RF, XGBoost, GBM, AdaBoost). Such diversity enables the stacking framework to exploit heterogeneous strengths across model families. The revised Methods section now includes these selection criteria explicitly.

Response: We have expanded the literature review to include recent studies addressing stacking ensemble models in cancer research and their computational implications. Specifically, we now discuss Mohammed et al. (2021), who demonstrated the computational feasibility and predictive superiority of a CNN-based stacking ensemble for multi-cancer classification; Wang et al. (2025), who applied a multimodal stacking framework for head-and-neck cancer prognosis achieving a C-index of 0.93; and Kwon et al. (2019), who systematically evaluated stacking configurations for breast cancer prediction, highlighting trade-offs between meta-learner choice and model complexity. These studies strengthen our justification for using stacking ensembles in clinical prediction tasks and align with our emphasis on balancing predictive power with computational efficiency. The corresponding paragraph has been added to the revised manuscript.

7. Discuss the impact of class imbalance pre- and post-ROSE balancing.

Response: we added needed explanation.

8. Authors should clarify how diversity metrics guided the selection of sub-model sets.

and 9. Justify the use of both pairwise and non-pairwise diversity measures.

Response: We thank the reviewer for this insightful comment. We have clarified how diversity metrics were used to guide the construction of sub-model sets and justified the use of both pairwise and non-pairwise diversity measures. Specifically, pairwise measures were employed to identify and remove highly correlated or redundant classifiers, while non-pairwise metrics quantified the overall ensemble heterogeneity across multiple models. These metrics jointly ensured that retained sub-model sets maximized both predictive complementarity and diversity, improving the effectiveness of stacking.

10. Explain why some stacking combinations did not improve accuracy.

Response: We appreciate the reviewer’s insightful comment. Not all stacking combinations improved accuracy relative to their best-performing base classifiers. This is expected behavior and can arise for several reasons. First, when the base learners exhibit high prediction correlation or similar decision boundaries, the meta-learner receives redundant information and cannot extract additional signal. Second, if one base classifier (e.g., AdaBoost or RF) already dominates performance, its strong bias may overwhelm weaker learners, leading the stacked model to converge toward its predictions. Third, shallow or linear meta-learners (e.g., GLM, LDA) may be insufficient to capture the nonlinear interactions among base-model outputs. Conversely, deeper meta-learners (e.g., Neural Networks or GBM) improved accuracy in more heterogeneous sub-model sets (e.g., RF–XGB, NB–GBM), reflecting the importance of balancing diversity, complementarity, and meta-learner flexibility. Provide confusion matrices.

Response: Thank you for this suggestion. we revised table 6 and included confusion matrices in it.

11. Discuss computational complexity and training time.

Response: We have added Training and Prediction Times for Single Models and Stacking Ensemble table (table 8) reporting approximate runtimes and scalability.

12. Clarify SHAP computation for stacked model.

Response: We have clarified in the Methods section that SHAP values were computed using a model-agnostic framework (iml package in R), applied directly to the final stacked ensemble (RF + XGB with NN meta-learner). In this setup, the entire stacked model was treated as a single predictive function, and SHAP decomposed the predicted probabilities into feature-level contributions of the original clinical variables. This approach ensures that the importance scores and interaction effects reflect how the ensemble as a whole integrates base-model predictions and clinical predictors, rather than attributing importance to individual base learners in isolation. We revised method and result part based on these comment.

13. Explain moderate correlation between Ferritin, ESR, and TIBC clinically.

Response: We expanded the explanation of the moderate correlations between Ferritin, ESR, and TIBC to clarify their clinical basis. These relationships reflect acute-phase inflammatory responses: Ferritin rises and TIBC falls during inflammation, while ESR increases due to elevated plasma fibrinogen and other acute-phase proteins. Thus, their moderate correlations capture biologically meaningful interactions consistent with systemic inflammation and disease severity.

1. Discuss generalizability.

Response: As noted in our response to the editor, we have expanded on this limitation in the Discussion section. We explicitly state that the model requires validation on external, diverse datasets before it can be considered for clinical use in other populations.

14. Justify the absence of external validation despite a large dataset.

Response: We acknowledge this as a primary limitation in our Discussion section. While a large independent internal test set was used, we did not have access to a suitable external dataset for validation at the time of the study. We have highlighted that external validation is a critical next step for future research.

15. Explain the choice of robust scaling over other normalization methods.

Response: We have clarified in the Methods section that Robust scaling handles skewed distributions and outliers better than alternatives.

16. Discuss clinical practicality of predictors.

Response: We have enhanced the Discussion to emphasize the clinical utility of the top predictors.

17. Clarify hyper parameter tuning.

Response: We have clarified in the Methods section that Model hyperparameters were optimized using grid search within the caret, guided by 10-fold cross-validation repeated 10 times to balance bias and variance. For algorithms where extensive tuning yielded negligible performance gains (e.g., GLM, LDA, CART, Naïve Bayes), default parameters were retained for computational efficiency. Parameter search ranges were defined based on prior literature and empirical recommendations for clinical prediction tasks. Final hyperparameters, shown in Table 1, represent the best-performing configurations in terms of cross-validated accuracy and AUC.

18. Why traditional stacking underperformed.

Response: We have addressed this in the Discussion that the traditional stacking approach underperformed relative to the diversity-guided ensemble primarily due to redundancy among correlated base learners. As shown by the high pairwise correlations, many models produced highly similar predicti

Attachment

Submitted filename: Response to Reviewer.docx

pone.0341198.s003.docx^{(36.2KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0341198.r004

Decision Letter 1

Maciej Huk

18 Nov 2025

Dear Dr. Kavousi,

In particular:-->

multiple language problems should be removed,
multiple presentation problems need to be corrected,
data standarisatin process should be made clear.

plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Maciej Huk, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

Reviewer #3: (No Response)

Reviewer #4: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions??>

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: (No Response)

Reviewer #4: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: (No Response)

Reviewer #4: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available??>

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: (No Response)

Reviewer #4: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: (No Response)

Reviewer #4: No

**********

Reviewer #1: The authors have diligently incorporated all feedback provided into the revised version of the manuscript, ensuring that each comment has been adequately addressed and resolved to satisfaction.The authors have diligently incorporated all feedback provided into the revised version of the manuscript, ensuring that each comment has been adequately addressed and resolved to satisfaction.

Reviewer #2: I have reviewed the revised manuscript (PONE-D-25-18852R1) and the authors’ responses. The authors have thoroughly and thoughtfully addressed all concerns raised by the reviewers and editor. The manuscript is now substantially improved, clearer, methodologically sound, and well-supported by the results. I have no remaining concerns and recommend acceptance. I have reviewed the revised manuscript (PONE-D-25-18852R1) and the authors’ responses. The authors have thoroughly and thoughtfully addressed all concerns raised by the reviewers and editor. The manuscript is now substantially improved, clearer, methodologically sound, and well-supported by the results. I have no remaining concerns and recommend acceptance.

Reviewer #3: (No Response)(No Response)

Reviewer #4: >>> 1. Language problems: >>> 1. Language problems:

1.1 [23]: European journal of haematology => European Journal of Haematology

1.2 [30]: bmj => BMJ

1.3 [31]: Ieee Access => IEEE Access

1.4 [2]: Information fusion => Information Fusion.

1.5 [53]: crc Press => CRC Press

>>> 2. Presentation problems:

2.1 References: [90,92]: !!! INVALID CITATION !!!

2.2 [67]: "Kassambara A. Comparing groups: Numerical variables. Datanovia[Google Scholar]. 2019."

Reference data is not complete

2.3 [58]: URL "http://crancerminlipigoid/web/packages/caret/vignettes/caretSelectionpdf" is invalid

2.4 [59]: "Berrar D. Cross-Validation. 2019."

Reference data is not complete

2.5 [55]: "Ladha L, Deepa T. FEATURE SELECTION METHODS AND ALGORITHMS, L. Ladha et al. International Journal on Computer Science and Engineering (IJCSE)."

Please do not write with capital letters.

This reference is quite old. Maybe Authors could use source which is more recent and published in more respected journal?

2.6 [46]: Reference data is not complete

2.7 The format of references should be uniform

2.8 Fig 1. "Disagreement measure" block: the beginning of incomming connection seems to be not precise.

2.9 Fig 7: measurement error whiskers not presented

2.10 Fig 6-8, abbreviations of methods such as RF, NB, GBM, SVM, etc. are written without capital letters. Please be consistent.

2.11 Fig 4-7: title is not precise: it is not clear if presented data are for training or test data

2.12 Table 6 is too wide. It includes a lot of empty space which can be reduced (both horizontally and vertically).

2.13 Table 5: first column: header has no title

2.14 Table 3: font size is not uniform

>>> 3. Other problems:

3.1 Authors write: "Numeric features were normalized using the robust_scalar function".

Are Authors sure that it was "robust_scalar" and not robust scaler function?

3.2 Are authors sure that the process performed with "robust_scalar function" was normalization and not standarization? Robust scaler is used to standardize data vectors, and normalization is used to make the norm of vectors equal one.

Please compare: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

>>> Recommendation: major rework

===EOT===

**********

what does this mean? ). If published, this will include your full peer review and any attached files.). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Reviewer #1: Yes: Mohd Anul HaqMohd Anul Haq

Reviewer #2: No

Reviewer #3: No

Reviewer #4: No

**********

To ensure your figures meet our technical requirements, please review our figure guidelines: https://journals.plos.org/plosone/s/figures

You may also use PLOS’s free figure tool, NAAS, to help you prepare publication quality figures: https://journals.plos.org/plosone/s/figures#loc-tools-for-figure-preparation.

NAAS will assess whether your figures meet our technical requirements by comparing each figure against our figure specifications.

PLoS One. 2026 Apr 17;21(4):e0341198. doi: 10.1371/journal.pone.0341198.r005

Author response to Decision Letter 2

29 Nov 2025

Response to Reviewers

Manuscript title:

A Hybrid Feature-Selection and Diversity-Guided Stacking Framework for Interpretable Ensemble Learning: Application to COVID-19 Mortality Prediction (PONE-D-25-18852R1)

Dear Editor and Reviewers,

We thank the Academic Editor and the reviewers for their constructive and detailed feedback. We have revised the manuscript thoroughly based on the comments. All revisions have been incorporated in both the “Revised Manuscript with Track Changes” and the clean version.

Below we provide a point-by-point response.

________________________________________

GENERAL COMMENTS

• All reviewers’ and editor’s concerns have been fully addressed.

• The entire manuscript has undergone comprehensive English language editing to remove grammatical, stylistic, and formatting issues.

• Presentation issues in figures, tables, and references have been corrected.

• The data standardization section has been rewritten for clarity, explicitly explaining the use of Robust Scaler and distinguishing standardization from normalization.

________________________________________

RESPONSE TO ACADEMIC EDITOR

1. “Multiple language problems should be removed.”

✓ The manuscript has been fully reviewed and professionally edited.

✓ Ambiguous, grammatically incorrect, or unclear sentences have been rewritten throughout the text.

✓ Reference formatting has been standardized.

2. “Multiple presentation problems need to be corrected.”

All the issues noted by Reviewer #4 have been addressed individually (details below).

3. “Data standardisation process should be made clear.”

• Explained the use of Robust Scaler as a standardization method

• The confusing term “robust_scalar” was corrected to “Robust Scaler.”

________________________________________

RESPONSE TO REVIEWER #4

1. Language Problems

1.1–1.5 Incorrect capitalization in references

✓ Corrected to:

• European Journal of Haematology

• BMJ

• IEEE Access

• Information Fusion

• CRC Press

________________________________________

2. Presentation Problems

2.1 Invalid citation [90,92]

✓ Fixed by correcting the missing or mismatched references.

2.2 Incomplete reference [67]

✓ Updated with full bibliographic information.

2.3 Invalid URL in reference [58]

✓ Corrected or replaced with valid official reference source.

2.4 Missing reference details [59]

✓ Completed with full citation.

2.5 Old reference with capitalization issues

✓ Citation corrected to sentence case.

2.6 Reference [46] incomplete

✓ Updated with complete information.

2.7 Reference formatting inconsistent

✓ Entire reference list reformatted.

2.8 Figure 1: Inaccurate connector location

✓ We clarified the figure by ensuring that the connection arrows to the Pairwise measures (Disagreement, Yule’s Q, Cohen’s Kappa, Double-Fault) and non-pairwise metrics (Entropy, Kohavi–Wolpert) are precisely aligned and visually consistent.

2.9 Figure 7 missing whiskers

✓ As requested, we regenerated the performance figure using per-resample performance metrics from repeated 10-fold cross-validation on train data. We computed Accuracy, Sensitivity, Specificity, Precision, F1, and MCC for every resample and plotted mean ± SD (whiskers) for each stacking model. This revised figure is now included in the manuscript.

2.10 Inconsistent capitalization of abbreviations (RF, NB, etc.)

✓ Standardized throughout text, tables, and figures.

2.11 Figure titles unclear regarding dataset

✓ All figure captions now specify whether results are from training, cross-validation, or testing datasets.

2.12 Table 6 too wide

✓ Reformatted to reduce empty space.

2.13 Table 5 missing header label

✓ Header updated.

2.14 Table 3 font size inconsistent

✓ Font standardized.

________________________________________

3. Other Problems

3.1 Incorrect function name “robust_scalar”

✓ Corrected to “Robust Scaler”

3.2 Clarification: standardization vs normalization

✓ A rewritten paragraph now states that Robust Scaler performs standardization, not normalization.

________________________________________

We sincerely thank the reviewers and editor for their constructive feedback, which has strengthened the clarity, correctness, and overall quality of the manuscript.

We hope the revisions meet the publication criteria, and we respectfully request reconsideration for acceptance.

Sincerely,

Farideh Mohtasham

(on behalf of all authors)

Attachment

Submitted filename: Response_to_Reviewer_auresp_2.docx

pone.0341198.s004.docx^{(22KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0341198.r006

Decision Letter 2

Maciej Huk

4 Jan 2026

Hybrid Feature-Selection and Diversity-Guided Stacking Framework for Interpretable Ensemble Learning: Application to COVID-19 Mortality Prediction

PONE-D-25-18852R2

Dear Dr. Kavousi,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support ..

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Maciej Huk, Ph.D.

Academic Editor

PLOS One

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #4: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions??>

Reviewer #4: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #4: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available??>

Reviewer #4: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #4: Yes

**********

Reviewer #4: >>> 1. Language problems: not detected

>>> 2. Preseentation problems:

2.1 Table 6.: vertical alignment of values is not uniform

>>> 3. Other problems: not detected

>>> Recommendation: Acccept

=== EOT ===

**********

what does this mean? ). If published, this will include your full peer review and any attached files.). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Reviewer #4: No

**********

PLoS One. doi: 10.1371/journal.pone.0341198.r007

Acceptance letter

Maciej Huk

PONE-D-25-18852R2

PLOS One

Dear Dr. Kavousi,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS One. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Maciej Huk

Academic Editor

PLOS One

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File

(DOCX)

pone.0341198.s001.docx^{(114.3KB, docx)}

Attachment

Submitted filename: Response to Reviewer.docx

pone.0341198.s003.docx^{(36.2KB, docx)}

Attachment

Submitted filename: Response_to_Reviewer_auresp_2.docx

pone.0341198.s004.docx^{(22KB, docx)}

Data Availability Statement

[pone.0341198.ref001] 1.Das K, Behera RN. A survey on machine learning: concept, algorithms and applications. International Journal of Innovative Research in Computer and Communication Engineering. 2017;5(2):1301–9. [Google Scholar]

[pone.0341198.ref002] 2.Windeatt T, Ghaderi R. Binary labelling and decision-level fusion. Information Fusion. 2001;2(2):103–12. doi: 10.1016/s1566-2535(01)00029-x [DOI] [Google Scholar]

[pone.0341198.ref003] 3.Ghasemieh A, Lloyed A, Bahrami P, Vajar P, Kashef R. A novel machine learning model with Stacking Ensemble Learner for predicting emergency readmission of heart-disease patients. Decision Analytics Journal. 2023;7:100242. doi: 10.1016/j.dajour.2023.100242 [DOI] [Google Scholar]

[pone.0341198.ref004] 4.Berliana AU, Bustamam A. Implementation of Stacking Ensemble Learning for Classification of COVID-19 using Image Dataset CT Scan and Lung X-Ray. In: 2020 3rd International Conference on Information and Communications Technology (ICOIACT), 2020. 148–52. doi: 10.1109/icoiact50329.2020.9332112 [DOI] [Google Scholar]

[pone.0341198.ref005] 5.Khyani D, Jakkula S, Gowda S, Anusha KJ, Swetha KR. An interpretation of stacking and blending approach in machine learning. Int Res J Eng Technol. 2021;8(07). [Google Scholar]

[pone.0341198.ref006] 6.Mienye ID, Sun Y. A Survey of Ensemble Learning: Concepts, Algorithms, Applications, and Prospects. IEEE Access. 2022;10:99129–49. doi: 10.1109/access.2022.3207287 [DOI] [Google Scholar]

[pone.0341198.ref007] 7.Graczyk M, Lasota T, Trawiński B, Trawiński K. Comparison of bagging, boosting and stacking ensembles applied to real estate appraisal. In: Intelligent Information and Database Systems: Second International Conference, ACIIDS, Hue City, Vietnam, March 24-26, 2010 Proceedings, Part II, 2010. [Google Scholar]

[pone.0341198.ref008] 8.Feng Y, Li B, Fu R, Hao Y, Wang T, Guo H, et al. A simplified coronary model for diagnosis of ischemia-causing coronary stenosis. Comput Methods Programs Biomed. 2023;242:107862. doi: 10.1016/j.cmpb.2023.107862 [DOI] [PubMed] [Google Scholar]

[pone.0341198.ref009] 9.Wang X, Bakulski KM, Mukherjee B, Hu H, Park SK. Predicting cumulative lead (Pb) exposure using the Super Learner algorithm. Chemosphere. 2023;311(Pt 2):137125. doi: 10.1016/j.chemosphere.2022.137125 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref010] 10.Xu W, He H, Guo Z, Li W. Evaluation of machine learning models on protein level inference from prioritized RNA features. Brief Bioinform. 2022;23(3):bbac091. doi: 10.1093/bib/bbac091 [DOI] [PubMed] [Google Scholar]

[pone.0341198.ref011] 11.Reda R, Saffaj T, Bouzida I, Saidi O, Belgrir M, Lakssir B, et al. Optimized variable selection and machine learning models for olive oil quality assessment using portable near infrared spectroscopy. Spectrochim Acta A Mol Biomol Spectrosc. 2023;303:123213. doi: 10.1016/j.saa.2023.123213 [DOI] [PubMed] [Google Scholar]

[pone.0341198.ref012] 12.Mohammed M, Mwambi H, Mboya IB, Elbashir MK, Omolo B. A stacking ensemble deep learning approach to cancer type classification based on TCGA data. Sci Rep. 2021;11(1):15626. doi: 10.1038/s41598-021-95128-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref013] 13.Wang B, Liu J, Zhang X, Lin J, Li S, Wang Z, et al. A stacking ensemble framework integrating radiomics and deep learning for prognostic prediction in head and neck cancer. Radiat Oncol. 2025;20(1):127. doi: 10.1186/s13014-025-02695-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref014] 14.Kwon H, Park J, Lee Y. Stacking Ensemble Technique for Classifying Breast Cancer. Healthc Inform Res. 2019;25(4):283–8. doi: 10.4258/hir.2019.25.4.283 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref015] 15.Alhussen A, Anul Haq M, Ahmad Khan A, Mahendran RK, Kadry S. XAI-RACapsNet: Relevance aware capsule network-based breast cancer detection using mammography images via explainability O-net ROI segmentation. Expert Systems with Applications. 2025;261:125461. doi: 10.1016/j.eswa.2024.125461 [DOI] [Google Scholar]

[pone.0341198.ref016] 16.Haq MA, Khan I, Ahmed A, Eldin SM, Alshehri ALI, Ghamry NA. DCNNBT: A novel deep convolution neural network-based brain tumor classification model. Fractals. 2023;31(06):2340102. doi: 10.1142/S0218348X23401023 [DOI] [Google Scholar]

[pone.0341198.ref017] 17.Yousef R, Khan S, Gupta G, Siddiqui T, Albahlal BM, Alajlan SA, et al. U-Net-Based Models towards Optimal MR Brain Image Segmentation. Diagnostics (Basel). 2023;13(9):1624. doi: 10.3390/diagnostics13091624 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref018] 18.Abualnaja SY, Morris JS, Rashid H, Cook WH, Helmy AE. Machine learning for predicting post-operative outcomes in meningiomas: a systematic review and meta-analysis. Acta Neurochir (Wien). 2024;166(1):505. doi: 10.1007/s00701-024-06344-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref019] 19.Lei J, Zhai J, Zhang Y, Qi J, Sun C. Supervised Machine Learning Models for Predicting Sepsis-Associated Liver Injury in Patients With Sepsis: Development and Validation Study Based on a Multicenter Cohort Study. J Med Internet Res. 2025;27:e66733. doi: 10.2196/66733 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref020] 20.Dhingra LS, Aminorroaya A, Sangha V, Pedroso AF, Shankar SV, Coppi A, et al. Ensemble deep learning algorithm for structural heart disease screening using electrocardiographic images: PRESENT SHD. Journal of the American College of Cardiology. 2025;85(12):1302–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref021] 21.Tseng PY, Chen YT, Wang CH, Chiu KM, Peng YS, Hsu SP. Prediction of the development of acute kidney injury following cardiac surgery by machine learning. Critical Care. 2020;24(1):478. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref022] 22.Sawesi S, Jadhav A, Rashrash B. Machine Learning and Deep Learning Techniques for Prediction and Diagnosis of Leptospirosis: Systematic Literature Review. JMIR Med Inform. 2025;13:e67859. doi: 10.2196/67859 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref023] 23.Chiasakul T, Lam BD, McNichol M, Robertson W, Rosovsky RP, Lake L, et al. Artificial intelligence in the prediction of venous thromboembolism: A systematic review and pooled analysis. Eur J Haematol. 2023;111(6):951–62. doi: 10.1111/ejh.14110 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref024] 24.Cramer EY, Ray EL, Lopez VK, Bracher J, Brennen A, Castro Rivadeneira AJ, et al. Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States. Proc Natl Acad Sci U S A. 2022;119(15):e2113561119. doi: 10.1073/pnas.2113561119 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref025] 25.Dessie ZG, Zewotir T. Mortality-related risk factors of COVID-19: a systematic review and meta-analysis of 42 studies and 423,117 patients. BMC Infect Dis. 2021;21(1):855. doi: 10.1186/s12879-021-06536-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref026] 26.Cui S, Wang Y, Wang D, Sai Q, Huang Z, Cheng TCE. A two-layer nested heterogeneous ensemble learning predictive method for COVID-19 mortality. Appl Soft Comput. 2021;113:107946. doi: 10.1016/j.asoc.2021.107946 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref027] 27.Li J, Li X, Hutchinson J, Asad M, Liu Y, Wang Y, et al. An ensemble prediction model for COVID-19 mortality risk. Biol Methods Protoc. 2022;7(1):bpac029. doi: 10.1093/biomethods/bpac029 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref028] 28.Yakovyna V, Shakhovska N, Szpakowska A. A novel hybrid supervised and unsupervised hierarchical ensemble for COVID-19 cases and mortality prediction. Sci Rep. 2024;14(1):9782. doi: 10.1038/s41598-024-60637-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref029] 29.Rahmatinejad Z, Dehghani T, Hoseini B, Rahmatinejad F, Lotfata A, Reihani H, et al. A comparative study of explainable ensemble learning and logistic regression for predicting in-hospital mortality in the emergency department. Sci Rep. 2024;14(1):3406. doi: 10.1038/s41598-024-54038-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref030] 30.Wynants L, Van Calster B, Collins GS, Riley RD, Heinze G, Schuit E, et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ. 2020;369. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref031] 31.Jamshidi MB, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, et al. Artificial Intelligence and COVID-19: Deep Learning Approaches for Diagnosis and Treatment. IEEE Access. 2020;8:109581–95. doi: 10.1109/ACCESS.2020.3001973 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref032] 32.Sperrin M, Grant SW, Peek N. Prediction models for diagnosis and prognosis in Covid-19. BMJ. 2020. [DOI] [PubMed] [Google Scholar]

[pone.0341198.ref033] 33.Banoei MM, Dinparastisaleh R, Zadeh AV, Mirsaeidi M. Machine-learning-based COVID-19 mortality prediction model and identification of patients at low and high risk of dying. Crit Care. 2021;25(1):328. doi: 10.1186/s13054-021-03749-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref034] 34.Saxena A, Nixon B, Boyd A, Evans J, Faraone SV. A systematic review of the application of graph neural networks to extract candidate genes and biological associations. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2025;:e33031. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref035] 35.Ji C, Yu N, Wang Y, Ni J, Zheng C. SGLMDA: A Subgraph Learning-Based Method for miRNA-Disease Association Prediction. IEEE/ACM Trans Comput Biol Bioinform. 2024;21(5):1191–201. doi: 10.1109/TCBB.2024.3373772 [DOI] [PubMed] [Google Scholar]

[pone.0341198.ref036] 36.Wang J, Li J, Yue K, Wang L, Ma Y, Li Q. NMCMDA: neural multicategory MiRNA-disease association prediction. Brief Bioinform. 2021;22(5):bbab074. doi: 10.1093/bib/bbab074 [DOI] [PubMed] [Google Scholar]

[pone.0341198.ref037] 37.Li J, Lin H, Wang Y, Li Z, Wu B. Prediction of potential small molecule-miRNA associations based on heterogeneous network representation learning. Front Genet. 2022;13:1079053. doi: 10.3389/fgene.2022.1079053 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref038] 38.Riley RD, Ensor J, Snell KIE, Archer L, Whittle R, Dhiman P, et al. Importance of sample size on the quality and utility of AI-based prediction models for healthcare. Lancet Digit Health. 2025;7(6):100857. doi: 10.1016/j.landig.2025.01.013 [DOI] [PubMed] [Google Scholar]

[pone.0341198.ref039] 39.Yang Y, Khorshidi HA, Aickelin U. A review on over-sampling techniques in classification of multi-class imbalanced datasets: insights for medical problems. Front Digit Health. 2024;6:1430245. doi: 10.3389/fdgth.2024.1430245 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref040] 40.Salmi M, Atif D, Oliva D, Abraham A, Ventura S. Handling imbalanced medical datasets: review of a decade of research. Artif Intell Rev. 2024;57(10). doi: 10.1007/s10462-024-10884-2 [DOI] [Google Scholar]

[pone.0341198.ref041] 41.Malhotra R, Khanna M. Particle swarm optimization-based ensemble learning for software change prediction. Information and Software Technology. 2018;102:65–84. doi: 10.1016/j.infsof.2018.05.007 [DOI] [Google Scholar]

[pone.0341198.ref042] 42.Yakovyna V, Shakhovska N, Szpakowska A. A novel hybrid supervised and unsupervised hierarchical ensemble for COVID-19 cases and mortality prediction. Sci Rep. 2024;14(1):9782. doi: 10.1038/s41598-024-60637-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref043] 43.Tian W, Jiang W, Yao J, Nicholson CJ, Li RH, Sigurslid HH, et al. Predictors of mortality in hospitalized COVID-19 patients: A systematic review and meta-analysis. J Med Virol. 2020;92(10):1875–83. doi: 10.1002/jmv.26050 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref044] 44.de Paiva BBM, Pereira PD, de Andrade CMV, Gomes VMR, Souza-Silva MVR, Martins KPMP, et al. Potential and limitations of machine meta-learning (ensemble) methods for predicting COVID-19 mortality in a large inhospital Brazilian dataset. Sci Rep. 2023;13(1):3463. doi: 10.1038/s41598-023-28579-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref045] 45.Hatamabadi H, Sabaghian T, Sadeghi A, Heidari K, Safavi-Naini SAA, Looha MA. Epidemiology of COVID-19 in Tehran, Iran: A cohort study of clinical profile, risk factors, and outcomes. BioMed Research International. 2022;2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref046] 46.Klomp T. Iterative Imputation in Python: A Study on the Performance of the Package IterativeImputer. University Utrecht. 2022. [Google Scholar]

[pone.0341198.ref047] 47.Barough SS, Safavi-Naini SAA, Siavoshi F, Tamimi A, Ilkhani S, Akbari S, et al. Generalizable machine learning approach for COVID-19 mortality risk prediction using on-admission clinical and laboratory features. Sci Rep. 2023;13(1):2399. doi: 10.1038/s41598-023-28943-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref048] 48.Sharma V. A Study on Data Scaling Methods for Machine Learning. INJGAADMIN, ftjijgasr. 2022;1(1). doi: 10.55938/ijgasr.v1i1.4 [DOI] [Google Scholar]

[pone.0341198.ref049] 49.Mishra S, Pradhan RK. Analyzing the impact of feature correlation on classification accuracy of machine learning model. In: 2023. [Google Scholar]

[pone.0341198.ref050] 50.Chandrashekar G, Sahin F. A survey on feature selection methods. Computers & Electrical Engineering. 2014;40(1):16–28. doi: 10.1016/j.compeleceng.2013.11.024 [DOI] [Google Scholar]

[pone.0341198.ref051] 51.Alin A. Multicollinearity. Wiley Interdiscip Rev Comput Stat. 2010;2(3):370–4. [Google Scholar]

[pone.0341198.ref052] 52.Daoud JI. Multicollinearity and regression analysis. Journal of Physics: Conference Series. 2017. [Google Scholar]

[pone.0341198.ref053] 53.Sheskin DJ. Handbook of parametric and nonparametric statistical procedures. CRC Press. 2020. [Google Scholar]

[pone.0341198.ref054] 54.Moorthy U, Gandhi UD. RETRACTED ARTICLE: A novel optimal feature selection technique for medical data classification using ANOVA based whale optimization. J Ambient Intell Human Comput. 2020;12(3):3527–38. doi: 10.1007/s12652-020-02592-w [DOI] [Google Scholar]

[pone.0341198.ref055] 55.Ladha L, Deepa T. Feature Selection Methods and Algorithms. International Journal on Computer Science and Engineering (IJCSE). 2023;55. [Google Scholar]

[pone.0341198.ref056] 56.Okser S, Pahikkala T, Airola A, Salakoski T, Ripatti S, Aittokallio T. Regularized machine learning in the genetic prediction of complex traits. PLoS Genet. 2014;10(11):e1004754. doi: 10.1371/journal.pgen.1004754 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref057] 57.Kuhn M. Building predictive models in R using the caret package. Journal of Statistical Software. 2008;28:1–26.27774042 [Google Scholar]

[pone.0341198.ref058] 58.Kuhn M. Variable selection using the caret package. 2012. [Google Scholar]

[pone.0341198.ref059] 59.Berrar D. Cross-validation. 2019. 542–5.

[pone.0341198.ref060] 60.Bottino F, Tagliente E, Pasquini L, Napoli AD, Lucignani M, Figà-Talamanca L, et al. COVID Mortality Prediction with Machine Learning Methods: A Systematic Review and Critical Appraisal. J Pers Med. 2021;11(9):893. doi: 10.3390/jpm11090893 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref061] 61.Brownlee J. Machine learning mastery with R: Get started, build accurate models and work through projects step-by-step. Machine Learning Mastery. 2016. [Google Scholar]

[pone.0341198.ref062] 62.Tattar PN. Hands-On Ensemble Learning with R: A Beginner’s Guide to Combining the Power of Machine Learning Algorithms Using Ensemble Techniques. Packt Publishing Ltd. 2018. [Google Scholar]

[pone.0341198.ref063] 63.Kuncheva LI. Combining pattern classifiers: methods and algorithms. John Wiley & Sons. 2014. [Google Scholar]

[pone.0341198.ref064] 64.Wang L, Mo T, Wang X, Chen W, He Q, Li X, et al. A hierarchical fusion framework to integrate homogeneous and heterogeneous classifiers for medical decision-making. Knowledge-Based Systems. 2021;212:106517. doi: 10.1016/j.knosys.2020.106517 [DOI] [Google Scholar]

[pone.0341198.ref065] 65.Sullivan GM, Feinn R. Using Effect Size-or Why the P Value Is Not Enough. J Grad Med Educ. 2012;4(3):279–82. doi: 10.4300/JGME-D-12-00156.1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref066] 66.Fritz CO, Morris PE, Richler JJ. Effect size estimates: current use, calculations, and interpretation. J Exp Psychol Gen. 2012;141(1):2–18. doi: 10.1037/a0024338 [DOI] [PubMed] [Google Scholar]

[pone.0341198.ref067] 67.Zhu Y, Guo W. Family-Wise Error Rate Controlling Procedures for Discrete Data. Statistics in Biopharmaceutical Research. 2019;12(1):117–28. doi: 10.1080/19466315.2019.1654912 [DOI] [Google Scholar]

[pone.0341198.ref068] 68.Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC. Package ‘pROC’. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref069] 69.Kassambara A. Comparing groups: Numerical variables. Sydney, Australia: Datanovia. 2019. [Google Scholar]

[pone.0341198.ref070] 70.Molnar C, Schratz P. Package ‘iml’. R CRAN. 2020. [Google Scholar]

[pone.0341198.ref071] 71.Roth AE. The Shapley value: essays in honor of Lloyd S. Shapley. Cambridge University Press. 1988. [Google Scholar]

[pone.0341198.ref072] 72.Lunardon N, Menardi G, Torelli N. ROSE: a package for binary imbalanced learning. R journal. 2014;6(1). [Google Scholar]

[pone.0341198.ref073] 73.Zali A, Gholamzadeh S, Mohammadi G, Azizmohammad Looha M, Akrami F, Zarean E. Baseline characteristics and associated factors of mortality in COVID-19 patients; an analysis of 16000 cases in Tehran, Iran. Arch Acad Emerg Med. 2020;8(1):e70. [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref074] 74.Ogundimu EO, Altman DG, Collins GS. Adequate sample size for developing prediction models is not simply related to events per variable. J Clin Epidemiol. 2016;76:175–82. doi: 10.1016/j.jclinepi.2016.02.031 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref075] 75.Chamseddine E, Mansouri N, Soui M, Abed M. Handling class imbalance in COVID-19 chest X-ray images classification: Using SMOTE and weighted loss. Appl Soft Comput. 2022;129:109588. doi: 10.1016/j.asoc.2022.109588 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref076] 76.Javidi M, Abbaasi S, Naybandi Atashi S, Jampour M. COVID-19 early detection for imbalanced or low number of data using a regularized cost-sensitive CapsNet. Sci Rep. 2021;11(1):18478. doi: 10.1038/s41598-021-97901-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref077] 77.de Jong VMT, Rousset RZ, Antonio-Villa NE, Buenen AG, Van Calster B, Bello-Chavolla OY, et al. Clinical prediction models for mortality in patients with covid-19: external validation and individual participant data meta-analysis. BMJ. 2022;378:e069881. doi: 10.1136/bmj-2021-069881 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref078] 78.Van Calster B, Steyerberg EW, Wynants L, van Smeden M. There is no such thing as a validated prediction model. BMC Med. 2023;21(1):70. doi: 10.1186/s12916-023-02779-w [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref079] 79.Ribeiro MHDM, da Silva RG, Mariani VC, Coelho LDS. Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil. Chaos Solitons Fractals. 2020;135:109853. doi: 10.1016/j.chaos.2020.109853 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref080] 80.Hussain S, Songhua X, Aslam MU, Hussain F. Clinical predictions of COVID-19 patients using deep stacking neural networks. J Investig Med. 2024;72(1):112–27. doi: 10.1177/10815589231201103 [DOI] [PubMed] [Google Scholar]

[pone.0341198.ref081] 81.Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, et al. Prediction of In-hospital Mortality in Emergency Department Patients With Sepsis: A Local Big Data-Driven, Machine Learning Approach. Acad Emerg Med. 2016;23(3):269–78. doi: 10.1111/acem.12876 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref082] 82.de Paiva BBM, Pereira PD, de Andrade CMV, Gomes VMR, Souza-Silva MVR, Martins KPMP, et al. Potential and limitations of machine meta-learning (ensemble) methods for predicting COVID-19 mortality in a large inhospital Brazilian dataset. Sci Rep. 2023;13(1):3463. doi: 10.1038/s41598-023-28579-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref083] 83.An N, Ding H, Yang J, Au R, Ang TFA. Deep ensemble learning for Alzheimer’s disease classification. J Biomed Inform. 2020;105:103411. doi: 10.1016/j.jbi.2020.103411 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref084] 84.Gupta A, Jain V, Singh A. Stacking Ensemble-Based Intelligent Machine Learning Model for Predicting Post-COVID-19 Complications. New Gener Comput. 2022;40(4):987–1007. doi: 10.1007/s00354-021-00144-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref085] 85.Kablan R, Miller HA, Suliman S, Frieboes HB. Evaluation of stacked ensemble model performance to predict clinical outcomes: A COVID-19 study. Int J Med Inform. 2023;175:105090. doi: 10.1016/j.ijmedinf.2023.105090 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref086] 86.Liu Y, Du X, Chen J, Jin Y, Peng L, Wang HHX, et al. Neutrophil-to-lymphocyte ratio as an independent risk factor for mortality in hospitalized patients with COVID-19. J Infect. 2020;81(1):e6–12. doi: 10.1016/j.jinf.2020.04.002 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref087] 87.Wu C, Chen X, Cai Y, Xia J, Zhou X, Xu S, et al. Risk Factors Associated With Acute Respiratory Distress Syndrome and Death in Patients With Coronavirus Disease 2019 Pneumonia in Wuhan, China. JAMA Intern Med. 2020;180(7):934–43. doi: 10.1001/jamainternmed.2020.0994 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref088] 88.Xu W, Sun N-N, Gao H-N, Chen Z-Y, Yang Y, Ju B, et al. Risk factors analysis of COVID-19 patients with ARDS and prediction based on machine learning. Sci Rep. 2021;11(1):2933. doi: 10.1038/s41598-021-82492-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref089] 89.Sedgwick P. Retrospective cohort studies: advantages and disadvantages. BMJ. 2014;348(jan24 1):g1072–g1072. doi: 10.1136/bmj.g1072 [DOI] [Google Scholar]

[pone.0341198.ref090] 90.Agniel D, Kohane IS, Weber GM. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ. 2018;361:k1479. doi: 10.1136/bmj.k1479 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref091] 91.Perets O, Stagno E, Yehuda EB, McNichol M, Anthony Celi L, Rappoport N, et al. Inherent Bias in Electronic Health Records: A Scoping Review of Sources of Bias. medRxiv. 2024;:2024.04.09.24305594. doi: 10.1101/2024.04.09.24305594 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref092] 92.Degtiar I, Rose S. A Review of Generalizability and Transportability. Annu Rev Stat Appl. 2023;10(1):501–24. doi: 10.1146/annurev-statistics-042522-103837 [DOI] [Google Scholar]

[pone.0341198.ref093] 93.Collins GS, Dhiman P, Ma J, Schlussel MM, Archer L, Van Calster B, et al. Evaluation of clinical prediction models (part 1): from development to external validation. BMJ. 2024;384:e074819. doi: 10.1136/bmj-2023-074819 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0341198.ref094] 94.Collins GS, Reitsma JB, Altman DG, Moons KGM, TRIPOD Group. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. The TRIPOD Group. Circulation. 2015;131(2):211–9. doi: 10.1161/CIRCULATIONAHA.114.014508 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Hybrid feature-selection and diversity-guided stacking framework for interpretable ensemble learning: Application to COVID-19 mortality prediction

Farideh Mohtasham

Seyed Saeed Hashemi Nazari

Mohamad Amin Pourhoseingholi

Kaveh Kavousi

Mohammad Reza Zali

Roles

Abstract

Background

Methods

Results

Conclusions

1 Introduction

2 Materials and methods

2.1 Overview of the proposed framework

Fig 1. Research methodology of the proposed machine learning framework.

2.2 Data source and ethical approval

2.3 Data preprocessing

2.4 Feature selection

2.5 Model training and selection

Table 1. Parameter settings for the 16 base machine learning models.

2.6 Diversity-guided sub-model construction

Table 2. Contingency table illustrating agreement and disagreement between two classifiers.

2.6.1 Disagreement measure.

2.6.2 Yule’s Q-statistic.

2.6.3 Cohen’s Kappa statistic.

2.6.4 Double-fault measure.

2.7 Meta-learner integration

2.8 Model evaluation and statistical analysis

2.9 Model interpretability

3 Results

3.1 Data characteristics and preparation

Fig 2. Comorbidity distribution and mortality associations in the study cohort.

3.2 Feature Selection and correlation analysis

Fig 3. Correlation between selected features and the outcome (Death/Alive) in the training dataset.

3.3 Model performance evaluation

Fig 4. Performance comparison of base, boosting, and bagging machine learning algorithms using repeated 10-fold cross-validation on the training data.

3.4 Diversity-guided sub-model construction

Fig 5. Disagreement metrics among classifier predictions on the test dataset.

Fig 6. Inter-rater agreement among classifier predictions on the test dataset.

Table 3. Diversity metrics for the eight selected sub-model sets on the test dataset.

Fig 7. Comparison of diversity metrics across eight selected sub-model sets on the test dataset.

3.5 Stacking model evaluation and statistical comparison

Table 4. Accuracy of stacking sub-model sets using five different meta-learners on the test dataset.

Table 5. Results of significance tests comparing the best base classifier with the stacking model in each sub-model set.

Table 6. Performance evaluation of selected stacking models that outperform the most accurate individual algorithm in their respective combinations.

Table 7. Statistical comparison between stacked random forest (RF) and XGBoost (XGB) utilizing a neural network (NN) meta-learner and other stacking models.

Fig 8. Performance metrics of the selected stacked models on the training dataset.

Fig 9. ROC curves of the best-performing stacked models on the test dataset.

Fig 10. Calibration plot of the stacked Random Forest (RF) and XGBoost (XGB) model using a Neural Network (NN) meta-learner under repeated 10-fold cross-validation.

3.6 Computational complexity and training time

Table 8. Training and prediction times for single models and stacking ensembles.

3.7 Model Interpretation

Fig 11. Most influential predictors contributing to “Death” outcomes in the stacked RF–XGB model with an NN meta-learner.

Fig 12. SHAP-based interpretation of the stacked RF–XGB model using a Neural Network meta-learner.

Fig 13. Interaction effects between age and key clinical predictors influencing mortality in the stacked RF–XGB model with an NN meta-learner.

4 Discussion

5 Limitations and biases

5 Conclusion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Author response to Decision Letter 0

Transfer Alert

Decision Letter 0

Maciej Huk

Roles

Author response to Decision Letter 1

Decision Letter 1

Maciej Huk

Roles

Author response to Decision Letter 2

Decision Letter 2

Maciej Huk

Roles

Acceptance letter

Maciej Huk