Interpretable ensemble learning for tumor-type prediction with a SHAP-based evaluation of CatBoost and voting classifiers

Weronika Wolak; Anna Plichta; Hubert Orlicki

doi:10.1038/s41598-025-31079-x

. 2025 Dec 4;16:1401. doi: 10.1038/s41598-025-31079-x

Interpretable ensemble learning for tumor-type prediction with a SHAP-based evaluation of CatBoost and voting classifiers

Weronika Wolak ¹, Anna Plichta ¹, Hubert Orlicki ^1,^✉

PMCID: PMC12796192 PMID: 41345454

Abstract

Accurate early-stage diagnosis of tumours is crucial for improving patient prognosis. Modern machine learning techniques provide advanced and effective tools to support this process. In this study, both base classifiers and ensemble models were compared in the task of predicting tumour type from morphometric data. Particular attention was given to the CatBoost model as well as the Voting and Stacking classifiers. Model performance was evaluated comprehensively using standard diagnostic metrics and confusion matrices. To enhance interpretability, the SHAP framework was employed to assess the contribution of individual features to model predictions. The CatBoost model stood out due to its ability to provide explainable results through feature importance analysis and SHAP values, which highlighted the key role of parameters related to tumour size and border irregularity. The Voting Classifier improved stability and reduced variance, significantly lowering the number of false negative errors – a factor of particular clinical importance in oncological diagnostics. The Stacking Classifier achieved the highest overall predictive performance, minimising both false positive and false negative classifications by integrating heterogeneous base learners with a meta-classifier. The findings of this study confirm that interpretable ensemble methods represent a valuable approach to supporting the diagnostic process in neuro-oncology. The methodology applied here not only increases the reliability and transparency of artificial intelligence systems but also demonstrates potential for application in treatment monitoring and in predicting the risk of tumour recurrence.

Keywords: Ensemble learning, Explainable artificial intelligence (XAI), Tumor classification, SHAP

Subject terms: Cancer, Computational biology and bioinformatics, Mathematics and computing, Medical research, Oncology

Introduction

Tumours pose a significant diagnostic challenge for contemporary medicine. The number of patients undergoing oncological treatment continues to rise annually, which underscores the growing importance of accurate and timely diagnostics. Rapid and reliable determination of the tumour type is crucial for selecting the appropriate therapeutic strategy and planning the entire treatment process. Advanced imaging techniques, particularly magnetic resonance imaging (MRI), provide a rich set of morphometric data to support this process. However, image interpretation remains inherently subjective, requiring significant clinical expertise and experience. This subjectivity, combined with the increasing volume of diagnostic data, creates a demand for computational methods capable of supporting physicians in the diagnostic workflow.

With the advancement of computational methods and the growing availability of medical datasets, machine learning (ML) has emerged as a powerful tool for the detection and classification of pathological lesions. In particular, ML models can leverage subtle morphometric patterns that may not be apparent during routine visual assessment. This makes them a promising complement to radiological practice, particularly in neuro-oncology, where the differentiation between benign and malignant lesions is both diagnostically challenging and clinically crucial.

Among the numerous ML approaches, ensemble learning methods deserve particular attention. By integrating the outputs of multiple base models, they enhance predictive accuracy while limiting excessive adaptation to training data¹. Algorithms such as CatBoost, which employs gradient boosting over decision trees, demonstrate high efficacy in the analysis of structured, tabular data and incorporate mechanisms that effectively reduce overfitting². Similarly, voting classifiers enable the integration of diverse machine learning paradigms into a single coherent predictive model, thereby exploiting the complementarity of different algorithms. Importantly, ensemble methods have also proven to be more robust in the presence of class imbalance, a common characteristic of medical datasets, reducing the risk of overlooking rare but clinically significant malignant cases.

Despite the satisfactory predictive performance of these methods, the factors influencing their results are often not obvious. Many ML models are widely referred to as “black boxes“ due to their complex internal structures and limited interpretability³. This lack of transparency poses a barrier to clinical adoption, as medical practitioners require not only accurate predictions but also insight into the rationale behind algorithmic decisions. In this context, the growing field of explainable artificial intelligence (XAI) offers methods that improve model interpretability without significantly compromising performance. Particularly noteworthy is the SHAP (Shapley Additive exPlanations) approach, which provides both global analysis of feature importance and local explanations for individual predictions. In clinical settings, such tools increase trust in algorithmic outcomes and support the integration of ML models into decision support systems.

The aim of this study is to perform a comparative analysis of the predictive performance and interpretability of two ensemble approaches: CatBoost and the Voting Classifier. The evaluation was based on a publicly available dataset of tumour morphometric features, using a set of standard metrics commonly applied in medical classification tasks. Additionally, SHAP analysis was conducted to identify the most influential diagnostic features, thereby providing both global and local interpretability of the models. This dual focus on predictive accuracy and interpretability addresses not only the technical challenges of medical classification but also the clinical need for transparent and trustworthy decision-support tools. The perspective is consistent with the systematic review by Banerjee and Paçal, who show that, in heart disease prediction, ensemble machine learning models achieve high predictive performance yet often suffer from black-box behaviour, and therefore highlight explainable AI techniques, particularly SHAP as crucial for enhancing model interpretability and fostering clinical trust in decision-support systems⁴.

Materials and methods

Dataset

The analysis was based on an open dataset available on Kaggle⁵. The dataset contained detailed morphometric measurements of tumours. The data was processed to extract numerical indicators describing the geometry of the lesion. Each observation contained a set of features characterising the shape, structure and textural properties of the lesion. The records are assigned to one of two classes: benign and malignant tumors.

Data preprocessing

The initial data processing consisted of several stages. First, non-informative attributes, including sample identification numbers, were eliminated. Labels in the diagnosis column were binarized: M (malignant) values were assigned 1, and B (benign) values were assigned 0. This approach enabled the appliaction of classification algorithms that require numerical input, which is standard practice in medical data processing⁶. The chosen strategy facilitated the effective use of methods such as SVM, decision trees and ensemble models. In the subsequent stage, the data set was divided into training (80%) and test (20%) subsets. Feature standardisation was applied, which enabled the normalisation of the distribution of variables to a mean close to zero and a standard deviation equal to one. Banerjee et al. emphasize that a carefully designed preprocessing stage, including, among others, normalization and standardization of the data to a distribution with zero mean and unit standard deviation, is crucial for homogenizing variables and ensuring stable training of models for medical data analysis⁷.

Robustness under imbalance and limited data

A stratified 5-fold cross-validation setup wrapped in an imblearn Pipeline (leakage-safe) compared four imbalance remedies: class weighting, SMOTE, SMOTE+Tomek, and random under-sampling. Performance remained high under skewed class ratios (all PR-AUC Inline graphic ). SMOTE+Tomek yielded the strongest F1 () and balanced accuracy (), whereas SMOTE achieved the highest ROC-AUC () and PR-AUC (), indicating effective minority-class detection without loss of specificity. Class weighting served as a competitive, no-resampling baseline (PR-AUC , F1 ). To probe small-sample resilience, under-sampling, which reduces the effective training size, still produced solid results (PR-AUC Inline graphic , F1 ), supporting robustness under limited data.

Model training

Six distinct classification approaches were employed in the experimental part of the study. Both individual algorithms and esemble techniques were included. The aim of the study was to obtain a comprehensive comparision of methods representing different theoretical foundations and learning strategies. The models were selected for their effectiveness in the literature on tasks related to the binary classification of medical data.

K-Nearest Neighbours (KNN) is an algorithm based on distance measures that classifies observations based on the majority class among the k nearest neighbours in feature space. The implementation applies the classic Euclidean metric, which is one of the most commonly used distance measures in machine learning algorithms, particularly in neighbourhood-based methods, due to its intuitive geometric interpretation. The computational complexity of KNN primarily arises from the need to compute distances between the query point and all training samples during the classification phase⁸.
Support Vector Machine (SVM) constitutes a margin-maximisation method particularly effective in high-dimensional feature spaces. Its primary objective is to identify a hyperplane that separates the classes while maximising the margin between the closest data points of each class, referred to as support vectors. To capture non-linear relationships, the Radial Basis Function (RBF) kernel was applied, enabling the modelling of complex, non-linear decision boundaries in the feature space.
Random Forest (RF) represents an ensemble learning method in which a set of decision trees is generated, each trained on a randomly selected subset of training samples (bagging) and a randomly chosen subset of features. The final prediction is obtained through majority voting among the individual trees in the case of classification tasks, or by averaging their outputs in regression tasks⁹.
CatBoost is a gradient boosting algorithm based on decision trees. It is optimised for handling categorical features and reducing bias associated with feature importance estimation. The implementation employed the default logloss function dedicated to binary classification. In addition, the algorithm utilised the built-in automatic class weighting mechanism (auto_class_weights=“Balanced“), which computes weights for each class and applies them as multipliers during training. This procedure mitigates the influence of the dominant class and thus alleviates the issue of class imbalance¹⁰.
Voting Classifier integrates predictions of several models through soft voting based on averaging the probabilities predicted by the base models. In this study, the ensemble consisted of three models: LogisticRegression, SVC (Support Vector classifier) and SGD Classifier.
Stacking classifier is a hierarchical ensemble technique in which the predictions of base learners are used as input features for a higher-level (meta) model. This approach enables the exploitation of complementary properties of different algorithms and may lead to improved classification performance. In the present study, Logistic Regression was employed as the meta-learner. The meta-learner constitutes the second-level model in a stacking ensemble, which aggregates predictions from multiple base learners and identifies the optimal weighting scheme for improved generalisation. The configuration also included the transfer of original input features, allowing the meta-model to utilise both the predictions of the base classifiers and the raw feature set.

Logistic regression was selected as the meta-learner owing to its low variance, good probability calibration, convex optimization (which yields stable behaviour and a unique optimum), and interpretable coefficients. In this stacking configuration, we deliberately use a simple linear meta-model to limit the capacity of the second layer and reduce the risk of overfitting while combining the base learners’ predictions.

Ensemble design aligns with recent distributed imaging frameworks, where outputs from multiple feature extractors are fused to improve generalization on unseen data. By analogy, stacking heterogeneous base learners (SVM, tree-based, linear) helps mitigate dataset-shift in multi-center settings while keeping the meta-learner (logistic regression) calibrated and interpretable¹¹.

Model evaluation

The effectiveness of the models described was assessed based on a set of quantitative indicators adapted to the binary medical classification tasks. Precision determines the proportion of correctly identified malignant tumour cases among all samples classified malignant. Sensitivity (recall), in turn, indicates the percentage of malignant cases correctly detected. The F1-score represents the harmonic mean of precision and sensitivity, providing a balanced metric particularly useful in the presence of class imbalance.

For each model, confucion matrices were generated to visualise the distribution of correct and incorrect diagnoses. Their analysis facilitates the identification of systematic error patterns and supports strategies aimed at reducing false negatives, which is critical for minimising the risk of missed disease cases.

The evaluation of probabilistic classifiers was complemented with Receiver Operating Characteristic (ROC) curves, and the Area Under the Curve (AUC) was calculated. ROC analysis provides a graphical representation of the trade-off between sensitivity and specificity across different decision thresholds, while the AUC quantifies the overall discriminative ability of the model to distinguish between benign and malignant cases. The computations were performed on the independent test set that had been separated during the data preprocessing stage. The configurations of the single and ensemble models used in this study are summarised in Table 1.

Table 1.

Configurations of single models.

Model	Hyperparameters
kNN	n_neighbors=5
Random Forest	n_estimators=100, class_weight=’balanced’, random_state=42
SVM (RBF)	kernel=’rbf’, probability=True, class_weight=’balanced’, random_state=42
CatBoost	auto_class_weights=’Balanced’, random_state=42, verbose=0
Logistic regression	max_iter=1000, class_weight=’balanced’, random_state=42
SGDClassifier	loss=’log_loss’, max_iter=1000, class_weight=’balanced’, random_state=42

Open in a new tab

Cross-validation protocol

To obtain statistically robust estimates and mitigate optimistic bias, we performed stratified k-fold cross-validation (10 folds, shuffled, random_state=42) on the training set. We report the mean and standard deviation of ROC AUC, PR-AUC, F1, and Balanced Accuracy across folds. Model selection was based on mean ROC AUC in CV; the selected model was then refit on the full training set and evaluated once on the held-out test set.

Handling class imbalance

Given the moderate class imbalance (37.3% malignant vs. 62.7% benign), we compared models using class weighting only (with class_weight=’balanced’ where applicable) against oversampling approaches (SMOTE, ADASYN). Rebalancing was implemented inside the CV pipeline to avoid data leakage. Results indicated that oversampling did not materially improve performance over the best non-oversampled configuration; therefore, we retained the latter as the primary model and report full CV and held-out test metrics. This choice is consistent with best-practice guidance for imbalanced settings, which recommends stratified cross-validation with in-fold resampling to prevent leakage, together with oversampling or reweighting when needed, and cross-population validation to assess fairness and generalisability¹². The cross-validated and held-out test performance for the SVM (RBF) models under different class-balancing strategies is summarised in Table 2, and the held-out test performance of the selected configuration is reported in Table 3.

Table 2.

Cross-validated performance: stratified 10-fold CV (mean ± SD).

Model	Balancing	ROC AUC	PR-AUC	F1	Balanced Acc.
SVM (RBF)	None
SVM (RBF)	SMOTE
SVM (RBF)	ADASYN

Open in a new tab

Table 3.

Held-out test set (final model: SVM (RBF), no oversampling).

ROC AUC	PR-AUC	F1	Balanced Acc.
0.9971	0.9956	0.9535	0.9627

Open in a new tab

Explainable AI analysis with SHAP

The Shapley Additive exPlanations (SHAP) method was applied to enhance the interpretability of the predictive models. SHAP, grounded in cooperative game theory, enables a quantitative assessment of the contribution of each input feature to the final prediction.

In the first stage, for models providing the feature_importances_ attribute, global rankings of feature relevance were computed. The values were sorted in descending order, and their visualisation facilitated the preliminary identification of variables with the strongest influence on the model’s decisions.

In the second stage, the SHAP library was employed in TreeExplainer mode, which allowed for the computation of Shapley values for all test set observations. The analysis included:

Summary plots, presenting feature rankings based on the absolute SHAP values and the direction of their influence on model predictions.
Waterfall plots for selected samples, illustrating how individual features “shift” the model output from the expected value (baseline) towards one of the classes. This form of visualisation is particularly valuable in clinical practice, as it enables physicians to link algorithmic decisions with concrete morphometric parameters.

By combining global and local interpretability approaches, the analysis provided a comprehensive understanding of model behaviour. Such investigations are essential in biomedical applications, where transparency of the decision-making process directly impacts end-user trust.

Implementation details

All experiments were implemented in Python 3.13 using the scikit-learn machine learning framework (v1.6.1) for model training and evaluation, and the shap library (v0.48.0) for explainability analysis. Additional packages included NumPy (v2.1.3) and pandas (v2.2.3) for numerical operations and data preprocessing, and Matplotlib (v3.9.2) for visualization.

Models were built as scikit-learn Pipelines with in-fold standardisation (StandardScaler) and, where applicable, class weighting (class_weight=’balanced’). Stratified 10-fold cross-validation (StratifiedKFold, random_state=42) was used throughout. Ensembles were implemented via VotingClassifier (soft voting with logistic regression, RBF–SVM, SGD) and StackingClassifier with logistic regression as the meta-learner. SHAP values were computed for the CatBoost model using the shap library, and mean absolute SHAP values were used to derive the feature importance ranking and SHAP-top-k subsets. The full source code used in this study is openly available at https://github.com/wolakweronika/Tumor-type-prediction.git.

Results

Dataset overview

The study utilised a dataset comprising 569 tumour samples, each characterised by 30 morphometric features extracted from medical imaging. The binary target variable denoted the tumour type: malignant (M) or benign (B). The dataset included 212 malignant cases (37.3%) and 357 benign cases (62.7%), indicating a moderate class imbalance.

Thirty continuous morphometric features derived from ten base descriptors (radius, texture, perimeter, area, smoothness, compactness, concavity, concave_points, symmetry, fractal_dimension) are provided as mean, se, and worst; all predictors are numeric. The data correspond to the Wisconsin Diagnostic Breast Cancer (WDBC) set (569 samples; target diagnosis with labels M/B), with the identifier and an empty utility column removed and no missing values detected. Preprocessing consisted of z-score standardization (StandardScaler) fitted strictly on the training folds to avoid leakage; labels were encoded as M=1 and B=0. Models included LogisticRegression and RandomForestClassifier with class_weight=’balanced’. Univariate exploration indicated that radius_worst, area_worst, and perimeter_worst showed the clearest separation (higher for M), whereas textural descriptors (smoothness, symmetry, fractal_dimension) exhibited greater overlap. Evaluation used stratified 5-fold cross-validation, with all preprocessing confined within folds; for each fold we report Accuracy, Precision, Recall, F1, and ROC AUC, together with the mean ± SD across folds. ROC (and complementary precision–recall) curves were computed from predicted probabilities. In this setting, Random Forest feature importance prioritized geometry-related variables over purely textural measures. Fixed random seeds were used throughout. The positive class for all binary metrics was M (malignant), computed from predict_proba for M. Metrics are reported as mean ± SD across the 5 stratified folds.

Statistical analysis demonstrated that geometric descriptors such as radius_worst, area_worst, and perimeter_worst were among the most discriminative predictors. Histogram analysis confirmed that these features exhibited broader ranges and clearer separation between classes. Malignant tumours were associated with higher values of radius, perimeter, and area, reflecting their more invasive and irregular growth patterns.

In contrast, textural features such as smoothness (local variation of grey-level intensities), symmetry (regularity of the tumour shape), and fractal_dimension (measure of contour complexity) contributed less to class separation, as their distributions were largely similar across malignant and benign groups. Such descriptors may therefore serve a supporting role in tumour classification.

The findings were further corroborated by feature importance analysis. Both CatBoost and Random Forest consistently indicated that geometry-related variables had the greatest influence on predictive outcomes.

ROC curves and AUC values

ROC curves are one of the most widely used tools for evaluating classifiers. They allow the analysis of the trade-off between sensitivity (TPR) and specificity (TNR = 1–FPR) for different decision thresholds. As Fawcett notes, ROC graphs “depict relative tradeoffs between benefits (true positives) and costs (false positives)” and their interpretation is based on tracking changes in points in the ROC space as the decision threshold value changes¹³. These curves are particularly popular in biomedicine, where they are the standard for reporting the effectiveness of clinical decision support systems.

Based on the analysis, it can be concluded that all three ensemble models are characterised by exceptionally high predictive performance. The evaluated classifiers achieved AUC results exceeding 0.996, which confirms their high discriminatory power. Referring to the interpretation of diagnostic metrics, AUC values above 0.9 are considered excellent, and those close to 1.0 suggest almost perfect class separation.

The CatBoost model achieved an AUC score of 0.9971, which indicates an almost perfect ability to distinguish between malignant and benign tumours. It is worth noting that the classifier generated 88 unique probability values. This result indicates highly granular predictions and a rich distribution of probability assessments, suggesting that the model is sensitive to even the smallest differences in input features.

The AUC result for the Voting Classifier reached 0.9961, which should also be interpreted as excellent. The number of unique probability values was slightly lower than in the case of CatBoost and amounted to 80.

The best AUC result of 0.9974 was achieved by the Stacking Classifier, while generating 78 unique probability values. This suggests an exceptionally effective combination of base classifier predictions and slightly higher discriminatory power. It is worth noting that the meta-model aggregates information in a somewhat less fine-grained way than, for example, CatBoost. In the context of clinical applications, this can be both an advantage (lower risk of overfitting) and a limitation (reduced probability diversity, which may correspond to lower diagnostic signal richness).

DeLong’s test for correlated ROC curves was applied on the held-out test set ( Inline graphic ). All pairwise comparisons among CatBoost, Voting, and Stacking were conducted, with p-values adjusted using the Holm–Bonferroni method within this family of tests. In addition, paired t-tests were run on 5-fold cross-validated AUCs computed on identical folds for each model (Holm adjustment applied). Confidence intervals (CIs) for AUC obtained from DeLong’s method were truncated to the admissible range [0, 1] when necessary (CIs for Inline graphic AUC were left unconstrained).A detailed summary of these results is provided in Table 4.

Table 4.

AUC, number of unique probability values, and DeLong test vs. Stacking (Holm-adjusted).

Model	AUC	95% CI	Unique probability values (total)	AUC vs. Stacking (95% CI)	p vs. Stacking
CatBoost	0.9971	[0.992, 1.000]	88	[ , 0.002 ]	0.82 (ns)
Voting	0.9961	[0.989, 1.000]	80	[ , 0.001 ]	0.94 (ns)
Stacking	0.9974	[0.992, 1.000]	78	–	–

Open in a new tab

Significant values are in bold. CIs truncated to [0, 1]. Holm correction applied across all pairwise comparisons.

Differences in AUC were not statistically significant (DeLong: all Holm-adjusted Inline graphic ; CV paired t-test: adjusted ). For example, Voting vs. Stacking yielded AUC with 95% CI (Holm-adjusted ); CatBoost vs. Stacking: AUC with 95% CI (Holm-adjusted ).

Stratified 10-fold cross-validation was conducted on the training set, and the mean ± std deviation was reported for ROC AUC, PR-AUC, F1, and balanced accuracy. The comparative study includes traditional machine learning baselines (SVM, Random Forest, Logistic Regression, kNN, SGD), ensemble methods (Voting, Stacking), and a representative deep-learning model for tabular data (TabNet). The cross-validated results for all models are summarised in Table 5. Model selection was based on the mean ROC AUC. The SVM with an RBF kernel achieved the highest cross-validated ROC AUC, was refit on the full training set, and subsequently evaluated once on the held-out test split. The resulting ROC AUC of 0.997 and PR-AUC of 0.996 indicate excellent discriminative performance, with the corresponding held-out test metrics reported in Table 6.

Table 5.

Stratified 10-fold cross-validation results (mean ± std).

Model	ROC AUC	PR-AUC	F1	Balanced Acc.
SVM (RBF)	0.9967 ± 0.0053	0.9958 ± 0.0062	0.9676 ± 0.0244	0.9748 ± 0.0199
MLP (sklearn)	0.9961 ± 0.0084	0.9958 ± 0.0079	0.9759 ± 0.0227	0.9801 ± 0.0189
Voting (soft)	0.9959 ± 0.0083	0.9951 ± 0.0084	0.9643 ± 0.0224	0.9719 ± 0.0187
Stacking	0.9949 ± 0.0073	0.9933 ± 0.0086	0.9609 ± 0.0366	0.9690 ± 0.0309
Logistic Regression	0.9947 ± 0.0075	0.9930 ± 0.0089	0.9586 ± 0.0275	0.9685 ± 0.0226
CatBoost	0.9928 ± 0.0112	0.9917 ± 0.0113	0.9668 ± 0.0286	0.9725 ± 0.0249
Random Forest	0.9895 ± 0.0176	0.9880 ± 0.0169	0.9425 ± 0.0370	0.9525 ± 0.0316
SGD (log_loss)	0.9881 ± 0.0173	0.9837 ± 0.0205	0.9429 ± 0.0301	0.9579 ± 0.0195
kNN (k = 5)	0.9876 ± 0.0142	0.9821 ± 0.0170	0.9532 ± 0.0321	0.9569 ± 0.0285
TabNet	0.9793 ± 0.0200	0.9696 ± 0.0286	0.9134 ± 0.0604	0.9317 ± 0.0494

Open in a new tab

Significant values are in bold.

Table 6.

Held-out test performance of the selected model (SVM (RBF)).

Accuracy	Precision	Recall	F1	Balanced Acc.	ROC AUC	PR-AUC
0.9649	0.9535	0.9535	0.9535	0.9627	0.9971	0.9956

Open in a new tab

Interpretation of the shape of ROC curves

The generated graphs show a characteristic rise of the ROC curve in the upper left corner. This behaviour indicates very high sensitivity with a minimal number of false alarms. Even at low decision thresholds, the classifiers correctly identified malignant tumours.

All three models achieved an AUC score of 1.0 on the test set, which demonstrates the stability of the obtained data representations (Fig. 1). As Dietterich notes, “By constructing an ensemble out of all of these accurate classifiers, the algorithm can ‘average’ their votes and reduce the risk of choosing the wrong classifier”¹⁴. This means that both the single model (CatBoost) and the ensemble methods could benefit from high-quality input features, virtually eliminating the risk of choosing a suboptimal classifier in this experiment.

Factors contributing to the ideal results:

High discriminatory power of tumour geometric features Some characteristic features, such as radius_worst, perimeter_worst, and area_worst, already showed strong discriminative potential at the exploratory analysis stage. The decision boundary between benign and malignant cases was relatively clear, and the models fully exploited this relationship. Algorithms based on decision trees (CatBoost, Random Forest) are particularly well suited for capturing non-linear and hierarchical dependencies.
Automatic class weighting. The use of class_weight=“balanced“significantly reduced issues related to class imbalance. As a result, the models were not biased towards the dominant class and were able to correctly detect less frequent malignant cases. This improved classification sensitivity and enhanced result stability.
Cross-validation and hyperparameter tuning. Both cross-validation and hyperparameter optimisation reduced the risk of overfitting. It should be noted, however, that achieving AUC = 1.0 may reflect a very strong fit to the specifics of the test dataset. Therefore, while the results are encouraging, external validation would be required to confirm their generalisability.

Similar experiments in the literature on the classification of cancerous lesions in medical imaging indicate that AUC values often exceed 0.95 but rarely approach perfection. For instance, Arevalo et al. (2016) reported results in the range of AUC = 0.82–0.86 for different feature representations in the classification of mammography lesions, highlighting the limitations of traditional approaches based on manually engineered descriptors¹⁵. It should be emphasised that such high AUC values as those obtained in this study ( Inline graphic ) are rarely reported in biomedical research. Exceptions can be found in studies employing deep learning. For example, Coudray et al.¹⁶ showed that neural networks classifying lung cancer subtypes based on challenging TBLB biopsies achieved an AUC of 0.99 in indeterminate cases and an AUC in the range of 0.94–0.99 across four independent test sets. The authors emphasized that such high performance was remarkable given that these cases are difficult even for experienced pathologists.

In this context, the results obtained in the present study should be interpreted as comparable to the best outcomes reported in the literature, confirming the exceptional informativeness of the geometric features included in the analysis.

Critical interpretation of results

Although an AUC = 1.00 outcome is highly attractive from a theoretical research perspective, it must be interpreted with great caution in the context of clinical applications. Under real-world conditions, such as medical imaging or histopathological diagnostics, the frequency of misclassifications typically increases due to patient heterogeneity and the presence of noise in the data. Consequently, the results obtained in this study should be regarded as evidence of the potential of the applied models under controlled conditions, while emphasising the need for validation on independent external datasets to confirm their generalisability.

Ablation study

Voting-weight heatmaps

Across the voting-weight heatmaps (Table 7), the best cross-validated ROC–AUC occurs when the SGD component is given low weight ( Inline graphic ) and the SVC receives the highest weight (top-right cell, ). This aligns with the single-model results, where the RBF–SVM is consistently strong. With the surface is fairly flat–only small, second-order gains from up-weighting SVC and (to a lesser extent) LR. For the matrix is effectively constant, indicating a degenerate weighting regime in which the ensemble collapses to the over-weighted SGD and becomes insensitive to the other learners.

Table 7.

Ablation summary with CIs and significance (10-fold stratified CV; 95% CI: t-based; AUC p-values: DeLong with Holm correction).

Setting	ROC-AUC [95% CI]	PR-AUC [95% CI]	F1 (mean±SD)	AUC vs. base ()
ALL (30)	0.9990 [0.997, 1.000]	0.9960 [0.993, 0.999]	0.9762 ± 0.006	+0.004 (0.41)
MEAN_only (10)	0.9980 [0.996, 1.000]	0.9950 [0.991, 0.999]	0.9750 ± 0.007	+0.003 (0.57)
SE_only (10)	0.9450 [0.930, 0.961]	0.9300 [0.910, 0.951]	0.9000 ± 0.020	(<0.001)
WORST_only (10)	0.9985 [0.996, 1.000]	0.9955 [0.992, 0.999]	0.9758 ± 0.006	+0.003 (0.49)
SHAP_top10 (10)	0.9935 [0.989, 0.998]	0.9905 [0.985, 0.996]	0.9700 ± 0.008	(0.62)
SHAP_top20 (20)	0.9970 [0.994, 1.000]	0.9960 [0.992, 0.999]	0.9760 ± 0.006	+0.001 (0.83)
Voting (30)	0.9953 [0.991, 0.999]	0.9930 [0.988, 0.998]	0.9720 ± 0.008	(0.55)
Stacking_full (30)	0.9858 [0.973, 0.998]	0.9820 [0.970, 0.994]	0.9650 ± 0.012	(0.09)

Open in a new tab

Significant values are in bold.

Best model per feature set

The bar plot shows that using all features and using SHAP-selected subsets (top-20, top-15, top-10, top-5) achieve near-identical, very high ROC–AUC ( Inline graphic –0.999, overlapping 95% CIs). Sets restricted to a single descriptor family (e.g. SE_only) drop a few points (mid-0.94s), confirming that worst/mean morphology plus concavity/concave-points carry most of the discriminative signal.

SHAP top-k curve

The top-k analysis corroborates this: moving from Inline graphic (Stacking_full) to (Voting ) and (Voting without SGD) brings a steady gain, peaking around with SVM–RBF ( mean ROC–AUC). The shaded confidence band overlaps across k, indicating no statistically meaningful loss when trimming from 30 to 10–20 SHAP-ranked features while modestly improving mean AUC and simplifying the model. Favor SVC-centric voting weights with a light touch on SGD. SHAP-guided selection to 10–20 features preserves (and can slightly improve) performance and single-family feature subsets underperform, highlighting the value of combining size/shape with concavity-related descriptors.

Confusion matrixes

Confusion matrices provide an excellent basis for the detailed evaluation of classifiers. They present the number of correct classifications and highlight the types of errors committed, particularly false negatives (FN) and false positives (FP). The matrices generated for models based on advanced approaches, such as CatBoost, stacking, and voting, are of particular interest (Figs. 2, 3 ,4).

Fig. 3 — Confusion matrix voting classifier.

Fig. 4 — Confusion matrix stacking classifier.

Both the CatBoost and Random Forest models demonstrated an almost perfect classification distribution. Only one benign case was misclassified as malignant (FP), and two malignant cases were incorrectly assessed as benign (FN). This indicates that the tree-based models made highly effective use of the available features, particularly the geometric tumour descriptors, achieving near-optimal class separation.

Even better results were observed for the Stacking and SVM models, which completely eliminated false positive errors. All benign tumours were correctly classified. In both cases, only two FN errors were recorded. From a diagnostic perspective, this outcome is highly relevant. False positives, although less dangerous clinically, may generate unnecessary emotional distress, additional examinations, and invasive procedures. As highlighted in the analysis of breast cancer screening studies: “More than 200 women will experience important psychological distress including anxiety and uncertainty for years because of false positive findings”¹⁷. At the same time, the risk of underestimating malignant cases (FN errors) remains the more serious limitation. Petticrew et al. (2001) pointed out that the main consequence of false negatives is a false sense of security, which can lead to delayed diagnosis and treatment¹⁸.

The Voting Classifier demonstrated the greatest variability. Three benign tumours and one malignant tumour were misclassified. While the overall performance remained high, the majority voting method did not fully exploit the potential of the base models. The results obtained in this study can be considered remarkable in the context of biomedical literature. For comparison, data published in PDQ Screening and Prevention (2025) indicate that mammography, currently the standard diagnostic tool for breast cancer, may miss between 6% and 46% of malignant cases¹⁹. In this light, the results presented here demonstrate both the exceptional diagnostic usefulness of the applied machine learning methods and the high informativeness of the morphometric features employed.

Table of metrics

The highest accuracy among the analysed models (Accuracy = 0.9825) and perfect precision (Precision = 1.0) were achieved by the Stacking and SVM models. They did not generate any false positives, which makes them highly reliable in situations involving complex tumours.

CatBoost and Random Forest achieved very similar results (Accuracy = 0.9737, F1 = 0.9647), while maintaining a favourable balance between precision and sensitivity. Tree-based methods are particularly effective at capturing non-linear relationships and interactions between features.

The Voting Classifier was characterised by slightly lower precision (0.9333), but in turn achieved the highest sensitivity (Recall = 0.9767). KNN obtained the lowest results in all evaluated categories (Accuracy = 0.9474, F1 = 0.9302). The limited effectiveness of this model may be attributed to the fact that similarity-based methods are less robust to high-dimensional and complex data compared with ensemble or tree-based algorithms.

In the context of biomedicine, models with high sensitivity deserve particular attention. The overriding clinical goal is to minimise the risk of missing malignant tumours. In this respect, the Voting Classifier performs very well, even at the expense of reduced precision.

The results obtained in this study can be considered comparable to the best outcomes reported in the literature on brain tumour classification. For instance, Ilani et al. (2025) reported Accuracy = 98.56% and AUC = 99.8% for the U-Net architecture²⁰, while Nahiduzzaman et al. (2025) achieved similar results (Accuracy = 98.56% and AUC = 99.8%) using the DarkNet model²¹. In turn, Alsubai (2022), using a hybrid CNN-LSTM architecture, achieved the highest performance reported so far, with Accuracy = 99.1%, Precision = 98.8%, and Recall = 98.9%²².

Beyond point estimates, the full ROC and precision–recall curves confirm a tight separation between the top models, with tree-based and ensemble methods (Random Forest, CatBoost, XGBoost, Stacking, Voting) and SVM consistently dominating across thresholds. In practice, Random Forest provides a strong accuracy–efficiency trade-off, combining high F1 with low-latency inference and a compact model, while Stacking and SVM deliver the very best F1 scores at the cost of higher inference overhead. CatBoost matches Random Forest in F1 and ROC–AUC, while achieving the fastest inference at the expense of a larger model, and XGBoost remains competitive albeit with slightly lower discrimination. Deep baselines (MLP, TabNet) were less competitive overall on this tabular dataset, with MLP matching Random Forest in F1 but slightly lower AUCs and TabNet lagging further behind. Taken together, these results support the use of ensembles or tree-based classifiers as robust candidates for reliable, real-time clinical decision support.

In line with recent guidance for medical AI evaluation, comprehensive reporting should go beyond accuracy to include prevalence-robust metrics (e.g. PR-AUC, balanced accuracy, MCC) and cross-validated analysis to ensure reliability across thresholds²³. Our multi-metric, statistically validated benchmarking follows this recommendation and mirrors the article’s emphasis on reproducible, clinically meaningful assessment²³. The resulting test-set performance across models is summarised in Table 8, while complementary computational complexity indicators are reported in Table 9.

Table 8.

Test-set performance across models.

Model	Accuracy	Precision	Recall	F1	PR-AUC	ROC-AUC
RandomForest	0.9737	0.9762	0.9535	0.9647	0.9979	0.9987
Stacking	0.9825	1.0000	0.9535	0.9762	0.9964	0.9974
CatBoost	0.9737	0.9762	0.9535	0.9647	0.9955	0.9971
SVM	0.9825	1.0000	0.9535	0.9762	0.9948	0.9964
Voting	0.9649	0.9333	0.9767	0.9545	0.9944	0.9961
MLP	0.9737	0.9762	0.9535	0.9647	0.9900	0.9921
XGBoost	0.9561	0.9524	0.9302	0.9412	0.9898	0.9918
TabNet	0.9474	0.9512	0.9070	0.9286	0.9883	0.9915
KNN	0.9474	0.9302	0.9302	0.9302	0.9522	0.9764

Open in a new tab

Table 9.

Computational complexity metrics for evaluated models.

Model	Predict total (s)	Predict/sample (ms)	Throughput (samples/s)	Model size (MB)	n_estimators	n_support	max depth	n params
KNN	0.005069	0.04446	22,486.9	0.10839	–	–	–	–
RandomForest	0.015408	0.13516	7,399.1	0.28762	100.0	–	–	–
SVM	0.037202	0.32633	3,065.3	0.01594	–	58.0	–	–
CatBoost	0.001340	0.01175	85,081.9	1.07755	–	–	–	–
Voting	0.003105	0.02724	36,717.9	0.03028	–	–	–	–
Stacking	0.003444	0.03021	33,101.0	0.03105	–	–	–	–
XGBoost	0.001575	0.01382	72,346.4	0.27757	300.0	–	4.0	–
MLP	0.002212	0.01940	51,560.6	0.38396	–	–	–	12,289
TabNet	0.022741	0.19948	5,013.1	0.38842	–	–	–	–

Open in a new tab

Interpretability analysis

The interpretability of machine learning models enables the verification of algorithmic results in the context of clinical knowledge, thereby reducing the risk of so-called “black-box medicine”³. In this study, feature importance assessment was applied in the analysis of combined classical models. In addition, the SHAP technique was employed, providing both a global view of the impact of variables on predictions and a detailed explanation at the level of individual cases.

Feature importance (Catboost)

An analysis of the CatBoost model’s feature importance (Fig. 5) revealed that morphometric parameters directly related to tumour size, shape, and structural irregularities had the greatest predictive impact. The most influential variables included texture_worst, concave points_worst, area_worst, and radius_worst.. In clinical studies, these parameters are considered reliable indicators of tumour border irregularity and reflect the complexity of the observed lesions. These findings are consistent with previous reports in the literature. Similar observations were made, for example, in mammography analysis. Arevalo et al. (2015) compared manually extracted features (e.g. intensity, texture, shape) with features automatically learned by convolutional neural networks (CNNs)²⁴. They demonstrated that traditional morphological indicators such as surface area, perimeter, or texture parameters are diagnostically valuable, but their utility is limited due to the need for manual segmentation and their weaker ability to generalise to unseen data.

Morphometric predictors associated with size and shape irregularity not only reflect the biological aggressiveness of tumours but are also relatively straightforward to interpret clinically. Furthermore, such features can be applied not only in the initial diagnostic phase but also in monitoring disease progression.

Local case-based explanations using SHAP

In addition to the global feature importance analysis, local patient-level explanations were examined based on SHAP values. Table 10 presents two representative cases and the five most influential features driving the CatBoost model’s predictions. For the high-confidence benign case, low values of concavity- and size-related descriptors (e.g. concave points_worst, area_worst, radius_worst) strongly decreased the predicted probability of malignancy, which is reflected by negative SHAP values. In contrast, in the uncertain malignant case, higher values of lesion texture and concavity descriptors (such as texture_worst, texture_mean and concavity_worst) were associated with a higher predicted probability of malignancy, as indicated by positive SHAP values.

Table 10.

Top five features with the largest SHAP contributions for two CatBoost case studies.

Case	ID	Rank	Feature	Std. value	SHAP	Pred. prob.	Pred. label	True label
high_conf_neg	101	1	concave points_worst	-1.750	-0.038	0.001	benign (0)	benign (0)
high_conf_neg	101	2	area_worst	-0.942	-0.035	0.001	benign (0)	benign (0)
high_conf_neg	101	3	concave points_mean	-1.270	-0.034	0.001	benign (0)	benign (0)
high_conf_neg	101	4	concavity_worst	-1.313	-0.033	0.001	benign (0)	benign (0)
high_conf_neg	101	5	radius_worst	-1.168	-0.029	0.001	benign (0)	benign (0)
Uncertain	82	1	texture_worst	0.613	0.067	0.657	malignant (1)	malignant (1)
Uncertain	82	2	texture_mean	0.534	0.063	0.657	malignant (1)	malignant (1)
Uncertain	82	3	concavity_worst	0.288	0.057	0.657	malignant (1)	malignant (1)
Uncertain	82	4	area_se	-0.025	0.038	0.657	malignant (1)	malignant (1)
Uncertain	82	5	concave points_mean	0.029	0.035	0.657	malignant (1)	malignant (1)

Open in a new tab

SHAP summary and SHAP waterfall analysis

The SHAP summary analysis enabled the ranking of features according to their impact on brain tumour classification and indicated the direction of their influence (i.e. whether they increased the likelihood of assigning a lesion to the malignant or benign class). The strongest contribution to the classification of malignant cases was associated with high values of variables such as concave points_worst, area_worst, and concavity_worst, whereas low values of these features were characteristic of benign cases (Fig. 6). As Lundberg and Lee (2017) emphasise, “SHAP values attribute to each feature the change in the expected model prediction when conditioning on that feature”²⁵, confirming the ability to simultaneously assess both the importance and the direction of feature influence.

Fig. 6 — SHAP summary plot for the CatBoost model.

A detailed case-level analysis was also performed using a SHAP waterfall plot (Fig. 7). This examination showed that features such as concave points_mean, radius_worst, and area_worst strongly shifted the decision function towards the malignant class, whereas variables such as smoothness_worst slightly influenced the decision towards the benign class. Such local explanations not only help to understand the general mechanisms underlying the model’s behaviour but also provide justification for individual diagnostic decisions²⁶. This approach is particularly valuable in clinical practice, where transparency and explainability directly influence trust in algorithmic support systems.

Building on these results, the expanded SHAP analysis, including dependence, interaction, and beeswarm summary plots (Fig. 8), adds two clinically meaningful layers. First, dependence and interaction plots reveal mostly monotonic, positive relationships for the key morphology/size descriptors; concave points (mean and worst), area_worst, radius_worst, and concavity_worst, confirming that higher values consistently push predictions toward malignancy; interactions indicate that elevated concavity_worst amplifies the effect of area_worst and radius_worst, while higher texture features can further strengthen the impact of concave points. Second, combining SHAP with model-confidence case studies (high-confidence malignant/benign, most-uncertain, and confident error) shows how explanation patterns align with reliability. Confident malignant cases present concordantly high values of the core features with large positive SHAP contributions. Confident benign cases show uniformly low values with negative contributions. Uncertain cases display offsetting SHAP effects across features and rare confident errors expose atypical feature constellations useful for targeted review (e.g. additional imaging or expert adjudication) and for iterative model improvement (feature engineering or calibration). Practically, this enables rule-of-thumb guidance for decision support. Escalate when key features are high and SHAP-driven risk and probability are both high. Trigger a “second-look” workflow when SHAP contributions conflict and the probability is near 0.5. Routinely audit confident errors to refine the model and guard against systematic biases.

Discussion

Both the CatBoost and Voting Classifiers achieved excellent performance in brain tumour prediction, and the underlying reasons for these outcomes can be better understood by examining the mechanisms of the respective approaches. CatBoost, a gradient boosting framework, was specifically designed to efficiently process heterogeneous and high-dimensional data. Its core innovation, ordered boosting, mitigates target leakage and substantially reduces the risk of overfitting, distinguishing it from other gradient boosting implementations. Moreover, CatBoost incorporates highly effective encoding of categorical variables, a property of particular relevance in biomedical applications, where feature spaces frequently exhibit non-linear and heterogeneous interactions. In the present study, CatBoost proved especially effective in modelling morphometric parameters that capture tumour size, irregularity, and structural heterogeneity.

The Voting Classifier, constructed as an ensemble of Logistic Regression, Support Vector Classifier (SVC), and Stochastic Gradient Descent (SGD) Classifier, was implemented in soft-voting mode, aggregating the class probability estimates of the base learners to generate the final prediction. This approach proved particularly effective in minimising false negative outcomes, thereby increasing sensitivity, a factor of paramount importance in oncological diagnostics, where the omission of malignant cases carries serious clinical consequences. The Stacking Classifier, in turn, employed a hierarchical architecture in which a meta-model was trained on the predictions of the base learners alongside the original features. This enabled the exploitation of complementary algorithmic strengths, leading to improved overall performance. Notably, Stacking effectively reduced both false positive and false negative errors, highlighting its capacity to balance sensitivity and precision while maintaining robust generalisation. Taken together, these approaches illustrate that advanced ensemble learning strategies provide highly accurate and reliable solutions for predictive modelling in neuro-oncology.

The SHAP analysis provided valuable insights into the decision-making process of the CatBoost algorithm and demonstrated that geometric features exerted the greatest influence on classification outcomes. In the clinical literature, such indicators have long been recognised as critical for differentiating between benign and malignant lesions^27,28. Irregular tumour borders, increased size, and heterogeneous signal texture are widely acknowledged as characteristic markers of aggressive neoplasms²⁸. The consistency between the findings of this study and established clinical knowledge confirms that the algorithm captures biologically relevant patterns rather than artefactual correlations. The interpretability afforded by SHAP not only enables a more comprehensive understanding of the model’s decisions but also fosters the trust required for the safe and effective integration of artificial intelligence into diagnostic workflows.

The CatBoost, Voting, and Stacking models therefore complement each other in a unique manner, combining interpretability with high diagnostic sensitivity. CatBoost provides clinically meaningful justifications for predictions by linking them to established biomarkers, Voting reduces the risk of overlooking malignant tumours, and Stacking increases robustness against overfitting. Collectively, these models form a coherent and effective decision-support framework for the classification of cancerous lesions. The effectiveness of such combined strategies has also been demonstrated in other diagnostic domains. For instance, ensemble models based on stacking and voting, when enhanced with SHAP-based interpretability, have been shown to outperform individual classifiers in the detection of heart disease²⁹. The application of CatBoost and ensemble methods can therefore substantially strengthen the diagnostic workflow in oncology. These solutions align with the paradigm of personalised medicine, whose primary goal is to tailor therapy to the individual patient profile. Transparent ensemble models have the potential to act as a reliable “second opinion” by highlighting the most relevant diagnostic biomarkers and supporting clinical decision-making. Importantly, the literature emphasises that such approaches are not intended to replace physicians but to complement their expertise, reducing cognitive workload and accelerating the diagnostic process³⁰.

This study yielded highly promising results. However, several important limitations should be considered when interpreting the findings. First, the tumour dataset used was carefully curated but originated from a single reference source. There is increasing consensus in the literature that the effectiveness of machine learning models often decreases when applied to new environments that differ from those in which they were trained. As highlighted in the review Key challenges for delivering clinical impact with artificial intelligence, “generalisation can be hard due to technical differences between sites [...] as well as variations in local clinical and administrative practices”³¹. This implies that differences in medical equipment, imaging protocols, and local clinical workflows can lead to significant variability in algorithmic performance. The same review further notes that “proper assessment of real-world clinical performance and generalisation requires appropriately designed external validation [...] This practice is currently rare in the literature and is of critical concern”³¹. The absence of external validation, i.e. testing models on data from other institutions, therefore represents a major limitation in assessing the true clinical utility of the proposed approach. Evidence from other fields underscores this problem: Zech et al. (2018) demonstrated that convolutional neural networks trained to detect pneumonia on chest X-rays achieved substantially higher performance on internal datasets compared to external test sets, warning that “estimates of CNN performance based on test data from hospital systems used for model training may overstate their likely real-world performance”³².

Another limitation is that the dataset employed in this study had a relatively simple structure: it was well described and largely free of artefacts. While this undoubtedly contributed to the high performance of the models, it does not fully reflect the complexity and heterogeneity encountered in routine clinical practice. Models trained on simplified datasets risk overfitting to the specificities of the training environment and may lose effectiveness in more challenging scenarios. An additional barrier to generalisability is the phenomenon of underspecification, whereby models adapt to spurious correlations in the data. As emphasised in Toward Generalisability in the Deployment of Artificial Intelligence in Radiology, “underspecification is a major obstacle to the generalisability of machine learning models and defines the inability of the pipeline to ensure that the model has encoded the inner logic of the system”³³.

A further limitation is the lack of multimodal integration. In practice, tumour diagnosis is informed by imaging, histopathological, molecular, and clinical data. Models that rely solely on imaging or morphometric features may therefore have restricted predictive value. Recent research suggests that the integration of multimodal data significantly enhances diagnostic accuracy and robustness^34,35. Finally, this analysis was based on retrospective data. Retrospective studies, while valuable, carry inherent risks related to limited control over data acquisition protocols and data quality. In the context of medical artificial intelligence, there is growing recognition of the need for prospective validation. Randomised controlled trials (RCTs) are considered the gold standard in modern medicine, including for AI interventions, as they enable direct evaluation of a model’s clinical impact rather than solely its technical performance³⁶.

Clinical applicability and workflow integration can be made explicit by positioning the model as a decision-support layer in the radiology pipeline rather than a replacement for expert judgement. A pragmatic route is PACS/RIS integration with DICOM in/out and HL7/FHIR messaging, so that inference can run automatically when new MR studies are ingested, returning a structured report: predicted class, calibrated probability (with confidence interval), key SHAP features, and image-level overlays or region-of-interest flags where available. Thresholds should be site-tunable (e.g. sensitivity-first for triage vs. balanced F1 for routine reading), and outputs presented with human-in-the-loop controls: an “accept/override” button, links to the most influential features, and a short natural-language rationale derived from the SHAP case study corresponding to the patient. To support safe adoption, the package should include calibration plots, expected calibration error, decision-curve analysis (net benefit vs. threshold), and fail-safes (e.g. abstain when uncertainty or out-of-distribution scores exceed a limit). Operationally, continuous monitoring is needed: drift detection on input distributions, performance dashboards (AUROC/PR-AUC, sensitivity at fixed specificity), alerting when metrics degrade, and model/version governance with audit logs.

Generalisation claims require rigorous external validation beyond retrospective single-centre testing. A multi-centre, prospective study should sample different vendors, field strengths, pulse sequences, and patient demographics, with predefined endpoints and a locked model. Practical steps include: cross-site harmonisation of radiomic features (e.g. intensity normalisation; ComBat or similar batch-effect correction), site-stratified calibration, and reporting per-site/per-scanner subgroup metrics to surface domain shift. When data-sharing is constrained, federated or swarm training can be explored to learn across institutions without centralising PHI. After deployment, periodic re-calibration and, if needed, controlled continual learning should follow a change-management plan (pre-specification, rollback, re-approval).

Explainability and ethics go together. SHAP global summaries, interaction plots, and patient-level waterfalls/decision plots should be embedded in the clinical UI to make reasoning traceable, but with careful communication that feature attributions explain the model’s behaviour–not the underlying biology. To limit automation bias, interfaces should highlight uncertainty, show alternative explanations (e.g. counterfactual “what would flip the prediction?”), and default to conservative recommendations when evidence is weak. Fairness checks are essential: evaluate performance and calibration across sex, age, scanner vendor, and site; document gaps and mitigation (reweighting, targeted augmentation). Privacy and security must meet GDPR/HIPAA: on-prem or trusted cloud deployment, encryption at rest/in transit, strict access control, and de-identification for any secondary use. Finally, ensure compliance with regulatory guidance (e.g. EU MDR/FDA SaMD): maintain a technical file, intended-use statement, human-factors testing, and post-market surveillance. Together, these steps translate a high-performing model into a clinically credible, transparent, and auditable tool that can be safely integrated into diagnostic care.

A natural next step is to move from single-source features to truly multi-modal cohorts that combine imaging with clinical, laboratory, and where available pathology or genomics, so the model learns complementary signals and stays robust across sites. Practically, this means harmonised preprocessing, explicit fusion (early shared-latent vs. late modality-specific), and ablation to quantify each modality’s value. Longitudinal extensions should use repeated studies and follow-ups to enable trajectory-aware predictions, via sequence models (RNN/Transformer), time-aware boosting, or survival/competing-risk methods that handle irregular sampling and censoring. For multi-centre collaboration without sharing raw data, adopt federated training with aligned feature schemas, secure aggregation, and optional privacy safeguards; add site-level validation with shift detection, drift auditing, and clear versioning/rollback. Clinical translation then requires prospective, preregistered evaluations with predefined operating points, runtime safeguards (latency budgets, fail-safes, calibration checks), and continuous monitoring of calibration, subgroup performance, and fairness–under SOPs for threshold updates, periodic retraining, and ongoing quality control.

Conclusions

This study presented a comprehensive analysis of machine learning methods, including ensemble models and interpretability techniques, in the context of tumour classification. The results demonstrated that both the CatBoost model and Voting classifiers achieved high diagnostic accuracy. All evaluated approaches were characterised by strong overall performance and a low number of false negative predictions. The application of SHAP analysis enabled a clear link between model decisions and clinically documented features. Owing to this interpretability, the developed solutions not only achieved high effectiveness but also provided physicians with transparent diagnostic insights, thereby increasing trust in the algorithms.

The findings underscore the considerable potential of artificial intelligence, particularly interpretable ensemble mo delsin oncology. At the same time, the study highlights important limitations, including the homogeneity and simplicity of the input data and the need for validation in prospective, multicentre settings. The continued development of methods that combine high predictive accuracy with transparent decision-making may significantly enhance the diagnosis and monitoring of tumours in the future, ultimately serving as a valuable support tool in routine clinical practice.

Author contributions

Weronika Wolak – system implementation and preparation of the research environment; specification of the research problem and definition of research hypotheses, selection of methods and metrics to be used for evaluation, analysis and compilation of experiment results. 50Anna Plichta – literature review, preparation of the theoretical part of the work, development and editing of the preliminary version of the manuscript, linguistic correction. 25 Hubert Orlicki – work concept, formulation of the research problem; development of methodology; coordination of team work, substantive supervision. 25 All authors participated equally in the interpretation of the results, preparation of the final version of the manuscript and approved it for publication. The authors declare that there are no conflicts of interest between them.

Data availability

The datasets analysed during the current study are publicly available in the Kaggle repository: Tumor Prediction Dataset. (https://www.kaggle.com/datasets/madhuraatmarambhagat/tumor-prediction)

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Stern, U., Shwartz, D. & Weinshall, D. United we stand: Using epoch-wise agreement of ensembles to combat overfit (2024).
2.Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. Catboost: Unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, Vol. 31 (Curran Associates, Inc., 2018).
3.Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell.1, 206–215 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Banerjee, T. & Paçal, İ. A systematic review of machine learning in heart disease prediction. Turk. J. Biol.49, 600–634. 10.55730/1300-0152.2766 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Bhagat, M. A. Tumor prediction dataset. https://www.kaggle.com/datasets/madhuraatmarambhagat/tumor-prediction (Accessed 15 August 2025) (2023).
6.Kuhn, M. & Johnson, K. Applied Predictive Modeling (Springer, 2013). [Google Scholar]
7.Banerjee, T. et al. A novel unified inception-u-net hybrid gravitational optimization model (uigo) incorporating automated medical image segmentation and feature selection for liver tumor detection. Sci. Rep.15, 29908. 10.1038/s41598-025-14333-0 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Bishop, C. M. Pattern Recognition and Machine Learning (Springer, New York, 2006). [Google Scholar]
9.Breiman, L. Random forests. Mach. Learn.45, 5–32 (2001). [Google Scholar]
10.He, Y. et al. Ga-catboost-weight algorithm for predicting casualties in terrorist attacks: Addressing data imbalance and enhancing performance. Mathematics12, 818 (2024). [Google Scholar]
11.Banerjee, T. et al. A novel hybrid deep learning approach combining deep feature attention and statistical validation for enhanced thyroid ultrasound segmentation. Sci. Rep.15, 27207. 10.1038/s41598-025-12602-6 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Sartori, F. et al. A comprehensive review of deep learning applications with multi-omics data in cancer research. Genes16, 648. 10.3390/genes16060648 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Fawcett, T. An introduction to ROC analysis. Pattern Recogn. Lett.27, 861–874. 10.1016/j.patrec.2005.10.010 (2006). [Google Scholar]
14.Dietterich, T. G. Ensemble methods in machine learning. In Multiple Classifier Systems, Lecture Notes in Computer Science, vol. 1857, 1–15 (Springer, 2000).
15.Arevalo, J., Gonzalez, F. A., Ramos-Pollan, R., Oliveira, J. L. & Lopez, M. A. G. Representation learning for mammography mass lesion classification with convolutional neural networks. Artif. Intell. Med.73, 48–59. 10.1016/j.artmed.2016.09.009 (2016). [DOI] [PubMed] [Google Scholar]
16.Coudray, N. et al. Classification and differentiation of major histological types of lung cancer with deep learning on histopathology images. NPJ Precis. Oncol.5, 1–10. 10.1038/s41698-021-00162-0 (2021).33479506 [Google Scholar]
17.Gøtzsche, P. C. & Jørgensen, K. J. Screening for breast cancer with mammography. Cochrane Database Syst. Rev.2013, CD001877. 10.1002/14651858.CD001877.pub5 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Petticrew, M., Sowden, H., Lister, A. & Wright, D. False-negative results in screening programmes: Systematic review of impact and implications. Int. J. Epidemiol.30, 138–146. 10.1093/ije/30.1.138 (2001). [PubMed] [Google Scholar]
19.National Cancer Institute. Breast cancer screening (pdq^®) (2025). Mammography may miss 6–46% of invasive cancers.
20.Ilani, M. A., Shi, D. & Banad, Y. M. T1-weighted mri-based brain tumor classification using hybrid deep learning models. Sci. Rep.15, 7010. 10.1038/s41598-025-92020-w (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Nahiduzzaman, M. et al. A hybrid explainable model based on advanced machine learning and deep learning models for classifying brain tumors using mri images. Sci. Rep.15, 1649. 10.1038/s41598-025-85874-7 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Alsubai, S. et al. Ensemble deep learning for brain tumor detection. Front. Comput. Neurosci.10.3389/fncom.2022.1005617 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Banerjee, T. et al. Pyramidal attention-based t network for brain tumor classification: A comprehensive analysis of transfer learning approaches for clinically reliable ai hybrid approaches. Sci. Rep.15, 28669. 10.1038/s41598-025-11574-x (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Arevalo, J., González, F. A., Ramos-Pollán, R., Oliveira, J. L. & Lopez, M. A. G. Convolutional neural networks for mammography mass lesion classification. In 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 797–800, 10.1109/EMBC.2015.7318464 (IEEE, 2015). [DOI] [PubMed]
25.Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Proc. of the 31st International Conference on Neural Information Processing Systems (NeurIPS), 4765–4774 (Curran Associates, Inc., 2017).
26.Wu, L. et al. Development and validation of an interpretable machine learning model for predicting the risk of hepatocellular carcinoma in patients with chronic hepatitis b: A case–control study. BMC Gastroenterol.25, 157. 10.1186/s12876-025-03697-2 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.El-Dahshan, E.-S.M.A., Hosny, T. & Salem, A.-B.M. Hybrid intelligent techniques for mri brain images classification. Digit. Signal Process.20, 433–441. 10.1016/j.dsp.2009.07.002 (2010). [Google Scholar]
28.Pope, W. B. et al. Mr imaging correlates of survival in patients with high-grade gliomas. Am. J. Neuroradiol.26, 2466–2474 (2005). [PMC free article] [PubMed] [Google Scholar]
29.Ganie, S. M., Pramanik, P. K. D. & Zhao, Z. Ensemble learning with explainable ai for improved heart disease prediction based on multiple datasets. Sci. Rep.15, 97547. 10.1038/s41598-025-97547-6 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Holzinger, A., Langs, G., Denk, H., Zatloukal, K. & Müller, H. Causability and explainability of artificial intelligence in medicine. Wiley Interdiscipl. Rev. Data Mining Knowl. Discov.9, e1312. 10.1002/widm.1312 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med.17, 195. 10.1186/s12916-019-1426-2 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med.15, e1002683. 10.1371/journal.pmed.1002683 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Eche, T., Schwartz, L. H., Mokrane, F.-Z. & Dercle, L. Toward generalizability in the deployment of artificial intelligence in radiology: Role of computation stress testing to overcome underspecification. Radiol. Artif. Intell.3, e210097. 10.1148/ryai.2021210097 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Park, J. E., Kickingereder, P. & Kim, H. S. Radiomics and deep learning from research to clinical workflow: Neuro-oncologic imaging. Korean J. Radiol.21, 1126–1137. 10.3348/kjr.2019.0847 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Lee, D. et al. Multi-omics single-cell analysis reveals key regulators of hiv-1 persistence and aberrant host immune responses in early infection. eLife14, RP104856. 10.7554/eLife.104856.3 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Kassab, M., Hadhad, Y. & Schünemann, H. J. Clinical impact and quality of randomized controlled trials involving interventions evaluating artificial intelligence prediction tools: a systematic review. NPJ Digit. Med.4, 136. 10.1038/s41746-021-00524-2 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets analysed during the current study are publicly available in the Kaggle repository: Tumor Prediction Dataset. (https://www.kaggle.com/datasets/madhuraatmarambhagat/tumor-prediction)

[CR1] 1.Stern, U., Shwartz, D. & Weinshall, D. United we stand: Using epoch-wise agreement of ensembles to combat overfit (2024).

[CR2] 2.Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. Catboost: Unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, Vol. 31 (Curran Associates, Inc., 2018).

[CR3] 3.Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell.1, 206–215 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Banerjee, T. & Paçal, İ. A systematic review of machine learning in heart disease prediction. Turk. J. Biol.49, 600–634. 10.55730/1300-0152.2766 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Bhagat, M. A. Tumor prediction dataset. https://www.kaggle.com/datasets/madhuraatmarambhagat/tumor-prediction (Accessed 15 August 2025) (2023).

[CR6] 6.Kuhn, M. & Johnson, K. Applied Predictive Modeling (Springer, 2013). [Google Scholar]

[CR7] 7.Banerjee, T. et al. A novel unified inception-u-net hybrid gravitational optimization model (uigo) incorporating automated medical image segmentation and feature selection for liver tumor detection. Sci. Rep.15, 29908. 10.1038/s41598-025-14333-0 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Bishop, C. M. Pattern Recognition and Machine Learning (Springer, New York, 2006). [Google Scholar]

[CR9] 9.Breiman, L. Random forests. Mach. Learn.45, 5–32 (2001). [Google Scholar]

[CR10] 10.He, Y. et al. Ga-catboost-weight algorithm for predicting casualties in terrorist attacks: Addressing data imbalance and enhancing performance. Mathematics12, 818 (2024). [Google Scholar]

[CR11] 11.Banerjee, T. et al. A novel hybrid deep learning approach combining deep feature attention and statistical validation for enhanced thyroid ultrasound segmentation. Sci. Rep.15, 27207. 10.1038/s41598-025-12602-6 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Sartori, F. et al. A comprehensive review of deep learning applications with multi-omics data in cancer research. Genes16, 648. 10.3390/genes16060648 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Fawcett, T. An introduction to ROC analysis. Pattern Recogn. Lett.27, 861–874. 10.1016/j.patrec.2005.10.010 (2006). [Google Scholar]

[CR14] 14.Dietterich, T. G. Ensemble methods in machine learning. In Multiple Classifier Systems, Lecture Notes in Computer Science, vol. 1857, 1–15 (Springer, 2000).

[CR15] 15.Arevalo, J., Gonzalez, F. A., Ramos-Pollan, R., Oliveira, J. L. & Lopez, M. A. G. Representation learning for mammography mass lesion classification with convolutional neural networks. Artif. Intell. Med.73, 48–59. 10.1016/j.artmed.2016.09.009 (2016). [DOI] [PubMed] [Google Scholar]

[CR16] 16.Coudray, N. et al. Classification and differentiation of major histological types of lung cancer with deep learning on histopathology images. NPJ Precis. Oncol.5, 1–10. 10.1038/s41698-021-00162-0 (2021).33479506 [Google Scholar]

[CR17] 17.Gøtzsche, P. C. & Jørgensen, K. J. Screening for breast cancer with mammography. Cochrane Database Syst. Rev.2013, CD001877. 10.1002/14651858.CD001877.pub5 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Petticrew, M., Sowden, H., Lister, A. & Wright, D. False-negative results in screening programmes: Systematic review of impact and implications. Int. J. Epidemiol.30, 138–146. 10.1093/ije/30.1.138 (2001). [PubMed] [Google Scholar]

[CR19] 19.National Cancer Institute. Breast cancer screening (pdq^®) (2025). Mammography may miss 6–46% of invasive cancers.

[CR20] 20.Ilani, M. A., Shi, D. & Banad, Y. M. T1-weighted mri-based brain tumor classification using hybrid deep learning models. Sci. Rep.15, 7010. 10.1038/s41598-025-92020-w (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Nahiduzzaman, M. et al. A hybrid explainable model based on advanced machine learning and deep learning models for classifying brain tumors using mri images. Sci. Rep.15, 1649. 10.1038/s41598-025-85874-7 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Alsubai, S. et al. Ensemble deep learning for brain tumor detection. Front. Comput. Neurosci.10.3389/fncom.2022.1005617 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Banerjee, T. et al. Pyramidal attention-based t network for brain tumor classification: A comprehensive analysis of transfer learning approaches for clinically reliable ai hybrid approaches. Sci. Rep.15, 28669. 10.1038/s41598-025-11574-x (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Arevalo, J., González, F. A., Ramos-Pollán, R., Oliveira, J. L. & Lopez, M. A. G. Convolutional neural networks for mammography mass lesion classification. In 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 797–800, 10.1109/EMBC.2015.7318464 (IEEE, 2015). [DOI] [PubMed]

[CR25] 25.Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Proc. of the 31st International Conference on Neural Information Processing Systems (NeurIPS), 4765–4774 (Curran Associates, Inc., 2017).

[CR26] 26.Wu, L. et al. Development and validation of an interpretable machine learning model for predicting the risk of hepatocellular carcinoma in patients with chronic hepatitis b: A case–control study. BMC Gastroenterol.25, 157. 10.1186/s12876-025-03697-2 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.El-Dahshan, E.-S.M.A., Hosny, T. & Salem, A.-B.M. Hybrid intelligent techniques for mri brain images classification. Digit. Signal Process.20, 433–441. 10.1016/j.dsp.2009.07.002 (2010). [Google Scholar]

[CR28] 28.Pope, W. B. et al. Mr imaging correlates of survival in patients with high-grade gliomas. Am. J. Neuroradiol.26, 2466–2474 (2005). [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Ganie, S. M., Pramanik, P. K. D. & Zhao, Z. Ensemble learning with explainable ai for improved heart disease prediction based on multiple datasets. Sci. Rep.15, 97547. 10.1038/s41598-025-97547-6 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Holzinger, A., Langs, G., Denk, H., Zatloukal, K. & Müller, H. Causability and explainability of artificial intelligence in medicine. Wiley Interdiscipl. Rev. Data Mining Knowl. Discov.9, e1312. 10.1002/widm.1312 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med.17, 195. 10.1186/s12916-019-1426-2 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med.15, e1002683. 10.1371/journal.pmed.1002683 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Eche, T., Schwartz, L. H., Mokrane, F.-Z. & Dercle, L. Toward generalizability in the deployment of artificial intelligence in radiology: Role of computation stress testing to overcome underspecification. Radiol. Artif. Intell.3, e210097. 10.1148/ryai.2021210097 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Park, J. E., Kickingereder, P. & Kim, H. S. Radiomics and deep learning from research to clinical workflow: Neuro-oncologic imaging. Korean J. Radiol.21, 1126–1137. 10.3348/kjr.2019.0847 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Lee, D. et al. Multi-omics single-cell analysis reveals key regulators of hiv-1 persistence and aberrant host immune responses in early infection. eLife14, RP104856. 10.7554/eLife.104856.3 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Kassab, M., Hadhad, Y. & Schünemann, H. J. Clinical impact and quality of randomized controlled trials involving interventions evaluating artificial intelligence prediction tools: a systematic review. NPJ Digit. Med.4, 136. 10.1038/s41746-021-00524-2 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Interpretable ensemble learning for tumor-type prediction with a SHAP-based evaluation of CatBoost and voting classifiers

Weronika Wolak

Anna Plichta

Hubert Orlicki

Abstract

Introduction

Materials and methods

Dataset

Data preprocessing

Robustness under imbalance and limited data

Model training

Model evaluation

Table 1.

Cross-validation protocol

Handling class imbalance

Table 2.

Table 3.

Explainable AI analysis with SHAP

Implementation details

Results

Dataset overview

ROC curves and AUC values

Table 4.

Table 5.

Table 6.

Interpretation of the shape of ROC curves

Fig. 1.

Critical interpretation of results

Ablation study

Voting-weight heatmaps

Table 7.

Best model per feature set

SHAP top-k curve

Confusion matrixes

Fig. 2.

Fig. 3.

Fig. 4.

Table of metrics

Table 8.

Table 9.

Interpretability analysis

Feature importance (Catboost)

Fig. 5.

Local case-based explanations using SHAP

Table 10.

SHAP summary and SHAP waterfall analysis

Fig. 6.

Fig. 7.

Fig. 8.

Discussion

Conclusions

Author contributions

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases