An end-to-end interpretable machine-learning-based framework for early-stage diagnosis of gallbladder cancer using multi-modality medical data

Huiyu Zhao; Chuang Miao; Yidi Zhu; Yijun Shu; Xiangsong Wu; Ziming Yin; Xiao Deng; Wei Gong; Ziyi Yang; Weiwen Zou

doi:10.1186/s12885-025-14462-9

. 2025 Jul 16;25:1178. doi: 10.1186/s12885-025-14462-9

An end-to-end interpretable machine-learning-based framework for early-stage diagnosis of gallbladder cancer using multi-modality medical data

Huiyu Zhao ^1,^#, Chuang Miao ^1,^#, Yidi Zhu ^2,³, Yijun Shu ^2,³, Xiangsong Wu ^2,³, Ziming Yin ⁴, Xiao Deng ¹, Wei Gong ^2,^3,^✉, Ziyi Yang ^2,^3,^✉, Weiwen Zou ^1,^✉

PMCID: PMC12265321 PMID: 40670935

Abstract

Background

The accurate early-stage diagnosis of gallbladder cancer (GBC) is regarded as one of the major challenges in the field of oncology. However, few studies have focused on the comprehensive classification of GBC based on multiple modalities. This study aims to develop a comprehensive diagnostic framework for GBC based on both imaging and non-imaging medical data.

Methods

This retrospective study reviewed 298 clinical patients with gallbladder disease or volunteers from two devices. A novel end-to-end interpretable diagnostic framework for GBC is proposed to handle multiple medical modalities, including CT imaging, demographics, tumor markers, coagulation function tests, and routine blood tests. To achieve better feature extraction and fusion of the imaging modality, a novel global-hybrid-local network, namely GHL-Net, has also been developed. The ensemble learning strategy is employed to fuse multi-modality data and obtain the final classification result. In addition, two interpretable methods are applied to help clinicians understand the model-based decisions. Model performance was evaluated through accuracy, precision, specificity, sensitivity, F1-score, area under the curve (AUC), and matthews correlation coefficient (MCC).

Results

In both binary and multi-class classification scenarios, the proposed method showed better performance compared to other comparison methods in both datasets. Especially in the binary classification scenario, the proposed method achieved the highest accuracy, sensitivity, specificity, precision, F1-score, ROC-AUC, PR-AUC, and MCC of 95.24%, 93.55%, 96.87%, 96.67%, 95.08%, 0.9591, 0.9636, and 0.9051, respectively. The visualization results obtained based on the interpretable methods also demonstrated a high clinical relevance of the intermediate decision-making processes. Ablation studies then provided an in-depth understanding of our methodology.

Conclusion

The machine learning-based framework can effectively improve the accuracy of GBC diagnosis and is expected to have a more significant impact in other cancer diagnosis scenarios.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12885-025-14462-9.

Keywords: Gallbladder cancer, Deep learning, Machine learning, Multi-modality medical data, Interpretable

Introduction

Gallbladder cancer (GBC), a highly aggressive malignancy, poses significant challenges in the field of oncology due to its exceptionally poor prognosis and high mortality rates. The median survival time often barely exceeds one year from diagnosis [1, 2]. One of the most daunting aspects of managing GBC lies in the difficulty of differentiating it from benign gallbladder diseases, which are far more prevalent. The early stages of GBC can mimic benign conditions such as cholecystitis or cholesterol polyps, both in terms of symptoms and imaging findings. Current diagnostic modalities, including ultrasonography (US), computed tomography (CT), and magnetic resonance imaging (MRI), while useful, often fail to distinguish between benign and malignant lesions with sufficient certainty [3]. Therefore, there is an urgent need for novel diagnostic strategies that can improve the accuracy of differentiating GBC from benign gallbladder conditions based on a variety of medical modalities.

Recently, deep learning methods have demonstrated the ability to make accurate clinical diagnoses [4–6], which includes a bunch of studies of gallbladder cancer diagnoses. The single-modality data, such as tumor markers [7] and US [8–10], was first utilized in the diagnosis of GBC. Besides the gallbladder region, the combined examination of the remaining peripheral regions like the liver parenchyma adjacent to the gallbladder was shown to improve the performance of gallbladder lesion characterization [11]. Also, the combination of contrast-enhanced computed tomography (CECT) features and radiomic features was proved to effectively assist clinicians in the early diagnosis of GBC [12]. A survival probability model [13] was accomplished for patients after GBC surgery by using multi-modality data, including CECT imaging, laboratory test results, and systemic treatments. Nevertheless, the investigation of comprehensive and interpretable early-stage diagnosis of GBC based on multiple modalities is limited. Yet, several bottlenecks still limit a more accurate and effective clinical diagnosis of GBC.

Firstly, tumors in image data may have large intra-class differences and small inter-class variations [14–16], as shown in Fig. 1(A) and Fig. 1(B). Besides, not only the tumor region should be considered but also the changes in the gallbladder and its peripheral region may have an impact on the final diagnosis [11]. The complex scenarios limit the performance of traditional deep learning networks in extracting and utilizing objective features. Therefore, improved feature extraction and fusion methods need to be proposed to handle both the large- and the small-scale features in order to enhance the accuracy of the diagnosis. Secondly, most of the previous work only focused on single-modality processing, mainly US and CT, while the clinical diagnosis of GBC requires consideration of a comprehensive set of diagnostic aspects that single-modality data cannot fully cover [1, 14], as shown in Fig. 1(C). Besides imaging data, other modalities also have significant limitations. For example, demographic data are weakly associated with tumor malignancy; tumor markers and routine blood tests are non-organ-specific; and coagulation assessments are susceptible to inflammatory confounders. Then, due to the huge difference in the number of features between different modality data, efficient methods of fusing multiple modalities to improve the overall diagnostic accuracy need to be investigated. Finally, most deep learning-based diagnostic methods only provide a true or false prediction, which does not help clinicians understand the decision-making process of the model [17, 18]. This is the reason why deep learning has long been criticized as a “black box” [19]. Especially in the field of medical diagnosis, the intermediate processes of the network are also worth exploring beyond the final accurate prediction. Although the interpretability of US was addressed in [17], there is no interpretable work on comprehensive diagnosis of GBC based on multiple modalities.

Inspired by previous literature and the above-mentioned bottlenecks, an end-to-end interpretable framework for GBC diagnosis using multi-modality medical data is designed. First, the top-ranked CT slices and their corresponding local patches are selected. Then, a novel deep learning algorithm is proposed for better feature extraction and fusion that improves the diagnostic accuracy of imaging data. The predicted values are further incorporated as a new variable with other modalities. After that, ensemble learning consisting of multiple machine learning algorithms is employed to produce an accurate classification of GBC. Finally, two interpretable methods are applied in the framework to help clinicians better understand the algorithm-based decisions. While multimodal AI has advanced gastrointestinal oncology like gastrointestinal [20, 21], kidney carcinoma [22], and Colorectal [23], gallbladder cancer remains uniquely underserved. To the best of our knowledge, we are the first to investigate a comprehensive and interpretable framework for the early-stage diagnosis of GBC based on multi-modality clinical data.

The main contributions can be summarized as follows:

A novel end-to-end framework is designed for the early-stage diagnosis of gallbladder cancer that achieves comprehensive and highly accurate classification. The modalities utilized in this framework include CT imaging, demographics, tumor markers, coagulation function tests, and routine blood tests.
In the processing of image modality, a novel global-hybrid-local network (GHL-Net) is proposed that simultaneously considers the large-scale global information of the gallbladder region and the micro-variations of the tumors. The network includes an MSscEA block and an iAFF block for better multi-scale feature extraction and fusion.
By visualizing the weights of different variables and highlighting key imaging features, the designed framework is highly interpretable that not only helps clinicians understand the reasons for model-based decisions but also assists them in making more reasoned diagnosis decisions.
The designed framework was validated on real clinical datasets. The classification results show the high performance, robustness, generalization ability, as well as the clinical relevance of the framework.

Materials and methods

Clinical dataset

The gallbladder imaging datasets used in this study were obtained from a tertiary hospital affiliated to Shanghai Jiao Tong University School of Medicine. An internal dataset was collected using 64-row contrast-enhanced multidetector CT (Siemens Somatom Definition Flash, Siemens Healthcare), which included 99 healthy volunteers and 126 patients with gallbladder disease. An additional dataset consisting of 53 patients with gallbladder disease and 20 healthy volunteers was acquired by another 64-row CT (Brilliance, Philips Medical Systems) and used as an external test set. The flowchart of patient inclusion is shown in Fig. S1 and the detailed characteristics of the datasets are listed in Table 1. The number of CT slices in the internal and external datasets was 44,550 and 13,650, respectively. Non-ionic iodinated contrast medium (Omnipaque 350 mg I/ml, GE Healthcare) was used with a dose of 1.5–2 ml/kg (body weight) and a flow rate of 2 ml/s via power injection. For each sample, the selected CT imaging data consisted of arterial phase images derived from enhanced CT scans with a slice thickness of 0.5 mm. Demographic information (such as sex and age) and clinical parameters (including routine blood tests, coagulation function assessments, and serum tumor marker levels) were retrieved from the subjects'medical records. This research was approved by the Committee for Ethics of Xinhua Hospital, Shanghai Jiao Tong University School of Medicine (SHEC-C-2022–104).

Table 1.

The characteristics of the clinical datasets. Results for age, routine blood test, coagulation function test and tumor markers are expressed as mean $\pm$ standard deviation

	Internal dataset (n = 225)	External dataset (n = 73)
Sex (male/female)	94/119	44/29
Age, year, mean (SD)	63.43 $\pm 11$ .84	64.45 $\pm 11$ .17
Diagnosis, n (%)
Normal gallbladder	99 (44.00)	20 (27.40)
Benign lesion	64 (28.44)	30 (41.10)
Malignant lesion	62 (27.56)	23 (31.50)
Routine blood test
WBC, 10⁹/L	6.83 $\pm$ 2.85	6.24 $\pm$ 2.63
Hb, g/L	124.25 $\pm$ 21.86	124.25 $\pm$ 18.72
PLT, 10⁹/L	216.29 $\pm$ 75.75	219.32 $\pm$ 66.87
Coagulation function assessments
PT, second	11.40 $\pm$ 1.05	11.40 $\pm$ 1.00
APTT, second	31.51 $\pm$ 3.15	31.00 $\pm$ 3.19
TT, second	14.30 $\pm$ 1.32	14.54 $\pm$ 1.52
INR, 10^–1	1.04 $\pm$ 0.09	0.99 $\pm$ 0.09
Fib, g/L	3.43 $\pm$ 0.74	3.40 $\pm$ 0.93
D-D, mg/L FEU	0.56 $\pm$ 1.92	1.34 $\pm$ 2.43
Tumor marker
AFP, ng/mL	4.43 $\pm$ 7.04	3.16 $\pm$ 4.03
CEA, ng/mL	8.11 $\pm$ 21.30	7.61 $\pm$ 19.79
CA199, U/mL	427.12 $\pm$ 1416.79	632.11 $\pm$ 1924.41
CA125, U/mL	41.41 $\pm$ 133.39	55.57 $\pm$ 164.15
CA724, U/mL	17.29 $\pm$ 72.56	15.22 $\pm$ 40.30
CA153, U/mL	15.55 $\pm$ 46.01	14.87 $\pm$ 22.38
CA211, U/mL	12.22 $\pm$ 95.40	6.63 $\pm$ 11.83
NSE, ng/mL	18.37 $\pm$ 32.48	15.12 $\pm$ 9.74

Open in a new tab

FEU fibrinogen equivalent units

Overall framework

The end-to-end interpretable framework for early-stage GBC diagnosis is shown in Fig. 2. The overall framework can be divided into four phases:

Phase one: A segmentation network is employed to filter the top N slices and corresponding patches that are most relevant to the tumor and gallbladder regions from a large amount of redundant image data.
Phase two: A three-branch (global-hybrid-local) network, namely GHL-Net, is adopted to extract single-slice features at different scales effectively. Then, the predicted probability values of the image data are obtained by integrating the results of selected slices.
Phase three: The prediction results from image data are treated as a new variable and are incorporated into other laboratory variables. Then, an ensemble learning that consists of multiple machine learning methods is utilized in all modalities to obtain the final prediction for GBC diagnosis.
Phase four: The classification performance of the proposed framework is assessed. Two interpretable methods are embedded in the framework, which obtains the contribution weights of different variables and the key regions of the GHL-Net in the imaging modality, respectively.

Imaging modality processing

N-Top imaging slices selection

The clinically acquired raw CT imaging contains a considerable number of non-relevant slices, which not only increases the overall processing and training time but also reduces the accuracy of the prediction due to the presence of irrelevant interfering information. Thus, the most relevant slices for GBC diagnosis are first selected based on tumor and gallbladder labels to reduce redundant information. Slice selection is also a critical step in applying a 2D deep learning network to 3D data. The ground truth labels of gallbladder and tumor are provided by two clinicians (with 13 and 8 years of experience, respectively), with the validity confirmed by inter- and intra-class agreements. Then, a U-shape network [24] is adopted as a backbone for segmenting and filtering the most relevant $N$ slices. The variable $N$ depends on the z-axis resolution as well as the average volume of the gallbladder. Since the current z-axis resolution is 0.5 mm, $N = 15$ in this work can exclude a large amount of redundant slices and retain critical slices of the tumor and gallbladder. Moreover, GBC diagnosis requires comprehensive evaluation of both the gallbladder/surrounding tissue and tumor texture while suppressing irrelevant background [11]. To effectively capture these features, we cropped square regions with 1/2 and 1/4 of the original image scaling centered on the segmentation mask, generating global (whole-region) and local (tumor-focused) patches respectively. This dual-scale approach simultaneously preserves contextual anatomy and fine-grained pathological details. Finally, a random translation on the x- and y-axes is applied to improve the robustness and generalization ability of the subsequent diagnosis network.

GHL-Net for imaging modality

The proposed global-hybrid-local network (GHL-Net), which contains three branches: global branch, hybrid branch, and local branch, is illustrated in Fig. 3. Global and local branches are architecturally consistent and handle global slices and local patches, respectively. Specifically, both branches go through one shallow module and four deep modules. The block numbers of each module are 3, 3, 4, 6, and 3, which is consistent with ResNet-50 [25]. Each module is connected through an average pooling layer. The blocks in the shallow module are called CBR blocks, which consist of `a $3 \times 3$ convolutional layer, a batch normalization (BN) layer, and a ReLu activation layer'. In the deep module, the multi-scale spatial and channel extraction and attention (MSscEA) block is proposed to address the large intra-class and small inter-class difference problems as well as the features with varying locations and sizes [26–29]. The detailed structure of the MSscEA block is shown in Fig. S2(A). After passing a $1 \times 1$ convolutional layer, a BN layer, and a ReLu activation layer, the feature map $S^{H \times W \times C / 4}$ enters a multi-scale-feature-extraction layer, where the feature map is evenly divided into four subsets ${S_{i}}^{H \times W \times C / 4}$ and placed in four sub-branches $b_{i}$ where $i \in {1,4}$ . The inputs of the last two sub-branches are the summation of $s_{j}$ and the output of $b_{j - 1}$ where $j \in {3,4}$ . And all sub-features go through a $3 \times 3$ convolutional layer except $s_{1}$ . In this way, the merged output feature $s_{MSB}$ of the four sub-branches can be expressed as follows:

s_{MSB} = s_{1} + {conv}_{3} (s_{2}) + {conv}_{3} (s_{3} + {conv}_{3} (s_{2})) + {conv}_{3} (s_{4} + {conv}_{3} (s_{3} + {conv}_{3} (s_{2})))

where sub-feature map $s_{1}$ can be treated as the original feature. The remaining sub-feature maps all have a larger receptive field than $s_{1}$ . Due to the combinatorial explosion effect, the output features $s_{MSB}$ will contain features of multiple receptive field sizes after combining the four sub-branches. The multi-scale features are then imported to the concurrent spatial and channel attention block, where the features are split into three after a dilated convolutional layer and a BN layer. The spatial extraction and attention (sEA) blocks, shown in Fig. S2(B), are strategically positioned post-max-pooling, with stride values adaptively configured for distinct sub-feature hierarchies. Unlike conventional sE blocks [27], our architecture implements dual-channel processing to explicitly segregate and weight key anatomical regions from background, enhancing discriminative feature learning through a foreground channel emphasizing tumor-associated patterns and a suppression channel attenuating irrelevant anatomical noise. Two learning parameters $T_{a}$ and $α$ are used as threshold and ratio to control the proportion of two features, with the expression shown as follows:

F_{o} = {conv}_{3} (α (F_{i} \times R_{k}) + (1 - α) (F_{i} \times R_{b}))

where $R_{k} = T_{a} (σ, ({conv}_{1}, (F_{i})))$ and $R_{b} = (1 - T_{a} (σ, ({{conv}_{1} (F}_{i}, ))))$ are the attention maps of key region and background region, respectively, $σ ()$ is a sigmoid function, and $\times$ is the Hadamard product. A residual connection is applied after the three sub-features are concatenated to make the whole block stable. At last, a cE block [27] is used to focus on important channels in the feature map. The introduction of the multi-scale and attention mechanism effectively enhances the task-relevant features, which further improves the feature extraction capability of the network.

Fig. 3 — The architecture of the proposed global-hybrid-local network (GHL-Net)

The hybrid branch is placed in the middle of the global and local branches, which merges the low-dimensional features of both branches. The hybrid strategy is often used in scenarios where multiple image data are analyzed simultaneously [30, 31]. Here we consider global slices and local patches as two types of features containing different emphases and therefore add a hybrid structure to increase the collaboration between the two features. We employ the iterative Attentional Feature Fusion (iAFF) [32] method, as shown in Fig. S3, instead of a simple concatenation. By using iAFF, the network realizes the fusion of global and local features at different scales as well as the comprehensive integration of semantic information, which makes the hybrid branch contain richer information. Then, the hybrid branch goes through one more deep module to get the final feature map of that branch.

After obtaining the feature maps of the three branches separately, the feature fusion strategy is employed. Each branch ends with a fully connected layer that transforms the extracted feature into a one-dimensional feature. Then features from three branches are joined to get a single-slice prediction. It should be noted that the prediction results are the probability values of each class rather than discrete values. The cross-entropy loss is used as loss function for training imaging modality:

L o s s_{CE} = - \sum_{i = 1}^{M} \sum_{c = 1}^{K} y_{ic} l o g (h_{θ}, {(x_{i})}_{c})

where $M$ represents the number of samples, $K$ is the number of categories, $y_{ic}$ represents the one-hot coding of the target value of the sample that takes 1 if the true category of sample $x_{i}$ is equal to $c$ , otherwise takes 0, $h_{θ} {(x_{i})}_{c}$ represents the predicted probability that sample $x_{i}$ belongs to category $c$ . The probability value is more indicative of the measurement of uncertainty in the prediction and is more informative for the subsequent process compared to discrete values. Then, the image-based prediction for each case is derived by averaging the probabilistic results of the multiple slices, which can be expressed as $\sum_{i}^{N} ρ_{j} / N$ , where $ρ_{j}$ stands for the class of prediction. Due to the high uncertainty of individual slice prediction, this majority-voting strategy may effectively improve the accuracy and robustness of the diagnosis based on image modality.

Ensemble learning for multi-modalities

In order to solve the problem of large differences in the number of parameters of each modality and better match the clinical diagnostic process, we consider the prediction of the image data as an additional variable to be combined with other laboratory modalities and obtain a comprehensive diagnostic prediction. This step is consistent with the existing sequence of clinical diagnosis, in which a more comprehensive examination is usually performed only when it is deemed needed [1]. Also, this process ensures that the amount of data from different modalities is approximately the same. The full list of modalities and variables is listed in Table S1, which can be categorized as imaging, demographic, tumor marker, coagulation function test, and routine blood test.

Due to the relatively small number of parameters in laboratory data relative to the images, general machine learning is able to reveal promising results with high interpretability. Here we employ a stacking classifier [33] to enhance diagnostic accuracy and model robustness through ensemble learning strategy. This technique strategically combines heterogeneous base classifiers via a meta-classifier [34, 35], capitalizing on their complementary strengths while mitigating individual limitations, which is particularly crucial for handling multimodal data disparities of GBC. As illustrated in phase three of Fig. 2, the proposed stacking classifier can be divided into two stages. The first stage consists of several machine learning classifiers including Catboost, support vector machine (SVM), Decision Tree, Random Forest, k-nearest neighbors (KNN), and multi-layer perceptron (MLP). These advanced machine learning algorithms may complement the imputation bias, thus ensuring maximum divergence in stacked imputation. Then, in the second stage, the predictions of these classifiers are trained using the Linear Regression (LR) as meta-classifier to get the final prediction. LR provides interpretable linear weighting of base models, which aligns with the following explanation method to modality contributions. By fusing the prediction results of multiple base classifiers, the characteristics of the data set can be captured in different aspects [36, 50]. In this way, not only the overfitting problem of the individual machine-learning model is alleviated but also the overall classification performance is improved.

Interpretability of the method

The “black box” nature of deep learning makes it hard for researchers to understand the results produced by the network. This bottleneck becomes acute when clinicians need to understand the reasons behind diagnostic results [17, 18]. Recently, interpretable deep learning methods have received lots of attention [51]. Here, the interpretability of the overall architecture is divided into two aspects: the contribution weight of the different variables and the key regions of the image.

First, SHapley additive explanations (SHAP) [37] are applied to get the weight of the different modalities. The game theory principles behind and unifying framework have led SHAP to be used by many researchers for interpretable tasks [38, 39, 49]. By using SHAP, each feature is assigned an importance value that indicates the impact on model predictions when the feature is included. Specifically, the two models are trained with and without the inclusion of specific features. Then, the predictions of the two models are compared with the current inputs to obtain the marginal contributions. In this way, the Shapley value of the parameter $ϕ_{i}$ corresponding to feature $i$ when computing the process on all feature subsets $S$ of the feature set $F$ that do not include feature $i$ is obtained with the following algorithm:

ϕ_{i} = \sum_{S \in F \ {i}} \frac{|S|! (F - |S| - 1)!}{n!} (f (S \cup i) - f (S))

where $f ()$ is the prediction model and $n$ is the number of total feature sets. Then, the principle of SHAP to account for the importance of a particular feature i on the prediction result of a single sample is as follows:

f (x) = g (z^{`}) = ϕ_{0} + \sum_{i = 1}^{M} (ϕ_{i}, z_{i})

where $g ()$ is the interpretation model that maps back to the original function via the mapping function $x = h_{x} (z)$ , with $z$ as the simplified input, and $ϕ_{0} = f (h_{x} (0))$ is the baseline value with no input. Due to the global interpretability of SHAP, the overall behavior of the model is revealed. By calculating the SHAP values for the different variables, the importance of each variable in one method can be obtained. Lastly, these importance values are normalized and summed to obtain the final weights of all modalities in the stacked classifier.

For the presentation of important features of the image, a modified occlusion algorithm [40] is employed, as shown in Algorithm 1. The occlusion algorithm is a perturbation-based algorithm that determines the importance of a region by the accuracy of the prediction provided by the network when that region is masked. The advantage of this algorithm is the ability to directly assess the importance of each region based on its impact on the results. Inspired by the multi-scale strategy in [41], we use three sizes of sliding strides $S_{1}, S_{2}, S_{3}$ corresponding to their mask sizes to obtain three attribution maps $Y_{1}, Y_{2}, Y_{3}$ , respectively. The final merged attribution map is then ${Y = (Y}_{1} + Y_{2} + Y_{3}) / 3$ . By fusion the results from multiple scales, the problem of poor robustness at a single scale can be effectively solved. Further by associating the computed attribution map with ground truth masks of tumors, we confirm that the proposed machine learning-driven automated detection methods may also approach clinical criteria for accurate diagnosis.

graphic file with name 12885_2025_14462_Figa_HTML.jpg — **Algorithm 1** Computing image classification attribution matrix based on the multi-scale occlusion methods

Implement details

To ensure robust model evaluation, we implemented five-fold cross-validation on the internal dataset, selecting optimal parameters based on maximum validation accuracy for subsequent independent testing. To address class imbalance issues in the internal training set, we implemented a comprehensive data augmentation strategy involving random translation ( $\pm 15 %$ of image dimensions), rotation ( $[- 45^{\circ}, 45^{\circ}]$ ), and scaling ( $\times [0.8, 1.2]$ ) for all samples. After augmentation, we standardized the training set by balancing all classes to 100 cases each, ensuring equal representation during model optimization. All image and laboratory data were normalized before being put into the model. In our experiments, the proposed network used the Adam algorithm [42] to optimize. The initial learning rate was set to $10^{- 3}$ . All methods were trained for 100 epochs, with the learning rate decaying every 20 rounds during the training process. The batch size was set to 4. All networks were implemented using Pytorch 2.1.0 and trained with two NVIDIA GeForce RTX 3090 GPUs.

Comparison settings

Given the absence of prior multimodal AI studies specifically targeting GBC, we rigorously evaluate our method against three state-of-the-art (SOTA) multimodal medical frameworks to ensure the methodological novelty and performance. We searched published literature since 2020 and filtered the suitable comparison methods as follows: methods of wang et al. [43], Lin et al. [44], and Zhou et al. [45]. The configurations of these multi-modalities medical diagnosis methods are as follows: Wang et al. [43] divided brain MRI images into multiple slices and fused them with non-imaging data after extracting features. Then an SVM is used to obtain a diagnosis of Alzheimer's disease. Lin et al. [44] used CNN and traditional feature extraction methods to extract two types of features of imaging data respectively. Then the features were fused with one-hot metadata to obtain skin lesion classification. Zhou et al. [45] designed the SCResNet and HoFN algorithms to extract features from images and EHRs, respectively, and predicted the severity sincerity of COVID-19 through a fully connected layer. To ensure fairness, all the methods utilize the same datasets and experimental settings.

Evaluation metrics

The evaluation metrics include Accuracy (Acc), Precision (Pre), Specificity (Sp), Sensitivity (Se), F1-score, area under the receiver operating characteristic curve (ROC AUC), Precision-Recall AUC (PR AUC), and matthews correlation coefficient (MCC) [46]. All the above results are obtained from the confusion matrices corresponding to true positive, true negative, false positive, and false negative. Moreover, the Dice value is used to evaluate the segmentation accuracy and the similarity between the threshold-based focus regions and the clinician labels for gallbladder tumors or cholecystitis. The Dice value is calculated in the same way as the F1 value.

Results

Segmentation results

The accuracy of gallbladder and tumor segmentation was first evaluated. The precision, sensitivity, and Dice values were 89.63%, 87.01%, and 0.883, respectively for the gallbladder and 86.99%, 83.43%, and 0.852 for the tumor, respectively. The segmentation process serves primarily to select relevant slices and extract corresponding patches. Given this methodological design and the additional spatial variation introduced by the random translation, the achieved segmentation quality proves adequate to support downstream classification tasks.

Classification performance in different scenarios

Two classification scenarios were considered in this work, namely the benign-malignant binary classification scenario and the normal-benign-malignant multi-class classification scenario. The binary scenario excludes normal cases, which is more relevant for clinical diagnosis. The multi-class scenario, on the other hand, is more comprehensive and shows the capability of the methods to handle multiple classes simultaneously. The performance of the proposed method with other comparison methods is shown in Table 2. The confusion matrix of the proposed method is further shown in Fig. S4. In the binary classification scenario, the proposed method achieved 95.24% accuracy, with only 6 out of 126 cases being misjudged. In contrast, in the three comparison methods, the accuracy was much lower than the proposed method (with an accuracy of 82.91%, 86.51%, and 82.54%, respectively). The proposed method achieved 93.55%, 96.87%, 96.67%, and 95.08% for the remaining metrics of sensitivity, specificity, precision, and F1-score, respectively. The superiority of the proposed method was also verified by an AUC test of 0.9591. In the multi-class classification scenario, the accuracy, F1-score, and AUC were 92.44%, 92.18%, and 0.9190. Despite the decrease in performance after the introduction of normal cases, the proposed method still obtained optimal results in all evaluations and AUC tests.

Table 2.

The classification evaluation results of various methods in both binary classification and multi-class classification scenarios

		Acc (%)	Pre (%)	Sp (%)	Se (%)	F1-score (%)	ROC AUC	PR AUC	MCC
Binary Classification Result
Internal	Wang et al. [37]	82.91	79.65	78.33	87.72	83.33	0.8714	0.8624	0.6623
	Lin et al. [38]	86.51	86.89	87.50	85.48	86.18	0.8950	0.9072	0.7301
	Zhou et al. [39]	82.54	81.25	81.25	83.87	82.54	0.8443	0.8486	0.6512
	proposed	95.24	96.67	96.87	93.55	95.08	0.9591	0.9636	0.9051
External	Wang et al. [37]	79.25	75.00	80.00	78.26	76.60	0.8216	0.8144	0.5801
	Lin et al. [38]	84.91	80.00	83.33	86.96	83.33	0.8765	0.8672	0.6979
	Zhou et al. [39]	77.36	70.37	73.33	82.61	76.00	0.8354	0.7953	0.5546
	proposed	92.45	88.00	90.00	95.65	91.67	0.9348	0.9215	0.8504
Multi-class Classification Result
Internal	Wang et al. [37]	82.40	81.11	91.36	82.12	81.51	0.8108	0.8076	0.7363
	Lin et al. [38]	85.33	84.45	92.73	85.07	84.72	0.8486	0.8642	0.7793
	Zhou et al. [39]	85.78	83.94	92.47	84.41	84.17	0.8675	0.8869	0.7785
	proposed	92.44	92.20	96.15	92.16	92.18	0.9190	0.9326	0.8845
External	Wang et al. [37]	79.45	79.36	89.61	79.98	79.61	0.8403	0.8243	0.7055
	Lin et al. [38]	82.19	81.93	91.09	82.75	82.16	0.8709	0.8688	0.7398
	Zhou et al. [39]	83.56	83.53	91.68	84.20	83.80	0.8921	0.8837	0.7669
	proposed	90.41	90.46	95.08	90.65	90.52	0.9134	0.9346	0.8631

Open in a new tab

Further, the robustness and generalization ability of different methods were tested using an independent external dataset. The proposed method achieved 92.45%, 95.65%, 90.00%, 88.00%, 91.67%, and 0.9348 in the accuracy, sensitivity, specificity, precision, F1-score, and AUC tests of the external dataset test in the binary classification scenario, respectively, and 90.41%, 90.65%, 95.08%, 90.46%, 90.52%, and 0.9134 in the multi-class classification scenario, respectively. Relative to the internal dataset, all methods showed some decrease in performance, which may mainly be caused by the unavoidable overfitting and the bias of the dataset itself. Even so, the proposed method still achieved a minimal degradation and the best performance of the various metrics in both scenarios, which better illustrates the robustness of the method.

Weights of different modalities

The results for the absolute SHAP values of the twenty variables are illustrated in Fig. 4, including the impact plot (Fig. 4(A)) and per observations plot (Fig. 4(B)). By measuring the SHAP values of different variables, we may gain insight into the importance of each variable to the classification results. Overall, the two plots showed very similar outcomes and the importance of each variable was generally consistent with the clinical diagnosis of GBC [47]. CT imaging emerged as the dominant predictor (the mean absolute SHAP = 0.208), reflecting its established role in clinical diagnosis. Then, tumor markers, including CA211, CA724, CA153, CA199, CA125, and CEA, exhibited second discriminative features for the classification task, with the combined contribution of multiple biomarkers exceeding the impact of radiological features. Although the lack of tumor markers with high sensitivity and specificity in GBC makes judgments based on markers alone unreliable [40], tumor markers still provide a visual indicator of tumor in the clinic. Relatively, Coagulation profiles and blood tests showed context-dependent utility. However, some of the variables such as PT, TT, Hb, etc. also had a proportionate impact on the classification results. In real clinical tests, the importance results that are highly consistent with the clinical diagnostic logic may confirm the validity of the machine-based results and points out abnormal indicators in particular cases.

Vision feature map results of imaging

The attribution map results for malignant and benign cases obtained based on the occlusion algorithm are shown in Fig. 5(A) and Fig. 5(B), which demonstrate a spatial decision logic consistent with clinical expertise. Since the ground truth masks of malignant and benign tumors were labeled by clinicians in the internal dataset, we further set a threshold for the attribution map and used it to assess the performance of the occlusion algorithm as well as the proposed GHL-Net. When the normalized threshold was set to 0.75, the Dice values on malignant and benign cases were 0.526 and 0.612, respectively. It should be noted that attribution maps serve to verify the consistency between the attention features of the network and the clinically relevant regions for clinicians, rather than providing pixel-perfect segmentation. Thus, the quantitative evaluation and visual analysis may fundamentally infer that the key clinical features have been well acquired by the proposed network. This anatomically grounded interpretability provides radiologists with targeted attention regions, bridging AI logic with real-world diagnostic workflows.

Effectiveness of proposed feature extraction and fusion methods

In order to investigate the effect of different modules on classification, we compared the performance of multiple feature extraction methods in Fig. 6 and Table S2, and the performance of feature fusion methods on image modality in Fig. 7 and Table S3. As shown in Table S2, the designed MSscEA block improved the accuracy of global slices and local patches by 7.93% and 3.96%, respectively, and achieved 76.98% and 84.92%. The fact that the local branch focuses more on the tumor region might be the reason for better key feature extraction and classification performance. We further compared the classification results based on different data fusion methods, with the results listed in Table S3. We first tried to merge the global and local features with a concatenate function, however, the network did not fit better than the local branch. This indicates that a simple concatenate does not help the network better. In contrast, the introduction of a hybrid branch had better results than the combination of global and local branches as well as individual ones. In this case, we further introduced the iAFF fusion module to form the final fusion method and achieved an optimal classification accuracy of 91.27%. Lastly, the enhancement that the $N$ -top slice selection strategy had on the classification results is shown in Table S4. These outcomes demonstrate the effectiveness of the designed strategy and blocks on image data classifications.

Fig. 6 — Comparison of multiple evaluation metrics with different feature extraction methods on local and global branches in the binary classification scenario

Fig. 7 — Comparison of multiple evaluation metrics with different feature fusion methods in the binary classification scenario

Effectiveness of multi-modalities and ensemble learning

After the ablation studies on imaging modality, we then compared the performance of each modality and different machine learning methods. We categorized the data into image data and laboratory data and tested the AUCs of each type as well as their combination, as shown in Fig. 6. Although the AUC values for each single clinical modality data were quite low, the performance of combined clinical data (AUC = 0.8949, 0.9034, respectively) is very similar to that of the image data (AUC = 0.9039, 0.8895, respectively) in either binary and multi-class classification scenarios. Nevertheless, the AUC values are further improved (AUC = 0.9591, 0.9190, respectively) after two modalities were fused, which indicated that some of the wrong cases would be corrected by the addition of another modality and led to a higher accuracy.

Lastly, we measured the results of different machine learning methods and the stacking ensemble learning strategy, as listed in Table 3. The stacking ensemble module exhibited the best results with accuracy, sensitivity, specificity, precision, F1-score, and AUC of 95.24%, 93.55%, 96.87%, 96.67%, 95.08%, and 0.9591. Through the use of the ensemble learning strategy, we improved the accuracy, sensitivity, specificity, precision, F1-score, and AUC by 7.94%, 12.84%, 14.06%, 1.61%, 7.39%, and 0.0786, respectively, compared to the Catboost method, which was the best among single machine learning methods.

Table 3.

Comparison of multiple evaluation metrics with different machine learning methods in the binary classification scenario

Branch	Acc (%)	Pre (%)	Sp (%)	Se (%)	F1-score (%)	ROC AUC	PR AUC	MCC
Catboost	87.30	83.82	82.81	91.94	87.69	0.8805	0.8764	0.7498
SVM	79.37	80.00	81.25	77.42	78.69	0.8124	0.8043	0.5873
Decision Tree	84.12	81.82	81.25	87.10	84.38	0.8793	0.8702	0.6842
Random Forest	82.54	80.30	79.69	85.48	82.81	0.8627	0.8544	0.6524
KNN	77.78	78.33	79.69	75.81	77.05	0.8032	0.7842	0.5555
MLP	67.46	66.67	67.19	67.74	67.20	0.6844	0.6709	0.3493
Ensemble learning	95.24	96.67	96.87	93.55	95.08	0.9591	0.9538	0.9052

Open in a new tab

Discussion

In this paper, an end-to-end interpretable framework for multi-modality gallbladder cancer diagnosis is proposed. First, the global slices containing the entire gallbladder and peripheral regions and the corresponding local patches of the tumor are filtered. Then, a global-hybrid-local network (GHL-Net) is designed for better extraction and fusion of imaging features across multiple scales. The prediction results of imaging are integrated into the laboratory data as an additional variable. After that, the ensemble learning strategy is employed to derive the final classification result of gallbladder cancer. To validate the effectiveness of the proposed framework, we conducted experiments with two real clinical datasets. The results demonstrate the high accuracy and generalization ability of our model classification compared to other state-of-the-art methods. Furthermore, two interpretable methods are utilized to verify the correlation between the model-based diagnostic results and the clinical reality, as well as illustrate the assistance of the method in the diagnosis of gallbladder cancer.

Early and accurate diagnosis of tumors has always been a critical focus of clinical research and a challenge to existing technology. First of all, tumors vary greatly in their imaging presentation. Methods that are compatible with detail features and global features need to be considered. Second, complex diseases often cannot be diagnosed based on single-modality data but rather a combination of multi-modality data since each modality captures distinct but partial facets of pathophysiology. Therefore, effective fusion methods among different modalities need to be explored. In addition, a single result based on deep learning is often not accepted by clinicians. It is also critical to make methods understandable and to be more supportive through the clinical decision-making process.

To address the above issues, the source images are first filtered based on the clinical labels. The screened data not only eliminates the disturbance to the classification model caused by the redundant data but also forms a global slice and corresponding local patches based on critical tumor features. Correspondingly, inspired by [30, 31, 48], a three-branch network called the global-hybrid-local network (GHL-Net) is proposed for the high-accuracy classification of image data over multiple scales. With the modified MSscEA block and the iAFF fusion block, the designed GHL-Net achieves outstanding classification results on single-slice images, as shown in Table S2 and Table S3. The MSscEA block provides a larger range of receptive fields and represents multi-scale features at a granular level, which improves the feature extraction and attention ability of a single branch. The iAFF block, on the other hand, fuses the feature information of different scales well, which could improve the final classification accuracy. In the separate global and local branches, there is an improvement of 7.93% and 3.96% in accuracy compared to the backbone. The accuracy is further improved to 91.27% after multi-branch fusion, which proves the effectiveness of the proposed blocks and network.

In order to better match the clinical diagnostic process of gallbladder cancer and to fully utilize the information from multiple modalities, the probability results of the imaging modality are treated as another variable in addition to the laboratory modalities. This is similar to the idea in [49], which obtained two prediction results from image data and non-image data respectively, and then fused the two results with the Catboost method to get the final classification result. The multimodal fusion approach effectively achieves synergistic confidence enhancement by integrating complementary diagnostic evidence from various modalities and suppressing modality-specific noise, thereby enhancing diagnostic statistical robustness. As demonstrated by the comparative differences in Table S3 and Table 3, multimodal fusion processing improves diagnostic accuracy by 3.97% and AUC by 0.0552, highlighting its critical role in clinical reliability. Then, the problem of imbalance of data parameters is well solved by the decision-level fusion. Specifically, the amount of data often varies by several orders of magnitude across different clinical modalities. The use of diagonal matrix and other feature upscaling methods is still doubtful when facing such a huge difference in the number of parameters. Then, the results of the internal and external datasets presented in Table 2 validate the effectiveness of the designed framework versus the comparison models. Moreover, the proposed framework avoids the problem of absence modality. As shown in Fig. 8, even a single image or laboratory data can provide a satisfactory result whether in binary or multi-class classification scenarios. Overall, the global-hybrid-local framework and its corresponding network architecture significantly enhance the prediction accuracy of imaging data, while the ensemble learning approach further improves both predictive accuracy and overall robustness through multimodal data fusion.

Fig. 8 — The performance of each modality and their combination in the internal dataset. The receiver operating characteristics are figured in orange, and 95%CIs in grey. A-C Performance in the binary classification scenario. A Image-modality only; B Clinical-modality; C Image and clinical modalities. D-F Performance in the multi-class classification scenario. D Image-modality only; E Clinical-modality; F Image and clinical modalities. AUC, area under the curve; CI, confidence interval; ROC, receiver operating characteristic

Finally, the designed framework also incorporates two interpretable methods. Specifically, the importance of each modality and key region of the image data is demonstrated through the SHAP and the modified oscillation method, respectively. As shown in Fig. 5, the visualizations of the important imaging features have a very high correlation with the clinical annotation. The Dice values are still comparable even without the guidance of clinical masks. Certainly, improvements in model generalization capabilities and more advanced visualization techniques may further enhance the consistency between attribution maps and clinical diagnosis. Even though, the results indicate that the proposed methods effectively extract the key features to get a higher classification performance. The introduction of interpretable methods not only improves the credibility of model-based results but also better assists clinicians in diagnosing.

Our study has several limitations. First, the CT modality obtained from one center may lead to a certain bias. Future multicenter validation studies are warranted to rigorously assess the robustness across diverse patient demographics and imaging protocols of the model. More advanced techniques such as federated learning [52], domain adaptation and transfer learning need to be developed to deal with multi-center data at the same time. Then, more modalities, such as liquid biopsy, can be included in future studies to obtain more comprehensive and accurate diagnostic results. Finally, only normal, benign, and malignant classification scenarios have been performed so far. As the amount of data increases, it is possible to achieve prediction of specific pathology types in the future.

Conclusion

In this paper, an end-to-end interpretable framework for multi-modality gallbladder cancer diagnosis is proposed. First, a global-hybrid-local network, namely GHL-Net, is designed that contains multi-scale feature extraction and feature fusion blocks to achieve better classification performance on CT imaging. Then, the prediction results of the imaging data are merged with the laboratory data to obtain the final classification diagnosis of gallbladder cancer through the ensemble learning method. Finally, two interpretable methods are embedded in the framework that visualizes the importance of the different variables and the key region of the imaging. Through the test on real clinical datasets, the proposed framework demonstrates better performance in terms of classification accuracy and generalization ability compared to other state-of-the-art methods. At the same time, the consistency between the visualization results and the clinical labels confirms that the proposed framework extracts the clinically important features well. The results of the ablation studies also validated the effectiveness of each block and strategy. Future work will concentrate on multi-center generalization and classification of more pathology types. Overall, the proposed method enhances the great potential of machine learning and deep learning in the field of comprehensive diagnosis of cancer.

Supplementary Information

Supplementary Material 1^{(257.1KB, docx)}

Acknowledgements

Not applicable.

Abbreviations

GBC: Gallbladder cancer
US: Ultrasonography
CT: Computed tomography
MRI: Magnetic resonance imaging
CECT: Contrast-enhanced computed tomography
GHL-Net: Global-hybrid-local network
BN: Batch normalization
MsscEA: Multi-scale spatial and channel extraction and attention
sEA: Spatial extraction and attention
cE: Channel extraction
iAFF: Iterative Attentional Feature Fusion
SVM: Support vector machine
KNN: K-nearest neighbors
MLP: Multi-layer perceptron
LR: Linear Regression
SHAP: SHapley additive explanations
Acc: Accuracy
Pre: Precision
Sp: Specificity
Se: Sensitivity
AUC: Area under the receiver operating characteristic curve
ROC: Receiver operating characteristic

Authors’ contributions

W.G., Z.Y., and W.Z.: the conception and design of the study. Y.Z., Y.S., X.W., Z.Y., W.G., and Z.Y. organized the database. H.Z., C.M., and X.D.: software and statistics. Y.Z., Y.S., X.W., Z.Y., W.G., and Z.Y.: data collection. H.Z., C.M., and X.D.: data analysis. W.G., Z.Y., and W.Z.: supervision. H.Z., C.M., and Z.Y.: writing of the original manuscript. W.G. and W.Z: revised the manuscript. All authors have read and approved the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of China (Grant No. T2225023); Interdisciplinary Program of Shanghai Jiao Tong University (YG2024QNA19, YG2022ZD009); Xinhua Hospital Funded Clinical Research (21XHDB10).

Data availability

The data that support the findings of this study are not openly available due to reasons of sensitivity and are available from the corresponding author upon reasonable request.

Declarations

Ethics approval and consent to participate

This research was approved by the Committee for Ethics of Xinhua Hospital, Shanghai Jiao Tong University School of Medicine (SHEC-C-2022–104). Because of the retrospective nature of the study, the need for informed consent was waived by the institutional review board (Committee for Ethics of Xinhua Hospital, Shanghai Jiao Tong University School of Medicine).

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Huiyu Zhao and Chuang Miao contributed equally to this work and share first authorship.

Contributor Information

Wei Gong, Email: gongwei@xinhuamed.com.cn.

Ziyi Yang, Email: yangziyi@xinhuamed.com.cn.

Weiwen Zou, Email: wzou@sjtu.edu.cn.

References

1.Roa JC, García P, Kapoor VK, Maithel SK, Javle M, Koshiol J. Gallbladder cancer. Nat Rev Dis Primers. 2022;8(1):69. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Yang Z, Wu Z, Xiong Y, Liu S, Cai C, Shao Z, Zhu Y, Song X, Shen W, Wang X, Wu X. Successful conversion surgery for locally advanced gallbladder cancer after gemcitabine and nab-paclitaxel chemotherapy. Front Oncol. 2022;12: 977963. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.van Dooren M, de Reuver PR. Gallbladder polyps and the challenge of distinguishing benign lesions from cancer. United Eur Gastroenterol J. 2022;10(7):625. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Aggarwal R, Sounderajah V, Martin G, Ting DS, Karthikesalingam A, King D, Ashrafian H, Darzi A. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit Med. 2021;4(1):65. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Mahmood T, et al. Harnessing the power of radiomics and deep learning for improved breast cancer diagnosis with multiparametric breast mammography. Expert Syst Appl. 2024;249: 123747. [Google Scholar]
6.Mahmood T, et al. Alzheimer’s disease unveiled: cutting-edge multi-modal neuroimaging and computational methods for enhanced diagnosis. Biomed Signal Process Control. 2024;97: 106721. [Google Scholar]
7.Chang Y, Wu Q, Chi L, Huo H, Li Q. Adoption of combined detection technology of tumor markers via deep learning algorithm in diagnosis and prognosis of gallbladder carcinoma. J Supercomput. 2022;78(3):3955–75. [Google Scholar]
8.Obaid AM, Turki A, Bellaaj H, Ksantini M, AlTaee A, Alaerjan A. Detection of gallbladder disease types using deep learning: an informative medical method. Diagnostics. 2023;13(10):1744. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Basu S, Gupta M, Rana P, Gupta P, Arora C. Surpassing the human accuracy: detecting gallbladder cancer from USG images with curriculum learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 20886–20896). 2022.
10.Basu S, Gupta M, Rana P, Gupta P, Arora C. Radformer: transformers with global–local attention for interpretable and accurate Gallbladder Cancer detection. Med Image Anal. 2023;83: 102676. [DOI] [PubMed] [Google Scholar]
11.Yin Y, Yakar D, Slangen JJ, Hoogwater FJ, Kwee TC, de Haas RJ. The value of deep learning in gallbladder lesion characterization. Diagnostics. 2023;13(4):704. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Kinoshita M, Ueda D, Matsumoto T, Shinkawa H, Yamamoto A, Shiba M, Okada T, Tani N, Tanaka S, Kimura K, Ohira G. Deep learning model based on contrast-enhanced computed tomography imaging to predict postoperative early recurrence after the curative resection of a solitary hepatocellular carcinoma. Cancers. 2023;15(7):2140. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Yin Z, Chen T, Shu Y, Li Q, Yuan Z, Zhang Y, Xu X, Liu Y. A gallbladder cancer survival prediction model based on multimodal fusion analysis. Dig Dis Sci. 2023;68(5):1762–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Feo CF, Ginesu GC, Fancellu A, Perra T, Ninniri C, Deiana G, Scanu AM, Porcu A. Current management of incidental gallbladder cancer: a review. Int J Surg. 2022;98: 106234. [DOI] [PubMed] [Google Scholar]
15.Takahama S, Kurose Y, Mukuta Y, Abe H, Fukayama M, Yoshizawa A, Kitagawa M, Harada T. Multi-stage pathological image classification using semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10702–10711). 2019.
16.Kong J, He Y, Zhu X, Shao P, Xu Y, Chen Y, Coatrieux JL, Yang G. BKC-net: bi-knowledge contrastive learning for renal tumor diagnosis on 3D CT images. Knowl-Based Syst. 2022;252: 109369. [Google Scholar]
17.Van der Velden BH, Kuijf HJ, Gilhuijs KG, Viergever MA. Explainable artificial intelligence (XAI) in deep learning-based medical image analysis. Med Image Anal. 2022;79: 102470. [DOI] [PubMed] [Google Scholar]
18.Bai X, Wang X, Liu X, Liu Q, Song J, Sebe N, Kim B. Explainable deep learning for efficient and robust pattern recognition: a survey of recent developments. Pattern Recogn. 2021;120: 108102. [Google Scholar]
19.Castelvecchi D. Can we open the black box of AI? Nature. 2016;538(7623):20. [DOI] [PubMed] [Google Scholar]
20.Ahamed MF, et al. Detection of various gastrointestinal tract diseases through a deep learning method with ensemble ELM and explainable AI. Expert Syst Appl. 2024;256: 124908. [Google Scholar]
21.Ahamed MF, et al. Interpretable deep learning architecture for gastrointestinal disease detection: a Tri-stage approach with PCA and XAI. Comput Biol Med. 2025;185: 109503. [DOI] [PubMed] [Google Scholar]
22.Rehman A, Mahmood T, Saba T. Robust kidney carcinoma prognosis and characterization using Swin-ViT and DeepLabV3+ with multi-model transfer learning. Appl Soft Comput. 2025;170: 112518. [Google Scholar]
23.Ahamed MF, et al. Automated detection of colorectal polyp utilizing deep learning methods with explainable AI. IEEE Access. 2024. 10.1109/ACCESS.2024.3402818. [Google Scholar]
24.Ronneberger O, Fischer P, Brox T, U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5–9, 2015, proceedings, part III 1 (pp. 234–241). 2015; Springer International Publishing.
25.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). 2016.
26.Gao SH, Cheng MM, Zhao K, Zhang XY, Yang MH, Torr P. Res2net: A new multi-scale backbone architecture. IEEE Trans Pattern Anal Mach Intell. 2019;43(2):652–62. [DOI] [PubMed] [Google Scholar]
27.Roy AG, Navab N, Wachinger C. Concurrent spatial and channel ‘squeeze & excitation’in fully convolutional networks. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16–20, 2018, Proceedings, Part I (pp. 421–429). 2018; Springer International Publishing.
28.Nam JH, Syazwany NS, Kim SJ, Lee SC. Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11480–11491). 2024.
29.Lu X, Suganuma M, Okatani T. Cascaded Multi-Scale Attention for Enhanced Multi-Scale Feature Extraction and Interaction with Low-Resolution Images. arXiv preprint 2024; arXiv:2412.02197.
30.Omeroglu AN, Mohammed HM, Oral EA, Aydin S. A novel soft attention-based multi-modal deep learning framework for multi-label skin lesion classification. Eng Appl Artif Intell. 2023;120: 105897. [Google Scholar]
31.He X, Wang Y, Zhao S, Chen X. Co-attention fusion network for multimodal skin cancer diagnosis. Pattern Recogn. 2023;133: 108990. [Google Scholar]
32.Dai Y, Gieseke F, Oehmcke S, Wu Y, Barnard K. Attentional feature fusion. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 3560–3569). 2021.
33.Džeroski S, Ženko B. Is combining classifiers with stacking better than selecting the best one? Mach Learn. 2004;54:255–73. [Google Scholar]
34.Malmasi S, Dras M. Native language identification with classifier stacking and ensembles. Comput Linguist. 2018;44(3):403–46. [Google Scholar]
35.Kang H, Kang S. A stacking ensemble classifier with handcrafted and convolutional features for wafer map pattern classification. Comput Ind. 2021;129: 103450. [Google Scholar]
36.Yadav SS, Kadam VJ, Jadhav SM. Comparative analysis of ensemble classifier and single base classifier in medical disease diagnosis. In Communication and Intelligent Systems: Proceedings of ICCIS 2019 (pp. 475–489). 2020; Springer Singapore.
37.Shapley LS. A value for n‐person games. Contribution to the Theory of Games, 2. 1953.
38.Lundberg S. A unified approach to interpreting model predictions. arXiv preprint; 2017. arXiv:1705.07874.
39.Kannangara KPM, Zhou W, Ding Z, Hong Z. Investigation of feature contribution to shield tunneling-induced settlement using Shapley additive explanations method. J Rock Mech Geotech Eng. 2022;14(4):1052–63. [Google Scholar]
40.Zeiler MD. Visualizing and Understanding Convolutional Networks. In European conference on computer vision/arXiv (Vol. 1311). 2014.
41.Behzadi-Khormouji H, Rostami H. Fast multi-resolution occlusion: a method for explaining and understanding deep neural networks. Appl Intell. 2021;51(4):2431–55. [Google Scholar]
42.Kingma DP. Adam: A method for stochastic optimization. arXiv preprint. 2014. arXiv:1412.6980.
43.Wang C, Li Y, Tsuboshita Y, Sakurai T, Goto T, Yamaguchi H, Yamashita Y, Sekiguchi A, Tachimori H. A high-generalizability machine learning framework for predicting the progression of Alzheimer’s disease using limited data. NPJ Digit Med. 2022;5(1):43. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Lin Q, Guo X, Feng B, Guo J, Ni S, Dong H. A novel multi-task learning network for skin lesion classification based on multi-modal clues and label-level fusion. Comput Biol Med. 2024;175: 108549. [DOI] [PubMed] [Google Scholar]
45.Zhou J, Zhang X, Zhu Z, Lan X, Fu L, Wang H, Wen H. Cohesive multi-modality feature learning and fusion for COVID-19 patient severity prediction. IEEE Trans Circuits Syst Video Technol. 2021;32(5):2535–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Palepu J, Endo I, Chaudhari VA, Murthy GVS, Chaudhuri S, Adam R, Smith M, de Reuver PR, Lendoire J, Shrikhande SV, De Aretxabala X. ‘IHPBA-APHPBA clinical practice guidelines’: international Delphi consensus recommendations for gallbladder cancer. HPB. 2024;26(11):1311–26. [DOI] [PubMed] [Google Scholar]
48.Ali AM, et al. TESR: two-stage approach for enhancement and super-resolution of remote sensing images. Remote Sens. 2023;15(9): 2346. [Google Scholar]
49.Qiu S, Miller MI, Joshi PS, Lee JC, Xue C, Ni Y, Wang Y, De Anda-Duran I, Hwang PH, Cramer JA, Dwyer BC. Multimodal deep learning for Alzheimer’s disease dementia assessment. Nat Commun. 2022;13(1):3404. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Dong X, Yu Z, Cao W, Shi Y, Ma Q. A survey on ensemble learning. Front Comput Sci. 2020;14:241–58. [Google Scholar]
51.Salahuddin Z, Woodruff HC, Chatterjee A, Lambin P. Transparency of deep neural networks for medical image analysis: a review of interpretability methods. Comput Biol Med. 2022;140: 105111. [DOI] [PubMed] [Google Scholar]
52.Ahamed MF, et al. A review on brain tumor segmentation based on deep learning methods with federated learning techniques. Comput Med Imaging Graph. 2023;110: 102313. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1^{(257.1KB, docx)}

Data Availability Statement

The data that support the findings of this study are not openly available due to reasons of sensitivity and are available from the corresponding author upon reasonable request.

[CR1] 1.Roa JC, García P, Kapoor VK, Maithel SK, Javle M, Koshiol J. Gallbladder cancer. Nat Rev Dis Primers. 2022;8(1):69. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Yang Z, Wu Z, Xiong Y, Liu S, Cai C, Shao Z, Zhu Y, Song X, Shen W, Wang X, Wu X. Successful conversion surgery for locally advanced gallbladder cancer after gemcitabine and nab-paclitaxel chemotherapy. Front Oncol. 2022;12: 977963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.van Dooren M, de Reuver PR. Gallbladder polyps and the challenge of distinguishing benign lesions from cancer. United Eur Gastroenterol J. 2022;10(7):625. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Aggarwal R, Sounderajah V, Martin G, Ting DS, Karthikesalingam A, King D, Ashrafian H, Darzi A. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit Med. 2021;4(1):65. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Mahmood T, et al. Harnessing the power of radiomics and deep learning for improved breast cancer diagnosis with multiparametric breast mammography. Expert Syst Appl. 2024;249: 123747. [Google Scholar]

[CR6] 6.Mahmood T, et al. Alzheimer’s disease unveiled: cutting-edge multi-modal neuroimaging and computational methods for enhanced diagnosis. Biomed Signal Process Control. 2024;97: 106721. [Google Scholar]

[CR7] 7.Chang Y, Wu Q, Chi L, Huo H, Li Q. Adoption of combined detection technology of tumor markers via deep learning algorithm in diagnosis and prognosis of gallbladder carcinoma. J Supercomput. 2022;78(3):3955–75. [Google Scholar]

[CR8] 8.Obaid AM, Turki A, Bellaaj H, Ksantini M, AlTaee A, Alaerjan A. Detection of gallbladder disease types using deep learning: an informative medical method. Diagnostics. 2023;13(10):1744. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Basu S, Gupta M, Rana P, Gupta P, Arora C. Surpassing the human accuracy: detecting gallbladder cancer from USG images with curriculum learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 20886–20896). 2022.

[CR10] 10.Basu S, Gupta M, Rana P, Gupta P, Arora C. Radformer: transformers with global–local attention for interpretable and accurate Gallbladder Cancer detection. Med Image Anal. 2023;83: 102676. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Yin Y, Yakar D, Slangen JJ, Hoogwater FJ, Kwee TC, de Haas RJ. The value of deep learning in gallbladder lesion characterization. Diagnostics. 2023;13(4):704. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Kinoshita M, Ueda D, Matsumoto T, Shinkawa H, Yamamoto A, Shiba M, Okada T, Tani N, Tanaka S, Kimura K, Ohira G. Deep learning model based on contrast-enhanced computed tomography imaging to predict postoperative early recurrence after the curative resection of a solitary hepatocellular carcinoma. Cancers. 2023;15(7):2140. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Yin Z, Chen T, Shu Y, Li Q, Yuan Z, Zhang Y, Xu X, Liu Y. A gallbladder cancer survival prediction model based on multimodal fusion analysis. Dig Dis Sci. 2023;68(5):1762–76. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Feo CF, Ginesu GC, Fancellu A, Perra T, Ninniri C, Deiana G, Scanu AM, Porcu A. Current management of incidental gallbladder cancer: a review. Int J Surg. 2022;98: 106234. [DOI] [PubMed] [Google Scholar]

[CR15] 15.Takahama S, Kurose Y, Mukuta Y, Abe H, Fukayama M, Yoshizawa A, Kitagawa M, Harada T. Multi-stage pathological image classification using semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10702–10711). 2019.

[CR16] 16.Kong J, He Y, Zhu X, Shao P, Xu Y, Chen Y, Coatrieux JL, Yang G. BKC-net: bi-knowledge contrastive learning for renal tumor diagnosis on 3D CT images. Knowl-Based Syst. 2022;252: 109369. [Google Scholar]

[CR17] 17.Van der Velden BH, Kuijf HJ, Gilhuijs KG, Viergever MA. Explainable artificial intelligence (XAI) in deep learning-based medical image analysis. Med Image Anal. 2022;79: 102470. [DOI] [PubMed] [Google Scholar]

[CR18] 18.Bai X, Wang X, Liu X, Liu Q, Song J, Sebe N, Kim B. Explainable deep learning for efficient and robust pattern recognition: a survey of recent developments. Pattern Recogn. 2021;120: 108102. [Google Scholar]

[CR19] 19.Castelvecchi D. Can we open the black box of AI? Nature. 2016;538(7623):20. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Ahamed MF, et al. Detection of various gastrointestinal tract diseases through a deep learning method with ensemble ELM and explainable AI. Expert Syst Appl. 2024;256: 124908. [Google Scholar]

[CR21] 21.Ahamed MF, et al. Interpretable deep learning architecture for gastrointestinal disease detection: a Tri-stage approach with PCA and XAI. Comput Biol Med. 2025;185: 109503. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Rehman A, Mahmood T, Saba T. Robust kidney carcinoma prognosis and characterization using Swin-ViT and DeepLabV3+ with multi-model transfer learning. Appl Soft Comput. 2025;170: 112518. [Google Scholar]

[CR23] 23.Ahamed MF, et al. Automated detection of colorectal polyp utilizing deep learning methods with explainable AI. IEEE Access. 2024. 10.1109/ACCESS.2024.3402818. [Google Scholar]

[CR24] 24.Ronneberger O, Fischer P, Brox T, U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5–9, 2015, proceedings, part III 1 (pp. 234–241). 2015; Springer International Publishing.

[CR25] 25.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). 2016.

[CR26] 26.Gao SH, Cheng MM, Zhao K, Zhang XY, Yang MH, Torr P. Res2net: A new multi-scale backbone architecture. IEEE Trans Pattern Anal Mach Intell. 2019;43(2):652–62. [DOI] [PubMed] [Google Scholar]

[CR27] 27.Roy AG, Navab N, Wachinger C. Concurrent spatial and channel ‘squeeze & excitation’in fully convolutional networks. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16–20, 2018, Proceedings, Part I (pp. 421–429). 2018; Springer International Publishing.

[CR28] 28.Nam JH, Syazwany NS, Kim SJ, Lee SC. Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11480–11491). 2024.

[CR29] 29.Lu X, Suganuma M, Okatani T. Cascaded Multi-Scale Attention for Enhanced Multi-Scale Feature Extraction and Interaction with Low-Resolution Images. arXiv preprint 2024; arXiv:2412.02197.

[CR30] 30.Omeroglu AN, Mohammed HM, Oral EA, Aydin S. A novel soft attention-based multi-modal deep learning framework for multi-label skin lesion classification. Eng Appl Artif Intell. 2023;120: 105897. [Google Scholar]

[CR31] 31.He X, Wang Y, Zhao S, Chen X. Co-attention fusion network for multimodal skin cancer diagnosis. Pattern Recogn. 2023;133: 108990. [Google Scholar]

[CR32] 32.Dai Y, Gieseke F, Oehmcke S, Wu Y, Barnard K. Attentional feature fusion. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 3560–3569). 2021.

[CR33] 33.Džeroski S, Ženko B. Is combining classifiers with stacking better than selecting the best one? Mach Learn. 2004;54:255–73. [Google Scholar]

[CR34] 34.Malmasi S, Dras M. Native language identification with classifier stacking and ensembles. Comput Linguist. 2018;44(3):403–46. [Google Scholar]

[CR35] 35.Kang H, Kang S. A stacking ensemble classifier with handcrafted and convolutional features for wafer map pattern classification. Comput Ind. 2021;129: 103450. [Google Scholar]

[CR36] 36.Yadav SS, Kadam VJ, Jadhav SM. Comparative analysis of ensemble classifier and single base classifier in medical disease diagnosis. In Communication and Intelligent Systems: Proceedings of ICCIS 2019 (pp. 475–489). 2020; Springer Singapore.

[CR37] 37.Shapley LS. A value for n‐person games. Contribution to the Theory of Games, 2. 1953.

[CR38] 38.Lundberg S. A unified approach to interpreting model predictions. arXiv preprint; 2017. arXiv:1705.07874.

[CR39] 39.Kannangara KPM, Zhou W, Ding Z, Hong Z. Investigation of feature contribution to shield tunneling-induced settlement using Shapley additive explanations method. J Rock Mech Geotech Eng. 2022;14(4):1052–63. [Google Scholar]

[CR40] 40.Zeiler MD. Visualizing and Understanding Convolutional Networks. In European conference on computer vision/arXiv (Vol. 1311). 2014.

[CR41] 41.Behzadi-Khormouji H, Rostami H. Fast multi-resolution occlusion: a method for explaining and understanding deep neural networks. Appl Intell. 2021;51(4):2431–55. [Google Scholar]

[CR42] 42.Kingma DP. Adam: A method for stochastic optimization. arXiv preprint. 2014. arXiv:1412.6980.

[CR43] 43.Wang C, Li Y, Tsuboshita Y, Sakurai T, Goto T, Yamaguchi H, Yamashita Y, Sekiguchi A, Tachimori H. A high-generalizability machine learning framework for predicting the progression of Alzheimer’s disease using limited data. NPJ Digit Med. 2022;5(1):43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Lin Q, Guo X, Feng B, Guo J, Ni S, Dong H. A novel multi-task learning network for skin lesion classification based on multi-modal clues and label-level fusion. Comput Biol Med. 2024;175: 108549. [DOI] [PubMed] [Google Scholar]

[CR45] 45.Zhou J, Zhang X, Zhu Z, Lan X, Fu L, Wang H, Wen H. Cohesive multi-modality feature learning and fusion for COVID-19 patient severity prediction. IEEE Trans Circuits Syst Video Technol. 2021;32(5):2535–49. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] 46.Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Palepu J, Endo I, Chaudhari VA, Murthy GVS, Chaudhuri S, Adam R, Smith M, de Reuver PR, Lendoire J, Shrikhande SV, De Aretxabala X. ‘IHPBA-APHPBA clinical practice guidelines’: international Delphi consensus recommendations for gallbladder cancer. HPB. 2024;26(11):1311–26. [DOI] [PubMed] [Google Scholar]

[CR48] 48.Ali AM, et al. TESR: two-stage approach for enhancement and super-resolution of remote sensing images. Remote Sens. 2023;15(9): 2346. [Google Scholar]

[CR49] 49.Qiu S, Miller MI, Joshi PS, Lee JC, Xue C, Ni Y, Wang Y, De Anda-Duran I, Hwang PH, Cramer JA, Dwyer BC. Multimodal deep learning for Alzheimer’s disease dementia assessment. Nat Commun. 2022;13(1):3404. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR50] 50.Dong X, Yu Z, Cao W, Shi Y, Ma Q. A survey on ensemble learning. Front Comput Sci. 2020;14:241–58. [Google Scholar]

[CR51] 51.Salahuddin Z, Woodruff HC, Chatterjee A, Lambin P. Transparency of deep neural networks for medical image analysis: a review of interpretability methods. Comput Biol Med. 2022;140: 105111. [DOI] [PubMed] [Google Scholar]

[CR52] 52.Ahamed MF, et al. A review on brain tumor segmentation based on deep learning methods with federated learning techniques. Comput Med Imaging Graph. 2023;110: 102313. [DOI] [PubMed] [Google Scholar]

PERMALINK

An end-to-end interpretable machine-learning-based framework for early-stage diagnosis of gallbladder cancer using multi-modality medical data

Huiyu Zhao

Chuang Miao

Yidi Zhu

Yijun Shu

Xiangsong Wu

Ziming Yin

Xiao Deng

Wei Gong

Ziyi Yang

Weiwen Zou

Abstract

Background

Methods

Results

Conclusion

Supplementary Information

Introduction

Fig. 1.

Materials and methods

Clinical dataset

Table 1.

Overall framework

Fig. 2.

Imaging modality processing

N-Top imaging slices selection

GHL-Net for imaging modality

Fig. 3.

Ensemble learning for multi-modalities

Interpretability of the method

Implement details

Comparison settings

Evaluation metrics

Results

Segmentation results

Classification performance in different scenarios

Table 2.

Weights of different modalities

Fig. 4.

Vision feature map results of imaging

Fig. 5.

Effectiveness of proposed feature extraction and fusion methods

Fig. 6.

Fig. 7.

Effectiveness of multi-modalities and ensemble learning

Table 3.

Discussion

Fig. 8.

Conclusion

Supplementary Information

Acknowledgements

Abbreviations

Authors’ contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases