Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Apr 14;15:12812. doi: 10.1038/s41598-025-97297-5

ALL diagnosis: can efficiency and transparency coexist? An explainble deep learning approach

Dost Muhammad 1,, Muhammad Salman 2, Ayse Keles 3, Malika Bendechache 3
PMCID: PMC11997075  PMID: 40229347

Abstract

Acute Lymphoblastic Leukemia (ALL) is a life-threatening malignancy characterized by its aggressive progression and detrimental effects on the hematopoietic system. Early and accurate diagnosis is paramount to optimizing therapeutic interventions and improving clinical outcomes. This study introduces a novel diagnostic framework that synergizes the EfficientNet-B7 architecture with Explainable Artificial Intelligence (XAI) methodologies to address challenges in performance, computational efficiency, and explainability. The proposed model achieves improved diagnostic performance, with accuracies exceeding 96% on the Taleqani Hospital dataset and 95.50% on the C-NMC-19 and Multi-Cancer datasets. Rigorous evaluation across multiple metrics-including Area Under the Curve (AUC), mean Average Precision (mAP), Accuracy, Precision, Recall, and F1-score-demonstrates the model’s robustness and establishes its superiority over state-of-the-art architectures namely VGG-19, InceptionResNetV2, ResNet50, DenseNet50 and AlexNet . Furthermore, the framework significantly reduces computational overhead, achieving up to 40% faster inference times, thereby enhancing its clinical applicability. To address the opacity inherent in Deep learning (DL) models, the framework integrates advanced XAI techniques, including Gradient-weighted Class Activation Mapping (Grad-CAM), Class Activation Mapping (CAM), Local Interpretable Model-Agnostic Explanations (LIME), and Integrated Gradients (IG), providing transparent and explainable insights into model predictions. This fusion of high diagnostic precision, computational efficiency, and explainability positions the proposed framework as a transformative tool for ALL diagnosis, bridging the gap between cutting-edge AI technologies and practical clinical deployment.

Keywords: Explainable artificial intelligence, XAI for medical diagnosis, EXplainble medical imaging, ALL detection, Decision support system, Responsible AI

Subject terms: Computational biology and bioinformatics, Engineering

Introduction

Acute Lymphoblastic Leukemia (ALL) is a hematological malignancy characterized by the uncontrolled proliferation of abnormal white blood cells, disrupting normal hematopoietic function. This aggressive progression, compounded by its impact on the immune system, makes ALL a significant global health challenge1. Projections from the World Health Organization (WHO) estimate a nearly 50% increase in cancer incidence and mortality rates by 2040 (https://platform.who.int/mortality/themes/theme-details/topics/indicator-groups/indicator-group-details/MDB/leukaemia). This alarming statistic underscores the critical need for advancements in diagnostic methodologies to facilitate timely and accurate detection2.

Traditional diagnostic methods, while effective, suffer from inherent limitations, including their invasive nature, time-intensive processes, and susceptibility to diagnostic inaccuracies. Microscopic differentiation between normal and ALL cells is further hindered by technical issues such as illumination errors and staining artifacts3. The advent of medical imaging and computational techniques offers a non-invasive alternative; however, manual interpretation of extensive datasets by radiologists remains labor-intensive and prone to errors, particularly when subtle cellular variations are critical for diagnosis.

DL-based methodologies have emerged as transformative methods in the diagnosis and classification of ALL. Numerous studies2,46 have explored the potential of various architectures, including VGG-19, InceptionResNetV2, DenseNet-121, and EfficientNet-B3, among others. While these models have demonstrated promising diagnostic performance, their adoption in clinical settings is constrained by computational inefficiencies, which result in significant processing times7,8. This computational overhead poses a critical bottleneck, particularly in time-sensitive medical scenarios where rapid decision-making is essential.

Another critical limitation of existing DL models lies in their lack of explainability. The “black-box” nature of these models often leaves patients, clinicians, and stakeholders unable to comprehend the decision-making process, thereby undermining trust and adoption9,10. Addressing this gap is crucial for fostering confidence in DL-driven diagnostic tools, ensuring their effective integration into Clinical Decision Support Systems (CDSS).

Motivated by these challenges, this study proposes a novel diagnostic framework for ALL that utilising the EfficientNet-B7 architecture, augmented with XAI techniques. The proposed framework addresses key technical gaps by delivering state-of-the-art diagnostic accuracy while significantly reducing computational time. To enhance explainability, XAI methodologies namely Grad-CAM11, CAM12, LIME13, and IG14 are integrated, providing transparent insights into model decisions and promoting trust among healthcare practitioners.

This study aims to address the following research questions:

  • R1: How can the diagnostic acculturates of ALL prediction models be improved while maintaining computational efficiency for real-time clinical applications?

  • R2: What are the key technical challenges in integrating XAI into ALL diagnosis, and how can these be overcome to enhance model explainability?

  • R3: To what extent can XAI methodologies improve trust and transparency in DL-based diagnostic frameworks for medical practitioners and patients?

To address these research questions, this study bridges gaps in diagnostic performance, computational efficiency, and model explainability through a novel framework.The contributions of this study are summarized as follows:

  • Development of a robust diagnostic framework using EfficientNet-B7, demonstrating improved performance compared to traditional architectures namely VGG-19, InceptionResNetV2, ResNet50, DenseNet50 and AlexNet. The framework achieves high accuracy, precision, recall, F1-score, mAP, and AUC across multiple datasets.

  • Significant reduction in computational time, improving the framework’s viability for real-time clinical applications and addressing critical bottlenecks in medical diagnostics.

  • Integration of XAI methodologies, including Grad-CAM, CAM, LIME, and IG, to provide interpretable and transparent explanations of model predictions, thereby enhancing trust and confidence in automated diagnostic systems.

This paper is structured as follows: The section on relevant studies reviews the current advancements and identifies gaps in the field. The methodology section elaborates on the proposed framework and the benchmark models used for performance comparison. The experimental results and discussion section presents key findings, highlighting the strengths of the proposed approach. Finally, the conclusion reflects on the study’s contributions, acknowledges its limitations, and outlines exciting directions for future research to build upon this work.

Relevant studies

In the extant literature, research works in the scope of detection, identification, classification and use of XAI in ALL were considered.

Mondal et al.2 utilised a CNN for the automated detection of ALL by exploring a weighted ensemble to enhance classification. The model was trained and validated using the C-NMC-19 dataset by incorporating various augmentation methods. A non-invasive diagnosis of ALL from images was presented by5 using VGG-16 and efficient channel attention (ECA) to distinguish cancerous and healthy cells through better feature extraction. An AI-based system was presented by15 to automate blast cell detection in microscopic images and to classify cells as either healthy or leukemic. The proposed method was tested on the ALL_IBD and CNMC-19 datasets. Kasani et. al.6 proposed a hybrid model that integrates various DL networks to identify leukemic B-lymphoblasts and enhances model performance using data augmentation with transfer learning strategies. Similarly, the researchers of16 utilised EfficientNet-B3 for ALL classification using the C-NMC-19 dataset with balancing and pre-processing techniques. Liu et.al.17 employed deep bagging ensemble learning (DLbased) for ALL cells classification using microscopic images. The authors of19 combined the vision transformer model and CNN for the classification of cancer and normal cells. The dataset was balanced using the data enhancement method (DERS).

In a study conducted by Abir et.al.4 developed a model for automatically detecting ALL by integrating various transfer learning approaches such as Inception-V3, ResNetV2, VGG-19 and InceptionResNetV2. The authors presented LIME as the XAI method in their research work. This study18 explored SHAP as an XAI approach and new therapeutic opportunities for histone deacetylase (HDAC) inhibitors in childhood acute leukaemia using drug repurposing approaches and molecular analysis.

Several studies have explored advanced techniques in EEG recognition, fault diagnosis, and neuroimaging-based CAD to address domain-specific challenges. DMV-MAT20 introduced a subject-specific EEG recognition framework leveraging multiview networks and module adaptation transfer to improve feature diversity and transfer performance, while STS-HGCN-AL21 utilized a spatio-temporal-spectral hierarchical graph convolutional network for patient-specific seizure prediction, incorporating active learning to adapt preictal intervals. In fault diagnosis, FEM simulation combined with ELM was applied for gear fault classification22, and a MED-based CNN23 approach was used for axial piston pump fault detection through automatic feature learning. Similarly, the PM-SMEKLM algorithm extended transfer learning (TL) to neuroimaging-based CAD24, using sparse learning and cross-domain transformations to diagnose brain diseases. However, these approaches did not incorporate explainability methods to interpret model decisions, limiting their utility in sensitive applications such as healthcare. Additionally, computational time was not reported in these studies, raising concerns about their practicality for real-time applications. Recent advancements advancements in ensemble learning (EL) frameworks for cervical cancer25 and COVID-19 detection26, showcasing techniques like feature fusion, ConvLSTM layers, and Grad-CAM-based interpretability. However, these methods lack computational efficiency and broader explainability integration.

This study explored various architectural methodologies and evaluation criteria, with the use of EfficientNet-B7 and integration of XAI standing as a novel contribution. While prior literature presented in Table 1 applied models like VGG-16/19, Inception-V3, ResNetV2/50, DenseNet121, and EfficientNet-B3, our choice of EfficientNet-B7 strikes a strategic balance between accuracy and efficiency. We employed comprehensive metrics, including AUC and mAP, and utilized the C-NMC-19, Multi cancer and Taleqani Hospital datasets, demonstrating improved performance. Unlike other studies, we integrated XAI techniques (Grad-CAM, CAM, LIME and IG), providing enhanced model explainability and addressing computational time, which has been largely overlooked in existing work. Thus, this study sets a potential new benchmark for XAI in ALL research.

Table 1.

Comparison with extant literature.

Researchers Architecture Perf-metrics Dataset XAI integration Computational time provided
Mondal et al.2 VGG-16, InceptionResNet-V2, MobileNet, DenseNet-121 Accuracy, AUC C-NMC-19 No No
Amin et al.5 ECA-Net utilizing VGG-16 Accuracy, sensitivity, specificity C-NMC-19 No No
Khandekar et al.15 ResNet50, ResNext50, VGG-16, YOLOv4 mAP, Recall, F1-score C-NMC-19 No No
Kasani et al.6 Aggregation-based architecture of VGG19 and NASNetLarge Accuracy, precision, recall, F1-score ISBI-19 No No
El-Ghani et al.16 EfficientNet-B3 Accuracy, precision, recall, F1-score C-NMC-19 No No
Ying et al.17 Deep Ensemble Learning F1-score ISBI-19 No No
Jiang et al. ViT-CNN Accuracy, precision ISBI-19 No No
Abir et al.4 Inception-V3 Accuracy, F1-score ISBI-19 LIME No
Uysal et al.18 Decision Tree Regressor Accuracy, F1-score ChEMBL version SHAP No
Our proposed EfficientNet-B7 Accuracy, precision, recall, F1-score, AUC, mAP C-NMC-19, Multi-cancer, Taleqani Hospital Grad-CAM, CAM, LIME, IG Yes

Materials and methods

The proposed workflow is illustrated in Fig. 1, highlighting the key steps. It involves diagnostic screening, data collection, pre-processing, and feature extraction using EfficientNet-B7. The model undergoes compilation and training, followed by evaluation based on several metrics and computational efficiency. Finally, XAI methods like Grad-CAM, CAM, LIME, and IG are integrated to provide interpretable visual explanations, ensuring the framework is both accurate and transparent,enabling health staff, patients, and tech enthusiasts to better understand the proposed model’s decisions.

Fig. 1.

Fig. 1

Workflow of the proposed methodology.

Experimental environment and implementation setup

The experiments were implemented using Python, selected for its flexibility and extensive ecosystem of DL libraries such as TensorFlow and Keras presented in Table 2. Model training and evaluation were conducted on a high-performance computational setup equipped with an AMD Ryzen 7 5700X Eight-Core CPU and a 16GB NVIDIA GeForce RTX 4080 GPU, ensuring efficient processing of DL workloads. The operating system utilized was Ubuntu 20.04 LTS, which provided a stable and optimized environment for executing computational tasks. To manage dependencies, Python virtual environments were created, ensuring consistency across experimental setups.

Table 2.

Implementation setup overview.

Environmental details Specifications
Operating system Ubuntu 20.04 LTS
Processor AMD Ryzen 7 5700X Eight-Core
Architecture 64-bit
GPU 16GB NVIDIA GeForce RTX 4080
Programming language Python
Framework TensorFlow, Keras
Libraries utilised Pandas, Numpy, cv2, Matplotlib, Seaborn, Scikit-learn, os

The models were trained using the Adamax optimizer with a learning rate of 0.001, batch size of 32, and early stopping to prevent overfitting. Additionally, TensorBoard was employed for real-time monitoring of the training process, allowing visualization of metrics such as accuracy and loss curves.

Description of datasets

Our study utilized three diverse datasets, each offering unique characteristics suitable for validating the proposed framework:

  • C-NMC-1927: This dataset contains either cancerous or normal cells derived from real-world microscopic images, consisting of 60% Acute Lymphoblastic Leukaemia (ALL) images and 40% normal images. Noise and illumination errors were corrected using a stain color normalization technique28, enhancing image quality. The dataset’s reliability is underpinned by the ground truth provided by expert oncologists, making it suitable for validating computational models.

  • Taleqani Hospital dataset29: Collected at the bone marrow laboratory of Taleqani Hospital in Tehran, this dataset includes 3,256 peripheral blood smear (PBS) images from 89 patients suspected of having ALL. The dataset is imbalanced and comprises meticulously prepared and stained images by skilled laboratory staff. The images underwent preprocessing, including resizing and normalization, to ensure compatibility with the proposed model.

  • Multi Cancer dataset30: This dataset includes 10,000 peripheral blood smear (PBS) images containing both ALL and healthy samples. Similar to the Taleqani Hospital dataset, it is imbalanced and comprises carefully prepared and stained images by professional staff. Preprocessing steps such as resizing and normalization were applied to standardize the dataset.

The datasets were divided into two subsets: training (75%) and testing (25%) to ensure sufficient data for both model training and performance evaluation. Stratified splitting was used to maintain the class distribution across the subsets, mitigating the impact of dataset imbalance and ensuring robust model evaluation.

Preprocessing

In the preprocessing stage, the considered datasets underwent a systematic filtration process to ensure uniformity and compatibility with the proposed framework. Let Inline graphic represent the original dataset, where each image is denoted as Inline graphic, and its corresponding file extension is given by Inline graphic. The filtration process is mathematically formulated in Eq1:

graphic file with name d33e677.gif 1

where only images with valid extensions are retained, ensuring that corrupted or irrelevant files are removed. This step enhances data quality and prevents anomalies that could adversely affect model training.

Following filtration, the dataset was partitioned into training and testing subsets. Given a dataset Inline graphic of size Inline graphic, a split ratio Inline graphic was applied to allocate images into training (Inline graphic) and testing (Inline graphic) sets:

graphic file with name d33e717.gif 2

where Inline graphic is selected to ensure sufficient training samples while maintaining a representative test set. This partitioning process, described in Eq. (2), is crucial for model generalization and robust performance evaluation.

To meet the input dimensionality requirements of Convolutional Neural Networks (CNNs)31, all images were resized to a fixed dimension of Inline graphic pixels. Let an image Inline graphic be represented as a matrix of pixel intensities, presented in Eq3:

graphic file with name d33e754.gif 3

where Inline graphic and Inline graphic denote the original height and width, respectively, and Inline graphic represents the number of color channels. The resizing operation is performed as follows:

graphic file with name d33e780.gif 4

where Inline graphic is the transformed image with standardized dimensions Inline graphic. The transformation in Eq. (4) ensures uniform input size across the dataset, minimizes computational overhead, and facilitates the utilization of pre-trained CNN architectures, which typically require fixed input dimensions.

Additionally, pixel intensities were normalized to the range Inline graphic to improve training stability. If Inline graphic represents the pixel intensity at location Inline graphic in the Inline graphic-th channel, normalization is applied as:

graphic file with name d33e829.gif 5

ensuring that all pixel values lie within a standard range. The normalization process, as defined in Eq. (5), prevents saturation effects and accelerates convergence during network training.

These preprocessing steps, detailed in Eqs. (1)-(5), play a critical role in optimizing dataset quality and ensuring the robustness of the proposed diagnostic framework.

Proposed model

In this study, we carefully selected methods and hyperparameters based on a combination of empirical practices and established best practices to optimize performance. The EfficientNet-B7 base model32 illustrated in Fig. 2, pre-trained on the ImageNet dataset, was utilized via the Keras framework. By leveraging pre-trained weights, the baseline model benefited from the rich features of ImageNet, enabling significant improvements in image recognition performance. EfficientNet-B7 is a convolutional neural network (CNN) with feed-forward training that progresses from the input layer to the classification layer, as shown in Eq. (6). The back-propagation of error begins at the classification phase and proceeds to the input layer, completing one pass. Information is passed from neuron Inline graphic of Inline graphicth to neuron Inline graphic of the Inline graphicth layer, where Inline graphic is the connection weight between the two neurons within the Inline graphicth layer33,34.

graphic file with name d33e913.gif 6

The base model was configured without the top layer, allowing for a customized output layer tailored to the specific task requirements. To facilitate feature map reduction and abstraction, we applied max pooling after the base model output. Max pooling enhanced the most salient features through spatial down-sampling, which not only reduced computational load but also ensured robustness to variations in input images. Following the initial output, batch normalization was employed to normalize inputs for subsequent layers using adjustment and activation scaling.

Fig. 2.

Fig. 2

Proposed model (EfficientNetB7) architecture.

Additionally, to address potential class imbalances in the datasets, we utilized class weighting during model training. This approach ensured that minority classes were appropriately represented in the learning process, reducing bias in predictions. Furthermore, balanced datasets were created as part of the preprocessing stage to mitigate inherent data bias.

To further optimize performance, a dense layer with L1 and L2 regularization35,36 was employed to reduce overfitting, as illustrated in Eq. (7). The regularization parameters were carefully chosen to balance model complexity and the fidelity of the training data.

graphic file with name d33e934.gif 7

Here, Inline graphic represents weights, Inline graphic is the number of parameters in the model, and Inline graphic is the regularization strength. Techniques such as batch normalization, regularization, and dropout37, as shown in Eq. (8), were integrated to fine-tune the balance between reducing overfitting and maintaining model complexity and integrity, aligning with the unique aspects of the training dataset36.

graphic file with name d33e971.gif 8

Where Inline graphic denotes a neuron within the Inline graphicth layer employing the activation function Inline graphic. Here, Inline graphic is a random variable, Inline graphic is the input, and Inline graphic represents the output of this neuron. The EfficientNet-B7 model concludes with an output layer integrating a dense layer with a sigmoid activation function for classification. During the model compilation phase, the Adamax optimizer was employed due to its proven effectiveness in handling embeddings and improving model optimization38. The hyper-parameter settings used for training the proposed model are summarized in Table 3, where key parameters such as dropout rate, optimizer, learning rate, batch size, number of epochs, activation function, and loss function are specified to ensure optimal performance and convergence.

Table 3.

Hyper-parameter settings for proposed model training.

Hyper-parameters Details
Dropout 0.2
Optimizer Adam
Learning rate 0.001
Batch size 32
Epochs 30
Activation function ReLU
Loss function Binary cross-entropy

Benchmark architectures

In this study, several DL architectures, namely VGG-19, InceptionResNetV2 and ResNet50 were trained and evaluated as benchmark models. These models have been widely adopted in ALL diagnosis due to their valuable performance.

VGG-19

VGG-19 is a deep convolutional neural network (CNN) architecture introduced by the Visual Geometry Group at the University of Oxford. It comprises 19 layers, including 16 convolutional layers, three fully connected layers, and a softmax classifier. The design of VGG-19 emphasizes simplicity by using fixed small Inline graphic convolutional filters across all layers. This approach allows the model to capture spatial hierarchies while increasing depth.

The working mechanism of VGG-19 includes convolutional operations to extract feature representations, max pooling layers to reduce spatial dimensions, and fully connected layers to map features to class probabilities. Mathematically, the convolution operation for generating a feature map Inline graphic can be expressed as shown in Eq. (9):

graphic file with name d33e1098.gif 9

where Inline graphic represents the convolutional kernel, Inline graphic is the input image, and Inline graphic is the resulting feature map.

InceptionResNetV2

InceptionResNetV2 combines the strengths of Inception modules and residual connections, achieving efficient multi-scale feature extraction while mitigating the vanishing gradient problem. The Inception module applies parallel convolution operations of varying kernel sizes, such as Inline graphic, Inline graphic, and Inline graphic, allowing the network to capture features at different spatial scales. Residual connections further enhance this architecture by introducing shortcut paths that bypass layers, ensuring gradient stability during backpropagation.

The mathematical formulation for a residual block in this architecture is given by:

graphic file with name d33e1148.gif 10

where Inline graphic represents the transformations applied by the block, and Inline graphic is the residual connection. As shown in Eq. (10), the output Inline graphic is obtained by summing the transformed input and the original input, ensuring that the network can learn both identity mappings and complex transformations effectively.

ResNet50

ResNet50 is another widely used architecture designed to address the degradation problem in very deep networks. This model features 50 layers and employs residual learning through identity mappings. These residual connections allow input features to flow directly to later layers, enabling efficient gradient propagation and improving optimization.

A single residual block in ResNet50 can be expressed as:

graphic file with name d33e1182.gif 11

where Inline graphic and Inline graphic are the weights of the convolutional layers, Inline graphic is the bias, and Inline graphic denotes the activation function (e.g., ReLU). The residual connection, represented by the term Inline graphic, facilitates the flow of information and gradients through the network by directly adding the input to the output of the transformed block.

The introduction of batch normalization within ResNet50 further stabilizes training by normalizing intermediate layer activations, thereby enhancing the model’s robustness and convergence properties, as shown in Eq. (11).

DenseNet50

DenseNet50 is a CNN architecture that excels in medical image classification due to its efficient parameter usage and dense connectivity. Each layer in DenseNet50 connects to all subsequent layers within a dense block, enabling effective feature propagation and reuse. The input to the Inline graphic-th layer is defined as:

graphic file with name d33e1235.gif 12

where Inline graphic includes batch normalization (BN), ReLU, and convolution, and Inline graphic denotes concatenation. Dense blocks extract hierarchical features, with a growth rate Inline graphic determining the number of new features added per layer. The total number of output features at layer Inline graphic is given by:

graphic file with name d33e1267.gif 13

where Inline graphic is the initial feature count. Transition layers are used between dense blocks to reduce the feature dimensions. These layers combine 1x1 convolutions and average pooling, as expressed in:

graphic file with name d33e1280.gif 14

where Inline graphic and Inline graphic represent the weights and biases, and Inline graphic denotes the average pooling operation.

Global Average Pooling (GAP) reduces the spatial dimensions of the feature maps to a feature vector, which is passed through a softmax function to compute class probabilities. This process is defined as:

graphic file with name d33e1308.gif 15

where Inline graphic is the logit for class Inline graphic, and Inline graphic is the total number of classes. During training, DenseNet50 minimizes the cross-entropy loss function, defined as:

graphic file with name d33e1333.gif 16

where Inline graphic is the ground truth label.

Equation 12 highlights how each layer reuses the features from preceding layers, ensuring effective feature propagation. The growth rate of features, shown in Equation 13, demonstrates the model’s ability to incrementally build on extracted features. The dimensionality reduction in transition layers, described in Equation 14, ensures computational efficiency. Finally, Equations 15 and 16 describe the probability computation and loss function, respectively, which are central to the classification task.

AlexNet

AlexNet is a deep CNN designed for efficient image classification, making it suitable for medical imaging tasks. Input images, typically of size Inline graphic, are normalized to ensure faster convergence during training. Feature extraction is performed in the convolutional layers using filters Inline graphic, defined as:

graphic file with name d33e1380.gif 17

where Inline graphic represents the output at position Inline graphic, and Inline graphic is the bias term. The convolutional output is passed through a rectified linear unit (ReLU) activation function to introduce non-linearity, expressed as:

graphic file with name d33e1406.gif 18

Max pooling is applied to reduce the spatial dimensions of the feature maps while retaining important features. If the pooling kernel size is Inline graphic, the pooling operation is given by:

graphic file with name d33e1419.gif 19

The extracted features are flattened into a one-dimensional vector and passed through fully connected layers. The output of the fully connected layer is calculated as:

graphic file with name d33e1426.gif 20

where Inline graphic and Inline graphic are the weights and biases, and Inline graphic is the input vector. The final probabilities for each class are computed using the softmax function:

graphic file with name d33e1452.gif 21

where Inline graphic is the logit corresponding to class Inline graphic, and Inline graphic is the number of classes. During training, AlexNet minimizes the cross-entropy loss:

graphic file with name d33e1478.gif 22

where Inline graphic is the ground truth label.

Equation 17 describes the feature extraction process through convolutional filters, while Equation 18 introduces non-linearity to the network. Equation 19 demonstrates how max pooling reduces spatial dimensions. The fully connected layers and their computation are detailed in Equation 20, with the softmax function and cross-entropy loss provided in Equations 21 and 22, respectively.

AlexNet’s ability to extract hierarchical features and its effective regularization techniques, such as dropout, make it a robust model for medical image classification tasks.

Integration of XAI

EfficientNet-B7 with its complex architecture extracts intricate features from the input cancerous images essential for high accuracy in ALL identification and classification. However, due to the deep architecture of this model, understanding the decision process for the Non-Tech community is challenging. In response to this gap, we employed XAI approaches namely LIME39, Grad-CAM40, CAM and IG14 to provide the visual explanation and interpretation of the image features which are influencing the model’s prediction.

Grad-CAM

Grad-CAM utilises the gradients of the last convolutional layer (top_activation) to generate a heatmap illustrated in Eq. (23) that illuminates the critical region within the image which affects the model decision.

graphic file with name d33e1537.gif 23

The Eq. (23) represents this localization map for class Inline graphic, where Inline graphic highlights important regions, Inline graphic ensures non-negative output, Inline graphic indicates the importance of feature map Inline graphic for class Inline graphic, and Inline graphic are the feature maps from convolutional layers.

CAM

CAM12 provides a method to visually interpret a convolutional neural network’s decisions by highlighting regions of an image that are critical for a specific class prediction. CAM achieves this by leveraging the global average pooling layer to directly connect feature maps from the last convolutional layer to the final output. The importance of these feature maps is determined by the weights of the fully connected layer.

The localization map for class Inline graphic, denoted as Inline graphic, is mathematically represented as:

graphic file with name d33e1612.gif 24

where Inline graphic represents the weight of the fully connected layer corresponding to feature map Inline graphic for class Inline graphic, and Inline graphic denotes the activation at spatial location Inline graphic of the Inline graphic-th feature map.

Equation (24) highlights the regions of the image that contribute most to the network’s prediction for class Inline graphic. By summing the weighted activations, CAM generates a class-specific heatmap that provides interpretability to convolutional neural network predictions, helping identify critical spatial regions in the input image.

LIME

LIME provides an explanation by perturbing the input images, observing the changes in the model prediction, and pinpointing the image features that substantially impact the model’s prediction as shown in Eq. (25)41.

graphic file with name d33e1678.gif 25

In Eq. (25), Inline graphic measures the difference between the original model Inline graphic and the interpretable model Inline graphic, with Inline graphic as the proximity measure. Weights Inline graphic are for perturbed samples Inline graphic. Inline graphic is the complexity penalty for Inline graphic.

IG

IG is an XAI approach that offers a way to attribute the prediction of the model to its input features, notably pixels for images by integrating the output gradients from a baseline to the actual image; thereby highlighting the role of an individual pixel in image analysis as shown in Eq. (26).

graphic file with name d33e1745.gif 26

Here, Inline graphic is the integrated gradient for feature Inline graphic, Inline graphic is the actual input, Inline graphic is the baseline input, and Inline graphic scales the difference between inputs.

Evaluation metrics

We utilized a series of evaluation metrics37 to gauge and benchmark the performance of the proposed model. Accuracy, precision, recall and f1-score stem from the confusion matrix. In addition, the area under the curve (AUC) and mean average precision are well-known performance evaluation metrics for AI models42.The confusion matrix, a fundamental tool in performance assessment, consists of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN), which are defined within the context of this study as follows (Fig. 3):

  • True positive (TP): Instances where the model correctly identifies an image as ALL, and the corresponding ground truth label confirms the presence of ALL. This indicates a successful classification of a pathological case where malignant lymphoblasts, characterized by their large nuclei and abnormal morphology, proliferate uncontrollably, disrupting normal hematopoiesis.

  • True negative (TN): Cases where the model accurately predicts an image as Normal (Healthy), and the actual ground truth also corresponds to a healthy subject. This represents a correct exclusion of disease presence, ensuring that the model does not mistakenly detect pathology in normal blood samples.

  • False positive (FP): Instances where the model incorrectly classifies a Normal (Healthy) image as ALL, despite the ground truth indicating the absence of leukemia. This type of misclassification contributes to an increased false alarm rate, potentially leading to unnecessary clinical interventions, psychological distress for patients, and additional diagnostic testing.

  • False negative (FN): Cases where the model fails to detect ALL and instead classifies an image as Normal (Healthy), despite the ground truth confirming the presence of leukemia. This is the most critical misclassification type, as it poses significant clinical risks by failing to identify diseased cases, potentially delaying necessary medical treatment. In ALL cases, the presence of dysfunctional white blood cells (lymphoblasts) disrupts the normal immune function and blood cell production, leading to severe complications if left undiagnosed.

Fig. 3.

Fig. 3

Comparison of the confusion matrices for the proposed model across different datasets: (A) Taleqani Hospital dataset, (B) C-NMC-19 dataset, and (C) Multi cancer dataset. Each subfigure illustrates the model’s classification performance on the respective dataset.

Results and discussion

In this study, we employed the EfficientNet-B7, a forefront CNN architecture for the vital task of identification and classification of ALL utilising the microscopic images. Following the model’s training and validation phase, we integrated the XAI methodologies to demystify and explain the model’s prediction decision.

All diagnosis

Case study-Taleqani Hospital

The evaluation metrics for diagnosing ALL using the Talqani Hospital dataset are summarised in Table 4. These metrics include Accuracy, Precision, Recall, F1-Score, Area Under the Curve (AUC), Mean Average Precision (mAP), and computational time (in seconds), providing a thorough comparison of the benchmark models-VGG-19, InceptionResNetV2, and ResNet50-with the proposed model.

Table 4.

Performance comparison of different models for ALL diagnosis on the Talqani Hospital dataset (Total images: 3256).

Model Accuracy (%) Precision (%) Recall (%) F1-score (%) AUC (%) mAP (%) Computational time (s)
VGG-19 89.56 89.51 89.56 89.50 90.01 89.04 89.9
InceptionResNetV2 71.74 84.71 71.74 69.46 72.83 71.85 108.48
ResNet50 89.80 91.66 89.80 89.08 88.90 89.45 121.28
DenseNet50 86.41 82.66 84.13 86.50 88.17 88.15 129.27
AlexNet 84.71 83.16 84.13 86.50 88.17 84.19 121.33
Our proposed 96.78 97.00 96.25 96.57 99.85 99.56 59.28

VGG-19 achieved an accuracy of 89.56%, with Precision (89.51%) and Recall (89.56%) demonstrating balanced diagnostic performance. The F1-Score of 89.5% and AUC of 90.01% reflect a strong ability to differentiate classes, with an mAP of 89.04% indicating good overall performance. However, its computational time of 89.9 seconds highlights moderate efficiency, which could be a limiting factor for rapid diagnosis.

InceptionResNetV2 delivered suboptimal results, with an accuracy of 71.74% and an F1-Score of 69.46%, despite a relatively high Precision of 84.71%. Its Recall of 71.74% suggests a struggle in identifying all positive leukemia cases, and the AUC (72.83%) and mAP (71.85%) further underscore its limitations. Additionally, its computational time of 108.48 seconds makes it less suitable for clinical applications requiring timely results.

ResNet50 demonstrated better performance than VGG-19, achieving an accuracy of 89.8%, with Precision (91.66%) being the highest among the benchmark models. Its Recall (89.8%) and F1-Score (89.08%) indicate consistent diagnostic reliability, while the AUC (88.9%) and mAP (89.45%) reflect competitive class differentiation. However, with a computational time of 121.28 seconds, it is the least efficient model, presenting a trade-off between accuracy and computational cost.

DenseNet50 provides balanced performance across the evaluated metrics, achieving an accuracy of (86.41%). Its precision of (82.66%) and recall of (84.13%) result in an F1-Score of (86.50%. The AUC of (88.17%) demonstrates reasonable discriminatory power. However, DenseNet50’s computational time of (129.27 seconds) is the highest among all the tested models, highlighting its computational inefficiency despite its balanced performance.

AlexNet, while being a foundational architecture, demonstrates modest performance in this study. It achieves an accuracy of (84.71%), with a precision of (83.16%) and a recall of (84.13%), resulting in an F1-Score of (86.50%). The AUC of (88.17%) is comparable to DenseNet50. However, its computational time of (121.33 seconds) is high, indicating that while AlexNet provides satisfactory performance, it is not optimal for real-time or large-scale clinical applications.

The proposed model outperformed all benchmark models in every evaluated metric. It achieved an accuracy of 96.78%, with Precision (97%), Recall (96.25%), and an F1-Score (96.57%), demonstrating its ability to accurately and reliably diagnose ALL. Furthermore, the near-perfect AUC of 99.85% and mAP of 99.56% indicate exceptional class differentiation and prediction capability. Importantly, the proposed model required only 59.28 seconds for computation, making it the most efficient and effective solution for ALL diagnosis. These results highlight the superiority of the proposed model, particularly in its ability to combine high diagnostic accuracy with computational efficiency, making it suitable for clinical deployment.

The results in Table 4, combined with the Receiver Operating Characteristic (ROC) curves shown in Fig. 4, clearly illustrate the potential of the proposed model as a robust and efficient diagnostic tool for ALL. The ROC curves, representing the classification performance for healthy and cancerous images, demonstrate the model’s strong ability to distinguish between the two classes, with high true positive rates significantly outperforming the random baseline. These findings highlight the model’s capability to meet the stringent demands of clinical settings with high reliability and precision.

Fig. 4.

Fig. 4

AUC-ROC curves for healthy (red) and cancerous (black) image classification in the diagnosis of ALL using the Talqani Hospital Dataset.

The results in Table 4, combined with the Receiver Operating Characteristic (ROC) curves shown in Fig. 4, clearly illustrate the potential of the proposed model as a robust and efficient diagnostic tool for ALL. The training and testing performance curves, as depicted in Fig. 9, further reinforce this by demonstrating the model’s stable convergence, minimal overfitting, and strong classification ability. The ROC curves, representing the classification performance for healthy and cancerous images, demonstrate the model’s strong ability to distinguish between the two classes, with high true positive rates significantly outperforming the random baseline. These findings highlight the model’s capability to meet the stringent demands of clinical settings with high reliability and precision (Fig. 5).

Fig. 9.

Fig. 9

Performance evaluation of the proposed model on the Multi-Cancer dataset over 30 epochs. The training loss (black) rapidly decreases, while the testing loss (red) remains stable at a low value, demonstrating effective learning and generalization. The training accuracy (green) shows a continuous upward trend, approaching 100%, while the testing accuracy (yellow) improves consistently, stabilizing above 95%. These trends indicate the model’s robustness in classifying ALL with high precision and reliability.

Fig. 5.

Fig. 5

Performance evaluation of the proposed model on the Taleqani Hospital dataset over 30 epochs. The training loss (black) decreases rapidly, while the testing loss (red) remains stable, indicating effective generalization. The training accuracy (green) and testing accuracy (yellow) show consistent improvement, with testing accuracy reaching 96%.

Case study: C-NMC-19 dataset

The table  5 provides the performance evaluation of different deep learning models for diagnosing ALL using the C-NMC-19 dataset. The VGG-19 model delivered an accuracy of 87.3%, with a Precision of 87.54% and Recall of 87.3%. The F1-Score of 86.92% highlights a relatively balanced performance. The AUC value of 88.02% and mAP of 87.11% indicate strong predictive capability. However, the computational time of 1294.82 seconds makes it the least efficient model, suggesting that it may not be optimal for time-sensitive applications.

Table 5.

Performance comparison of different models for ALL diagnosis on the C-NMC-19 dataset (Total images: 4137).

Model Accuracy (%) Precision (%) Recall (%) F1-score (%) AUC (%) mAP (%) Computational time (s)
VGG-19 87.30 87.54 87.30 86.92 88.02 87.11 1294.82
InceptionResNetV2 65.01 61.17 65.01 64.34 65.11 65.77 1081.24
ResNet50 86.09 87.76 86.09 85.21 86.40 86.33 1091.64
DenseNet50 85.11 82.56 86.42 82.05 83.96 86.33 1129.39
AlexNet 85.32 85.37 85.32 84.90 83.17 86.44 1098.39
Our proposed 97.13 99.32 92.42 95.75 96.04 94.45 882.88

InceptionResNetV2 showed a significantly lower performance, with an accuracy of 65.01%, a Precision of 61.17%, and a Recall of 65.01%. Its F1-Score of 64.34%, AUC of 65.11%, and mAP of 65.77% suggest limited effectiveness in handling this dataset. Although it required less computational time than VGG-19, taking 1081.24 seconds, its lower performance metrics indicate it is less suitable for the given diagnostic task.

ResNet50 demonstrated slightly better performance compared to VGG-19 in terms of Precision, achieving 87.76%, along with an accuracy of 86.09% and a Recall of 86.09%. Its F1-Score of 85.21%, AUC of 86.4%, and mAP of 86.33% indicate a strong ability to classify correctly. However, it required 1091.64 seconds of computational time, which, while better than VGG-19, remains resource-intensive.

DenseNet50 achieves an accuracy of 85.11%, with precision and recall values of 82.56% and 86.42%, respectively. The F1-Score of 82.05% reflects a slightly lower ability to balance predictions compared to ResNet50. The model attains an AUC score of 83.96% and an mAP score of 86.33%, showcasing moderate performance. However, DenseNet50 requires the highest computational time of 1,129.39 seconds, highlighting its inefficiency despite its reasonable performance metrics.

AlexNet achieves an accuracy of 85.32%, with precision and recall of 85.37% and 85.32%, respectively, leading to an F1-Score of 84.90%. The AUC score of 83.17% is similar to that of DenseNet50, while the mAP score of 86.44% demonstrates its moderate effectiveness in classification tasks. Its computational time of 1,098.39 seconds is slightly lower than DenseNet50 but remains higher than ResNet50.

The proposed model outperformed all the benchmark models across all performance metrics. It achieved the highest accuracy of 97.13%, with a Precision of 99.32% and a Recall of 92.42%, resulting in a robust F1-Score of 95.75%. Its AUC value of 96.04% and mAP of 94.45% highlight its superior ability to differentiate between classes accurately. Additionally, the computational time of 882.88 seconds is the lowest among the models, demonstrating remarkable efficiency. These results, along with Figs. 6 and 7, underscore the proposed model’s potential to be an exceptional tool for ALL diagnosis on the C-NMC-19 dataset, combining high accuracy with computational efficiency.

Fig. 6.

Fig. 6

AUC-ROC curves for healthy (orange) and cancerous (black) image classification in the diagnosis of ALL using the C-NMC-19 Dataset.

Fig. 7.

Fig. 7

Performance evaluation of the proposed model on the C-NMC-19 dataset over 30 epochs. The training loss (black) rapidly decreases, while the testing loss (red) remains stable at a low value, indicating effective generalization. The training accuracy (green) steadily increases, nearing 100%, while the testing accuracy (yellow) also improves consistently, stabilizing above 97%. These trends highlight the model’s strong learning capability and robust classification performance for ALL detection using the C-NMC-19 dataset.

Case study: multi cancer dataset

The table 6 presents the performance evaluation of various DL models for ALL diagnosis using the Multi Cancer Dataset. The VGG-19 model achieved an accuracy of 82.21% with a Precision of 83.91% and Recall of 82.32%. The F1-Score of 87.89% reflects a balanced performance, supported by an AUC of 84.41% and an mAP of 81.92%. However, its computational time of 1035.31 seconds indicates moderate efficiency compared to the other models.

Table 6.

Performance comparison of different models for ALL diagnosis on the Multi cancer dataset (total images: 10000).

Model Accuracy (%) Precision (%) Recall (%) F1-score (%) AUC (%) mAP (%) Computational time (s)
VGG-19 82.21 83.91 82.32 87.89 84.41 81.92 1035.31
InceptionResNetV2 79.56 80.43 79.34 80.91 79.02 79.41 1021.03
ResNet50 85.78 84.69 85.54 83.22 85.83 84.77 1231.11
DenseNet50 85.78 84.69 83.74 81.32 84.13 84.17 1301.23
AlexNet 83.18 84.19 82.34 83.22 83.17 84.17 1251.23
Our proposed 95.86 96.11 95.61 95.82 96.72 96.23 912.13

InceptionResNetV2 showed a relatively lower performance with an accuracy of 79.56%, Precision of 80.43%, and Recall of 79.34%. Its F1-Score stood at 80.91%, with an AUC of 79.02% and an mAP of 79.41%. While its computational time of 1021.03 seconds is lower than VGG-19, its overall performance metrics are comparatively less favorable.

ResNet50 demonstrated improved accuracy at 85.78%, along with a Precision of 84.69% and Recall of 85.54%. Its F1-Score of 83.22% and AUC of 85.83% highlight its capability in classification tasks, supported by an mAP of 84.77%. However, the computational time of 1231.11 seconds, the highest among the models, reflects a significant trade-off between accuracy and efficiency.

DenseNet50 matches ResNet50 in accuracy at 85.78%, with a precision of 84.69% and a recall of 83.74%. The F1-Score of 81.32% and AUC score of 84.13% are slightly lower than ResNet50, while the mAP score of 84.17% reflects moderate overall performance. However, DenseNet50 requires 1,301.23 seconds for processing, the highest among all models, indicating significant computational inefficiency.

AlexNet achieves an accuracy of 83.18%, with a precision of 84.19% and a recall of 82.34%. This results in an F1-Score of 83.22%. Its AUC score of 83.17% is comparable to DenseNet50, while its mAP score of 84.17% demonstrates moderate effectiveness in classification tasks. However, AlexNet’s computational time of 1,251.23 seconds is among the highest, highlighting its inefficiency for large-scale diagnostic applications.

The proposed model exhibited outstanding performance, surpassing all other models across all evaluation metrics. It achieved an accuracy of 95.86%, Precision of 96.11%, and Recall of 95.61%, leading to a highly balanced F1-Score of 95.82%. The AUC of 96.72% and mAP of 96.23% indicate exceptional class discrimination and prediction capability. Additionally, the computational time of only 912.13 seconds demonstrates its superior efficiency, making it suitable for practical clinical applications. This result, along with Figs. 8 and 9, positions the proposed model as not only the most accurate but also the most efficient, showcasing its potential as an excellent diagnostic tool for ALL detection using the Multi-Cancer dataset.

Fig. 8.

Fig. 8

AUC-ROC curves for healthy (orange) and cancerous (black) image classification in the diagnosis of ALL using the multi cancer dataset.

The improved performance of our proposed model, based on EfficientNet-B7, across all three datasets (Talqani Hospital dataset, C-NMC-19 dataset, and Multi-Cancer dataset) can be attributed to its architectural efficiency, superior feature extraction capabilities, and optimized computational design. EfficientNet-B7 employs a compound scaling technique that balances network depth, width, and resolution to maximize feature extraction while minimizing computational costs. This scalability allows the model to capture fine-grained details crucial for microscopic image classification tasks. Unlike models such as VGG-19, which rely on fixed architectures, EfficientNet-B7 dynamically adjusts its design to ensure efficient feature utilization, explaining the consistently higher accuracy, precision, recall, and F1-scores achieved compared to other models.

EfficientNet-B7 utilising depthwise separable convolutions and squeeze-and-excitation blocks, which enhance its ability to focus on the most relevant features within ALL images. This attention mechanism gives it a distinct advantage over architectures like ResNet50 and DenseNet50, which lack explicit feature recalibration techniques. The improved AUC scores achieved by our proposed model across datasets highlight its ability to effectively distinguish between diagnostic labels (Healthy and ALL). Additionally, the proposed model demonstrates remarkable generalization across datasets of varying sizes and complexities, consistently outperforming benchmark architectures. While models such as InceptionResNetV2 and AlexNet show limited generalization, evidenced by their lower accuracy and AUC scores, EfficientNet-B7 maintains robust performance, showcasing its adaptability to diverse data distributions, a critical factor in medical diagnostics.

Our model achieves superior performance while being computationally efficient, with significantly lower computational times compared to DenseNet50, ResNet50, and AlexNet. This efficiency is due to its optimized architecture, which reduces redundant computations and using lightweight operations like depthwise separable convolutions. For instance, on the Talqani Hospital dataset, the proposed model processes images in just 59.28 seconds, compared to 129.27 seconds for DenseNet50 and 121.33 seconds for AlexNet, making it highly suitable for real-time applications in clinical settings. Furthermore, the consistently high mAP scores across datasets, such as 99.56% on the Talqani Hospital dataset and 96.23% on the Multi-Cancer dataset, highlight the model’s effectiveness in capturing both localization and classification accuracy. The F1-scores, which balance precision and recall, are also consistently higher, indicating the model’s ability to handle imbalanced datasets effectively. In contrast, baseline models such as InceptionResNetV2 exhibit significant challenges with imbalanced data, resulting in lower F1-scores.

Prediction explanation

XAI approaches namely Grad-CAM, CAM, LIME and IG were applied to the proposed model to explain the predictive decision and enhance the trust of medical professionals and patients. Comparative visualization with the original image of the mentioned techniques is illustrated in Figs. 10, 11 and 12.

Fig. 10.

Fig. 10

Performance comparison of XAI techniques on the C-NMC-19 dataset for ALL diagnosis. (A) Original image. (B) Grad-CAM (best performer) effectively highlights diagnostically relevant regions. (C) CAM (second-best) identifies key areas with slightly less precision. (D) LIME (moderate performance) highlights localized regions but lacks clarity. (E) IG (weak performer) provides dispersed and less relevant explanations. This demonstrates the superiority of Grad-CAM for reliable explainability.

Fig. 11.

Fig. 11

Performance comparison of XAI methods on the Taleqani Hospital dataset for ALL diagnosis. (F) Original image. (G) Grad-CAM (best performer) effectively highlights diagnostically significant regions with high clarity. (H) CAM (second-best) identifies relevant areas but with slightly reduced focus compared to Grad-CAM. (I) LIME (moderate performance) localizes some important regions but lacks precision. (J) IG (weak performer) provides dispersed and less informative explanations. This comparison underscores the reliability and superior explainability of Grad-CAM in this dataset.

Fig. 12.

Fig. 12

Performance comparison of XAI approaches on the Multi-Cancer dataset for ALL diagnosis. (K) Original image. (L) Grad-CAM (best performer) highlights diagnostically important regions with superior clarity and focus. (M) CAM (second-best) identifies relevant regions, though with slightly less precision than Grad-CAM. (N) LIME (moderate performance) captures some critical areas but exhibits lower accuracy. (O) IG (weak performer) provides dispersed and less relevant explanations, making it the least reliable method. This comparison demonstrates the consistent superiority of Grad-CAM across datasets.

The Grad-CAM generated heatmaps in Figs. 10B, 11G and 12L, which offered a visual representation of the areas within the original ALL image that contribute most to the model’s prediction. The process involves the extraction of gradient values from the last convolutional layer (top_actovation) concerning ALL cancer, which indicates the non-functional white cells in the Leukaemia imagery. According to heatmap (B) in Fig. 10, the red area indicates the highest contribution, whereas the areas in black contribute the least. Similarly, the heatmap (G) shown in the Fig. 11 and (L) shown in Fig. 12 illustrates that the areas marked in white are the highest contributors to the model’s decision making process, whereas areas in other colors contribute less significantly.

The Class Activation Maps (CAM) shown in Figs. 10C, 11H, and 12M visually highlight the regions within the original ALL images that are most relevant to the model’s predictions. CAM works by utilizing the weighted activations of the final convolutional layer, mapping these activations back to the input space to identify areas of high importance for specific class predictions.

In the process, the feature maps are scaled by their associated weights from the fully connected layer, emphasizing the regions most associated with the model’s classification of ALL cancer. The heatmap in Fig. 10C demonstrates that the red regions signify the highest relevance to the prediction, while the black regions are less impactful. Similarly, the heatmaps in Figs. 11H and 12M indicate that white areas contribute the most to the decision-making process, whereas other colors signify regions with lower relevance. These CAM heatmaps offer valuable insights into the spatial features that the model relies on, thereby enhancing interpretability and supporting clinical decision-making in diagnosing ALL cancer.

In the LIME explanation Figs. 10D, 11I and 12N, the yellow areas are identified as key influencers in steering the model’s decision. LIME provided an explanation based on the perturbation of the input ALL image and observed the effect on the output.LIME was initially utilized to explain the Inception-V3 by4; however, it has shown enhanced effectiveness with the EfficientNet-B7.

In Figs. 10E, 11J, and 12O, the regions highlighted in green represent positive contributions to the model’s prediction, as determined by the Integrated Gradients (IG) method. IG works by computing the path integral of gradients along a straight-line path from a baseline input, such as a black image, to the actual input. This approach quantifies the contribution of each feature by accumulating gradients over the interpolation path, providing a clear attribution of the input features.

The visualizations generated by Grad-CAM, CAM, LIME, and IG were cross-verified against the ground truth provided in the dataset to evaluate their accuracy and reliability in identifying critical regions influencing the model’s predictions for ALL diagnosis. The results highlight varying degrees of effectiveness among these XAI methods in capturing relevant areas within the images.

Grad-CAM demonstrated the most accurate performance, effectively highlighting the critical regions when compared to the original image. Its ability to focus on the most relevant areas, particularly the non-functional white cells indicative of ALL, reinforces its reliability as an interpretability method. This alignment with the ground truth makes Grad-CAM a highly dependable tool for understanding the decision-making process of DL models.

CAM, while capturing the primary regions of interest, also included irrelevant areas in the prediction. Although it correctly identified the target cells, the inclusion of additional, less relevant cells reduced its precision compared to Grad-CAM. This highlights a potential limitation of CAM in distinguishing between the most and least relevant features within the complex cellular structures of ALL images.

LIME, on the other hand, struggled to maintain consistency with the ground truth. While it identified some relevant regions, it missed important features that were crucial for accurate predictions and, at the same time, included irrelevant areas. This inconsistency in capturing essential visual cues indicates that LIME may not be well-suited for tasks requiring high precision in localizing biologically significant features.

IG performed the weakest among the methods, failing to effectively capture the complex decision-making architecture of the model. Its inability to align with the ground truth highlights its limitations in interpreting intricate patterns in DL models, particularly for tasks with high architectural complexity, such as ALL diagnosis.

In summary, the results demonstrate that Grad-CAM provides the most reliable and accurate visual explanations for model predictions, closely matching the ground truth. CAM offers reasonable explanations but lacks the precision of Grad-CAM, while LIME and Integrated Gradients fall short in capturing the relevant features necessary for accurate explainability. This cross-validation emphasizes the importance of selecting appropriate XAI methods based on the task and complexity of the underlying model.

Conclusion, limitations and future research directions

In this study, we have effectively demonstrated the integration of EfficientNet-B7 with XAI methods across three datasets-Taleqani Hospital, C-NMC-19, and the Multi Cancer dataset-to enhance the diagnosis process and provide explainable model decisions through Grad-CAM, CAM, LIME, and IG for ALL. The incorporation of comprehensive evaluation metrics, namely AUC, mAP, Accuracy, Precision, Recall, and F1-score, further validated the efficacy and reliability of our proposed framework presented in Tables 4, 5, and 6. These results establish a new benchmark for AI-driven diagnostics in ALL diagnosis, emphasizing both performance and explainability.

The necessity of this study stems from the critical demand for diagnostic frameworks that balance computational efficiency and explainability, particularly in time-sensitive medical scenarios. Our experimental results highlight the superiority of the proposed framework in achieving significantly lower computational times across all three datasets compared to benchmark models such as VGG-19, InceptionResNetV2, ResNet50, DenseNet50 and AlexNet, ensuring timely diagnosis without compromising accuracy. Additionally, the explainability offered by the integrated XAI methods promotes trust and transparency, addressing a key challenge in the adoption of AI systems in clinical settings.

However, certain limitations of the proposed framework warrant further investigation. First, while our framework achieves superior diagnostic accuracy and computational efficiency, its performance has been validated only on haematological (Microscopic images) datasets. Extending its applicability to other medical imaging domains and diseases will be crucial to generalize its utility. Second, the integration of multiple XAI methods, though beneficial for explainability, may introduce computational overhead that requires optimization in real-time clinical environments. Third, the reliance on labelled datasets poses challenges in scalability for regions with limited access to annotated medical data. Lastly, the smaller size of certain datasets could increase the risk of overfitting, despite mitigation strategies such as data augmentation and regularization.

Conclusively, this work not only proposed a novel framework for the diagnosis of ALL but also paved the way for future advancements in AI applications for medical diagnostics. By balancing computational efficiency with the imperative for explainability, clarity, and trust, this study sets a precedent for the integration of AI in healthcare. The results underscore the transformative potential of combining advanced DL architectures with XAI techniques to bridge the gap between technical performance and clinical applicability.

In future, we aim to explore additional architectures (Randomised and Spiking Neural network) and XAI methodologies to extend this framework to a broader spectrum of haematological diseases. Furthermore, we plan to address the scalability challenge by utilising semi-supervised and unsupervised learning techniques, enabling the framework to adapt to limited labelled data scenarios. Optimizing the computational overhead introduced by XAI methods will also be a priority to ensure real-time applicability in clinical settings. These directions will further enhance the diagnostic process, ensuring robust, explainable, and efficient solutions for diverse clinical challenges.

Acknowledgements

This research was supported by Science Foundation Ireland under grant numbers 18/CRT/6223 (SFI Centre for Research Training in Artificial Intelligence), 13/RC/2106/Inline graphic (ADAPT Centre),13/RC/2094/Inline graphic (Lero Centre) and College of Science and Enginnering, UOG. For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.

Author contributions

Conceptualization, D.M, M. S, M.B.; methodology, D.M, M.S; software, D.M, M.S; validation, D.M ,A,K,M.B; formal analysis,D.M, M.S; investigation, D.M, M.B; resources, D.M, M.B, data curation, D.M,M.B, writing- original draft preparation, D.M, M.S, ; writing-review and editing, D.M, A,K,M.B; visualization, D.M., supervision M.B; project administration, M.B; funding acquisition, D.M, M.B; All authors have read and agreed to the published version of the manuscript.

Data availability

The datasets used in this study are publicly available: Mourya, S., Kant, S., Kumar, P., Gupta, A., & Gupta, R. ALL Challenge dataset of ISBI 2019 (C-NMC 2019). The Cancer Imaging Archive, DOI: 10.7937/tcia.2019.dc64i46r (2019). https://doi.org/10.7937/tcia.2019.dc64i46r. Aria, M. et al. Acute lymphoblastic leukemia (ALL) image dataset. https://www.kaggle.com, DOI: 10.34740/KAGGLE/DSV/2175623 (2021). O. S. Multi Cancer dataset. Data set, DOI: 10.34740/KAGGLE/DSV/3415848 (2022).

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Gocher, A. M., Workman, C. J. & Vignali, D. A. Interferon-Inline graphic: teammate or opponent in the tumour microenvironment?. Nature Reviews Immunology22, 158–172 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Mondal, C. et al. Ensemble of convolutional neural networks to diagnose acute lymphoblastic leukemia from microscopic images. Informatics in Medicine Unlocked27, 100794 (2021). [Google Scholar]
  • 3.Ghaderzadeh, M. et al. A fast and efficient cnn model for b-all diagnosis and its subtypes classification using peripheral blood smear images. International Journal of Intelligent Systems37, 5113–5133 (2022). [Google Scholar]
  • 4.Abir, W. H. et al. Explainable ai in diagnosing and anticipating leukemia using transfer learning method. Computational Intelligence and Neuroscience 2022 (2022). [DOI] [PMC free article] [PubMed] [Retracted]
  • 5.Amin, M. M., Kermani, S., Talebi, A. & Oghli, M. G. Recognition of acute lymphoblastic leukemia cells in microscopic images using k-means clustering and support vector machine classifier. Journal of medical signals and sensors5, 49 (2015). [PMC free article] [PubMed] [Google Scholar]
  • 6.Kasani, P. H., Park, S.-W. & Jang, J.-W. An aggregated-based deep learning method for leukemic b-lymphoblast classification. Diagnostics10, 1064 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Muhammad, D., Ahmed, I., Ahmad, M. O. & Bendechache, M. Randomized explainable machine learning models for efficient medical diagnosis. IEEE Journal of Biomedical and Health Informatics 1–10, 10.1109/JBHI.2024.3491593 (2024). [DOI] [PubMed]
  • 8.Ali, M., Muhammad, D., Khalaf, O. I. & Habib, R. Optimizing mobile cloud computing: A comparative analysis and innovative cost-efficient partitioning model. SN Computer Science6, 1–25 (2025). [Google Scholar]
  • 9.Nasir, S., Khan, R. A. & Bai, S. Ethical framework for harnessing the power of ai in healthcare and beyond. arXiv preprint arXiv:2309.00064 (2023).
  • 10.Muhammad, D. & Bendechache, M. Unveiling the black box: A systematic review of explainable artificial intelligence in medical image analysis. Computational and Structural Biotechnology Journal24, 542–560. 10.1016/j.csbj.2024.08.005 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Kellener, E. et al. Utilizing segment anything model for assessing localization of grad-cam in medical imaging. arXiv preprint arXiv:2306.15692 (2023).
  • 12.Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
  • 13.Shin, J. Feasibility of local interpretable model-agnostic explanations (lime) algorithm as an effective and interpretable feature selection method: comparative fnirs study. Biomedical Engineering Letters 1–15 (2023). [DOI] [PMC free article] [PubMed]
  • 14.Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In International conference on machine learning, 3319–3328 (PMLR, 2017).
  • 15.Khandekar, R., Shastry, P., Jaishankar, S., Faust, O. & Sampathila, N. Automated blast cell detection for acute lymphoblastic leukemia diagnosis. Biomedical Signal Processing and Control68, 102690 (2021). [Google Scholar]
  • 16.Abd El-Ghany, S., Elmogy, M. & El-Aziz, A. A. Computer-aided diagnosis system for blood diseases using efficientnet-b3 based on a dynamic learning algorithm. Diagnostics13, 404 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Liu, Y. & Long, F. Acute lymphoblastic leukemia cells image analysis with deep bagging ensemble learning. In ISBI 2019 C-NMC Challenge: Classification in Cancer Cell Imaging: Select Proceedings, 113–121 (Springer, 2019).
  • 18.Uysal, I. & Kose, U. Development of a simulation environment for the importance of histone deacetylase in childhood acute leukemia with explainable artificial intelligence. BRAIN. Broad Research in Artificial Intelligence and Neuroscience 14, 254–286 (2023).
  • 19.Jiang, Z., Dong, Z., Wang, L. & Jiang, W. Method for diagnosis of acute lymphoblastic leukemia based on vit-cnn ensemble model. Computational Intelligence and Neuroscience 2021 (2021). [DOI] [PMC free article] [PubMed]
  • 20.Cui, W. et al. Deep multiview module adaption transfer network for subject-specific eeg recognition. IEEE Transactions on Neural Networks and Learning Systems (2024). [DOI] [PubMed]
  • 21.Li, Y. et al. Spatio-temporal-spectral hierarchical graph convolutional network with semisupervised active learning for patient-specific seizure prediction. IEEE transactions on cybernetics52, 12189–12204 (2021). [DOI] [PubMed] [Google Scholar]
  • 22.Liu, X., Huang, H. & Xiang, J. A personalized diagnosis method to detect faults in gears using numerical simulation and extreme learning machine. Knowledge-Based Systems195, 105653 (2020). [Google Scholar]
  • 23.Wang, S. & Xiang, J. A minimum entropy deconvolution-enhanced convolutional neural networks for fault diagnosis of axial piston pumps. Soft Computing24, 2983–2997 (2020). [Google Scholar]
  • 24.Fei, X., Wang, J., Ying, S., Hu, Z. & Shi, J. Projective parameter transfer based sparse multiple empirical kernel learning machine for diagnosis of brain disease. Neurocomputing413, 271–283 (2020). [Google Scholar]
  • 25.Bilal, O. et al. Differential evolution optimization based ensemble framework for accurate cervical cancer diagnosis. Applied Soft Computing167, 112366 (2024). [Google Scholar]
  • 26.Asif, S. et al. A deep ensemble learning framework for covid-19 detection in chest x-ray images. Network Modeling Analysis in Health Informatics and Bioinformatics13, 30 (2024). [Google Scholar]
  • 27.Mourya, S., Kant, S., Kumar, P., Gupta, A. & Gupta, R. ALL Challenge dataset of ISBI 2019 (C-NMC 2019). The Cancer Imaging Archive, 10.7937/tcia.2019.dc64i46r (2019). 10.7937/tcia.2019.dc64i46r. [DOI]
  • 28.Dabass, M., Dabass, J., Vashisth, S. & Vig, R. A hybrid u-net model with attention and advanced convolutional learning modules for simultaneous gland segmentation and cancer grade prediction in colorectal histopathological images. Intelligence-Based Medicine7, 100094 (2023). [Google Scholar]
  • 29.Aria, M. et al. Acute lymphoblastic leukemia (all) image dataset. https://www.kaggle.com, 10.34740/KAGGLE/DSV/2175623 (2021).
  • 30.Naren, O. S. Multi cancer dataset. Data set, 10.34740/KAGGLE/DSV/3415848 (2022).
  • 31.Ghosh, S., Das, N. & Nasipuri, M. Reshaping inputs for convolutional neural network: Some common and uncommon methods. Pattern Recognition93, 79–94 (2019). [Google Scholar]
  • 32.Tan, M. & Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Chaudhuri, K. & Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, vol. 97 of Proceedings of Machine Learning Research, 6105–6114 (PMLR, 2019).
  • 33.Ghosh, A., Soni, B. & Baruah, U. Transfer learning-based deep feature extraction framework using fine-tuned efficientnet b7 for multiclass brain tumor classification. Arabian Journal for Science and Engineering 1–22 (2023).
  • 34.Muhammad, D., Keles, A. & Bendechache, M. Towards explainable deep learning in oncology: Integrating efficientnet-b7 with xai techniques for acute lymphoblastic leukaemia. In Proceedings of the 27th European Conference on Artificial Intelligence (ECAI) (Spain, 2024).
  • 35.Albahar, M. A. & Binsawad, M. Deep autoencoders and feedforward networks based on a new regularization for anomaly detection. Security and Communication Networks2020, 1–9 (2020). [Google Scholar]
  • 36.Muhammad, D., null Rafiullah & Bendechache, M. Improving diagnostic trust: an explainable deep learning framework for genitourinary cancer prediction. IET Conference Proceedings 2024, 47–54, 10.1049/icp.2024.3275 (2024). https://digital-library.theiet.org/doi/pdf/10.1049/icp.2024.3275.
  • 37.Muhammad, D., Ahmad, I., Khalil, M. I., Khalil, W. & Ahmad, M. O. A generalized deep learning approach to seismic activity prediction. Applied Sciences13, 1598 (2023). [Google Scholar]
  • 38.Chauhan, T., Palivela, H. & Tiwari, S. Optimization and fine-tuning of densenet model for classification of covid-19 cases in medical imaging. International Journal of Information Management Data Insights1, 100020 (2021). [Google Scholar]
  • 39.Vermeire, T., Brughmans, D., Goethals, S., de Oliveira, R. M. B. & Martens, D. Explainable image classification with evidence counterfactual. Pattern Analysis and Applications25, 315–335 (2022). [Google Scholar]
  • 40.Zhang, Y. et al. Grad-cam helps interpret the deep learning models trained to classify multiple sclerosis types using clinical brain magnetic resonance imaging. Journal of Neuroscience Methods353, 109098 (2021). [DOI] [PubMed] [Google Scholar]
  • 41.Bhattacharya, A. Applied Machine Learning Explainability Techniques: Make ML models explainable and trustworthy for practical applications using LIME, SHAP, and more (Packt Publishing Ltd, 2022).
  • 42.Muhammad, D., Ahmed, I., Naveed, K. & Bendechache, M. An explainable deep learning approach for stock market trend prediction. Heliyon10, e40095. 10.1016/j.heliyon.2024.e40095 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used in this study are publicly available: Mourya, S., Kant, S., Kumar, P., Gupta, A., & Gupta, R. ALL Challenge dataset of ISBI 2019 (C-NMC 2019). The Cancer Imaging Archive, DOI: 10.7937/tcia.2019.dc64i46r (2019). https://doi.org/10.7937/tcia.2019.dc64i46r. Aria, M. et al. Acute lymphoblastic leukemia (ALL) image dataset. https://www.kaggle.com, DOI: 10.34740/KAGGLE/DSV/2175623 (2021). O. S. Multi Cancer dataset. Data set, DOI: 10.34740/KAGGLE/DSV/3415848 (2022).


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES