Skip to main content
Digital Health logoLink to Digital Health
. 2026 Feb 11;12:20552076261423994. doi: 10.1177/20552076261423994

Unveiling the black box: Explainable transfer learning for ocular disorder diagnosis

Zaib un Nisa 1, Arfan Jaffar 1, Sohail Masood Bhatti 1, Ines Hilali Jaghdam 2, Tehseen Mazhar 3,4,, Muhammad Amir Khan 5,, Habib Hamam 6,7,8,9
PMCID: PMC12901956  PMID: 41696077

Abstract

Objective

To systematically evaluate transfer learning (TL) models for multiclass ocular disease diagnosis and assess their reliability using explainable artificial intelligence (AI).

Methods

Eight pretrained convolutional neural network (CNN) models were evaluated on a public dataset covering cataract, diabetic retinopathy, glaucoma, and normal classes under a unified protocol. Performance was measured using accuracy, precision, recall, and F1-score. Grad-CAM, LIME, and SHAP were used for interpretability, and the Friedman test assessed performance consistency.

Results

Several models achieved near-perfect performance for diabetic retinopathy. DenseNet121 and XceptionNet performed best for cataract detection, while glaucoma showed consistently weaker results, indicating the need for segmentation-based approaches. Despite similar accuracy, explainability revealed substantial differences in model attention. EfficientNetB3 produced the most clinically meaningful visual explanations.

Conclusions

Accuracy alone is insufficient for trustworthy medical AI. Explainable AI is essential for model selection. EfficientNetB3 offers the best balance between performance and interpretability, and glaucoma diagnosis requires more advanced, segmentation-aware pipelines.

Keywords: Ocular disorders, transfer learning, diabetic retinopathy, cataract, glaucoma, explainable AI

Introduction

Ocular disorders are the predominant cause of blindness. The World Health Organization reports that 2.2 billion individuals have visual impairment. 1 Although the eyes constitute a minor segment of the human anatomy, they are essential. Lacking adequate eyesight significantly diminishes quality of life. The sclera is the white part of the eye. It encircles the eye, as seen in Figure 1. The white area often becomes red as a result of blood vessel dilatation. Numerous forms of ocular disorders may exist. One of the leading causes of adult-onset eye disorders is diabetic retinopathy. People with types 1 and 2 diabetes may not realize the severity of their eye disease until it is too late. However, it is preventable. Their symptoms vary from minor ocular irritation to severe discomfort and visual impairment. To safeguard ocular health, early detection of eye illnesses is essential. In this context, artificial algorithms are crucial. The eye diseases we will discuss in this study are cataract, diabetic retinopathy, and glaucoma. In cataract eye disease, the crystalline lens of the eye becomes clouded, resulting in blurred vision. It is commonly found in elderly persons this problem is not treated, then it can cause complete blindness.

Figure 1.

Figure 1.

Anatomy of a normal eye.

Research gap and contribution

There is a rapid adoption of transfer learning models for Ocular Disorder Diagnosis. Most of the existing studies focus only on reporting performance of one model with limited interpretability of results. Explainable artificial intelligence (AI) is also restricted to one method's qualitative visualization. It makes limited insight into the model and disease specific behavior. It renders the decision-making process of deep learning ambiguous and unreliable. This study provides a comparative and explainability-based evaluation of eight widely adopted transfer learning models to assess their disease-specific reliability and research-oriented decision support prototype. Main contribution of the study is follows:

We offer a unified and systematic comparison of eight transfer learning (TL) models for multidisease classification of ocular disorder.

Diseases-specific performance, that shows same TL models exhibits different performance for different eye diseases.

Explainable AI-based assessment, Grad-CAM (class activation mapping), LIME (local interpretable model-agnostic explanations), and SHAP(SHapley Additive exPlanations) to show multilevel interpretability at spatial, instance and pixel level.

We showed cross-model consistency analysis, demonstrating that different models converge on clinical plausible visual cues for cataract and diabetic retinopathy, while glaucoma needs segmentation-based model.

Novelty

The main aim of this study is not to present a new methodology. This study introduces disease-specific transfer learning models that are based on explainability, situated within an experimental framework. This study examines the differential behavior of the same model across various eye diseases. It assists in identifying significant issues for research-oriented decision support prototype. Previous studies concentrate solely on a single eye disease and assess only the aggregated accuracy. This study employs explainable AI analysis across all transfer learning algorithms. Adequate accuracy alone is insufficient for effective decision making. Interpretations based on Explainable AI can provide medical professionals with valuable insights. We focused on evaluation-centric and interpretability-focused nature of the contribution, rather than methodological innovation.

Related works

Current research

Normally, ocular disorders are not life-threatening, but they can degrade the quality of life significantly over the period of time. Cataracts are one of the major causes of vision impairment in the world. Its early diagnosis can save people beforehand. For the early diagnosis, machine and deep learning algorithms play a vital role due to advancements in AI. Malik et al. used multiple machine learning algorithms for the purpose, including decision trees, random forests, naive bayes, and neural network algorithms. They extracted features from the information, age, illness history, and clinical observations. Random forest achieved the highest accuracy, that is, 86.63%, among all the ML classifiers. 2 Abbas proposed a method for glaucoma detection in which there is no need for manual segmentation. Therefore, domain expertise is minimized. He used convolutional neural network (CNN) for feature extraction and deep belief network for feature selection. He achieved 99% accuracy, 98.01% specificity, and 84.5% sensitivity. The dataset he used consisted of 1200 images, which is a small dataset to be used with deep learning algorithms. He used multiple datasets to improve generalization. His method clearly showed there is too much difference between specificity and sensitivity. In the case of eye disease detection, it is very important to have a good sensitivity value. He didn’t use explainable AI. 3

Saqib et al. 4 proposed a lightweight deep learning model based on MobileNetV1 and MobileNetV2 for cataract and glaucoma detection, demonstrating superior accuracy compared to other models using transfer learning techniques. Their work validated MobileNet's effectiveness in low-resource, mobile-based vision diagnostics.

Similarly, Guefrachi et al. 5 employed advanced CNN models such as InceptionResNetV2, MobileNetV2, and EfficientNet2L for diabetic retinopathy classification. Their multistage training strategy, involving feature extraction and fine-tuning, showed that integrating data augmentation with TL models significantly improves generalizability.

Jain et al. proposed a CNN model, LCDNet, for diagnosing diabetic retinopathy, retinitis pigmentosa using retinal fundus images. They did not perform any segmentation and used an automatic method for feature extraction and diagnosis. Their model achieved an accuracy of 99.7%. 4 They also didn’t use any explainable AI tools for the interpretation of their results. Moreover, they didn’t compare the performance with other deep learning models. Muntaqim et al. introduced a multiscale deep learning approach for the diagnosis of ocular diseases. 5 They used a multistaged deep learning architecture, whereby the first step involves the extraction of fine-grained information using convolutional layers. In stage 2, these attributes are further enhanced by the use of two branches of convolutional blocks. Classification is conducted at step 3. Although the model demonstrates exceptional performance, it does not use any explainable AI tools to elucidate the internal workings of the deep learning black box.

Beyond classification, segmentation-based approaches are critical, especially for glaucoma detection. Chen and Lv 6 proposed deep CNN methods for segmenting the optic disc and cup in fundus images, enhancing the precision of glaucoma diagnosis through boundary-aware attention mechanisms. Alkhaldi and Alabdulathim 7 developed a hybrid model combining clinical and segmentation features for improved glaucoma severity classification. These studies emphasize that precise boundary segmentation significantly impacts diagnosis quality.

GAN-based segmentation, such as the adversarial approach proposed by Liu et al. 8 improves robustness in distinguishing optic disc and cup boundaries by learning structural variations in fundus images.

Chen et al. 9 proposed that Lightweight models like MobileNetV3 were also validated for segmentation tasks, confirming their suitability for real-time medical imaging on edge devices.

Alzamil 10 employed a deep neural network (DNN) to classify Retinal Fundus Disease. They compared the performance with four pretrained models. Their model achieved an accuracy of 94.10%, which is better than transfer learning algorithms. They used classification to detect all the diseases, whereas some diseases need precise localization. Rachmawanto et al. proposed a CNN for the classification of four diseases. They achieved remarkable accuracy, that is, 95%. But their model didn’t show how the deep learning model is taking its decision. Explainable AI helps practitioners and also AI researchers know which features and which part of the image is given more importance by the deep learning model. In a related advancement, Guefrachi et al. 11 proposed a deep learning-based clinical support tool using ResNet152V2 integrated into a graphical user interface for automated diabetic retinopathy screening. The system achieved 100% precision and recall, further confirming that certain TL models are mature for real-world deployment in ophthalmology. Kaushik et al. 12 introduced a stacked generalization approach for diabetic retinopathy classification, incorporating image normalization and ensemble CNNs to improve diagnostic robustness under varying illumination. In another domain, Shahwar et al. 13 explored hybrid classical-quantum transfer learning models using ResNet34 for Alzheimer's diagnosis, achieving significant improvements over classical models. Their findings highlight the potential of hybrid architectures in enhancing medical image classification pipelines. The study 14 evaluates the effectiveness of several DNN architectures for identifying ocular diseases using the Ocular Disease Intelligent Recognition (ODIR) dataset. To enhance image quality and draw attention to key elements, preprocessing methods such bilateral filtering, unsharp masking, and region of interest (ROI) selection were employed. A variety of artificial neural network architectures, including as CNNs, long short-term memory (LSTM) networks, and DNNs, were used to test several models, including ResNet50, InceptionV3, MobileNet, DenseNet121, NASNetMobile, Xception, and VGG19.

Singh et al. 15 showed comparative analysis using deep learning models for ocular disease detection. Though the models show good performance, however, they only used models’ performance to interpret their results. They didn’t use any explainable AI tools. Muntaqim et al. 16 presented a comparison between deep learning and machine learning algorithms. They found that DenseNet-169 showed best performance among all the models. They also didn’t show how the deep learning black box is actually working.

Methodology

In this research study, we will classify these eye diseases as: diabetic retinopathy, cataracts, glaucoma, and normal. In cataract diseases, the lens starts suffering from opacity. It can vary in location and size and causes a blurry or impaired vision. 15 Neovascularization, microaneurysms, and hemorrhages are typical symptoms of diabetic retinopathy. Mild nonproliferative abnormalities are the first stage of the disease, which develops into proliferative retinopathy and eventually causes blindness from widespread neovascularization and hemorrhages. 6 Inflammation and retinal neurodegeneration are also primary causes of the pathogenesis of diabetic macular edema. Glaucoma is caused by progressive damage to the optic nerve. It causes the irreversible loss of vision. 8 In this study, we will use transfer learning algorithms for the classification of these diseases. For this purpose, we have used eight transfer learning algorithms, as shown in Figure 2. The figure also shows the workflow of the research study. First of images are brought into the workspace using the dataset “eye_disease_detection.” The dataset is available on the public repository “Kaggle.” https://www.kaggle.com/datasets/gunavenkatdoddi/eye-diseases-classification.

Figure 2.

Figure 2.

Process flow diagram of this research study.

Dataset

The retinal scans in the collection are categorized into four groups: normal, diabetic retinopathy, cataract, and glaucoma. Each group has about a thousand images. Sources such as IDRiD, Oculur recognition, hemodynamic response function, and others contribute to the collection of these images. The dataset is balanced, as shown in Figure 3, therefore, there is no need for data augmentation. But we still used some light data augmentation methods. We didn’t use aggressive augmentation techniques to avoid image distortion. In the dataset, 4217 images belong to four classes. The images are evenly distributed among the classes, as shown in Figure 3. Figure 4 shows some of the randomly chosen images from the dataset.

Figure 3.

Figure 3.

Percentage of images in each class in the dataset.

Figure 4.

Figure 4.

Random images from the dataset.

Data preprocessing

In the preprocessing step, the data is normalized by dividing all the input images by 255. This normalizes the pixel values to 0,1. This helps in training the algorithms and prevents the vanishing gradient problem. As we iterate through the dataset, all the elements are converted into numpy arrays. Numpy arrays work easily with Python tools. We have used GPU-based system and a cloud platform for code execution. We have provided the link of the code in the code availability section.

Data split

Seventy percent of the dataset is reserved for training purposes, while the remaining 30% is used for testing purposes. The data is split with a ratio of 70 to 30.

All the model architectures used in the current study, are described below.

Models architectures

The following models are used for transfer learning: EfficientNetB3, MobileNet, InceptionResNetV2, Resnet50, XceptionNet, DenseNet121, VGG19, and InceptionV3.

EfficientNetB3

This model is known for its scalability and efficiency. It is part of EfficientNet, which is specially designed to enhance accuracy and efficiency. Due to aberrant tissue growth within the brain, brain tumors pose serious health hazards that, if left untreated, can result in life-threatening disorders. With better resolution and tissue differentiation than other imaging modalities, magnetic resonance imaging is the primary diagnostic technique for brain tumors. It is mostly used for medical imaging, which includes brain tumor classification, breast cancer classification, and leukemia detection.17,18 It is also used for plant disease classification and three-dimensional (3D) vehicle detection and segmentation. It is optimized by using genetic algorithms. 19

This research study included Efficient Net as a component of the transfer learning algorithm. The top levels of the algorithm are configured to false, allowing us to implement our own customized classification layers. The model's summary is shown in Figure 5.

Figure 5.

Figure 5.

Summary of the transfer learning model in which EfficientNetB3 is used.

MobileNet

MobileNet is a kind of CNNs specifically designed for mobile and embedded vision applications. They can provide enhanced performance with constrained resources. It is mostly used for image classification jobs and object detection. 20 They include depth-wise separable convolutions. This decreases the number of parameters as the model deepens, hence reducing computing complexity. This research study included MobileNet as a component of the transfer learning algorithm. The top levels of the algorithm are configured to false, allowing us to implement our own customized classification layers. The model's summary is shown in Figure 6.

Figure 6.

Figure 6.

Summary of the transfer learning model in which MobileNet is used.

InceptionRnetV2

It is a CNN architecture that combines the strengths of Inception and Resnet model. It is designed to enhance the efficiency of the model to perform image classification tasks. Brain tumors are a serious medical problem that requires prompt and correct diagnosis in order to enhance patient outcomes. Life expectancy can be significantly reduced by misclassification, underscoring the significance of precise diagnostic techniques. 21 In the domain of medical imaging, it showed wonderful performance in classifying brain tumor and plant disease detection. 22

It handles multiscale feature extraction which makes it ideal for complex image classification tasks. 5 It is also widely used in transfer learning models. This research study also includes InceptionResnetV2 as a component of the transfer learning algorithm. The top levels of the algorithm are configured to false, allowing us to implement our own customized classification layers. The model's summary is shown in Figure 7.

Figure 7.

Figure 7.

Summary of the transfer learning model in which InceptionResNetV2 is used.

Figure 7 shows that, the total parameters of inceptionResnetV2 are 54,737,380 which are largest among all the transfer learning models used in this study. It shows that this model is computationally expensive.

Resnet 50

It is a deep residual network. Residual networks were introduced to introduce the concept of residual learning and address the vanishing gradient problem. Vanishing gradient problem causes degradation in the performance, it is addressed by Resnet50. It is extensively used in several computer vision applications due to its ability to extract significant details from images. It shown exceptional efficacy in the identification of gastrointestinal diseases and brain tumors. 23 Resnet50 is often used in the transfer learning models, that uses pretrained weights. This research study also uses Resnet50 for transfer learning. The top levels of the algorithm are configured to false, allowing us to implement our own customized classification layers. The model's summary is shown in Figure 8.

Figure 8.

Figure 8.

Summary of the transfer learning model in which Resnet50 is used.

XceptionNet

It is a deep learning model that is widely used for many image classification problems. It is known for its depth-wise separable convolutions and thus reduces computational costs and enhances performance. It is used for the automatic detection of skin cancer deep fake detection and COVID-19 diagnosis.

This research study uses XceptionNet for transfer learning. The top levels of the algorithm are configured to false, allowing us to implement our own customized classification layers. The model's summary is shown in Figure 9.

Figure 9.

Figure 9.

Summary of XceptionNet.

DenseNet121

DenseNet is known for its dense connectivity pattern. It is widely used in various domains, including medical imaging, agriculture, and waste management. In medical imaging, it is successfully used for diagnosing COVID-19 and brain tumors. In agriculture it is the best algorithm that is used for identifying tomato plant disease. It is also used effectively for deep fake detection. 24 This research study uses DenseNet121 for transfer learning. The top levels of the algorithm are configured to false, allowing us to implement our own customized classification layers. The model's summary is shown in Figure 10.

Figure 10.

Figure 10.

Summary of the transfer learning model in which DenseNet121 is used.

VGG19

It is used in several applications because of its durability in visual recognition jobs. It can extract hierarchical information from images, making it useful for processing complicated visuals. It consists of 19 layers, including 16 convolutional layers and 3 fully connected layers. It is used for facial recognition, human activity identification, and brain tumor detection. This research study also uses VGG19 for transfer learning. The top levels of the algorithm are configured to false, allowing us to implement our own customized classification layers. The model's summary is shown in Figure 11.

Figure 11.

Figure 11.

Summary of the transfer learning model in which VGG19 is used.

InceptionV3

It is recognized for its robustness, attributed to multiscale feature extraction. It is mostly used in intricate pattern recognition activities. Similar to other TL models, it is used in medical imaging, agriculture, and plant identification. The most significant part of this model is its use in real-time systems, namely the Alternating Least Squares system for patients.

The summary of inceptionv3 is presented in Figure 12.

Figure 12.

Figure 12.

Summary of the transfer learning model in which InceptionV3 is used.

Transfer learning setup

The same procedure is followed for all the algorithms used in this research study. The top layers of the pretrained models are set to false so we can use our own classification layers. The model is trained on the “ImageNet” dataset. Those pretrained weights are called. The shape of the input image is (300,300,3). At the end of the pretrained model layers, the global max-pooling layer is used to flatten the output, so we can use our customized top layers.

The model is trained up to 40 epochs, with a mini-batch size equal to 40 samples. Patience is set to 1, so if the accuracy or loss doesn’t improve for 1 epoch, the learning rate is reduced. We have incorporated regularization techniques across all the transfer learning models. We used L2 regularization, dropout, and batch normalization, as shown in model summaries to overcome overfitting. Stop-patience is set to 3, so if no improvement is monitored after 3 epochs, the model stops training. The threshold value is set to 0.9, so that if the training accuracy is below 90%, the callback function is used to improve training accuracy, and if the training accuracy surpasses 90%, the focus shifts to validation loss.

CNN model

A CNN model is used for non-TL comparison. The architecture of the CNN model is presented in Figure 13.

Figure 13.

Figure 13.

Summary of convolutional neural network (CNN) model used for base-line comparison.

The model has used the same configurations as all the other models, in the current study, that is, same data split, same dataset, 40 number of epochs, patience equals to 1, and stop patience is set to 3. Threshold is 0.9, factor 0.5.

Results

Experimental setup and evaluation protocol

The classification layers of all transfer learning models were customized, as described in the Methodology section. Model performance was evaluated using accuracy, precision, recall, and F1-score, reported both overall and class-wise. In addition, the nonparametric Friedman statistical test was applied to assess whether performance differences among models were statistically significant across disease classes. All experiments were conducted using a GPU-enabled environment on a cloud platform. The implementation code is provided in the Code Availability section.

EfficientNetB3

EfficientNetB3 was trained for a maximum of 40 epochs, but early stopping terminated training at epoch 26. The best model was obtained at epoch 22, achieving a validation accuracy of 95% (Figure 14). The training and validation curves indicate the presence of overfitting. On the test dataset, the model achieved 96% accuracy, 96% precision, 93% recall, and 92.5% F1-score.

Figure 14.

Figure 14.

Accuracy and loss plots of EfficientNetB3. The model achieved 99% training and 95% validation accuracy.

MobileNet

MobileNet completed the full 40 training epochs, as early stopping was not triggered. The best performance was observed at epoch 38. The model achieved 100% training accuracy and 94% validation accuracy (Figure 15). The curves indicate mild overfitting. On the test dataset, MobileNet achieved 94% accuracy, 94% precision, 92% recall, and 93% F1-score.

Figure 15.

Figure 15.

Accuracy and loss plots of MobileNet. The model achieved 99% training and 94% validation accuracy.

InceptionResNetV2

InceptionResNetV2 training was stopped early at epoch 18 due to performance stabilization. The best epoch was identified as epoch 15, where the model achieved 99.5% training accuracy and 95% validation accuracy (Figure 16). The learning curves exhibit noticeable fluctuations and a gap between training and validation performance. On the test dataset, the model achieved 96% accuracy, 96% precision, 94% recall, and 95% F1-score.

Figure 16.

Figure 16.

Accuracy and loss plots of InceptionResNetV2. The model achieved 99.5% training and 95% validation accuracy.

Resnet50

ResNet50 was trained for up to 40 epochs and completed 35 epochs before training stabilization. As shown in Figure 17, both training and validation accuracy curves stabilized after approximately epoch 19. The model achieved 100% training accuracy and 94% validation accuracy. On the test dataset, ResNet50 obtained 93% accuracy, 92% precision, 90% recall, and 91% F1-score.

Figure 17.

Figure 17.

Accuracy and loss plots of ResNet50. The model achieved 100% training and 94% validation accuracy.

We have incorporated regularization techniques across all the transfer learning models. We used L2 regularization, dropout, and batch normalization, as shown in model summaries to overcome overfitting. We used data augmentation only by used random horizontal flipping. More aggressive augmentation was deliberately avoided so images may not distort. We used early stopping with adaptive learning rate scheduling via custom call back, dynamically reducing learning rate when performance is stuck. Despite these measures there is still overfitting. This is widely reported in the fundus image benchmarks. So, we emphasize that such results should not be used directly in the clinical practices rather can be used for research-oriented decision support prototype.

DenseNet121

DenseNet121 exhibited fluctuations in validation performance during the initial training phase, followed by stabilization after approximately epoch 20 (Figure 18). The model achieved 100% training accuracy and 96% validation accuracy. On the test dataset, DenseNet121 achieved 94% accuracy, 96% precision, 93% recall, and 92.5% F1-score.

Figure 18.

Figure 18.

Accuracy and loss plots of Dense121. The model achieved 100% training and 96% validation accuracy.

XceptionNet

XceptionNet displayed noticeable fluctuations in the validation accuracy curve throughout training (Figure 19). The training curve reached 100% accuracy, while validation accuracy peaked at 95%, indicating a performance gap between training and validation curves. The model achieved 100% training accuracy and 95% validation accuracy.

Figure 19.

Figure 19.

Accuracy and loss plots of XceptionNet. The model achieved 100% training and 95% validation accuracy.

VGG19

VGG19 showed substantial fluctuations during early training, followed by smoother convergence in later epochs (Figure 20). The model achieved 95% training accuracy and 91% validation accuracy. These values reflect the final stabilized performance across training epochs.

Figure 20.

Figure 20.

Accuracy and loss plots of VGG19. The model achieved 95% training and 91% validation accuracy.

InceptionV3

InceptionV3 demonstrated relatively smooth training behavior with minor fluctuations in the validation curve (Figure 21). The model achieved 100% training accuracy and 94% validation accuracy. The learning curves show a visible gap between training and validation performance.

Figure 21.

Figure 21.

Accuracy and loss plots of InceptionV3. The model achieved 100% training and 94% validation accuracy.

Overall performance comparison

Figure 22 summarizes the comparative performance of all eight transfer learning models across the four disease classes. Several models achieved comparable overall performance levels. For cataract detection, ResNet50, DenseNet121, XceptionNet, and InceptionV3 each achieved an accuracy of 0.96 (Table 1). For diabetic retinopathy, near-perfect performance was observed across multiple models, with several architectures (MobileNet, ResNet50, DenseNet121, InceptionV3, and VGG19) achieving accuracy, precision, recall, and F1-score values of 1.00 (Table 2).

Figure 22.

Figure 22.

Performance of the transfer learning algorithms for detecting different diseases.

Table 1.

Performance of various transfer learning models for detecting cataract.

Model Accuracy Recall Precision F1-score
EfficientNetB3 0.94 0.98 0.94 0.96
MobileNet 0.94 0.98 0.92 0.95
InceptionResNetV2 0.93 0.97 0.90 0.94
ResNet50 0.96 0.97 0.95 0.96
DenseNet121 0.96 0.98 0.95 0.97
XceptionNet 0.96 0.98 0.95 0.97
InceptionV3 0.96 0.97 0.95 0.96
VGG19 0.94 0.99 0.91 0.95

Table 2.

Performance of various transfer learning models for detecting diabetic retinopathy.

Model Accuracy Recall Precision F1-score
EfficientNetB3 0.98 0.99 0.99 1.00
MobileNet 1.00 1.00 1.00 1.00
InceptionResNetV2 0.99 1.00 0.99 1.00
ResNet50 1.00 1.00 1.00 1.00
DenseNet121 1.00 1.00 1.00 1.00
XceptionNet 0.98 0.98 0.98 0.97
InceptionV3 1.00 1.00 1.00 1.00
VGG19 1.00 1.00 1.00 1.00

For glaucoma detection, accuracy values ranged between 0.86 and 0.92 across models (Table 3). For the normal class, accuracy values ranged between 0.88 and 0.94 (Table 4).

Table 3.

Performance of various transfer learning models for detecting glaucoma.

Model Accuracy Recall Precision F1-score
EfficientNetB3 0.92 0.89 0.90 0.90
MobileNet 0.89 0.86 0.92 0.89
InceptionResNetV2 0.86 0.88 0.87 0.88
ResNet50 0.91 0.92 0.90 0.91
DenseNet121 0.91 0.91 0.91 0.91
XceptionNet 0.89 0.91 0.88 0.89
InceptionV3 0.88 0.86 0.90 0.88
VGG19 0.89 0.85 0.93 0.89

Table 4.

Performance of various transfer learning models for detecting a healthy eye.

Model Accuracy Recall Precision F1-score
EfficientNetB3 0.94 0.88 0.91 0.90
MobileNet 0.88 0.88 0.89 0.88
InceptionResNetV2 0.90 0.85 0.94 0.89
ResNet50 0.92 0.90 0.93 0.91
DenseNet121 0.89 0.88 0.90 0.89
XceptionNet 0.90 0.86 0.94 0.90
InceptionV3 0.91 0.92 0.90 0.91
VGG19 0.91 0.91 0.91 0.91

Tables 1 to 4 report class-wise performance metrics (accuracy, precision, recall, and F1-score) for each model. To further summarize performance consistency, the mean and standard deviation of all models for each disease category were computed (Table 5). Diabetic retinopathy exhibited the highest mean accuracy (0.993) with low standard deviation (0.0091), whereas glaucoma showed comparatively lower mean accuracy (0.89) and higher variability.

Table 5.

Mean and standard deviation of the models.

Performance metrics Cataract Diabetic retinopathy Glaucoma Normal
Mean Standard deviation Mean Standard deviation Mean Standard deviation Mean Standard deviation
Accuracy 0.94 0.012 0.993 0.0091 0.89 0.019 0.90 0.018
Recall 0.97 0.007 0.996 0.0074 0.88 0.026 0.88 0.02
Precision 0.93 0.020 0.995 0.0075 0.90 0.019 0.91 0.019
F1-score 0.95 0.010 0.996 0.010 0.89 0.0118 0.89 0.011

Figure 23 presents the accuracy distribution of all models for the four disease categories. The figure shows variation in model performance across disease types. Figures 24 and 25 present the corresponding precision and recall distributions across models. Figure 26 shows the confusion matrices for all eight transfer learning models evaluated on the test dataset. The matrices illustrate the distribution of true and predicted labels for each class across models.

Figure 23.

Figure 23.

Accuracies of all the pretrained models using the transfer learning (TL) approach for the four cases, cataract, diabetic retinopathy, glaucoma, and normal.

Figure 24.

Figure 24.

Precision of all the pretrained models using the transfer learning (TL) approach for the four cases, cataract, diabetic retinopathy, glaucoma, and normal.

Figure 25.

Figure 25.

Recall of all the pretrained models using the transfer learning (TL) approach for the four cases, cataract, diabetic retinopathy, glaucoma, and normal.

Figure 26.

Figure 26.

Confusion matrices of the transfer learning models used in current study.

The confusion matrices of the models are evaluated based on the test data.

Base line comparison with CNN

A conventional CNN model was implemented as a baseline using the same dataset, data split, and training configuration. The CNN model was trained for up to 40 epochs, with training terminated at epoch 31 due to stagnation in performance (Figure 27).

Figure 27.

Figure 27.

Train, validation, and test results of convolutional neural network (CNN) model.

The CNN achieved an overall test accuracy of 82%. Figure 27 presents the training, validation, and test performance curves, while Figure 28 presents the confusion matrix for the CNN model evaluated on the test dataset. Class-wise performance indicated stronger performance for diabetic retinopathy compared to cataract and glaucoma.

Figure 28.

Figure 28.

Confusion matrix of the convolutional neural network (CNN) model.

Friedman test

A nonparametric Friedman test was conducted to compare the performance of the eight transfer learning models across the four disease categories. The test was performed using the accuracy matrix reported in Table 6.

Table 6.

Accuracy matrix used for Friedman test.

Model Cataract Diabetic retinopathy Glaucoma Normal
EfficientNetB3 0.94 0.98 0.92 0.94
Mobile Net 0.94 1.00 0.89 0.88
InceptionResnetV2 0.93 0.99 0.86 0.90
ResNet50 0.96 1.00 0.91 0.92
DenseNet121 0.96 1.00 0.91 0.89
Xception Net 0.96 0.98 0.89 0.90
InceptionV3 0.96 1.00 0.88 0.91
VGG19 0.94 1.00 0.89 0.91

Two hypotheses were considered:

  • - Null hypothesis (H₀): All models perform similarly, with no statistically significant differences.

  • - Alternative hypothesis (H₁): At least one model performs significantly differently.

The Friedman test produced a chi-square statistic of 9.333 and a p-value of 0.2296 (Figure 29). Table 6 reports the accuracy values used for the test across all models and disease classes.

Figure 29.

Figure 29.

Result of Friedman test.

The test also generated model ranking values based on average ranks across conditions. ResNet50 obtained the lowest (best) average rank (2.50), followed by DenseNet121 and EfficientNetB3, while InceptionResNetV2 showed the highest (least favorable) rank.

Grad-CAM results

XAI is crucial for opening the back box of Deep learning. Researchers are using it for the interpretation of the results 17 Grad-CAM visualizations were generated for all eight transfer learning models using the same cataract image input. XAI shows the important parts or sections of a picture that affect how breast cancer is diagnosed. Clinicians can use XAI to check AI predictions, which builds trust and openness. Explainable AI (XAI) deep learning (DL) technique that improves breast cancer identification using advanced picture preprocessing, modifications to the DenseNet architecture, and a specialized fine-tuning process for histopathology imaging. 25 Explainable AI (XAI) helps ophthalmologists understand the model's evaluations by designating clinically relevant areas in retinal pictures. This helps them diagnose eye problems.

The heatmaps show varying spatial distributions of highlighted regions across models. Some models exhibit concentration of activation near the central region of the fundus image, while others display more diffuse or fragmented activation patterns. The intensity and localization of highlighted regions differ across architectures. Figure 30 presents the resulting heatmaps for each model.

Figure 30.

Figure 30.

Grad-CAM of eight pretrained models used in this study by the transfer learning approach. The same image of cataract eye disease is used for all the models for generating their heatmaps.

Though the accuracies of all the classifiers are quite excellent, from the heatmaps of grad-CAM we discovered that there are substantial differences in their attention mechanisms. It shows that high accuracy alone cannot determine the reliability of the model. It highlights the importance of explainable AI for safe model selection. In this case, EfficientB3 is better for eye disease detection among the eight TL models and may also assist physicians in locating the source.

LIME and SHAP

LIME and SHAP analyses were conducted to further examine model behavior at the instance level. Figure 31 presents LIME and SHAP outputs for InceptionV3 as an illustrative example. The highlighted superpixels in LIME represent regions contributing positively to the prediction, while SHAP visualizations show positive and negative pixel contributions.

Figure 31.

Figure 31.

LIME and SHAP for model InceptionV3.

Figure 32 presents the LIME visualizations across all eight models. The superpixel patterns differ across architectures, with some models producing broader contiguous regions and others producing more fragmented attribution regions.

Figure 32.

Figure 32.

LIME results of the eight models used in the study.

Figure 33 presents SHAP visualizations across all models. The pixel-level attributions appear spatially distributed across the image rather than confined to a single localized region. The predicted probabilities associated with each model are reported alongside the corresponding visualizations.

Figure 33.

Figure 33.

SHAP of all the models used in this study.

Inception V3

Discussion

Disease-specific behavior of transfer learning models

The experimental results indicate that the same TL architecture does not behave uniformly across ocular diseases. Instead, model performance varies systematically according to the visual characteristics of each condition and the dataset properties. This disease-dependent behavior underscores the need for class-level evaluation rather than relying only on global performance.

For diabetic retinopathy (DR), several models achieved near-perfect performance across evaluation metrics. This consistent outcome across diverse architectures suggests that the dataset contains visually distinctive patterns that are readily separable by deep convolutional models. Lesions commonly associated with DR, such as hemorrhages, microaneurysms, and exudates, introduce strong textural and structural contrasts that facilitate discrimination from normal images. Moreover, the highly curated nature of the dataset and relatively controlled imaging conditions likely contributed to the observed performance. Accordingly, DR appears comparatively less complex than the other classes within this benchmark, which helps explain the uniformly high scores.

Cataract detection also exhibited strong performance across multiple architectures. Several models, including ResNet50, DenseNet121, XceptionNet, and InceptionV3, achieved comparable accuracy values, with DenseNet121 and XceptionNet showing particularly balanced performance across accuracy, precision, and recall. Unlike DR, cataract is characterized by global image degradation and loss of clarity rather than localized pathological structures. This characteristic is consistent with the diffuse attribution patterns observed in the explainability analyses and helps explain why multiple architectures can achieve competitive performance despite architectural differences. However, differences across models suggest that sensitivity to cataract-related visual degradation remains architecture-dependent.

In contrast, glaucoma consistently produced lower performance across models. This behavior is clinically and technically expected, as glaucoma diagnosis relies primarily on structural assessment of the optic disc and optic cup, including features such as the cup-to-disc ratio, neuroretinal rim thinning, and localized excavation. These are subtle anatomical cues that are not optimally captured by global image classification approaches. Consequently, TL models operating directly on raw fundus images without region-of-interest extraction or structural modeling appear inherently limited for this task. This supports the need for optic disc/cup-focused pipelines (e.g. ROI learning or segmentation-aware modeling) for glaucoma.

Performance on the normal class was moderate compared with DR and cataract. This outcome likely reflects heterogeneity within the normal category and the presence of borderline cases that visually resemble early-stage pathology. This class overlap can make “normal vs subtle abnormal” discrimination more challenging than detecting advanced disease patterns.

Overall, these findings demonstrate that TL model performance is strongly disease-dependent and that the intrinsic visual characteristics of each condition play a decisive role in achievable accuracy. Therefore, disease-aware model design is preferable to a single generic classification strategy for all ocular disorders.

Why high accuracy is not sufficient: Interpretability and trust

Although several TL models achieved comparable—and in some cases near-perfect—performance metrics, the explainability analyses revealed substantial differences in how these models arrived at their predictions. This discrepancy underscores an important limitation of relying solely on quantitative indicators such as accuracy, precision, recall, and F1-score when evaluating medical AI systems. In medical imaging, correct predictions must also be supported by clinically plausible evidence.

Grad-CAM visualizations demonstrated that models with similar classification accuracy can exhibit markedly different attention patterns. Some architectures concentrated activation on plausible regions of the fundus image, whereas others produced diffuse, fragmented, or weak localization patterns. In certain cases, highlighted regions extended into anatomically irrelevant areas, suggesting that correct predictions may be influenced by spurious correlations rather than disease-relevant features. These differences indicate that numerical performance can mask instability or nonclinical feature reliance.

The instance-level analyses provided by LIME reinforced this observation. Across models, LIME generally identified broad regions contributing to predictions rather than highly localized pathological structures. However, the spatial coherence and granularity of these superpixel attributions differed between architectures. Some models produced overly coarse and contiguous regions, whereas others generated more structured and discriminative attribution patterns. Such variability implies that architectures prioritize different visual cues even when their outputs are similar.

SHAP-based pixel-level attributions further showed that many models relied on globally distributed features rather than sharply localized regions. This behavior is consistent with conditions such as cataract, where visual degradation affects the entire image. At the same time, the lack of focused attribution in tasks requiring structural discrimination highlights limitations of certain architectures for more complex diagnostic targets. Together, these results show that interpretability adds complementary evidence about whether predictions align with expected clinical signals.

Collectively, these findings demonstrate that explainable AI tools are not merely supplementary visualizations but constitute a critical evaluation layer for medical AI systems. Interpretability can help distinguish models that base decisions on clinically meaningful features from those that may rely on coincidental patterns. Consequently, model selection for healthcare applications should incorporate explainability analysis alongside traditional performance metrics.

Overfitting and generalization considerations

To mitigate overfitting across all TL models, regularization and training control strategies were applied, including L2 regularization, dropout, and batch normalization within the classification layers. Data augmentation was deliberately limited to mild transformations (primarily random horizontal flipping) to preserve anatomical integrity. Early stopping and adaptive learning rate scheduling were also employed to stabilize training and prevent unnecessary over-optimization.

Despite these measures, learning curves for most models exhibited a noticeable gap between training and validation performance, indicating persistent overfitting. This behavior is widely reported in fundus image benchmarks, particularly when datasets are relatively small or curated under controlled acquisition conditions. Such datasets may not fully capture the diversity of imaging devices, patient demographics, illumination conditions, and disease severity encountered in real clinical environments.

The consistently near-perfect performance observed for DR further suggests that the dataset may simplify the classification task due to visually distinctive patterns and controlled image quality. While this facilitates strong benchmark performance, it also raises concerns regarding generalizability. Models trained and evaluated under these conditions may not maintain comparable performance on external datasets or real-world clinical settings.

These observations highlight the need for cautious interpretation of performance results. The current findings demonstrate the effectiveness of TL models within the evaluated framework and support their use for research-oriented decision support prototypes. However, translation to clinical deployment requires external validation and broader generalization-focused evaluation.

Practical implications for model selection

The combined analysis of quantitative performance metrics and explainability outcomes provides guidance for selecting models according to the target ocular disease. Rather than identifying a single universally optimal architecture, the findings indicate that model suitability is task-dependent.

For DR, several architectures demonstrated consistently high performance across evaluation metrics. In this context, model selection becomes less dependent on marginal accuracy differences and more influenced by considerations such as computational efficiency, stability, and deployment constraints. Accordingly, multiple TL models are reasonable candidates, with external validation as the key next step.

For cataract detection, multiple architectures also exhibited competitive performance. DenseNet121 and XceptionNet showed particularly balanced results across accuracy, precision, and recall, while explainability analyses indicated that EfficientNetB3 produced relatively coherent attribution patterns. Thus, cataract-oriented systems may prioritize models that combine strong metrics with consistent explanations.

In contrast, glaucoma detection presents different requirements. The comparatively lower metrics, combined with diffuse or weak attribution patterns, indicate that classification-only pipelines based on global fundus images are limited for this task. These findings support the need for pipelines that incorporate structural analysis, such as optic disc/cup segmentation, region-of-interest learning, or anatomically informed feature extraction.

More broadly, this study highlights that optimal model selection in medical imaging should not be based solely on headline accuracy values. Instead, a balanced evaluation framework integrating predictive performance, training stability, interpretability, and disease-specific compatibility offers a more reliable foundation for clinically meaningful AI development.

Summary of key findings

This study demonstrates that the effectiveness of TL models for ocular disease diagnosis is strongly dependent on the clinical task. While several architectures achieved near-perfect performance for DR and robust performance for cataract detection, glaucoma remained substantially more challenging under a classification-only paradigm. Explainability analyses revealed differences in model behavior that were not captured by performance metrics alone, highlighting the importance of interpretability for trustworthy evaluation. Overall, the findings support disease-aware modeling and evaluation frameworks that jointly consider performance and interpretability.

Limitations

This study has several limitations that should be acknowledged when interpreting the findings. First, the evaluation was conducted using a single train–test split (70/30), which may limit robustness of performance estimation. Future work should incorporate cross-validation and multiple data splits to provide more reliable and statistically stable evaluations.

Second, all experiments were performed on a single publicly available dataset. While this enabled controlled benchmarking across models, reliance on one data source may restrict generalizability. External validation using independent datasets collected across different clinical environments, imaging devices, and patient populations is needed to better assess real-world performance.

Third, although several models achieved near-perfect results for DR, such outcomes should be interpreted with caution. The dataset appears curated under relatively controlled conditions with visually distinctive patterns, which may simplify the classification task. Consequently, strong benchmark performance may not directly translate to equivalent performance in heterogeneous clinical settings.

Finally, hyperparameter optimization was not explored. Models were evaluated under a unified configuration to ensure fair comparison. While this supports consistency, further gains may be achievable through systematic hyperparameter tuning, advanced augmentation strategies, and architecture-specific optimization.

Conclusion

Recapitulation

Visual impairment resulting from ocular conditions necessitates early and accurate diagnosis to prevent irreversible vision loss. This study conducted a comprehensive evaluation of eight distinct TL models using the publicly available “eye-disease-detection” dataset, offering insights into the comparative performance of state-of-the-art CNNs for multiclass classification of four ocular disorders: diabetic retinopathy, cataract, glaucoma, and normal. This study presents a comprehensive evaluation of eight widely used pretrained architectures under a unified experimental framework, combined with systematic interpretation of their outputs using explainable AI techniques.

The results indicate that TL models are remarkably effective in detecting diabetic retinopathy, Several transfer learning models achieved near-perfect performance for diabetic retinopathy on the evaluated dataset. DenseNet121 and XceptionNet demonstrated particularly strong performance in cataract detection, while glaucoma detection proved more challenging, with relatively lower metrics and signs of overfitting in several models. Notably, the study did not apply hyperparameter fine-tuning or segmentation-based preprocessing, which may have contributed to the suboptimal performance in glaucoma diagnosis. Explainable AI tools such as Grad-CAM were employed to visualize and analyze the decision-making process of each TL model, revealing important disparities in model interpretability and localization performance. Among all models evaluated, EfficientNetB3 not only delivered high accuracy but also produced the most clinically plausible heatmaps, making it the most promising candidate for real-world applications.

Future work

Building on this study, future work will focus on validation-oriented extensions that are necessary for clinical-grade deployment. For diabetic retinopathy, where multiple TL models already demonstrate consistently strong performance, the next step is developing a robust computer-aided diagnostic pipeline with external dataset validation and deployment-oriented evaluation. For glaucoma, the results indicate that classification-only modeling on raw fundus images is insufficient; therefore, future work will prioritize segmentation-aware and ROI-based approaches, including optic disc and optic cup delineation, and the integration of structural biomarkers.

Across all diseases, systematic hyperparameter optimization and disease-aware augmentation/preprocessing ablation studies will be conducted to quantify their impact relative to the baseline models. In addition, ensemble strategies that combine top-performing TL models will be investigated to improve robustness and generalizability across heterogeneous image characteristics. Finally, future evaluation will include area under the curve-based analyses alongside existing metrics, and the continued integration of explainable AI modules to support transparency and trust in clinical settings.

Footnotes

ORCID iD: Tehseen Mazhar https://orcid.org/0000-0002-4649-2376

Ethics approval: Not applicable.

Consent to participate: Not applicable.

Consent to publish: Not applicable.

Author contributions: Muhammd Amir Khan , Muhammad Usman Tariq perform the Original Writing Part, Software, and Methodology; Sheikh Muhammad Saqib and Tehseen Mazhar perform Rewriting, investigation, design Methodology, and Conceptualization; , Tariq shahzad, Habib Hamam and Abdul Khader Jilani Saudagar perform the related work part and manage results and discussions; Tehseen Mazhar Muhammad Amir khan and Habib Hamam perform related work part and manage results and discussion; Abdul Khader Jilani Saudagar, Tehseen Mazhar and Muhammd Amir Khan perform Rewriting, design Methodology, and Visualization; Tariq shahzad and Muhammad Saqib performs Rewriting, design Methodology, and Visualization

Funding: The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R845), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

  • 1.World Health Organization. Blindness and vision impairment, https://www.who.int/news-room/fact-sheets/detail/blindness-and-visual-impairment (2023).
  • 2.Malik S, Kanwal N, Asghar MN, et al. Data driven approach for eye disease classification with machine learning. Appl Sci 2019; 9: 2789. [Google Scholar]
  • 3.Abbas Q. Glaucoma-deep: detection of glaucoma eye disease on retinal fundus images using deep learning. Int J Adv Comput Sci Appl 2017; 8. DOI: 10.14569/IJACSA.2017.080606. [DOI] [Google Scholar]
  • 4.Saqib SM, Iqbal M, Asghar MZ, et al. Cataract and glaucoma detection based on transfer learning using MobileNet. Heliyon 2024; 10: e36759. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Guefrachi S, Echtioui A, Hamam H. Diabetic retinopathy detection using deep learning multistage training method. Arab J Sci Eng 2025; 50: 1079–1096. [Google Scholar]
  • 6.Chen N, Lv X. Research on segmentation model of optic disc and optic cup in fundus. BMC Ophthalmol 2024; 24: 273. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Alkhaldi NA, Alabdulathim RE. Optimizing glaucoma diagnosis with deep learning-based segmentation and classification of retinal images. Appl Sci 2024; 14: 7795. [Google Scholar]
  • 8.Liu Y, Wu J, Zhu Y, et al. Combined optic disc and optic cup segmentation network based on adversarial learning. IEEE Access 2024. [Google Scholar]
  • 9.Chen Y, Liu Z, Meng Y, et al. Lightweight optic disc and optic cup segmentation based on MobileNetv3 convolutional neural network. Biomimetics 2024; 9: 637. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Alzamil ZS. Advancing eye disease assessment through deep learning: a comparative study with pre-trained models. Eng Technol Appl Sci Res 2024; 14: 14579–14587. [Google Scholar]
  • 11.Guefrachi S, Echtioui A, Hamam H. Automated diabetic retinopathy screening using deep learning. Multimed Tools Appl 2024; 83: 65249–65266. [Google Scholar]
  • 12.Kaushik H, Singh D, Kaur M, et al. Diabetic retinopathy diagnosis from fundus images using stacked generalization of deep models. IEEE Access 2021; 9: 108276–92. [Google Scholar]
  • 13.Shahwar T, Zafar J, Almogren A, et al. Automated detection of Alzheimer’s via hybrid classical quantum neural networks. Electronics (Basel) 2022; 11: 721. [Google Scholar]
  • 14.Singh DP, Banerjee T, Mahajan S, et al. A comprehensive study on deep learning models for the detection of diabetic retinopathy using pathological images. Arch Comput Methods Eng 2025: 1–30. [Google Scholar]
  • 15.Jain L, Murthy HS, Patel C, et al. , eds. Retinal eye disease detection using deep learning. In: 2018 14th International conference on information processing (ICINPRO), 2018: IEEE. [Google Scholar]
  • 16.Muntaqim MZ, Smrity TA, Miah ASM, et al. Eye disease detection enhancement using a multi-stage deep learning approach. IEEE Access 2024: 1–1. DOI: 10.1109/ACCESS.2024.3476412. [DOI] [Google Scholar]
  • 17.Nasra P, Gupta S, Kumar GR, eds. Leveraging EfficientNetB3 for accurate breast cancer classification: insights and innovations. In: 2024 9th International conference on communication and electronics systems (ICCES), 2024: IEEE. [Google Scholar]
  • 18.Satushe V, Vyas V, Metkar S, et al. AI in MRI brain tumor diagnosis: a systematic review of machine learning and deep learning advances (2010–2025). Chemom Intell Lab Syst 2025; 263: 105414. [Google Scholar]
  • 19.Kotwal JG, Kashyap R, Shafi PM. Artificial driving based EfficientNet for automatic plant leaf disease classification. Multimed Tools Appl 2024; 83: 38209–38240. [Google Scholar]
  • 20.Saqib SM, Iqbal M, Mazhar T, et al. Effectiveness of teachable machine, mobile net, and YOLO for object detection: a comparative study on practical applications. Egypt Inform J 2025; 30: 100680. [Google Scholar]
  • 21.Talukder MA, Islam MM, Uddin MA, et al. A deep ensemble learning framework for brain tumor classification using data balancing and fine-tuning. Sci Rep 2025; 15: 35251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Winarno W, Harjoko A. Enhancing deep learning model using whale optimization algorithm on brain tumor MRI. J Electron Electromed Eng Med Inform 2026; 8: 136–151. [Google Scholar]
  • 23.Cambay VY, Barua PD, Hafeez Baig A, et al. Automated detection of gastrointestinal diseases using resnet50*-based explainable deep feature engineering model with endoscopy images. Sensors 2024; 24: 7710. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Manoranjitham R, Swaroop SS, (eds). A comparative study of DenseNet121 and InceptionResNetV2 model for deepfake image detection. In: 2024 3rd International conference on applied artificial intelligence and computing (ICAAIC), 2024: IEEE. [Google Scholar]
  • 25.Talukder MA. An improved XAI-based DenseNet model for breast cancer detection using reconstruction and fine-tuning. Results Eng 2025; 26: 104802. [Google Scholar]

Articles from Digital Health are provided here courtesy of SAGE Publications

RESOURCES