Unveiling the black box: Explainable transfer learning for ocular disorder diagnosis

Zaib un Nisa; Arfan Jaffar; Sohail Masood Bhatti; Ines Hilali Jaghdam; Tehseen Mazhar; Muhammad Amir Khan; Habib Hamam

doi:10.1177/20552076261423994

. 2026 Feb 11;12:20552076261423994. doi: 10.1177/20552076261423994

Unveiling the black box: Explainable transfer learning for ocular disorder diagnosis

Zaib un Nisa ¹, Arfan Jaffar ¹, Sohail Masood Bhatti ¹, Ines Hilali Jaghdam ², Tehseen Mazhar ^3,^4,^✉, Muhammad Amir Khan ^5,^✉, Habib Hamam ^6,^7,^8,⁹

PMCID: PMC12901956 PMID: 41696077

Abstract

Objective

To systematically evaluate transfer learning (TL) models for multiclass ocular disease diagnosis and assess their reliability using explainable artificial intelligence (AI).

Methods

Eight pretrained convolutional neural network (CNN) models were evaluated on a public dataset covering cataract, diabetic retinopathy, glaucoma, and normal classes under a unified protocol. Performance was measured using accuracy, precision, recall, and F1-score. Grad-CAM, LIME, and SHAP were used for interpretability, and the Friedman test assessed performance consistency.

Results

Several models achieved near-perfect performance for diabetic retinopathy. DenseNet121 and XceptionNet performed best for cataract detection, while glaucoma showed consistently weaker results, indicating the need for segmentation-based approaches. Despite similar accuracy, explainability revealed substantial differences in model attention. EfficientNetB3 produced the most clinically meaningful visual explanations.

Conclusions

Accuracy alone is insufficient for trustworthy medical AI. Explainable AI is essential for model selection. EfficientNetB3 offers the best balance between performance and interpretability, and glaucoma diagnosis requires more advanced, segmentation-aware pipelines.

Keywords: Ocular disorders, transfer learning, diabetic retinopathy, cataract, glaucoma, explainable AI

Introduction

Ocular disorders are the predominant cause of blindness. The World Health Organization reports that 2.2 billion individuals have visual impairment.¹ Although the eyes constitute a minor segment of the human anatomy, they are essential. Lacking adequate eyesight significantly diminishes quality of life. The sclera is the white part of the eye. It encircles the eye, as seen in Figure 1. The white area often becomes red as a result of blood vessel dilatation. Numerous forms of ocular disorders may exist. One of the leading causes of adult-onset eye disorders is diabetic retinopathy. People with types 1 and 2 diabetes may not realize the severity of their eye disease until it is too late. However, it is preventable. Their symptoms vary from minor ocular irritation to severe discomfort and visual impairment. To safeguard ocular health, early detection of eye illnesses is essential. In this context, artificial algorithms are crucial. The eye diseases we will discuss in this study are cataract, diabetic retinopathy, and glaucoma. In cataract eye disease, the crystalline lens of the eye becomes clouded, resulting in blurred vision. It is commonly found in elderly persons this problem is not treated, then it can cause complete blindness.

Research gap and contribution

There is a rapid adoption of transfer learning models for Ocular Disorder Diagnosis. Most of the existing studies focus only on reporting performance of one model with limited interpretability of results. Explainable artificial intelligence (AI) is also restricted to one method's qualitative visualization. It makes limited insight into the model and disease specific behavior. It renders the decision-making process of deep learning ambiguous and unreliable. This study provides a comparative and explainability-based evaluation of eight widely adopted transfer learning models to assess their disease-specific reliability and research-oriented decision support prototype. Main contribution of the study is follows:

We offer a unified and systematic comparison of eight transfer learning (TL) models for multidisease classification of ocular disorder.

Diseases-specific performance, that shows same TL models exhibits different performance for different eye diseases.

Explainable AI-based assessment, Grad-CAM (class activation mapping), LIME (local interpretable model-agnostic explanations), and SHAP(SHapley Additive exPlanations) to show multilevel interpretability at spatial, instance and pixel level.

We showed cross-model consistency analysis, demonstrating that different models converge on clinical plausible visual cues for cataract and diabetic retinopathy, while glaucoma needs segmentation-based model.

Novelty

The main aim of this study is not to present a new methodology. This study introduces disease-specific transfer learning models that are based on explainability, situated within an experimental framework. This study examines the differential behavior of the same model across various eye diseases. It assists in identifying significant issues for research-oriented decision support prototype. Previous studies concentrate solely on a single eye disease and assess only the aggregated accuracy. This study employs explainable AI analysis across all transfer learning algorithms. Adequate accuracy alone is insufficient for effective decision making. Interpretations based on Explainable AI can provide medical professionals with valuable insights. We focused on evaluation-centric and interpretability-focused nature of the contribution, rather than methodological innovation.

Related works

Current research

Normally, ocular disorders are not life-threatening, but they can degrade the quality of life significantly over the period of time. Cataracts are one of the major causes of vision impairment in the world. Its early diagnosis can save people beforehand. For the early diagnosis, machine and deep learning algorithms play a vital role due to advancements in AI. Malik et al. used multiple machine learning algorithms for the purpose, including decision trees, random forests, naive bayes, and neural network algorithms. They extracted features from the information, age, illness history, and clinical observations. Random forest achieved the highest accuracy, that is, 86.63%, among all the ML classifiers.² Abbas proposed a method for glaucoma detection in which there is no need for manual segmentation. Therefore, domain expertise is minimized. He used convolutional neural network (CNN) for feature extraction and deep belief network for feature selection. He achieved 99% accuracy, 98.01% specificity, and 84.5% sensitivity. The dataset he used consisted of 1200 images, which is a small dataset to be used with deep learning algorithms. He used multiple datasets to improve generalization. His method clearly showed there is too much difference between specificity and sensitivity. In the case of eye disease detection, it is very important to have a good sensitivity value. He didn’t use explainable AI.³

Saqib et al.⁴ proposed a lightweight deep learning model based on MobileNetV1 and MobileNetV2 for cataract and glaucoma detection, demonstrating superior accuracy compared to other models using transfer learning techniques. Their work validated MobileNet's effectiveness in low-resource, mobile-based vision diagnostics.

Similarly, Guefrachi et al.⁵ employed advanced CNN models such as InceptionResNetV2, MobileNetV2, and EfficientNet2L for diabetic retinopathy classification. Their multistage training strategy, involving feature extraction and fine-tuning, showed that integrating data augmentation with TL models significantly improves generalizability.

Jain et al. proposed a CNN model, LCDNet, for diagnosing diabetic retinopathy, retinitis pigmentosa using retinal fundus images. They did not perform any segmentation and used an automatic method for feature extraction and diagnosis. Their model achieved an accuracy of 99.7%.⁴ They also didn’t use any explainable AI tools for the interpretation of their results. Moreover, they didn’t compare the performance with other deep learning models. Muntaqim et al. introduced a multiscale deep learning approach for the diagnosis of ocular diseases.⁵ They used a multistaged deep learning architecture, whereby the first step involves the extraction of fine-grained information using convolutional layers. In stage 2, these attributes are further enhanced by the use of two branches of convolutional blocks. Classification is conducted at step 3. Although the model demonstrates exceptional performance, it does not use any explainable AI tools to elucidate the internal workings of the deep learning black box.

Beyond classification, segmentation-based approaches are critical, especially for glaucoma detection. Chen and Lv⁶ proposed deep CNN methods for segmenting the optic disc and cup in fundus images, enhancing the precision of glaucoma diagnosis through boundary-aware attention mechanisms. Alkhaldi and Alabdulathim⁷ developed a hybrid model combining clinical and segmentation features for improved glaucoma severity classification. These studies emphasize that precise boundary segmentation significantly impacts diagnosis quality.

GAN-based segmentation, such as the adversarial approach proposed by Liu et al.⁸ improves robustness in distinguishing optic disc and cup boundaries by learning structural variations in fundus images.

Chen et al.⁹ proposed that Lightweight models like MobileNetV3 were also validated for segmentation tasks, confirming their suitability for real-time medical imaging on edge devices.

Alzamil¹⁰ employed a deep neural network (DNN) to classify Retinal Fundus Disease. They compared the performance with four pretrained models. Their model achieved an accuracy of 94.10%, which is better than transfer learning algorithms. They used classification to detect all the diseases, whereas some diseases need precise localization. Rachmawanto et al. proposed a CNN for the classification of four diseases. They achieved remarkable accuracy, that is, 95%. But their model didn’t show how the deep learning model is taking its decision. Explainable AI helps practitioners and also AI researchers know which features and which part of the image is given more importance by the deep learning model. In a related advancement, Guefrachi et al.¹¹ proposed a deep learning-based clinical support tool using ResNet152V2 integrated into a graphical user interface for automated diabetic retinopathy screening. The system achieved 100% precision and recall, further confirming that certain TL models are mature for real-world deployment in ophthalmology. Kaushik et al.¹² introduced a stacked generalization approach for diabetic retinopathy classification, incorporating image normalization and ensemble CNNs to improve diagnostic robustness under varying illumination. In another domain, Shahwar et al.¹³ explored hybrid classical-quantum transfer learning models using ResNet34 for Alzheimer's diagnosis, achieving significant improvements over classical models. Their findings highlight the potential of hybrid architectures in enhancing medical image classification pipelines. The study¹⁴ evaluates the effectiveness of several DNN architectures for identifying ocular diseases using the Ocular Disease Intelligent Recognition (ODIR) dataset. To enhance image quality and draw attention to key elements, preprocessing methods such bilateral filtering, unsharp masking, and region of interest (ROI) selection were employed. A variety of artificial neural network architectures, including as CNNs, long short-term memory (LSTM) networks, and DNNs, were used to test several models, including ResNet50, InceptionV3, MobileNet, DenseNet121, NASNetMobile, Xception, and VGG19.

Singh et al.¹⁵ showed comparative analysis using deep learning models for ocular disease detection. Though the models show good performance, however, they only used models’ performance to interpret their results. They didn’t use any explainable AI tools. Muntaqim et al.¹⁶ presented a comparison between deep learning and machine learning algorithms. They found that DenseNet-169 showed best performance among all the models. They also didn’t show how the deep learning black box is actually working.

Methodology

In this research study, we will classify these eye diseases as: diabetic retinopathy, cataracts, glaucoma, and normal. In cataract diseases, the lens starts suffering from opacity. It can vary in location and size and causes a blurry or impaired vision.¹⁵ Neovascularization, microaneurysms, and hemorrhages are typical symptoms of diabetic retinopathy. Mild nonproliferative abnormalities are the first stage of the disease, which develops into proliferative retinopathy and eventually causes blindness from widespread neovascularization and hemorrhages.⁶ Inflammation and retinal neurodegeneration are also primary causes of the pathogenesis of diabetic macular edema. Glaucoma is caused by progressive damage to the optic nerve. It causes the irreversible loss of vision.⁸ In this study, we will use transfer learning algorithms for the classification of these diseases. For this purpose, we have used eight transfer learning algorithms, as shown in Figure 2. The figure also shows the workflow of the research study. First of images are brought into the workspace using the dataset “eye_disease_detection.” The dataset is available on the public repository “Kaggle.” https://www.kaggle.com/datasets/gunavenkatdoddi/eye-diseases-classification.

Figure 2. — Process flow diagram of this research study.

Dataset

The retinal scans in the collection are categorized into four groups: normal, diabetic retinopathy, cataract, and glaucoma. Each group has about a thousand images. Sources such as IDRiD, Oculur recognition, hemodynamic response function, and others contribute to the collection of these images. The dataset is balanced, as shown in Figure 3, therefore, there is no need for data augmentation. But we still used some light data augmentation methods. We didn’t use aggressive augmentation techniques to avoid image distortion. In the dataset, 4217 images belong to four classes. The images are evenly distributed among the classes, as shown in Figure 3. Figure 4 shows some of the randomly chosen images from the dataset.

Figure 3. — Percentage of images in each class in the dataset.

Figure 4. — Random images from the dataset.

Data preprocessing

In the preprocessing step, the data is normalized by dividing all the input images by 255. This normalizes the pixel values to 0,1. This helps in training the algorithms and prevents the vanishing gradient problem. As we iterate through the dataset, all the elements are converted into numpy arrays. Numpy arrays work easily with Python tools. We have used GPU-based system and a cloud platform for code execution. We have provided the link of the code in the code availability section.

Data split

Seventy percent of the dataset is reserved for training purposes, while the remaining 30% is used for testing purposes. The data is split with a ratio of 70 to 30.

All the model architectures used in the current study, are described below.

Models architectures

The following models are used for transfer learning: EfficientNetB3, MobileNet, InceptionResNetV2, Resnet50, XceptionNet, DenseNet121, VGG19, and InceptionV3.

EfficientNetB3

This model is known for its scalability and efficiency. It is part of EfficientNet, which is specially designed to enhance accuracy and efficiency. Due to aberrant tissue growth within the brain, brain tumors pose serious health hazards that, if left untreated, can result in life-threatening disorders. With better resolution and tissue differentiation than other imaging modalities, magnetic resonance imaging is the primary diagnostic technique for brain tumors. It is mostly used for medical imaging, which includes brain tumor classification, breast cancer classification, and leukemia detection.^17,18 It is also used for plant disease classification and three-dimensional (3D) vehicle detection and segmentation. It is optimized by using genetic algorithms.¹⁹

This research study included Efficient Net as a component of the transfer learning algorithm. The top levels of the algorithm are configured to false, allowing us to implement our own customized classification layers. The model's summary is shown in Figure 5.

Figure 5. — Summary of the transfer learning model in which EfficientNetB3 is used.

MobileNet

MobileNet is a kind of CNNs specifically designed for mobile and embedded vision applications. They can provide enhanced performance with constrained resources. It is mostly used for image classification jobs and object detection.²⁰ They include depth-wise separable convolutions. This decreases the number of parameters as the model deepens, hence reducing computing complexity. This research study included MobileNet as a component of the transfer learning algorithm. The top levels of the algorithm are configured to false, allowing us to implement our own customized classification layers. The model's summary is shown in Figure 6.

Figure 6. — Summary of the transfer learning model in which MobileNet is used.

InceptionRnetV2

It is a CNN architecture that combines the strengths of Inception and Resnet model. It is designed to enhance the efficiency of the model to perform image classification tasks. Brain tumors are a serious medical problem that requires prompt and correct diagnosis in order to enhance patient outcomes. Life expectancy can be significantly reduced by misclassification, underscoring the significance of precise diagnostic techniques.²¹ In the domain of medical imaging, it showed wonderful performance in classifying brain tumor and plant disease detection.²²

It handles multiscale feature extraction which makes it ideal for complex image classification tasks.⁵ It is also widely used in transfer learning models. This research study also includes InceptionResnetV2 as a component of the transfer learning algorithm. The top levels of the algorithm are configured to false, allowing us to implement our own customized classification layers. The model's summary is shown in Figure 7.

Figure 7 shows that, the total parameters of inceptionResnetV2 are 54,737,380 which are largest among all the transfer learning models used in this study. It shows that this model is computationally expensive.

Resnet 50

It is a deep residual network. Residual networks were introduced to introduce the concept of residual learning and address the vanishing gradient problem. Vanishing gradient problem causes degradation in the performance, it is addressed by Resnet50. It is extensively used in several computer vision applications due to its ability to extract significant details from images. It shown exceptional efficacy in the identification of gastrointestinal diseases and brain tumors.²³ Resnet50 is often used in the transfer learning models, that uses pretrained weights. This research study also uses Resnet50 for transfer learning. The top levels of the algorithm are configured to false, allowing us to implement our own customized classification layers. The model's summary is shown in Figure 8.

Figure 8. — Summary of the transfer learning model in which Resnet50 is used.

XceptionNet

It is a deep learning model that is widely used for many image classification problems. It is known for its depth-wise separable convolutions and thus reduces computational costs and enhances performance. It is used for the automatic detection of skin cancer deep fake detection and COVID-19 diagnosis.

This research study uses XceptionNet for transfer learning. The top levels of the algorithm are configured to false, allowing us to implement our own customized classification layers. The model's summary is shown in Figure 9.

DenseNet121

DenseNet is known for its dense connectivity pattern. It is widely used in various domains, including medical imaging, agriculture, and waste management. In medical imaging, it is successfully used for diagnosing COVID-19 and brain tumors. In agriculture it is the best algorithm that is used for identifying tomato plant disease. It is also used effectively for deep fake detection.²⁴ This research study uses DenseNet121 for transfer learning. The top levels of the algorithm are configured to false, allowing us to implement our own customized classification layers. The model's summary is shown in Figure 10.

Figure 10. — Summary of the transfer learning model in which DenseNet121 is used.

VGG19

It is used in several applications because of its durability in visual recognition jobs. It can extract hierarchical information from images, making it useful for processing complicated visuals. It consists of 19 layers, including 16 convolutional layers and 3 fully connected layers. It is used for facial recognition, human activity identification, and brain tumor detection. This research study also uses VGG19 for transfer learning. The top levels of the algorithm are configured to false, allowing us to implement our own customized classification layers. The model's summary is shown in Figure 11.

Figure 11. — Summary of the transfer learning model in which VGG19 is used.

InceptionV3

It is recognized for its robustness, attributed to multiscale feature extraction. It is mostly used in intricate pattern recognition activities. Similar to other TL models, it is used in medical imaging, agriculture, and plant identification. The most significant part of this model is its use in real-time systems, namely the Alternating Least Squares system for patients.

The summary of inceptionv3 is presented in Figure 12.

Transfer learning setup

The same procedure is followed for all the algorithms used in this research study. The top layers of the pretrained models are set to false so we can use our own classification layers. The model is trained on the “ImageNet” dataset. Those pretrained weights are called. The shape of the input image is (300,300,3). At the end of the pretrained model layers, the global max-pooling layer is used to flatten the output, so we can use our customized top layers.

The model is trained up to 40 epochs, with a mini-batch size equal to 40 samples. Patience is set to 1, so if the accuracy or loss doesn’t improve for 1 epoch, the learning rate is reduced. We have incorporated regularization techniques across all the transfer learning models. We used L2 regularization, dropout, and batch normalization, as shown in model summaries to overcome overfitting. Stop-patience is set to 3, so if no improvement is monitored after 3 epochs, the model stops training. The threshold value is set to 0.9, so that if the training accuracy is below 90%, the callback function is used to improve training accuracy, and if the training accuracy surpasses 90%, the focus shifts to validation loss.

CNN model

A CNN model is used for non-TL comparison. The architecture of the CNN model is presented in Figure 13.

The model has used the same configurations as all the other models, in the current study, that is, same data split, same dataset, 40 number of epochs, patience equals to 1, and stop patience is set to 3. Threshold is 0.9, factor 0.5.

Results

Experimental setup and evaluation protocol

The classification layers of all transfer learning models were customized, as described in the Methodology section. Model performance was evaluated using accuracy, precision, recall, and F1-score, reported both overall and class-wise. In addition, the nonparametric Friedman statistical test was applied to assess whether performance differences among models were statistically significant across disease classes. All experiments were conducted using a GPU-enabled environment on a cloud platform. The implementation code is provided in the Code Availability section.